记录一下使用ms-swift训练qwen时遇到的问题
这是我碰到的问题,使用了torch==2.7.1,2.6.0均遇到此问题。由于服务器上其他人并未遇到此问题,因此显卡方面不存在问题,主要问题仍然可能在于torch版本。
```Traceback (most recent call last):
[rank1]: File "/home/yusun/env/anaconda3/envs/ms-swift/lib/python3.10/site-packages/swift/cli/sft.py", line 10, in <module>
[rank1]: sft_main()
[rank1]: File "/home/yusun/env/anaconda3/envs/ms-swift/lib/python3.10/site-packages/swift/llm/train/sft.py", line 324, in sft_main
[rank1]: return SwiftSft(args).main()
[rank1]: File "/home/yusun/env/anaconda3/envs/ms-swift/lib/python3.10/site-packages/swift/llm/train/sft.py", line 27, in __init__
[rank1]: super().__init__(args)
[rank1]: File "/home/yusun/env/anaconda3/envs/ms-swift/lib/python3.10/site-packages/swift/llm/base.py", line 19, in __init__
[rank1]: self.args = self._parse_args(args)
[rank1]: File "/home/yusun/env/anaconda3/envs/ms-swift/lib/python3.10/site-packages/swift/llm/base.py", line 31, in _parse_args
[rank1]: args, remaining_argv = parse_args(self.args_class, args)
[rank1]: File "/home/yusun/env/anaconda3/envs/ms-swift/lib/python3.10/site-packages/swift/utils/utils.py", line 152, in parse_args
[rank1]: args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
[rank1]: File "/home/yusun/env/anaconda3/envs/ms-swift/lib/python3.10/site-packages/transformers/hf_argparser.py", line 358, in parse_args_into_dataclasses
[rank1]: obj = dtype(**inputs)
[rank1]: File "<string>", line 319, in __init__
[rank1]: File "/home/yusun/env/anaconda3/envs/ms-swift/lib/python3.10/site-packages/swift/llm/argument/train_args.py", line 178, in __post_init__
[rank1]: self._add_version()
[rank1]: File "/home/yusun/env/anaconda3/envs/ms-swift/lib/python3.10/site-packages/swift/llm/argument/train_args.py", line 229, in _add_version
[rank1]: self.output_dir = add_version_to_work_dir(self.output_dir)
[rank1]: File "/home/yusun/env/anaconda3/envs/ms-swift/lib/python3.10/site-packages/swift/utils/utils.py", line 133, in add_version_to_work_dir
[rank1]: dist.broadcast_object_list(obj_list)
[rank1]: File "/home/yusun/env/anaconda3/envs/ms-swift/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: File "/home/yusun/env/anaconda3/envs/ms-swift/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3479, in broadcast_object_list
[rank1]: broadcast(object_sizes_tensor, src=global_src, group=group)
[rank1]: File "/home/yusun/env/anaconda3/envs/ms-swift/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: File "/home/yusun/env/anaconda3/envs/ms-swift/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2726, in broadcast
[rank1]: work = group.broadcast([tensor], opts)
[rank1]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank1]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank1]: Last error:
[rank1]: Cuda failure 1 'invalid argument'
```
这是我碰到的问题,使用了torch==2.7.1,2.6.0均遇到此问题。由于服务器上其他人并未遇到此问题,因此显卡方面不存在问题,主要问题仍然可能在于torch版本。
最后经同学指点,在sft.sh脚本前添加了一句:
`export NCCL_NVLS_ENABLE=0`
解决该问题
火山引擎开发者社区是火山引擎打造的AI技术生态平台,聚焦Agent与大模型开发,提供豆包系列模型(图像/视频/视觉)、智能分析与会话工具,并配套评测集、动手实验室及行业案例库。社区通过技术沙龙、挑战赛等活动促进开发者成长,新用户可领50万Tokens权益,助力构建智能应用。
更多推荐
所有评论(0)