ms-swift报错io.TextIOWrapper
使用ms-swift微调qwen2.5-omni。
·
项目场景:
使用ms-swift微调qwen2.5-omni
问题描述
[rank0]: File “/all/data3/pengtao/ms-swift/swift/llm/train/sft.py”, line 96, in run
[rank0]: train_dataset, val_dataset = self._encode_dataset(train_dataset, val_dataset)
[rank0]: File “/all/data3/pengtao/ms-swift/swift/llm/train/sft.py”, line 224, in _encode_dataset
[rank0]: train_dataset = packing_dataset_cls(
[rank0]: File “/all/data3/pengtao/ms-swift/swift/llm/dataset/utils.py”, line 307, in init
[rank0]: self.create_packed_dataset()
[rank0]: File “/all/data3/pengtao/ms-swift/swift/llm/dataset/utils.py”, line 321, in create_packed_dataset
[rank0]: worker.start()
[rank0]: File “/root/anaconda3/envs/swift/lib/python3.10/multiprocessing/process.py”, line 121, in start
[rank0]: self._popen = self._Popen(self)
[rank0]: File “/root/anaconda3/envs/swift/lib/python3.10/multiprocessing/context.py”, line 224, in _Popen
[rank0]: return _default_context.get_context().Process._Popen(process_obj)
[rank0]: File “/root/anaconda3/envs/swift/lib/python3.10/multiprocessing/context.py”, line 288, in _Popen
[rank0]: return Popen(process_obj)
[rank0]: File “/root/anaconda3/envs/swift/lib/python3.10/multiprocessing/popen_spawn_posix.py”, line 32, in init
[rank0]: super().init(process_obj)
[rank0]: File “/root/anaconda3/envs/swift/lib/python3.10/multiprocessing/popen_fork.py”, line 19, in init
[rank0]: self._launch(process_obj)
[rank0]: File “/root/anaconda3/envs/swift/lib/python3.10/multiprocessing/popen_spawn_posix.py”, line 47, in _launch
[rank0]: reduction.dump(process_obj, fp)
[rank0]: File “/root/anaconda3/envs/swift/lib/python3.10/multiprocessing/reduction.py”, line 60, in dump
[rank0]: ForkingPickler(file, protocol).dump(obj)
[rank0]: TypeError: cannot pickle ‘_io.TextIOWrapper’ object
Packing (num_proc=1): 0%| | 0/45873 [00:01<?, ?it/s]
[rank0]:[W708 03:02:16.397089511 CudaIPCTypes.cpp:100] Producer process tried to deallocate over 1000 memory blocks referred by consumer processes. Deallocation might be significantly slowed down. We assume it will never going to be the case, but if it is, please file but to https://github.com/pytorch/pytorch
[rank0]:[W708 03:02:16.456878546 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank0]:[W708 03:02:20.740747554 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
W0708 03:02:20.769000 92751 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 92898 closing signal SIGTERM
/root/anaconda3/envs/swift/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ’
E0708 03:02:21.189000 92751 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 92897) of binary: /root/anaconda3/envs/swift/bin/python3.10
Traceback (most recent call last):
File “/root/anaconda3/envs/swift/lib/python3.10/runpy.py”, line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/root/anaconda3/envs/swift/lib/python3.10/runpy.py”, line 86, in _run_code
exec(code, run_globals)
File “/root/anaconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/run.py”, line 896, in
main()
File “/root/anaconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 355, in wrapper
return f(*args, **kwargs)
File “/root/anaconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/run.py”, line 892, in main
run(args)
File “/root/anaconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/run.py”, line 883, in run
elastic_launch(
File “/root/anaconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py”, line 139, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/root/anaconda3/envs/swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py”, line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/all/data3/pengtao/ms-swift/swift/cli/sft.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2025-07-08_03:02:20
host : localhost.localdomain
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 92897)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
出现io.TextIOWrapper报错
TypeError: cannot pickle ‘_io.TextIOWrapper’ object
解决方案:
将deepseed版本降低至0.16.9
pip install deepspeed==0.16.9
火山引擎开发者社区是火山引擎打造的AI技术生态平台,聚焦Agent与大模型开发,提供豆包系列模型(图像/视频/视觉)、智能分析与会话工具,并配套评测集、动手实验室及行业案例库。社区通过技术沙龙、挑战赛等活动促进开发者成长,新用户可领50万Tokens权益,助力构建智能应用。
更多推荐
所有评论(0)