error when finetuning yi-34b using HybridParallel
hi, I am training yi-34b model using HybridParallel and got the following errors, my pytorch version is 2.0, cuda is 11.8. Could you please give me some help? Thanks!
Gradient checkpointing enabled successfully Flash-attention enabled successfully Model params: 32.03 B Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 197, in _replace_sub_module replace_layer = target_module.from_native_module( File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/layer/linear.py", line 166, in from_native_module linear_1d = Linear1D_Col( File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/layer/linear.py", line 104, in init self.randomizer = create_randomizer_with_offset(seed, process_group=self.process_group) File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/layer/utils.py", line 273, in create_randomizer_with_offset is_synchronized = Randomizer.is_randomizer_index_synchronized(process_group) File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/layer/utils.py", line 217, in is_randomizer_index_synchronized dist.all_gather(gathered_index, index_tensor, process_group) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1436, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2435, in all_gather work = group.allgather([tensor_list], [tensor]) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1207, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed. Last error: Bootstrap : no socket interface found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/code/multi_node/ColossalAI/applications/Colossal-LLaMA-2/train.py", line 423, in
Hey @puppet101 ,
This seems to be an NCCL communication error.
Would you like to first try setting NCCL_SOCKET_IFNAME to adjust specific interface usage?
For example:
export NCCL_SOCKET_IFNAME=eth
You might want to use ifconfig to view configurations of network interface, and then make sure setting NCCL_SOCKET_IFNAME compatible with specific configurations.
Reference: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html
Closed as the issue has been inactive for over a month. Please let us know if there exist any further issues.