ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

error when finetuning yi-34b using HybridParallel

Open puppet101 opened this issue 1 year ago • 1 comments

hi, I am training yi-34b model using HybridParallel and got the following errors, my pytorch version is 2.0, cuda is 11.8. Could you please give me some help? Thanks!

Gradient checkpointing enabled successfully Flash-attention enabled successfully Model params: 32.03 B Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 197, in _replace_sub_module replace_layer = target_module.from_native_module( File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/layer/linear.py", line 166, in from_native_module linear_1d = Linear1D_Col( File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/layer/linear.py", line 104, in init self.randomizer = create_randomizer_with_offset(seed, process_group=self.process_group) File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/layer/utils.py", line 273, in create_randomizer_with_offset is_synchronized = Randomizer.is_randomizer_index_synchronized(process_group) File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/layer/utils.py", line 217, in is_randomizer_index_synchronized dist.all_gather(gathered_index, index_tensor, process_group) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1436, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2435, in all_gather work = group.allgather([tensor_list], [tensor]) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1207, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed. Last error: Bootstrap : no socket interface found

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/opt/code/multi_node/ColossalAI/applications/Colossal-LLaMA-2/train.py", line 423, in main() File "/opt/code/multi_node/ColossalAI/applications/Colossal-LLaMA-2/train.py", line 279, in main model, optimizer, _, dataloader, lr_scheduler = booster.boost( File "/usr/local/lib/python3.10/dist-packages/colossalai/booster/booster.py", line 138, in boost model, optimizer, criterion, dataloader, lr_scheduler = self.plugin.configure( File "/usr/local/lib/python3.10/dist-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1106, in configure model = HybridParallelModule( File "/usr/local/lib/python3.10/dist-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 70, in init module, self.shared_params = shardformer.optimize(module, policy=custom_policy) File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/shardformer.py", line 54, in optimize shared_params = sharder.shard() File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 43, in shard self._replace_module(include=held_layers) File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 67, in _replace_module self._recursive_replace_layer( File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer self._recursive_replace_layer( File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer self._recursive_replace_layer( File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 115, in _recursive_replace_layer self._recursive_replace_layer( File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 112, in _recursive_replace_layer self._replace_sub_module(module, sub_module_replacement, include) File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/shard/sharder.py", line 201, in _replace_sub_module raise RuntimeError( RuntimeError: Failed to replace self_attn.q_proj of type Linear with Linear1D_Col with the exception: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1207, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed. Last error: Bootstrap : no socket interface found. Please check your model configuration or sharding policy, you can set up an issue for us to help you as well.

puppet101 avatar Mar 12 '24 12:03 puppet101

Hey @puppet101 ,

This seems to be an NCCL communication error.

Would you like to first try setting NCCL_SOCKET_IFNAME to adjust specific interface usage? For example:

export NCCL_SOCKET_IFNAME=eth

You might want to use ifconfig to view configurations of network interface, and then make sure setting NCCL_SOCKET_IFNAME compatible with specific configurations.

Reference: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html

yuanheng-zhao avatar Mar 13 '24 15:03 yuanheng-zhao

Closed as the issue has been inactive for over a month. Please let us know if there exist any further issues.

yuanheng-zhao avatar Apr 17 '24 06:04 yuanheng-zhao