[BUG]: problem when run the [train.py](https://github.com/hpcaitech/ColossalAI/blob/main/examples/tutorial/sequence_parallel/train.py#L76)
🐛 Describe the bug
I use the command to use the synthetic data to run the code. And it meets problem.
torchrun --nproc_per_node=4 train.py --synthetic 2>&1 | tee run.log
The problem log is below
Traceback (most recent call last):
File "train.py", line 240, in <module>
Traceback (most recent call last):
File "train.py", line 240, in <module>
main()
File "train.py", line 181, in main
main()
File "train.py", line 181, in main
tokens, types, sentence_order, loss_mask, lm_labels, padding_mask = get_batch_for_sequence_parallel(
File "/data/xxxx/coloss/ColossalAI/examples/tutorial/sequence_parallel/data/bert_helper.py", line 147, in get_batch_for_sequence_parallel
tokens, types, sentence_order, loss_mask, lm_labels, padding_mask = get_batch_for_sequence_parallel(
File "/data/xxxx/coloss/ColossalAI/examples/tutorial/sequence_parallel/data/bert_helper.py", line 147, in get_batch_for_sequence_parallel
print("data_b['text].shape:",tokens.data_b['text'].shape)
AttributeError: 'Tensor' object has no attribute 'data_b'
print("data_b['text].shape:",tokens.data_b['text'].shape)
AttributeError: 'Tensor' object has no attribute 'data_b'
Traceback (most recent call last):
File "train.py", line 240, in <module>
Traceback (most recent call last):
File "train.py", line 240, in <module>
main()
File "train.py", line 181, in main
tokens, types, sentence_order, loss_mask, lm_labels, padding_mask = get_batch_for_sequence_parallel(
File "/data/xxxx/coloss/ColossalAI/examples/tutorial/sequence_parallel/data/bert_helper.py", line 147, in get_batch_for_sequence_parallel
main()
File "train.py", line 181, in main
print("data_b['text].shape:",tokens.data_b['text'].shape)
AttributeError: 'Tensor' object has no attribute 'data_b'
tokens, types, sentence_order, loss_mask, lm_labels, padding_mask = get_batch_for_sequence_parallel(
File "/data/xxxx/coloss/ColossalAI/examples/tutorial/sequence_parallel/data/bert_helper.py", line 147, in get_batch_for_sequence_parallel
print("data_b['text].shape:",tokens.data_b['text'].shape)
AttributeError: 'Tensor' object has no attribute 'data_b'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1568) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/xxxxal/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/xxxxal/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/home/xxxxal/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/xxxxal/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/xxxxal/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Environment
No response
Hi @lambda7xx Thank you for your feedback. We will try to reproduce your issue and fix it soon. For sequence parallel, you can try a new version example https://github.com/hpcaitech/ColossalAI/tree/main/examples/tutorial/sequence_parallel
By the way, we are restructuring the documents and examples, and the new version examples will be provided at the following link https://github.com/hpcaitech/ColossalAI/tree/main/examples
I try this command
torchrun --nproc_per_node=4 train.py --synthetic 2>&1 | tee run.logto run the [train.py](https://github.com/hpcaitech/ColossalAI/blob/main/examples/tutorial/sequence_parallel/train.py) it doesn't work out
Could you provide more details? We have updated a lot. This issue was closed due to inactivity. Thanks.