ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: problem when run the [train.py](https://github.com/hpcaitech/ColossalAI/blob/main/examples/tutorial/sequence_parallel/train.py#L76)

Open lambda7xx opened this issue 3 years ago • 2 comments

🐛 Describe the bug

I use the command to use the synthetic data to run the code. And it meets problem. torchrun --nproc_per_node=4 train.py --synthetic 2>&1 | tee run.log

The problem log is below

Traceback (most recent call last):
  File "train.py", line 240, in <module>
Traceback (most recent call last):
  File "train.py", line 240, in <module>
    main()
      File "train.py", line 181, in main
main()
  File "train.py", line 181, in main
    tokens, types, sentence_order, loss_mask, lm_labels, padding_mask = get_batch_for_sequence_parallel(
  File "/data/xxxx/coloss/ColossalAI/examples/tutorial/sequence_parallel/data/bert_helper.py", line 147, in get_batch_for_sequence_parallel
    tokens, types, sentence_order, loss_mask, lm_labels, padding_mask = get_batch_for_sequence_parallel(
  File "/data/xxxx/coloss/ColossalAI/examples/tutorial/sequence_parallel/data/bert_helper.py", line 147, in get_batch_for_sequence_parallel
    print("data_b['text].shape:",tokens.data_b['text'].shape)
AttributeError: 'Tensor' object has no attribute 'data_b'
    print("data_b['text].shape:",tokens.data_b['text'].shape)
AttributeError: 'Tensor' object has no attribute 'data_b'
Traceback (most recent call last):
  File "train.py", line 240, in <module>
Traceback (most recent call last):
  File "train.py", line 240, in <module>
    main()
  File "train.py", line 181, in main
    tokens, types, sentence_order, loss_mask, lm_labels, padding_mask = get_batch_for_sequence_parallel(
  File "/data/xxxx/coloss/ColossalAI/examples/tutorial/sequence_parallel/data/bert_helper.py", line 147, in get_batch_for_sequence_parallel
    main()
  File "train.py", line 181, in main
    print("data_b['text].shape:",tokens.data_b['text'].shape)
AttributeError: 'Tensor' object has no attribute 'data_b'
    tokens, types, sentence_order, loss_mask, lm_labels, padding_mask = get_batch_for_sequence_parallel(
  File "/data/xxxx/coloss/ColossalAI/examples/tutorial/sequence_parallel/data/bert_helper.py", line 147, in get_batch_for_sequence_parallel
    print("data_b['text].shape:",tokens.data_b['text'].shape)
AttributeError: 'Tensor' object has no attribute 'data_b'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1568) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/xxxxal/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/xxxxal/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/home/xxxxal/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/xxxxal/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/xxxxal/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

Environment

No response

lambda7xx avatar Nov 15 '22 01:11 lambda7xx

Hi @lambda7xx Thank you for your feedback. We will try to reproduce your issue and fix it soon. For sequence parallel, you can try a new version example https://github.com/hpcaitech/ColossalAI/tree/main/examples/tutorial/sequence_parallel

By the way, we are restructuring the documents and examples, and the new version examples will be provided at the following link https://github.com/hpcaitech/ColossalAI/tree/main/examples

binmakeswell avatar Nov 15 '22 05:11 binmakeswell

I try this command torchrun --nproc_per_node=4 train.py --synthetic 2>&1 | tee run.log to run the [train.py](https://github.com/hpcaitech/ColossalAI/blob/main/examples/tutorial/sequence_parallel/train.py) it doesn't work out

lambda7xx avatar Nov 15 '22 07:11 lambda7xx

Could you provide more details? We have updated a lot. This issue was closed due to inactivity. Thanks.

binmakeswell avatar Apr 14 '23 08:04 binmakeswell