Traceback (most recent call last):
File "xx/OpenDiT-master/train.py", line 383, in
main(args)
File xx/OpenDiT-master/train.py", line 275, in main
batch = next(dataloader_iter)
File "/root/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next
data = self._next_data()
File "/root/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data
idx, data = self._get_data()
File "/root/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1285, in _get_data
success, data = self._try_get_data()
File "/root/miniconda3/envs/opendit/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1146, in _try_get_data
raise RuntimeError(f'DataLoader worker (pid(s) {pids_str}) exited unexpectedly') from e
RuntimeError: DataLoader worker (pid(s) 2365846) exited unexpectedly
How did you launch your script?
I think memory may leak, get bigger than 600G, so been killed
it about torch dataloader. you can use gc collect to avoid this problem