Anhelor
Anhelor
Why not using tensor.isnan() and tensor.isinf()?
I am facing the same issue when i try to transfer ds checkpoint in ZeRO-3 (degree is 16) to universal checkpoint: concurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/home/xxx/miniforge3/lib/python3.10/concurrent/futures/process.py",...
[nohup.out.txt](https://github.com/user-attachments/files/16956774/nohup.out.txt) I add these print statement in ds_to_universal.py, and the output is nohup.out.txt.