DeepDeWedge icon indicating copy to clipboard operation
DeepDeWedge copied to clipboard

PytorchStreamReader failed reading file data.pkl: file read failed

Open andreanans opened this issue 4 months ago • 0 comments

Deepdewedge training very often crashes due to random file read errors when running on multiple GPUs, for example:

“Error loading .//subtomos/fitting_subtomos/subtomo0/229.pt Error message is: PytorchStreamReader failed reading file data.pkl: file read failed” every time that happens, it would wait some time, and attempt again and again, until it can read that file and continue. But sometimes after N attempts, the job just crashes. The file is not missing; it was written by a previous job, so this is not something that failed to be written as the job is executing.

Is this issue caused by the multiple GPUs trying to access the same file? Is there anything we can do to avoid this error?

andreanans avatar Oct 03 '25 13:10 andreanans