Sporadic error during training
Hi, I often get this error during training:
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/mrcinterpret
er.py", line 192, in _read_header
raise ValueError("Couldn't read enough bytes for MRC header")
ValueError: Couldn't read enough bytes for MRC header
which is then followed by this error:
FileNotFoundError: [Errno 2] No such file or directory:
'home/recons/ddw/8/logs/version_0/checkpoints/epoch/epoch=999.ckpt
the jobs will continue the training by setting resume_from_checkpoint: ".." to the last val_loss checkpoint. The weird thing is that not all jobs fail like this, only sometimes, sporadically very early in the process or later or never. Re-running it once (or multiple times) eventually they finish (and produce amazing tomograms! :D)
Thanks for the help :) Cheers, Philippe
Just wanted to note that the second error happens just because we have the prediction command (ddw refine-tomogram) in our script, so if training fails the checkpoint expected for prediction will not be there and it will fail too.
The real error is the first one and indeed it seems to happen at random 🙁
Hello Philippe and Ricardo,
thanks for opening this issue!
The error you describe sounds like something is wrong with reading and/or writing the model input and target subtoms as mrc files. Unfortunately, I have no idea as to what that might be, and the fact that it happens at random makes it hard to debug, but I will try my best!
Is the error message you shared the complete one? If not, could you please share the full message? That might help me find a place to start.
Best, Simon
Hi Simon,
Thanks for looking into it nevertheless! The entire error is this:
ValueError: Caught ValueError in DataLoader worker process 9.
Original Traceback (most recent call last):
File
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/fetch.py", line 51, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File
"/scicore/projects/scicore-p-structsoft/ubuntu/software/DeepDeWedge/ddw/utils/su
btomo_dataset.py", line 61, in __getitem__
subtomo0 = load_mrc_data(subtomo0_file)
File
"/scicore/projects/scicore-p-structsoft/ubuntu/software/DeepDeWedge/ddw/utils/mr
ctools.py", line 12, in load_mrc_data
with mrcfile.open(mrc_file, permissive=True) as mrc:
File
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/load_functio
ns.py", line 139, in open
return NewMrc(name, mode=mode, permissive=permissive,
File
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/mrcfile.py",
line 115, in __init__
self._read(header_only)
File
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/mrcfile.py",
line 131, in _read
super(MrcFile, self)._read(header_only)
File
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/mrcinterpret
er.py", line 170, in _read
self._read_header()
File
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/mrcfile/mrcinterpret
er.py", line 192, in _read_header
raise ValueError("Couldn't read enough bytes for MRC header")
ValueError: Couldn't read enough bytes for MRC header
and actually I just saw another one but which seems to originate from the same problem:
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File
"/scicore/projects/scicore-p-structsoft/ubuntu/software/Miniconda3/miniconda3_py
thon3.12.1/envs/DeepDeWedge-ub/lib/python3.10/site-packages/torch/utils/data/_ut
ils/fetch.py", line 51, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File
"/scicore/projects/scicore-p-structsoft/ubuntu/software/DeepDeWedge/ddw/utils/su
btomo_dataset.py", line 61, in __getitem__
subtomo0 = load_mrc_data(subtomo0_file)
File
"/scicore/projects/scicore-p-structsoft/ubuntu/software/DeepDeWedge/ddw/utils/mr
ctools.py", line 14, in load_mrc_data
data = torch.tensor(mrc.data)
RuntimeError: Could not infer dtype of NoneType
Cheers, Philippe
Hi Philippe,
thank you for sharing the full error messages. The problem is indeed related to reading and writing the headers of the subtomogram mrc files. For that DDW uses the mrcfile package, and, as far as some research online has shown, other people have already had similar-seeming issues with mrcfile and headers. I will further look into this today and tomorrow and will let you know as soon as I have made any progress on this.
In the meantime, if you are still using DDW and encounter another mrc-header-related error, it would be great if you could find out which subtomogram is causing the error and send me the corrupted file. This would be helpful for debugging, since I don't know what exactly is causing the header error.
To find the corrupted subtomogram after the error has occurred, you could for example try to open all subtomograms in your DDW fitting data directory (e.g. PATH_TO_DDW_PROJECT/subtomos/fitting_subtomos/) directory in Python using the mrcfile package, and at least one should give you a header error. Let me know if you need help with this.
Best, Simon
Hi Philippe and Ricardo,
just a quick heads up: After thinking about this issue for a while, I think that the cleanest solution would be to save the subtomograms as .pt files using PyTorch, which gets rid of .mrc headers altogether.
I have already implemented this new strategy in a separate branch and am currently running a sanity check model fitting to make sure that the changes have not broken anything. Once the check is successful, you can test the changes on your data and if your problem is solved, I will merge the changes into main.
Best, Simon
Dear Simon,
Thanks a lot! I think the .pt seems like a good fix! Happy to test it as soon as it passes the checks.
I tried to read all the fitting_subtomos as well as the val_subtomos from a job which encountered the error from above. However, this error did not pop up and I was able to read in all the .mrc files in python without an error...
Kind regards, Philippe
Dear Philippe,
I have just pushed the version with the .pt subtomos to a new branch called torch_subtomos. You can test this version by switching to the torch_subtomos branch and reinstalling DDW (re-running pip install . inside the DeepDeWedge/ directory should be sufficient). If your installation was successful, all subtomos should be saved with a .pt extension.
If this fix has resolved your issue, I will merge the changes into the main branch.
Please let me know if you have any questions or problems!
Best, Simon
Thanks a lot Simon! We will test it.
If I may give some extra feedback on this: as far as we know the issue happens at random, which I guess is related to the very intensive I/O when dealing with many subtomogram files. I'm not sure if just changing the file format will prevent this from happening again. My suspicion is some thermal fluctuation on the GPUs generates weird numbers (like NaN or None) in the subtomograms during fitting, which ultimately causes the MRC header calculation to fail. Another possibility is related to our filesystem. The script we normally use to run ddw copies the entire "project" to a local SSD in the computing node to optimize performance. It could be some random failure in this process. Now I'm curious to see if the problem also happens when not using the SSD but only the cluster parallel filesystem. The drawback is that will certainly be slower. Anyway, happy to test your solution!
My suspicion is some thermal fluctuation on the GPUs generates weird numbers (like NaN or None) in the subtomograms during fitting, which ultimately causes the MRC header calculation to fail.
That's an interesting point! If we indeed get random NaN or None values, I would probably have to implement some kind of checking routine that tries to load sub-tomograms and re-saves them if loading fails or something in that direction.
Anyway, happy to test your solution!
I am very curious about the outcome of your tests! Let me know if you have any problems.
I have also had this error with fit-model in the last week. My scripts overwrote the log file, but when I ran on the same tomogram several times it happened at various different epochs from 250-850. Most of the errors happened closer to epoch 850 to 250, which I think suggests that it's a random event that has a chance of happening every epoch (maybe more likely as time goes on). I'm interested to hear if the new branch fixed it.
I previously ran without any training issues on a different dataset, but with refine-tomogram I kept getting an "Errno 28: no space left on device error." I managed to fix that by setting TMPDIR (in our slurm system) to be in my filespace on our gpfs storage system instead of on the node.
Thanks for sharing @RHennellJames! The fact that you get the same error suggests that the issue does not only occur for a certain hardware setup, which is good to know.
@Phaips @rdrighetto did switching to the torch subtomos resolve the issue in your experiments?
Hi @SimWdm!
We finally got to test the torch_subtomos branch. The good news is that it works. However, in a comparison with the vanilla code based on MRC subtomograms with identical settings, including the same random seed and running on the same hardware, the original code was noticeably faster (461min vs 521min when fitting for 1000 epochs). Also, while both results look good (predicting using the model from the last epoch), they are not identical, which puzzles me. They should look exactly the same, no? Am I missing something?
To summarize, the new code works, but it seems to be slower and we cannot really say whether it solves the issue, given the random nature of the problem. We will be on lab retreat for the next couple days but I'd be happy to share more details once we're back. In any case, thanks a lot for looking into this!
Thanks for the update @rdrighetto. I will ask our Scientific Computing manager to install it here and see if it solves the problem for me as well.
Best, Rory
Thanks for testing the fix @rdrighetto!
However, in a comparison with the vanilla code based on MRC subtomograms with identical settings, including the same random seed and running on the same hardware, the original code was noticeably faster (461min vs 521min when fitting for 1000 epochs).
The longer runtime could be due to less efficient reading/writing of the torch subtomos. I have to test this in an isolated environment to be on the save side, and I will get back to you once I know more!
Also, while both results look good (predicting using the model from the last epoch), they are not identical, which puzzles me. They should look exactly the same, no? Am I missing something?
That is another interesting observation! As the seed was the same in both runs, I would have expected identical results as well! Do training and validation curves look different as well? Have you tried to exactly reproduce results with the .mrc subtomos before? If so, was it successful?
We will be on lab retreat for the next couple days but I'd be happy to share more details once we're back. In any case, thanks a lot for looking into this!
Have a good time at the lab retreat! I am sure we'll eventually sort out all these subtomo-related problems! 🙂 🙏🏼
It seems that the torch branch also fixed the problem for me, at least on the dateset where I had the issue before.
Thanks very much for sorting this!
That's great news @RHennellJames! Was the model fitting slower for you as well?
Hi @SimWdm, the per-epoch time looked to be the same. I'm not sure if the total time was longer as it never got to completion for this dataset with the old code
Hi @SimWdm,
Do training and validation curves look different as well?
They look very similar but not identical:
vanilla
torch_subtomos
Are there any random factors playing in a role in the fitting process that are outside the scope of the enforced random seed?
Have you tried to exactly reproduce results with the .mrc subtomos before? If so, was it successful?
I am carrying out this test right now, both for the original .mrc subtomos as well as the torch_subtomos, will report back soon 😉
Thanks for the update @rdrighetto!
Are there any random factors playing in a role in the fitting process that are outside the scope of the enforced random seed?
Initially, I thought that everything should be seeded, but your question made me check again and I noticed that the random rotation of the sub-tomograms during model fitting are indeed not seeded (see ddw.utils.subtomo_dataset.SubtomoDataset).
This should be easy to fix. For clarity, I will open another issue on the reproducibility and link this one. 🤓
Thank you for paying close attention to your experiments, which helps un-cover these nasty little flaws! 🙂
Edit: The new issue is here https://github.com/MLI-lab/DeepDeWedge/issues/9#issue-2511502473
I was running each test (MRC vs. torch subtomos) identically one more time, and in this second run, the MRC subtomos have ran into the sporadic error while the torch_subtomos branch ran to completion. So now I'm more confident that the torch subtomos are more stable than the MRC files (but to be honest not yet 100% sure). Regarding performance, the torch subtomos now gave the best performance overall: 513min, so I don't think there's any fundamental bottleneck in comparison to the old code for MRC subtomos -- probably just random fluctuations of our hardware :-)
Regarding the reproducibility, I do have an interesting observation but will continue that discussion in #9.
Thanks again @SimWdm for the prompt responses here!
Let me summarise: As observed by @RHennellJames and @rdrighetto (thanks again to both of you!), the torch subtomo solution does not seem to cause any slowdowns and has not caused any crashes during data loading so far. So I think it should be safe to merge the subtomo branch into main.
Any objections? 🙂
Sounds good to me!
Hi, first of all, thanks so much for this discussion - it really helped me make sense of the errors I was getting!
Secondly, I wanted to share my experience: I tried both the MRC and Torch subtomo approaches, and both worked well in general. However, I found that the Torch subtomo solution succeeded for a tomogram where the MRC subtomo approach had previously failed due to the header error you described. I think I agree with @rdrighetto's assessment - it may be an I/O issue causing the MRC error.
That said, I ran into similar issues with a different dataset while using the .pt subtomos. The process ran for 779 epochs before failing, and I was able to restart from the last checkpoint and push it to completion, but it shows that a similar error is still possible with the Torch subtomo approach. Here’s the relevant part of the error message, which to me it feels it may be related to I/O issues:
_RuntimeError: Caught RuntimeError in DataLoader worker process 1.
Original Traceback (most recent call last):
File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/utils/d
ata/_utils/worker.py", line 309, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/utils/d
ata/_utils/fetch.py", line 52, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/utils/d
ata/_utils/fetch.py", line 52, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/ddw/utils/sub
tomo_dataset.py", line 60, in __getitem__
subtomo0 = torch.load(subtomo0_file)
File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/seriali
zation.py", line 1072, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File
"/home/dacostac/.conda/envs/ddw_torch/lib/python3.10/site-packages/torch/seriali
zation.py", line 480, in __init__
super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding
central directory
Epoch 779: 100%|██████████| 116/116 [01:51<00:00, 1.04it/s, loss=30, v_num=0, fitting_loss=29.10, val_loss=29.00]
real 951m23.637s
user 8650m1.955s
sys 198m50.308s_
Apologies @SimWdm - I just saw that you merged the .pt subtomos approach to master as I was writing this. I am still inclined to agree that it probably makes sense - but worth keeping in mind the issue may not be completely gone :)
Thanks for sharing your experience @TomasPascoa! No reasons to apologize: If the error persists (although in a different way), the issue should be kept open, so I have re-opened the issue.
Any help is highly appreciated! 😅
@TomasPascoa confirms my fears have become true 😅
Hi @SimWdm, thanks for your help! Unfortunately I continue to experience problems similar to @TomasPascoa so I just wanted to give you another data point. I have been trying to debug this with limited success. I printed out the path to the subtomo which causes this issue and can't seem to find any problems with it. It is the correct size and I can load and print out values if I load the subtomo by hand in a python shell. I will let you know if I manage the solve the issue but it does seems to be quite random.
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File
"/pscratch/sd/h/hnjones/ddw_final/lib/python3.10/site-packages/torch/utils/data/
_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File
"/pscratch/sd/h/hnjones/ddw_final/lib/python3.10/site-packages/torch/utils/data/
_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File
"/pscratch/sd/h/hnjones/ddw_final/lib/python3.10/site-packages/torch/utils/data/
_utils/fetch.py", line 51, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File
"/pscratch/sd/h/hnjones/ddw_final/lib/python3.10/site-packages/torch/utils/data/
dataset.py", line 335, in __getitem__
return self.datasets[dataset_idx][sample_idx]
File
"/pscratch/sd/h/hnjones/ddw_final/lib/python3.10/site-packages/ddw/utils/subtomo
_dataset.py", line 61, in __getitem__
subtomo0 = torch.load(subtomo0_file)
File
"/pscratch/sd/h/hnjones/ddw_final/lib/python3.10/site-packages/torch/serializati
on.py", line 1005, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File
"/pscratch/sd/h/hnjones/ddw_final/lib/python3.10/site-packages/torch/serializati
on.py", line 457, in __init__
super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding
central directory
Thank you for joining the discussion and for looking into this @henrynjones!
I have just created a new branch 8-sporadic-error-during-training in which I replaced the standard torch.load in the dataloader with a safe_load function that tries to load the subtomo and, if a RuntimeError occurs, waits for one second and then tries again:
def safe_load(file_path, max_retries=3, delay=1):
for attempt in range(max_retries):
try:
return torch.load(file_path)
except RuntimeError as e:
print(f"Error loading {file_path}")
if attempt == max_retries - 1:
raise e # Reraise if it's the last attempt
print(f"Retrying in {delay} seconds")
time.sleep(delay) # Wait before retrying
While this is definitely a bit hacky, I think it might work because, as @henrynjones reports, the subtomo that causes the error during model fitting can be safely loaded later on.
@henrynjones, @rdrighetto, @Phaips, @TomasPascoa Please feel free to try this hotfix on the new 8-sporadic-error-during-training branch.
I suggest writing all command line output to a log file (e.g. ddw fit-model [YOUR ARGS HERE] > log.txt) such that we can see if safe_load excepted any RuntimeError.
Let me know if you run into any issues with the new branch!
Best, Simon 😊
@SimWdm I have started testing the new branch and am running into some issues but it is hard for me to disentangle what might be a problem with my configuration (I am using the NERSC HPC system). I do usually seem to get this error within the update_subtomo_missing_wedge call (For debugging I made this happen after every epoch) but here I don't get a traceback at all which is strange. It also seems strange that it only failed once and then seemed to continue until something found a problem. I also changed the print statements in safe_load to have flush=True to ensure I see things in time. I again checked that the problem file can be loaded without issues in a shell. Note that this is after the second epoch (epoch 1) so I fully completed an earlier update_subtomo_missing_wedge call after epoch 0.
Epoch 1, global step 216: 'fitting_loss' reached 392410.87500 (best 392410.87500), saving model to 'data_prep/logs/version_2/checkpoints/fitting_loss/epoch=1-fitting_loss=392410.87500.ckpt' as top 5
Epoch 1, global step 216: 'val_loss' reached 389776.81250 (best 389776.81250), saving model to 'data_prep/logs/version_2/checkpoints/val_loss/epoch=1-val_loss=389776.81250-v1.ckpt' as top 3
^MUpdating subtomo missing wedges: 0%| | 0/135 [00:00<?, ?it/s]^[[A^MUpdating subtomo missing wedges: 0%| | 0/135 [00:21<?, ?it/s]
^MUpdating subtomo missing wedges: 0%| | 0/135 [00:00<?, ?it/s]^[[A
^MUpdating subtomo missing wedges: 0%| | 0/135 [00:00<?, ?it/s]^[[A^MUpdating subtomo missing wedges: 0%| | 0/135 [00:28<?, ?it/s]
^MUpdating subtomo missing wedges: 1%| | 1/135 [00:40<1:31:06, 40.79s/it]^[[A
^M
^[[A^MEpoch 1: 100%|â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>| 135/135 [14:23<00:00, 6.40s/it, loss=3.9e+05, v_num=2, val_loss=3.9e+5, fitting_loss=3.92e+5]
Error loading ./data_prep/subtomos/fitting_subtomos/subtomo0/7.pt
Retrying in 1 seconds
^MUpdating subtomo missing wedges: 1%|â<96><8f> | 2/135 [00:46<44:24, 20.03s/it] ^[[A
^MUpdating subtomo missing wedges: 2%|â<96><8f> | 3/135 [00:52<29:57, 13.61s/it]^[[A
^MUpdating subtomo missing wedges: 3%|â<96><8e> | 4/135 [00:57<22:46, 10.43s/it]^[[A
^MUpdating subtomo missing wedges: 0%| | 0/135 [00:00<?, ?it/s]^[[A
^MUpdating subtomo missing wedges: 4%|â<96><8e> | 5/135 [01:03<18:49, 8.69s/it]^[[A
^MUpdating subtomo missing wedges: 1%| | 1/135 [00:40<1:30:01, 40.31s/it]^[[A
^MUpdating subtomo missing wedges: 4%|â<96><8d> | 6/135 [01:08<16:15, 7.56s/it]^[[A
^MUpdating subtomo missing wedges: 1%|â<96><8f> | 2/135 [00:45<44:01, 19.86s/it] ^[[A
^MUpdating subtomo missing wedges: 5%|â<96><8c> | 7/135 [01:13<14:09, 6.64s/it]^[[A
^MUpdating subtomo missing wedges: 2%|â<96><8f> | 3/135 [00:51<29:26, 13.39s/it]^[[A
^MUpdating subtomo missing wedges: 6%|â<96><8c> | 8/135 [01:18<12:58, 6.13s/it]^[[A
^MUpdating subtomo missing wedges: 7%|â<96><8b> | 9/135 [01:23<12:01, 5.73s/it]^[[A
^MUpdating subtomo missing wedges: 3%|â<96><8e> | 4/135 [00:57<22:39, 10.38s/it]^[[A
^MUpdating subtomo missing wedges: 7%|â<96><8b> | 10/135 [01:28<11:22, 5.46s/it]^[[A
^MUpdating subtomo missing wedges: 4%|â<96><8e> | 5/135 [01:02<18:45, 8.66s/it]^[[A
^MUpdating subtomo missing wedges: 8%|â<96><8a> | 11/135 [01:33<11:04, 5.36s/it]^[[A
^MUpdating subtomo missing wedges: 4%|â<96><8d> | 6/135 [01:08<16:20, 7.60s/it]^[[A
^MUpdating subtomo missing wedges: 9%|â<96><89> | 12/135 [01:38<10:35, 5.17s/it]^[[A
^MUpdating subtomo missing wedges: 5%|â<96><8c> | 7/135 [01:13<14:14, 6.68s/it]^[[A
^MUpdating subtomo missing wedges: 10%|â<96><89> | 13/135 [01:42<10:07, 4.98s/it]^[[A
^MUpdating subtomo missing wedges: 6%|â<96><8c> | 8/135 [01:18<12:53, 6.09s/it]^[[A
^MUpdating subtomo missing wedges: 10%|â<96><88> | 14/135 [01:47<09:46, 4.85s/it]^[[A
^MUpdating subtomo missing wedges: 7%|â<96><8b> | 9/135 [01:23<12:03, 5.74s/it]^[[A
^MUpdating subtomo missing wedges: 11%|â<96><88> | 15/135 [01:51<09:25, 4.72s/it]^[[A
^MUpdating subtomo missing wedges: 7%|â<96><8b> | 10/135 [01:28<11:31, 5.53s/it]^[[A
^MUpdating subtomo missing wedges: 12%|â<96><88>â<96><8f> | 16/135 [01:56<09:15, 4.67s/it]^[[A
^MUpdating subtomo missing wedges: 8%|â<96><8a> | 11/135 [01:33<11:13, 5.43s/it]^[[A
^MUpdating subtomo missing wedges: 13%|â<96><88>â<96><8e> | 17/135 [02:00<09:02, 4.59s/it]^[[A
Aborted.
Aborted.
^MEpoch 1: 100%|â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>| 135/135 [17:09<00:00, 7.62s/it, loss=3.9e+05, v_num=2, val_loss=3.9e+5, fitting_loss=3.92e+5]
^M
^[[A^MEpoch 1: 100%|â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>| 135/135 [14:26<00:00, 6.42s/it, loss=3.9e+05, v_num=2, val_loss=3.9e+5, fitting_loss=3.92e+5]
^MEpoch 1: 100%|â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>â<96><88>| 135/135 [17:10<00:00, 7.63s/it, loss=3.9e+05, v_num=2, val_loss=3.9e+5, fitting_loss=3.92e+5]
^MUpdating subtomo missing wedges: 9%|â<96><89> | 12/135 [01:37<10:37, 5.18s/it]^[[A
srun: error: nid001469: task 2: Exited with exit code 1
srun: Terminating StepId=32196346.0
^MUpdating subtomo missing wedges: 13%|â<96><88>â<96><8e> | 18/135 [02:05<08:54, 4.57s/it]^[[A
srun: error: nid001469: task 0: Exited with exit code 1
slurmstepd: error: *** STEP 32196346.0 ON nid001469 CANCELLED AT 2024-10-25T21:15:51 ***
Thanks for testing the fix so quickly, @henrynjones!
From the output, it looks like safe_load successfully prevented the RuntimeError we saw before, but another error seems to occur further down, without showing a clear error message. Is that correct?
Unfortunately, I don’t have experience with NERSC HPC. It looks like the system uses SLURM, though. Would it be possible for you to test the fix directly on a server, without using SLURM?
One other thing I noticed in your output is that the progress bar doesn’t seem to appear in chronological order and includes some strange characters. Did you see the same behavior with the code in the main branch, or is this new?