nnUNetv2_plan_and_preprocess stops without giving any error message
I have started to arrange my CT-dataset for training a segmentation task, using nnUNet on a Debian GNU/Linux 11 (bullseye) VM. However, nnUNetv2_plan_and_preprocess halts when I am trying to verify the dataset integrity. No error message is given, and after many hours, I had to push ctrl+z to get the command line prompt back. I wonder whether anyone have an idea about what is halting this program?
nnUNetv2_plan_and_preprocess -d 3 --verify_dataset_integrity Fingerprint extraction... Dataset003_HeartTest Using <class 'nnunetv2.imageio.simpleitk_reader_writer.SimpleITKIO'> as reader/writer
My dataset is organized as follows:
(3dconvnet) ono@ct-model-training:~/data/nnUNet_raw/Dataset003_HeartTest$ ls -la total 28 drwxr-xr-x 5 ono ono 4096 Mar 5 09:39 . drwxr-xr-x 3 ono ono 4096 Mar 4 12:07 .. -rw-r--r-- 1 ono ono 5348 Mar 4 12:36 dataset.json drwxr-xr-x 2 ono ono 4096 Mar 4 12:07 imagesTr drwxr-xr-x 2 ono ono 4096 Mar 5 09:39 imagesTs drwxr-xr-x 2 ono ono 4096 Mar 4 12:07 labelsTr (3dconvnet) ono@ct-model-training:~/data/nnUNet_raw/Dataset003_HeartTest$ pwd /home/ono/data/nnUNet_raw/Dataset003_HeartTest (3dconvnet) ono@ct-model-training:~/data/nnUNet_raw/Dataset003_HeartTest$ head -30 dataset.json { "channel_names": { "0": "CT" }, "labels": { "background": 0, "heart": 1, "lv": 2 }, "numTraining": 20, "file_ending": ".nii.gz", "training": [ { "image": "./imagesTr/AHFP_Round_4_Control_1_7569_Norsvin_AHFP_Heart_with_contrast_back_IM00005_0000.nii.gz", "label": "./labelsTr/AHFP_Round_4_Control_1_7569_Norsvin_AHFP_Heart_with_contrast_back_IM00005.nii.gz" }, { ....
I encountered similar issue before, but might be due to different reasons. My code also seemed stuck, but after hours they finished correctly. When debugging this issue, I tried to control-c to interupt the running scripts. You could try, and see where the code stucks. From my experience, these pre-proccessing scripts use multi-processing codes, and you could add print functions in each process (where the code stucks). If you are good at reading python code, you can try my methods to see the progress of pre-processing (to see if it is slow because your dataset is big, or other reasons).
Thanks for your suggestion, @jackhu-bme! When manually stopping the code after a couple of hours, it seems like the it's stuck in the multi-processing within verify_dataset_integrity.py (see screen dump below). It seems like there is a problem with the multithreading. If I hard code num_processes=1 in the beginning of the verify_dataset_integrity()-function, it seems like the code is continuing, but I get another RunTimeError, due to "One of your background processes is missing", printed from fingerprint_extractor.py.
(3dconvnet) ono@ct-model-training:~$ nnUNetv2_plan_and_preprocess -d 3 --verify_dataset_integrity
Fingerprint extraction...
Dataset003_HeartTest
Using <class 'nnunetv2.imageio.simpleitk_reader_writer.SimpleITKIO'> as reader/writer
^CProcess SpawnPoolWorker-18:
Process SpawnPoolWorker-19:
Process SpawnPoolWorker-20:
Process SpawnPoolWorker-13:
Process SpawnPoolWorker-14:
Process SpawnPoolWorker-17:
Process SpawnPoolWorker-16:
Process SpawnPoolWorker-21:
Traceback (most recent call last):
File "/home/ono/.virtualenvs/3dconvnet/bin/nnUNetv2_plan_and_preprocess", line 8, in
File "/home/ono/repos/ct-segmentation/nnUNet/nnunetv2/experiment_planning/plan_and_preprocess_api.py", line 30, in extract_fingerprint_dataset verify_dataset_integrity(join(nnUNet_raw, dataset_name), num_processes) File "/home/ono/repos/ct-segmentation/nnUNet/nnunetv2/experiment_planning/verify_dataset_integrity.py", line 204, in verify_dataset_integrity result = p.starmap(
This is the same point where the code on my dataset took quite a long time. You can see this function, and make some prints to see what is happening!
You can go to the verify_dataset_integrity.py, see the code within the verify_dataset_integrity, line204:
# check whether only the desired labels are present with multiprocessing.get_context("spawn").Pool(num_processes) as p: result = p.starmap( verify_labels, zip(labelfiles, [reader_writer_class] * len(labelfiles), [expected_labels] * len(labelfiles)) )
You can see that the script use multi-process for function verify_labels.
Then go to this function, print the progress: (around line 33)
SEE WHERR I ADD THE CODE!
`def verify_labels(label_file: str, readerclass: Type[BaseReaderWriter], expected_labels: List[int]) -> bool: rw = readerclass() print(f"verifying {label_file}") # I add code here!!!! seg, properties = rw.read_seg(label_file) print(f"verifying {label_file} done, seg: {seg.shape}, properties: {properties}") # I add code here!!!!
found_labels = np.sort(pd.unique(seg.ravel())) # np.unique(seg)
unexpected_labels = [i for i in found_labels if i not in expected_labels]
if len(found_labels) == 0 and found_labels[0] == 0:
print('WARNING: File %s only has label 0 (which should be background). This may be intentional or not, '
'up to you.' % label_file)
if len(unexpected_labels) > 0:
print("Error: Unexpected labels found in file %s.\nExpected: %s\nFound: %s" % (label_file, expected_labels,
found_labels))
return False
return True`
Then run the nnUNetv2_plan_and_preprocess -d 3 --verify_dataset_integrity again, you can see the progress, if the checking is doing correctly. If yes, use num_processes={you cpu number}; if no, futher check the within the verify_labels.
You can do this for all the multi-process functions in this repo. Also, you can let GPT or deepseek explain my answer to you, if it seems a little bit complex.
How large is your dataset? Maybe you could initially try to work with a subset (e.g. 1 or 5 volumes) and check how long it runs. With large ct volumes this might take some time.
How large is your dataset? Maybe you could initially try to work with a subset (e.g. 1 or 5 volumes) and check how long it runs. With large ct volumes this might take some time.
very good suggestion! You can try this and see whether it stucks. @oyvindnordbo
Does this also happen without the verify dataset integrity flag?
How large is your dataset? Maybe you could initially try to work with a subset (e.g. 1 or 5 volumes) and check how long it runs. With large ct volumes this might take some time.
Thanks for a good suggestion! My dataset that I have tested is not very big. It's about 2.5 GB, and the program seems to halt at different points, a little arbitrarily. I cutted the dataset further, so it's now 0.5 GB, and the program seems to come further in the process. But it overloads the memory somewhere in the preprocessing, even though I have 30 GB of RAM available on the computer. When running nnUNetv2_plan_and_preprocess -d 4 --verify_dataset_integrity on this smaller dataset, I get a lot of output, but it ends up like this:
...
Configuration: 2d...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:31<00:00, 7.7
Configuration: 3d_fullres...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:28<00:00, 7.1
Configuration: 3d_lowres...
0%| | 0/4 [00:19<?,
Traceback (most recent call last):
File "/home/ono/.virtualenvs/3dconvnet/bin/nnUNetv2_plan_and_preprocess", line 8, in
Does this also happen without the verify dataset integrity flag?
Hi, and thanks for helping me! When I run without the --verify_dataset_integrity flag, the program is coming further than without the flag, but it seems to stop due to memory issues, even though that the data-set size is relatively small compared with the RAM available (2.5 GB vs 30 GB): End of output is then:
Preprocessing dataset Dataset003_HeartTest
Configuration: 2d...
0%| | 0/20 [00:14<?, ?it/s]
Traceback (most recent call last):
File "/home/ono/.virtualenvs/3dconvnet/bin/nnUNetv2_plan_and_preprocess", line 8, in
Can you please try the current master? I think there was an issue with blosc2 saving with memmap enabled Best, Fabian
Edit: Indeed, the issue appears to be what Fabian mentioned above. Quick and dirty solution: Simply by updating the file training/dataloading/nnunet_dataset.py the problem was fixed.
I run into this issue with one of the datasets I was working on (about 10GBs) and on a cluster with more than 500GB of RAM. I haven't tried to update nnUNet to a newer version, and I just wanted a quick and dirty solution. Simply avoiding the "multiprocessing magic" worked for me.
Go to preprocessing/preprocessors/default_preprocessor.py, about L277, where you see the line "# multiprocessing magic", and make it like this, so that you can manually decide whether to use multiprocessing or not:
if True:
for k in dataaset.keys():
print(k)
self.run_case_save(join(output_directory, k), dataset[k]['images'], dataset[k]['label'], plans_manager, configuration_manager, dataset_json).
else:
# multiprocessing magic below
...
This is not ideal, but I would have saved much more time by doing this than by debugging. I hope it helps someone.
Thanks for the update and glad it works now! Can I mark this as resolved?
Hi, i have the same exact problem with my full dataset. when i tried using only 3 sample data it goes through just fine but then after that:
Fingerprint extraction... Dataset408_LTS Using <class 'nnunetv2.imageio.simpleitk_reader_writer.SimpleITKIO'> reader/writer
#################### verify_dataset_integrity Done. If you didn't see any error messages then your dataset is most likely OK! ####################
Using <class 'nnunetv2.imageio.simpleitk_reader_writer.SimpleITKIO'> reader/writer 100% 3/3 [00:23<00:00, 7.80s/it] Experiment planning...
############################ INFO: You are using the old nnU-Net default planner. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md ############################
Attempting to find 3d_lowres config. Current spacing: [1.8 0.72421875 0.72421875]. Current patch size: (np.int64(80), np.int64(160), np.int64(160)). Current median shape: [287. 497.08737864 497.08737864] Attempting to find 3d_lowres config. Current spacing: [1.8 0.74594531 0.74594531]. Current patch size: (np.int64(80), np.int64(160), np.int64(160)). Current median shape: [287. 482.60910548 482.60910548] ... raise RuntimeError('Some background worker is 6 feet under. Yuck. \n' RuntimeError: Some background worker is 6 feet under. Yuck. OK jokes aside. One of your background processes is missing. This could be because of an error (look for an error message) or because it was killed by your OS due to running out of RAM. If you don't see an error message, out of RAM is likely the problem. In that case reducing the number of workers might help
For context, I was using Google Colab Pro. When processing the full dataset, the system RAM initially spikes, then decreases and eventually remains stagnant and stuck at line "Using <class 'nnunetv2.imageio.simpleitk_reader_writer.SimpleITKIO'>"