[Bug] Error occurs while running the train.py in the tools: _pickle.UnpicklingError: pickle data was truncated
Prerequisite
- [X] I have searched Issues and Discussions but cannot get the expected help.
- [X] I have read the FAQ documentation but cannot get the expected help.
- [X] The bug has not been fixed in the latest version (dev-1.x) or latest version (dev-1.0).
Task
I'm using the official example scripts/configs for the officially supported tasks/models/datasets.
Branch
main branch https://github.com/open-mmlab/mmdetection3d
Environment
System environment: sys.platform: linux Python: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0] CUDA available: True MUSA available: False numpy_random_seed: 793778121 GPU 0: NVIDIA A100-PCIE-40GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.3, V11.3.58 GCC: gcc (Ubuntu 8.4.0-1ubuntu1~18.04) 8.4.0 PyTorch: 1.11.0 PyTorch compiling details: PyTorch built with:
-
GCC 7.3
-
C++ Version: 201402
-
Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
-
Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
-
OpenMP 201511 (a.k.a. OpenMP 4.5)
-
LAPACK is enabled (usually provided by MKL)
-
NNPACK is enabled
-
CPU capability usage: AVX2
-
CUDA Runtime 11.3
-
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
-
CuDNN 8.2
-
Magma 2.5.2
-
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.12.0 OpenCV: 4.10.0 MMEngine: 0.10.4
Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: 793778121 Distributed launcher: none Distributed training: False GPU number: 1
Reproduces the problem - code sample
python tools/train.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py --work-dir=work_dirs/mv-3ddet
Reproduces the problem - command or script
python tools/train.py configs/detection/mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py --work-dir=work_dirs/mv-3ddet
Reproduces the problem - error message
09/06 03:16:31 - mmengine - WARNING - Failed to search registry with scope "embodiedscan" in the "loop" registry tree. As a workaround, the current "loop" registry in "mmengine" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "embodiedscan" is a correct scope, or whether the registry is initialized.
09/06 03:16:31 - mmengine - WARNING - euler-depth is not a meta file, simply parsed as meta information
Traceback (most recent call last):
File "tools/train.py", line 133, in <module>
main()
File "tools/train.py", line 129, in main
runner.train()
File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1728, in train
self._train_loop = self.build_train_loop(
File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1520, in build_train_loop
loop = LOOPS.build(
File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/loops.py", line 44, in __init__
super().__init__(runner, dataloader)
File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/base_loop.py", line 26, in __init__
self.dataloader = runner.build_dataloader(
File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1370, in build_dataloader
dataset = DATASETS.build(dataset_cfg)
File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/dataset/dataset_wrapper.py", line 223, in __init__
self.dataset = DATASETS.build(dataset)
File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/root/wwf/EmbodiedScan/embodiedscan/datasets/embodiedscan_dataset.py", line 59, in __init__
super().__init__(ann_file=ann_file,
File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/dataset/base_dataset.py", line 247, in __init__
self.full_init()
File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/dataset/base_dataset.py", line 298, in full_init
self.data_list = self.load_data_list()
File "/root/wwf/EmbodiedScan/embodiedscan/datasets/embodiedscan_dataset.py", line 342, in load_data_list
data_info = self.parse_data_info(raw_data_info)
File "/root/wwf/EmbodiedScan/embodiedscan/datasets/embodiedscan_dataset.py", line 147, in parse_data_info
info['ann_info'] = self.parse_ann_info(info)
File "/root/wwf/EmbodiedScan/embodiedscan/datasets/embodiedscan_dataset.py", line 238, in parse_ann_info
occ_masks = mmengine.load(mask_filename)
File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/fileio/io.py", line 856, in load
obj = handler.load_from_fileobj(f, **kwargs)
File "/root/miniconda3/envs/embodiedscan/lib/python3.8/site-packages/mmengine/fileio/handlers/pickle_handler.py", line 12, in load_from_fileobj
return pickle.load(file, **kwargs)
_pickle.UnpicklingError: pickle data was truncated
Additional information
No response
It looks like the annotation file you downloaded is broken, try downloading it again.
Thanks for your answer!
I re-downloaded the dataset you guys placed on Google Drive and also re-ran the script extract_occupancy_ann.py and it shows that everything is fine. But it still reports the same error when training.
I noticed that the README under the data folder shows json files starting with embodiedscan_infos, while the data extracted on Google Drive starts with embodiedscan, does this matter? Do I have to change these filenames?
By the way, I would also like to know if this warning is normal? If not, what should I do to get rid of it.
09/06 03:16:31 - mmengine - Warning - Failed to search the “loop” registry tree for registries in the range “embodiedscan”. As a workaround, the current “loop” registry in “mmengine” is used to build the instance. This may cause unexpected failures when running the built module. Please check that “embodiedscan” is the correct scope, or that the registry is initialized.
09/06 03:16:31 - mmengine - Warning - euler-depth is not a metafile, just parsed as meta-information
@Mintinson
Could you please provide the sample_idx of this scene?
Just replace
occ_masks = mmengine.load(mask_filename)
with
try:
occ_masks = mmengine.load(mask_filename)
except:
print(info['sample_idx'])
raise ValueError
This helps us to localize the problem.
Here is the output:
scannet/scene0031_00
Traceback (most recent call last):
...
and here is the structure of the corresponding scene:
location: data/scannet/scans/scene0031_00/
scene0031_00
├── occupancy
│ ├── occupancy.npy
│ └── visible_occupancy.pkl
├── scene0031_00_2d-instance-filt.zip
├── scene0031_00_2d-instance.zip
├── scene0031_00_2d-label-filt.zip
├── scene0031_00_2d-label.zip
├── scene0031_00.aggregation.json
├── scene0031_00.sens
├── scene0031_00.txt
├── scene0031_00_vh_clean_2.0.010000.segs.json
├── scene0031_00_vh_clean_2.labels.ply
├── scene0031_00_vh_clean_2.ply
├── scene0031_00_vh_clean.aggregation.json
├── scene0031_00_vh_clean.ply
└── scene0031_00_vh_clean.segs.json
1 directory, 15 files
location: data/scannet/scans/posed_images/scene0031_00/
scene0031_00
├── 00000.jpg
├── 00000.png
├── 00000.txt
├── 00010.jpg
├── ...
├── 02750.txt
├── depth_intrinsic.txt
├── intrinsic.txt
location: data/embodiedscan_occupancy/scannet/scene0031_00/
scene0031_00
├── occupancy.npy
├── visible_occupancy.pkl
@Mintinson
Could you please check the the sha256 hash values of visible_occupancy.pkl and occupancy.npy?
The hash of visible_occupancy.pkl is 405f14770ab2126e24282977d5f897d1b35569bfea3f60431d63351def49ef3a and the hash of occupancy.npy is da1b32fd3753626401446669f6df3edd3530783e784a5edee01e56c78eb6b5d1.
Thank you so much for your help! I checked the hash value of visible_occupancy.pkl and found that it was indeed different from the visible_occupancy.pkl hash value within embodiedscan_occupancy, I deleted the occupancy folder in raw data and ran the script again:
python embodiedscan/converter/extract_occupancy_ann.py --src data/embodiedscan_occupancy --dst data
This time the file has the correct hash value! I'm not sure what went wrong the first time I extracted these annotations. But now train.py is able to allow it without reporting errors!
I would like to ask how much memory this project needs to run, when I run train.py it gets killed because of out of memory.
The memory problem is caused by the design of mmengine dataloader which will copy annotation files num_gpu * num_workers times. We are trying to fix this problem.
For a quick solution, you can see #29 for detail.
I tried the above solution but it didn't work. I am wondering if 125 G of RAM is enough? Do I need more RAM so that I am able to replace my server earlier?
It usually costs ~140G RAM on my server. Maybe you can try setting fewer dataloader workers in config?
I will try that. Thank you for your timely help~
I would like to ask why this project is taking up so much RAM, all the projects I have done before have taken up less than 30G of memory on loading data, why is this reaching hundreds. Also, what are the GPU memory requirements for this project? So that I can allocate the hardware resources in time.
I apologize for the RAM memory problem. We are working on fixing it.
For GPU memory, the default setting of Embodiedscan Detection Task like mv-det3d_8xb4_embodiedscan-3d-284class-9dof.py requires ~20G GPU memory. It can be further reduced by decreasing batch size.
PS: The default setting totally uses ~600G RAM. I'm sorry for the previous incorrect response.