Failed to compute features for scenario token | EOFError: Ran out of input
Describe the bug
Training interrupted by error Failed to compute features for scenario token XXX in log XXX Error: Ran out of input.
Looking at closed issues, I thought it's already fixed long back in v0.4.
Setup
Please share your setup with us, the more detail the better. For example, type of machine (laptop, cluster instance), linux distribution, no. of cpu, no. of gpus, RAM, VRAM, cuda version, conda environment, nuplan-devkit release version.
- Ubuntu 22.04 LTS
- 1x 8 core CPU
- 1x RTX 3060
- 32GB RAM
- Conda: same as environment.yaml
- nuplan-devkit release 1.2 (current master branch)
Steps To Reproduce
Steps to reproduce the behavior:
- Run command
python nuplan/planning/script/run_training.py ... - Interrupt training
- Run command
python nuplan/planning/script/run_training.py ...again, specifying the same cache directory - Sometimes this error appears (rather annoying when it's towards the end of an epoch...)
Stack Trace
(nuplan) tk@tk-ubuntu:~/nuplan/nuplan-devkit$ python nuplan/planning/script/run_training.py group=/home/tk/nuplan/my_experiments/experiment_v014_resnet_more cache.cache_path=/home/tk/nuplan/my_experiments/cache experiment_name=training_raster_experiment job_name=train_default_raster py_func=train +training=training_raster_model scenario_builder=nuplan_mini scenario_filter.limit_total_scenarios=32000 lightning.trainer.params.accelerator=ddp lightning.trainer.params.max_epochs=4 lightning.trainer.checkpoint.resume_training=true data_loader.params.batch_size=80 data_loader.params.num_workers=8 logger_level=warning optimizer.lr=8e-5 lr_scheduler=multistep_lr lr_scheduler.milestones=[1,2,4,8,12,16] warm_up_lr_scheduler=linear_warm_up worker.threads_per_node=8
Global seed set to 0
2023-05-26 17:33:46,871 INFO worker.py:1625 -- Started a local Ray instance.
Ray objects: 100%|████████████████████████████████████████████████████████████████████████| 8/8 [00:02<00:00, 3.01it/s]
/train_default_raster/2023.05.26.17.15.46/checkpoints/epoch=0.ckpt
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.
Global seed set to 0
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All DDP processes registered. Starting ddp with 1 processes
----------------------------------------------------------------------------------------------------
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
--------------------------------------
0 | model | RasterModel | 25.3 M
--------------------------------------
17.0 M Trainable params
8.3 M Non-trainable params
25.3 M Total params
101.105 Total estimated model params size (MB)
/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/callback_hook.py:307: LightningDeprecationWarning: `Callback.on_load_checkpoint` signature has changed in v1.3. `trainer` and `pl_module` parameters have been added. Support for the old signature will be removed in v1.5
rank_zero_deprecation(
Restored states from the checkpoint file at /home/tk/nuplan/my_experiments/experiment_v014_resnet_more/training_raster_experiment/train_default_raster/2023.05.26.17.15.46/checkpoints/epoch=0.ckpt
2023-05-26 17:34:06,712 ERROR {/home/tk/nuplan/nuplan-devkit/nuplan/planning/training/preprocessing/feature_preprocessor.py:104} Failed to compute features for scenario token 64217a7437a55598 in log 2021.08.17.17.17.01_veh-45_02314_02798
Error: Ran out of input
Epoch 1: 0%| | 0/326 [00:12<?, ?it/s]
Traceback (most recent call last):
File "/home/tk/nuplan/nuplan-devkit/nuplan/planning/training/preprocessing/feature_preprocessor.py", line 93, in compute_features
all_features, all_feature_cache_metadata = self._compute_all_features(scenario, self._feature_builders)
File "/home/tk/nuplan/nuplan-devkit/nuplan/planning/training/preprocessing/feature_preprocessor.py", line 122, in _compute_all_features
feature, feature_metadata_entry = compute_or_load_feature(
File "/home/tk/nuplan/nuplan-devkit/nuplan/planning/training/preprocessing/utils/utils_cache.py", line 83, in compute_or_load_feature
feature = storing_mechanism.load_computed_feature_from_folder(file_name, builder.get_feature_type())
File "/home/tk/nuplan/nuplan-devkit/nuplan/planning/training/preprocessing/utils/feature_cache.py", line 88, in load_computed_feature_from_folder
data = pickle.load(f)
EOFError: Ran out of input
/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Epoch 1: 1%|▋ | 4/326 [00:15<20:47, 3.87s/it, loss=383, v_num=]
Error executing job with overrides: ['group=/home/tk/nuplan/my_experiments/experiment_v014_resnet_more', 'cache.cache_path=/home/tk/nuplan/my_experiments/cache', 'experiment_name=training_raster_experiment', 'job_name=train_default_raster', 'py_func=train', '+training=training_raster_model', 'scenario_builder=nuplan_mini', 'scenario_filter.limit_total_scenarios=32000', 'lightning.trainer.params.accelerator=ddp', 'lightning.trainer.params.max_epochs=4', 'lightning.trainer.checkpoint.resume_training=true', 'data_loader.params.batch_size=80', 'data_loader.params.num_workers=8', 'logger_level=warning', 'optimizer.lr=8e-5', 'lr_scheduler=multistep_lr', 'lr_scheduler.milestones=[1,2,4,8,12,16]', 'warm_up_lr_scheduler=linear_warm_up', 'worker.threads_per_node=8']
Traceback (most recent call last):
File "/home/tk/nuplan/nuplan-devkit/nuplan/planning/script/run_training.py", line 64, in main
engine.trainer.fit(model=engine.model, datamodule=engine.datamodule)
File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit
self._run(model)
File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 758, in _run
self.dispatch()
File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 799, in dispatch
self.accelerator.start_training(self)
File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
self._results = trainer.run_stage()
File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in run_stage
return self.run_train()
File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 871, in run_train
self.train_loop.run_training_epoch()
File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 491, in run_training_epoch
for batch_idx, (batch, is_last_batch) in train_dataloader:
File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/profiler/profilers.py", line 112, in profile_iterable
value = next(iterator)
File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/supporters.py", line 534, in prefetch_iterator
for val in it:
File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/supporters.py", line 464, in __next__
return self.request_next_batch(self.loader_iters)
File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/trainer/supporters.py", line 478, in request_next_batch
return apply_to_collection(loader_iters, Iterator, next)
File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/pytorch_lightning/utilities/apply_func.py", line 85, in apply_to_collection
return function(data, *args, **kwargs)
File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
data = self._next_data()
File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1183, in _next_data
return self._process_data(data)
File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/torch/_utils.py", line 425, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 5.
Original Traceback (most recent call last):
File "/home/tk/nuplan/nuplan-devkit/nuplan/planning/training/preprocessing/feature_preprocessor.py", line 93, in compute_features
all_features, all_feature_cache_metadata = self._compute_all_features(scenario, self._feature_builders)
File "/home/tk/nuplan/nuplan-devkit/nuplan/planning/training/preprocessing/feature_preprocessor.py", line 122, in _compute_all_features
feature, feature_metadata_entry = compute_or_load_feature(
File "/home/tk/nuplan/nuplan-devkit/nuplan/planning/training/preprocessing/utils/utils_cache.py", line 83, in compute_or_load_feature
feature = storing_mechanism.load_computed_feature_from_folder(file_name, builder.get_feature_type())
File "/home/tk/nuplan/nuplan-devkit/nuplan/planning/training/preprocessing/utils/feature_cache.py", line 88, in load_computed_feature_from_folder
data = pickle.load(f)
EOFError: Ran out of input
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/tk/anaconda3/envs/nuplan/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/tk/nuplan/nuplan-devkit/nuplan/planning/training/data_loader/scenario_dataset.py", line 48, in __getitem__
features, targets, _ = self._feature_preprocessor.compute_features(scenario)
File "/home/tk/nuplan/nuplan-devkit/nuplan/planning/training/preprocessing/feature_preprocessor.py", line 106, in compute_features
raise RuntimeError(msg)
RuntimeError: Failed to compute features for scenario token 64217a7437a55598 in log 2021.08.17.17.17.01_veh-45_02314_02798
Error: Ran out of input
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
It appears that this error arise after I keyboard-interrupt (Ctrl+C) a previous run of python nuplan/planning/script/run_training.py ..., when I attempt to run a new experiment that shares the same cache directory.
Hey @tinkei, thanks for reporting. The issue isn't fully resolved, but I'll make some changes to make it more robust. Two questions:
- to sanity-check, are you using the pre-cached features? you should have arguments like
cache.cache_path={CACHE_PATH} cache.use_cache_without_dataset=True
- I think what's happening is a feature cache file is being created, but doesn't get fully written to before training finishes. Could fix this by writing to a temp file, then moving the result once we compute and write the feature. Can you confirm this is the case by checking if an empty file exists for the feature that fails?
I'll also try to replicate. Thanks!
Hello, is there an update on this? The same problem happens if you interrupt caching and rerun the caching with same cache directory.