[BUG] resume training
I try to continue training proccess and get an error
from autogluon.multimodal import MultiModalPredictor
import uuid
# model_path = f"./tmp/{uuid.uuid4().hex}-automm_shopee"
# predictor = MultiModalPredictor(label="label", problem_type="binary", path=model_path)
predictor = MultiModalPredictor.load(
"tmp/a1c00ebdec7043e6865278e9a06c3aad-automm_shopee/epoch=9-step=1270.ckpt", resume=True
)
hyperparameter_tune_kwargs = {
"env.per_gpu_batch_size": 128,
}
predictor.fit(
train_data=train_df,
seed=42,
time_limit=60 * 30, # seconds
)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[3], [line 12](vscode-notebook-cell:?execution_count=3&line=12)
[6](vscode-notebook-cell:?execution_count=3&line=6) predictor = MultiModalPredictor.load(
[7](vscode-notebook-cell:?execution_count=3&line=7) "tmp/a1c00ebdec7043e6865278e9a06c3aad-automm_shopee/epoch=9-step=1270.ckpt", resume=True
[8](vscode-notebook-cell:?execution_count=3&line=8) )
[9](vscode-notebook-cell:?execution_count=3&line=9) hyperparameter_tune_kwargs = {
[10](vscode-notebook-cell:?execution_count=3&line=10) "env.per_gpu_batch_size": 128,
[11](vscode-notebook-cell:?execution_count=3&line=11) }
---> [12](vscode-notebook-cell:?execution_count=3&line=12) predictor.fit(
[13](vscode-notebook-cell:?execution_count=3&line=13) train_data=train_df,
[14](vscode-notebook-cell:?execution_count=3&line=14) seed=42,
[15](vscode-notebook-cell:?execution_count=3&line=15) time_limit=60 * 30, # seconds
[16](vscode-notebook-cell:?execution_count=3&line=16) )
File ~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:509, in MultiModalPredictor.fit(self, train_data, presets, tuning_data, max_num_tuning_data, id_mappings, time_limit, save_path, hyperparameters, column_types, holdout_frac, teacher_predictor, seed, standalone, hyperparameter_tune_kwargs, clean_ckpts)
[507](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:507) else:
[508](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:508) teacher_learner = teacher_predictor._learner
--> [509](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:509) self._learner.fit(
[510](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:510) train_data=train_data,
[511](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:511) presets=presets,
[512](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:512) tuning_data=tuning_data,
[513](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:513) max_num_tuning_data=max_num_tuning_data,
[514](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:514) time_limit=time_limit,
[515](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:515) save_path=save_path,
[516](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:516) hyperparameters=hyperparameters,
[517](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:517) column_types=column_types,
[518](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:518) holdout_frac=holdout_frac,
[519](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:519) teacher_learner=teacher_learner,
[520](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:520) seed=seed,
[521](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:521) standalone=standalone,
[522](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:522) hyperparameter_tune_kwargs=hyperparameter_tune_kwargs,
[523](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:523) clean_ckpts=clean_ckpts,
[524](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:524) id_mappings=id_mappings,
[525](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:525) )
[527](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:527) return self
File ~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:654, in BaseLearner.fit(self, train_data, presets, tuning_data, time_limit, save_path, hyperparameters, column_types, holdout_frac, teacher_learner, seed, standalone, hyperparameter_tune_kwargs, clean_ckpts, **kwargs)
[647](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:647) self.fit_sanity_check()
[648](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:648) self.prepare_fit_args(
[649](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:649) time_limit=time_limit,
[650](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:650) seed=seed,
[651](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:651) standalone=standalone,
[652](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:652) clean_ckpts=clean_ckpts,
[653](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:653) )
--> [654](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:654) fit_returns = self.execute_fit()
[655](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:655) self.on_fit_end(
[656](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:656) training_start=training_start,
[657](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:657) strategy=fit_returns.get("strategy", None),
(...)
[660](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:660) clean_ckpts=clean_ckpts,
[661](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:661) )
[663](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:663) return self
File ~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:566, in BaseLearner.execute_fit(self)
[564](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:564) return dict()
[565](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:565) else:
--> [566](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:566) attributes = self.fit_per_run(**self._fit_args)
[567](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:567) self.update_attributes(**attributes) # only update attributes for non-HPO mode
[568](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:568) return attributes
File ~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1343, in BaseLearner.fit_per_run(self, max_time, save_path, ckpt_path, resume, enable_progress_bar, seed, hyperparameters, advanced_hyperparameters, config, df_preprocessor, data_processors, model, standalone, clean_ckpts)
[1320](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1320) self.run_trainer(
[1321](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1321) trainer=trainer,
[1322](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1322) litmodule=litmodule,
(...)
[1325](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1325) resume=resume,
[1326](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1326) )
[1327](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1327) self.on_fit_per_run_end(
[1328](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1328) save_path=save_path,
[1329](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1329) standalone=standalone,
(...)
[1334](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1334) model=model,
[1335](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1335) )
[1337](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1337) return dict(
[1338](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1338) config=config,
[1339](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1339) df_preprocessor=df_preprocessor,
[1340](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1340) data_processors=data_processors,
[1341](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1341) model=model,
[1342](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1342) model_postprocess_fn=model_postprocess_fn,
-> [1343](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1343) best_score=trainer.callback_metrics[f"val_{self._validation_metric_name}"].item(),
[1344](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1344) strategy=strategy,
[1345](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1345) strict_loading=not peft_param_names,
[1346](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1346) )
KeyError: 'val_roc_auc'
Upd: this happens when training with middle batch size I mean I have 16 gpu memory and for example 12 gb allocated for training and crushing on this step, after crushing can't resume training, got error described above, if train with very small batch size everything is ok
I'm unable to reproduce the reported bug. When simulating the issue by inserting an exit() line at the checkpoint fusing stage, the current code functions correctly.
However, the pull request https://github.com/autogluon/autogluon/pull/4449 should address and resolve this issue.
Closing the issue for now. Please feel free to reopen it if the error still exists after this fix.