autogluon icon indicating copy to clipboard operation
autogluon copied to clipboard

[BUG] resume training

Open oleg-korshunov opened this issue 1 year ago • 1 comments

I try to continue training proccess and get an error

from autogluon.multimodal import MultiModalPredictor
import uuid

# model_path = f"./tmp/{uuid.uuid4().hex}-automm_shopee"
# predictor = MultiModalPredictor(label="label", problem_type="binary", path=model_path)
predictor = MultiModalPredictor.load(
    "tmp/a1c00ebdec7043e6865278e9a06c3aad-automm_shopee/epoch=9-step=1270.ckpt", resume=True
)
hyperparameter_tune_kwargs = {
    "env.per_gpu_batch_size": 128,
}
predictor.fit(
    train_data=train_df,
    seed=42,
    time_limit=60 * 30,  # seconds
)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[3], [line 12](vscode-notebook-cell:?execution_count=3&line=12)
      [6](vscode-notebook-cell:?execution_count=3&line=6) predictor = MultiModalPredictor.load(
      [7](vscode-notebook-cell:?execution_count=3&line=7)     "tmp/a1c00ebdec7043e6865278e9a06c3aad-automm_shopee/epoch=9-step=1270.ckpt", resume=True
      [8](vscode-notebook-cell:?execution_count=3&line=8) )
      [9](vscode-notebook-cell:?execution_count=3&line=9) hyperparameter_tune_kwargs = {
     [10](vscode-notebook-cell:?execution_count=3&line=10)     "env.per_gpu_batch_size": 128,
     [11](vscode-notebook-cell:?execution_count=3&line=11) }
---> [12](vscode-notebook-cell:?execution_count=3&line=12) predictor.fit(
     [13](vscode-notebook-cell:?execution_count=3&line=13)     train_data=train_df,
     [14](vscode-notebook-cell:?execution_count=3&line=14)     seed=42,
     [15](vscode-notebook-cell:?execution_count=3&line=15)     time_limit=60 * 30,  # seconds
     [16](vscode-notebook-cell:?execution_count=3&line=16) )

File ~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:509, in MultiModalPredictor.fit(self, train_data, presets, tuning_data, max_num_tuning_data, id_mappings, time_limit, save_path, hyperparameters, column_types, holdout_frac, teacher_predictor, seed, standalone, hyperparameter_tune_kwargs, clean_ckpts)
    [507](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:507) else:
    [508](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:508)     teacher_learner = teacher_predictor._learner
--> [509](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:509) self._learner.fit(
    [510](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:510)     train_data=train_data,
    [511](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:511)     presets=presets,
    [512](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:512)     tuning_data=tuning_data,
    [513](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:513)     max_num_tuning_data=max_num_tuning_data,
    [514](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:514)     time_limit=time_limit,
    [515](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:515)     save_path=save_path,
    [516](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:516)     hyperparameters=hyperparameters,
    [517](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:517)     column_types=column_types,
    [518](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:518)     holdout_frac=holdout_frac,
    [519](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:519)     teacher_learner=teacher_learner,
    [520](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:520)     seed=seed,
    [521](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:521)     standalone=standalone,
    [522](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:522)     hyperparameter_tune_kwargs=hyperparameter_tune_kwargs,
    [523](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:523)     clean_ckpts=clean_ckpts,
    [524](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:524)     id_mappings=id_mappings,
    [525](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:525) )
    [527](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/predictor.py:527) return self

File ~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:654, in BaseLearner.fit(self, train_data, presets, tuning_data, time_limit, save_path, hyperparameters, column_types, holdout_frac, teacher_learner, seed, standalone, hyperparameter_tune_kwargs, clean_ckpts, **kwargs)
    [647](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:647) self.fit_sanity_check()
    [648](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:648) self.prepare_fit_args(
    [649](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:649)     time_limit=time_limit,
    [650](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:650)     seed=seed,
    [651](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:651)     standalone=standalone,
    [652](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:652)     clean_ckpts=clean_ckpts,
    [653](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:653) )
--> [654](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:654) fit_returns = self.execute_fit()
    [655](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:655) self.on_fit_end(
    [656](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:656)     training_start=training_start,
    [657](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:657)     strategy=fit_returns.get("strategy", None),
   (...)
    [660](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:660)     clean_ckpts=clean_ckpts,
    [661](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:661) )
    [663](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:663) return self

File ~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:566, in BaseLearner.execute_fit(self)
    [564](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:564)     return dict()
    [565](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:565) else:
--> [566](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:566)     attributes = self.fit_per_run(**self._fit_args)
    [567](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:567)     self.update_attributes(**attributes)  # only update attributes for non-HPO mode
    [568](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:568)     return attributes

File ~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1343, in BaseLearner.fit_per_run(self, max_time, save_path, ckpt_path, resume, enable_progress_bar, seed, hyperparameters, advanced_hyperparameters, config, df_preprocessor, data_processors, model, standalone, clean_ckpts)
   [1320](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1320) self.run_trainer(
   [1321](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1321)     trainer=trainer,
   [1322](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1322)     litmodule=litmodule,
   (...)
   [1325](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1325)     resume=resume,
   [1326](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1326) )
   [1327](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1327) self.on_fit_per_run_end(
   [1328](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1328)     save_path=save_path,
   [1329](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1329)     standalone=standalone,
   (...)
   [1334](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1334)     model=model,
   [1335](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1335) )
   [1337](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1337) return dict(
   [1338](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1338)     config=config,
   [1339](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1339)     df_preprocessor=df_preprocessor,
   [1340](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1340)     data_processors=data_processors,
   [1341](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1341)     model=model,
   [1342](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1342)     model_postprocess_fn=model_postprocess_fn,
-> [1343](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1343)     best_score=trainer.callback_metrics[f"val_{self._validation_metric_name}"].item(),
   [1344](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1344)     strategy=strategy,
   [1345](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1345)     strict_loading=not peft_param_names,
   [1346](https://vscode-remote+wsl-002bubuntu-002d22-002e04.vscode-resource.vscode-cdn.net/mnt/e/projects/hacks/moderatsiya-kartochek-5706/~/miniconda3/envs/autogluon/lib/python3.10/site-packages/autogluon/multimodal/learners/base.py:1346) )

KeyError: 'val_roc_auc'

oleg-korshunov avatar Aug 25 '24 21:08 oleg-korshunov

Upd: this happens when training with middle batch size I mean I have 16 gpu memory and for example 12 gb allocated for training and crushing on this step, after crushing can't resume training, got error described above, if train with very small batch size everything is ok image

oleg-korshunov avatar Aug 26 '24 11:08 oleg-korshunov

I'm unable to reproduce the reported bug. When simulating the issue by inserting an exit() line at the checkpoint fusing stage, the current code functions correctly. However, the pull request https://github.com/autogluon/autogluon/pull/4449 should address and resolve this issue. Closing the issue for now. Please feel free to reopen it if the error still exists after this fix.

FANGAreNotGnu avatar Aug 31 '24 00:08 FANGAreNotGnu