autogluon icon indicating copy to clipboard operation
autogluon copied to clipboard

[AutoMM] Configurable ray chekpoints in HPO

Open GuillemGSubies opened this issue 1 year ago • 1 comments

Description

Right now, it is really hard to perform a HPO with transformer models given that a great amount of checkpoints is created, resulting is TBs of storage needed only to make a single HPO.

Specifically I've had troubles with MultiModalPredictor with a NER task.

I think this would be simpler if the parameters below could be configured:

https://github.com/autogluon/autogluon/blob/949a7815717e48f4675fcf079db0455fd509e444/multimodal/src/autogluon/multimodal/utils/hpo.py#L174

However, ideally there whould be some kind of checkpoint cleaner that deletes the pkl of models that did not achieve a good score and will not be used in inference. Simply keeping a config file for the statistical data will suffice

GuillemGSubies avatar Apr 15 '24 14:04 GuillemGSubies

TODO:

  1. Add num_to_keep for HPO in our multimodal configs instead of hardcoding it: https://github.com/autogluon/autogluon/blob/c51aa59cd4c32fd96420c79140fd832e7dd09fc7/multimodal/src/autogluon/multimodal/utils/hpo.py#L175
  2. Add checkpoint selection (and cleaning) before HPO to reduce the peak storage: https://github.com/autogluon/autogluon/blob/c51aa59cd4c32fd96420c79140fd832e7dd09fc7/multimodal/src/autogluon/multimodal/learners/base.py#L597

FANGAreNotGnu avatar Jun 28 '24 22:06 FANGAreNotGnu