MLE-agent [feature] add `mle analyze` to summarize training logs and give optimal config

RT

Sep 23 '24 13:09 leeeizhang

Any updates on the issue?

Oct 07 '24 18:10 huangyz0918

I think mle analyze can perform as:

ML Experiment Summary: summarize the experiment runs, including the hyper-parameters users recently tried, and the achievements (e.g., loss or accuracy) for those runs. To support this, we should integrate some ML tracking tools (e.g., MLflow, W&B, etc.) first.
Training Suggestor (e.g., HPO, NAS, etc.): suggest hyper-parameters and model architectures based on the experiment summary. Furthermore, we could also allow mle-agent to automatically explore the training configurations and execute them, which could be explored in the future.
Weekly Experiment Report: the ML experiment summary could also be included in the report, as ML scientists may be very interested in their weekly experimental progress. The experiment summary could be an important data source for the report agent.

The divided tasks are:

[ ] [integrate] ML tracking integration
- [ ] MLflow (mle-agent, repx)
- [ ] W&B (repx only)
[ ] [agent] experiment summarizer
[ ] [agent] weekly experiment reporter
[ ] [agent] HPO and NAS suggestor

References:

https://arxiv.org/pdf/2402.01881
https://arxiv.org/pdf/2309.01125
https://arxiv.org/pdf/2302.14838
https://github.com/automl/CAAFE

Oct 08 '24 06:10 leeeizhang

What about integrating W&B first? By fetching the experiments, we can know the metrics, the code and the hyper parameters, etc. Then we use an agent to give suggestions, also calling some web search functions (e.g., paper with code benchmark) to give summary and suggestion (e.g., how to tune the parameters to improve certain metrics)

Oct 08 '24 16:10 huangyz0918

The train/validation accuracy, the loss items are time-series data, how to analyze such data using LLM is also an interesting problem. Or we can fetch the visualization from W&B and try to analyze the images directly using the muti-modal ability, you can have a try and then we can discuss.

Oct 08 '24 16:10 huangyz0918

What about integrating W&B first? By fetching the experiments, we can know the metrics, the code and the hyper parameters, etc. Then we use an agent to give suggestions, also calling some web search functions (e.g., paper with code benchmark) to give summary and suggestion (e.g., how to tune the parameters to improve certain metrics)

That's right, I will integrate W&B first. For the analysis parts, we can incorporate the existing AdviseAgent to summarize and suggest.

The train/validation accuracy, the loss items are time-series data, how to analyze such data using LLM is also an interesting problem. Or we can fetch the visualization from W&B and try to analyze the images directly using the muti-modal ability, you can have a try and then we can discuss.

agree with you! How to analyze time-series data is still an open question for exploration, and using multi-modal ability to directly analyze experimental plots or charts is very valuable to try. Nevertheless, since the most common NAS and NPO algorithms still use the final & best accuracy/loss for analyzing and tuning, we may also keep the possibility that directly using each run's best/final metrics as prompts to build the agent in our very beginning PoC.

Oct 09 '24 13:10 leeeizhang