Integrate SDK for managed diagnostics
Description
Integrate SDK for managed diagnostics
- include new SDK google-cloud-mldiagnostics seed-env: --seed-commit=3d8c7df56aa55f363d5453fc82b64c5093ea2863
- add new config params
- modify profiler.py to add ML run and profiling
- modify metrics_logger.py to upload metrics
Four modes will be supported:
- Only upload configs and metrics, not profiling (managed_mldiagnostics=True)
- Upload configs and metrics, and do profiling on the first device (managed_mldiagnostics=True and profiler=xplane)
- Upload configs and metrics, and do profiling on the all devices (managed_mldiagnostics=True, profiler=xplane, and upload_all_profiler_results=True)
- on-demand profiling support (TODO. Not in this PR)
IMPORTANT
Since the GCP UI support is not formally rolled out yet, currently this feature only works in supercomputer-testing / us-central1. Enabling this feature in other projects and regions will fail.
Tests
Command:
- Enable the feature with
managed_mldiagnostics=True, no profiling
python3 -m MaxText.train src/MaxText/configs/base.yml run_name="xibin-run22" model_name="gpt3-52k" base_output_directory=gs://xibin-images/ dataset_type=synthetic steps=22 managed_mldiagnostics=True managed_profiler_run_group="xibin-demo" log_period=5
Note: managed_mldiagnostics_run_group is optional
- Enable the feature with
managed_mldiagnostics=True profiler=xplanefor single device profiling
python3 -m MaxText.train src/MaxText/configs/base.yml run_name="xibin-run23" model_name="gpt3-52k" base_output_directory=gs://xibin-images/ dataset_type=synthetic steps=22 profiler=xplane managed_mldiagnostics=True log_period=5
- Enable the feature with
managed_mldiagnostics=True profiler=xplane upload_all_profiler_results=Trueto profile on all TPU devices
python3 -m MaxText.train src/MaxText/configs/base.yml run_name="xibin-run24" model_name="gpt3-52k" base_output_directory=gs://xibin-images/ dataset_type=synthetic steps=22 profiler=xplane managed_mldiagnostics=True upload_all_profiler_results=True log_period=5
See all uploaded runs in the managed profiler GCP UI
Checklist
Before submitting this PR, please make sure (put X in square brackets):
- [x] I have performed a self-review of my code. For an optional AI review, add the
gemini-reviewlabel. - [x] I have necessary comments in my code, particularly in hard-to-understand areas.
- [x] I have run end-to-end tests tests and provided workload links above if applicable.
- [x] I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.
How is the correct GCP project info picked up?
The ML Run (managed profiler UI) is always created under the project / regions where the workload is running
How is the correct GCP project info picked up?
The ML Run (managed profiler UI) is always created under the project / regions where the workload is running
Thanks @xibinliu. In that case, the info is coming from XPK, right?
How is the correct GCP project info picked up?
The ML Run (managed profiler UI) is always created under the project / regions where the workload is running
Thanks @xibinliu. In that case, the info is coming from XPK, right?
The mldiagnostics SDK figured it out by itself. Even the run is on a GCE VM, it can also find the right the project / region info.