Description

Integrate SDK for managed diagnostics

include new SDK google-cloud-mldiagnostics seed-env: --seed-commit=3d8c7df56aa55f363d5453fc82b64c5093ea2863
add new config params
modify profiler.py to add ML run and profiling
modify metrics_logger.py to upload metrics

Four modes will be supported:

Only upload configs and metrics, not profiling (managed_mldiagnostics=True)
Upload configs and metrics, and do profiling on the first device (managed_mldiagnostics=True and profiler=xplane)
Upload configs and metrics, and do profiling on the all devices (managed_mldiagnostics=True, profiler=xplane, and upload_all_profiler_results=True)
on-demand profiling support (TODO. Not in this PR)

IMPORTANT

Since the GCP UI support is not formally rolled out yet, currently this feature only works in supercomputer-testing / us-central1. Enabling this feature in other projects and regions will fail.

Tests

Command:

Enable the feature with managed_mldiagnostics=True, no profiling

python3 -m MaxText.train src/MaxText/configs/base.yml run_name="xibin-run22" model_name="gpt3-52k" base_output_directory=gs://xibin-images/  dataset_type=synthetic steps=22 managed_mldiagnostics=True managed_profiler_run_group="xibin-demo" log_period=5

Note: managed_mldiagnostics_run_group is optional

Enable the feature with managed_mldiagnostics=True profiler=xplane for single device profiling

python3 -m MaxText.train src/MaxText/configs/base.yml run_name="xibin-run23" model_name="gpt3-52k" base_output_directory=gs://xibin-images/  dataset_type=synthetic steps=22 profiler=xplane managed_mldiagnostics=True  log_period=5

Enable the feature with managed_mldiagnostics=True profiler=xplane upload_all_profiler_results=True to profile on all TPU devices

python3 -m MaxText.train src/MaxText/configs/base.yml run_name="xibin-run24" model_name="gpt3-52k" base_output_directory=gs://xibin-images/  dataset_type=synthetic steps=22 profiler=xplane managed_mldiagnostics=True upload_all_profiler_results=True  log_period=5

See all uploaded runs in the managed profiler GCP UI

Checklist

Before submitting this PR, please make sure (put X in square brackets):

[x] I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
[x] I have necessary comments in my code, particularly in hard-to-understand areas.
[x] I have run end-to-end tests tests and provided workload links above if applicable.
[x] I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

Oct 24 '25 14:10 xibinliu

How is the correct GCP project info picked up?

The ML Run (managed profiler UI) is always created under the project / regions where the workload is running

Oct 27 '25 18:10 xibinliu

How is the correct GCP project info picked up?

The ML Run (managed profiler UI) is always created under the project / regions where the workload is running

Thanks @xibinliu. In that case, the info is coming from XPK, right?

Oct 28 '25 20:10 bvandermoon

How is the correct GCP project info picked up?

The ML Run (managed profiler UI) is always created under the project / regions where the workload is running

Thanks @xibinliu. In that case, the info is coming from XPK, right?

The mldiagnostics SDK figured it out by itself. Even the run is on a GCE VM, it can also find the right the project / region info.

Oct 31 '25 20:10 xibinliu