maxtext icon indicating copy to clipboard operation
maxtext copied to clipboard

Integrate SDK for managed diagnostics

Open xibinliu opened this issue 3 months ago • 3 comments

Description

Integrate SDK for managed diagnostics

  • include new SDK google-cloud-mldiagnostics seed-env: --seed-commit=3d8c7df56aa55f363d5453fc82b64c5093ea2863
  • add new config params
  • modify profiler.py to add ML run and profiling
  • modify metrics_logger.py to upload metrics

Four modes will be supported:

  1. Only upload configs and metrics, not profiling (managed_mldiagnostics=True)
  2. Upload configs and metrics, and do profiling on the first device (managed_mldiagnostics=True and profiler=xplane)
  3. Upload configs and metrics, and do profiling on the all devices (managed_mldiagnostics=True, profiler=xplane, and upload_all_profiler_results=True)
  4. on-demand profiling support (TODO. Not in this PR)

IMPORTANT

Since the GCP UI support is not formally rolled out yet, currently this feature only works in supercomputer-testing / us-central1. Enabling this feature in other projects and regions will fail.

Tests

Command:

  1. Enable the feature with managed_mldiagnostics=True, no profiling
python3 -m MaxText.train src/MaxText/configs/base.yml run_name="xibin-run22" model_name="gpt3-52k" base_output_directory=gs://xibin-images/  dataset_type=synthetic steps=22 managed_mldiagnostics=True managed_profiler_run_group="xibin-demo" log_period=5

Note: managed_mldiagnostics_run_group is optional

  1. Enable the feature with managed_mldiagnostics=True profiler=xplane for single device profiling
python3 -m MaxText.train src/MaxText/configs/base.yml run_name="xibin-run23" model_name="gpt3-52k" base_output_directory=gs://xibin-images/  dataset_type=synthetic steps=22 profiler=xplane managed_mldiagnostics=True  log_period=5
  1. Enable the feature with managed_mldiagnostics=True profiler=xplane upload_all_profiler_results=True to profile on all TPU devices
python3 -m MaxText.train src/MaxText/configs/base.yml run_name="xibin-run24" model_name="gpt3-52k" base_output_directory=gs://xibin-images/  dataset_type=synthetic steps=22 profiler=xplane managed_mldiagnostics=True upload_all_profiler_results=True  log_period=5

See all uploaded runs in the managed profiler GCP UI

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • [x] I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • [x] I have necessary comments in my code, particularly in hard-to-understand areas.
  • [x] I have run end-to-end tests tests and provided workload links above if applicable.
  • [x] I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

xibinliu avatar Oct 24 '25 14:10 xibinliu

How is the correct GCP project info picked up?

The ML Run (managed profiler UI) is always created under the project / regions where the workload is running

xibinliu avatar Oct 27 '25 18:10 xibinliu

How is the correct GCP project info picked up?

The ML Run (managed profiler UI) is always created under the project / regions where the workload is running

Thanks @xibinliu. In that case, the info is coming from XPK, right?

bvandermoon avatar Oct 28 '25 20:10 bvandermoon

How is the correct GCP project info picked up?

The ML Run (managed profiler UI) is always created under the project / regions where the workload is running

Thanks @xibinliu. In that case, the info is coming from XPK, right?

The mldiagnostics SDK figured it out by itself. Even the run is on a GCE VM, it can also find the right the project / region info.

xibinliu avatar Oct 31 '25 20:10 xibinliu