Description

Adding Webhooks, Injector, and connection-operator installations during cluster creation.

Testing

Test with the cluster create command and add the flag --managed-mldiagnostics. The cluster will then install the required ML Diagnostics components: cert-manager, the injection-webhook, and the connection-operator. like this: https://paste.googleplex.com/6009656740806656

Nov 04 '25 03:11 DannyLiCom

@xibinliu and @Shuang-cnt please review it.

Nov 04 '25 03:11 DannyLiCom

What's the justification for this change? Why are we adding this for every cluster?
What's diagon?
Could you cover the changes with unit tests please? https://github.com/AI-Hypercomputer/xpk/blob/main/docs/testing.md#unit-test

Nov 04 '25 08:11 jamOne-

Also, please merge the current main branch, because the PR changes are mixed with the sub-slicing changes, making it harder to review.

Nov 04 '25 08:11 jamOne-

Also, please merge the current main branch, because the PR changes are mixed with the sub-slicing changes, making it harder to review.

When I run git checkout -b, should I branch off of main or develop? My file changes only touch 3 files, but if I merge into main, there are now 6 changes.

Nov 04 '25 08:11 DannyLiCom

What's the justification for this change? Why are we adding this for every cluster?

What's diagon?

Could you cover the changes with unit tests please? https://github.com/AI-Hypercomputer/xpk/blob/main/docs/testing.md#unit-test

Perhaps @xibinliu needs to explain for it to be clearer.

Nov 04 '25 09:11 DannyLiCom

Also, please merge the current main branch, because the PR changes are mixed with the sub-slicing changes, making it harder to review.

When I run git checkout -b, should I branch off of main or develop? My file changes only touch 3 files, but if I merge into main, there are now 6 changes.

The main. Please just pull latest main and merge it into your branch - that should do the job.

Nov 04 '25 09:11 scaliby

Also, please merge the current main branch, because the PR changes are mixed with the sub-slicing changes, making it harder to review.

When I run git checkout -b, should I branch off of main or develop? My file changes only touch 3 files, but if I merge into main, there are now 6 changes.

The main. Please just pull latest main and merge it into your branch - that should do the job.

Done

Nov 05 '25 01:11 DannyLiCom

What's the justification for this change? Why are we adding this for every cluster?

What's diagon?

Could you cover the changes with unit tests please? https://github.com/AI-Hypercomputer/xpk/blob/main/docs/testing.md#unit-test

ML Diagnostics components

For point 1: Add the --managed-mldiagnostics flag to allow the user to decide whether or not to install these ML Diagnostics components.

Nov 05 '25 07:11 DannyLiCom

@jamOne- @scaliby The unit tests are failing because our installation involves the Kubernetes API server. Does the installation process require unit tests? If so, it will get stuck at get-credentials.

[XPK] ********************************************************************************
[XPK] b'E1106 04:14:54.063296    1983 memcache.go:265] "Unhandled Error" err="couldn\'t get current server API group list: Get \\"http://localhost:8080/api?timeout=32s\\": dial tcp [::1]:8080: connect: connection refused"\nE1106 04:14:54.063919    1983 memcache.go:265] "Unhandled Error" err="couldn\'t get current server API group list: Get \\"http://localhost:8080/api?timeout=32s\\": dial tcp [::1]:8080: connect: connection refused"\nE1106 04:14:54.065489    1983 memcache.go:265] "Unhandled Error" err="couldn\'t get current server API group list: Get \\"http://localhost:8080/api?timeout=32s\\": dial tcp [::1]:8080: connect: connection refused"\nE1106 04:14:54.065944    1983 memcache.go:265] "Unhandled Error" err="couldn\'t get current server API group list: Get \\"http://localhost:8080/api?timeout=32s\\": dial tcp [::1]:8080: connect: connection refused"\nThe connection to the server localhost:8080 was refused - did you specify the right host or port?\n'
[XPK] ********************************************************************************

Nov 06 '25 06:11 DannyLiCom

@DannyLiCom please just mock this behavior on command level, so there will be no actual command execution happening.

Nov 06 '25 10:11 scaliby

What's the justification for this change? Why are we adding this for every cluster?

What's diagon?

Could you cover the changes with unit tests please? https://github.com/AI-Hypercomputer/xpk/blob/main/docs/testing.md#unit-test

Perhaps @xibinliu needs to explain for it to be clearer.

Hi @jamOne- Google Cloud ML Diagnostics is an end-to-end managed platform for ML Engineers to optimize and diagnose their AI/ML workloads on Google Cloud. ML Engineers need to integrate their ML workload with google-cloud-mldiagnostics open source SDK (see the PR in MaxText) as well as deploy some GKE webhooks and operators in GKE cluster (this PR) to get a seamless workload tracking and profiling experience.

Nov 07 '25 14:11 xibinliu

What's the justification for this change? Why are we adding this for every cluster?

What's diagon?

Could you cover the changes with unit tests please? https://github.com/AI-Hypercomputer/xpk/blob/main/docs/testing.md#unit-test

Perhaps @xibinliu needs to explain for it to be clearer.

Hi @jamOne- Google Cloud ML Diagnostics is an end-to-end managed platform for ML Engineers to optimize and diagnose their AI/ML workloads on Google Cloud. ML Engineers need to integrate their ML workload with google-cloud-mldiagnostics open source SDK (see the PR in MaxText) as well as deploy some GKE webhooks and operators in GKE cluster (this PR) to get a seamless workload tracking and profiling experience.

c.c. @scaliby

Nov 10 '25 01:11 DannyLiCom

Hi @jamOne- or @scaliby Our code is largely complete here, and I would like to ask you to review it. Thank you!

Nov 10 '25 02:11 DannyLiCom

Hi @jamOne- Moving the installation logic to managed_ml_diagnostics.py is causing the unit test to fail.

=================================== FAILURES ===================================
__________ test_install_mldiagnostics_prerequisites_commands_executed __________

mocks = _Mocks(common_print_mock=<MagicMock name='xpk_print' id='140425777035104'>, commands_print_mock=<MagicMock name='xpk_p...ype' id='140425771897264'>, commands_tester=<xpk.core.testing.commands_tester.CommandsTester object at 0x7fb76c4419f0>)
mocker = <pytest_mock.plugin.MockerFixture object at 0x7fb76c5fb520>

    def test_install_mldiagnostics_prerequisites_commands_executed(
        mocks: _Mocks,
        mocker,
    ):
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'rollout',
          'status',
          'deployment/kueue-controller-manager',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'rollout',
          'status',
          'deployment/cert-manager-webhook',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'apply',
          '-f',
          'https://github.com/cert-manager/cert-manager/releases/',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'gcloud',
          'artifacts',
          'generic',
          'download',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'create',
          'namespace',
          'gke-mldiagnostics',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'apply',
          '-f',
          '-n',
          'gke-mldiagnostics',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'label',
          'namespace',
          'default',
          'managed-mldiagnostics-gke=true',
      )
    
>     mocks.commands_tester.assert_command_run(
          'kubectl',
          'rollout',
          'status',
          'deployment/kueue-controller-manager',
          times=1,
      )

src/xpk/commands/cluster_test.py:320: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <xpk.core.testing.commands_tester.CommandsTester object at 0x7fb76c4419f0>
times = 1
command_parts = ('kubectl', 'rollout', 'status', 'deployment/kueue-controller-manager')
matching = []

    def assert_command_run(self, *command_parts: str, times: int = 1):
      """Asserts the command composed from the command parts (joined with '.*') was run exactly `times` times."""
      matching = self.get_matching_commands(*command_parts)
      if not matching:
>       raise AssertionError(
            f"{command_parts} was not found in {self.commands_history}"
E           AssertionError: ('kubectl', 'rollout', 'status', 'deployment/kueue-controller-manager') was not found in []

src/xpk/core/testing/commands_tester.py:62: AssertionError
=========================== short test summary info ============================
FAILED src/xpk/commands/cluster_test.py::test_install_mldiagnostics_prerequisites_commands_executed - AssertionError: ('kubectl', 'rollout', 'status', 'deployment/kueue-controller-manager') was not found in []
======================== 1 failed, 163 passed in 1.85s =========================
make: *** [Makefile:39: run-unittests] Error 1

The following error occurs because the path provided here is under cluster:

run_command_with_updates_path=(
              'xpk.commands.cluster.run_command_with_updates'
          ),
          run_command_for_value_path=(
              'xpk.commands.cluster.run_command_for_value'
          ),

The unit test was passing before I moved the code to managed_ml_diagnostics.py.

If I want to modify this, it looks like I need to change the code inside the CommandsTester class.

Nov 13 '25 07:11 DannyLiCom

Hi @jamOne- Moving the installation logic to managed_ml_diagnostics.py is causing the unit test to fail.

=================================== FAILURES ===================================
__________ test_install_mldiagnostics_prerequisites_commands_executed __________

mocks = _Mocks(common_print_mock=<MagicMock name='xpk_print' id='140425777035104'>, commands_print_mock=<MagicMock name='xpk_p...ype' id='140425771897264'>, commands_tester=<xpk.core.testing.commands_tester.CommandsTester object at 0x7fb76c4419f0>)
mocker = <pytest_mock.plugin.MockerFixture object at 0x7fb76c5fb520>

    def test_install_mldiagnostics_prerequisites_commands_executed(
        mocks: _Mocks,
        mocker,
    ):
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'rollout',
          'status',
          'deployment/kueue-controller-manager',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'rollout',
          'status',
          'deployment/cert-manager-webhook',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'apply',
          '-f',
          'https://github.com/cert-manager/cert-manager/releases/',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'gcloud',
          'artifacts',
          'generic',
          'download',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'create',
          'namespace',
          'gke-mldiagnostics',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'apply',
          '-f',
          '-n',
          'gke-mldiagnostics',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'label',
          'namespace',
          'default',
          'managed-mldiagnostics-gke=true',
      )
    
>     mocks.commands_tester.assert_command_run(
          'kubectl',
          'rollout',
          'status',
          'deployment/kueue-controller-manager',
          times=1,
      )

src/xpk/commands/cluster_test.py:320: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <xpk.core.testing.commands_tester.CommandsTester object at 0x7fb76c4419f0>
times = 1
command_parts = ('kubectl', 'rollout', 'status', 'deployment/kueue-controller-manager')
matching = []

    def assert_command_run(self, *command_parts: str, times: int = 1):
      """Asserts the command composed from the command parts (joined with '.*') was run exactly `times` times."""
      matching = self.get_matching_commands(*command_parts)
      if not matching:
>       raise AssertionError(
            f"{command_parts} was not found in {self.commands_history}"
E           AssertionError: ('kubectl', 'rollout', 'status', 'deployment/kueue-controller-manager') was not found in []

src/xpk/core/testing/commands_tester.py:62: AssertionError
=========================== short test summary info ============================
FAILED src/xpk/commands/cluster_test.py::test_install_mldiagnostics_prerequisites_commands_executed - AssertionError: ('kubectl', 'rollout', 'status', 'deployment/kueue-controller-manager') was not found in []
======================== 1 failed, 163 passed in 1.85s =========================
make: *** [Makefile:39: run-unittests] Error 1

The following error occurs because the path provided here is under cluster:

run_command_with_updates_path=(
              'xpk.commands.cluster.run_command_with_updates'
          ),
          run_command_for_value_path=(
              'xpk.commands.cluster.run_command_for_value'
          ),

The unit test was passing before I moved the code to managed_ml_diagnostics.py.

If I want to modify this, it looks like I need to change the code inside the CommandsTester class.

I suspect the problem is that now the mocked function are in a different module. Please move the related unit tests to the new file as well: managed_ml_diagnostics_test.py. See https://github.com/AI-Hypercomputer/xpk/blob/3ca9359a2a9d35ab596e6744b4fd584b034daf15/src/xpk/commands/cluster_test.py#L54 on how to set up the CommandTester. Just note to pass the correct path there, it'll be different now, since you moved the code to the new file.

Nov 13 '25 08:11 jamOne-

Hi @jamOne- I believe I have made all the modifications. Please review it again. Thanks!

Nov 14 '25 08:11 DannyLiCom

Since this PR adds new dependencies and a new flag, we need to discuss this with our PM on Monday.

Nov 14 '25 10:11 jamOne-

Since this PR adds new dependencies and a new flag, we need to discuss this with our PM on Monday.

Okay sure, thank you.

Nov 14 '25 13:11 DannyLiCom

Hi @jamOne- Any updates?

Nov 18 '25 06:11 DannyLiCom

Hi @jamOne- @scaliby I think I'm done with the modifications. Please review it again.

Nov 21 '25 05:11 DannyLiCom

@scaliby I don't have the permission to merge.

Nov 21 '25 15:11 DannyLiCom

Managed ml diagnostics and xpk integration

Description

Testing