xpk icon indicating copy to clipboard operation
xpk copied to clipboard

Managed ml diagnostics and xpk integration

Open DannyLiCom opened this issue 2 months ago • 19 comments

Description

Adding Webhooks, Injector, and connection-operator installations during cluster creation.

Testing

Test with the cluster create command and add the flag --managed-mldiagnostics. The cluster will then install the required ML Diagnostics components: cert-manager, the injection-webhook, and the connection-operator. like this: https://paste.googleplex.com/6009656740806656

DannyLiCom avatar Nov 04 '25 03:11 DannyLiCom

@xibinliu and @Shuang-cnt please review it.

DannyLiCom avatar Nov 04 '25 03:11 DannyLiCom

  1. What's the justification for this change? Why are we adding this for every cluster?
  2. What's diagon?
  3. Could you cover the changes with unit tests please? https://github.com/AI-Hypercomputer/xpk/blob/main/docs/testing.md#unit-test

jamOne- avatar Nov 04 '25 08:11 jamOne-

Also, please merge the current main branch, because the PR changes are mixed with the sub-slicing changes, making it harder to review.

jamOne- avatar Nov 04 '25 08:11 jamOne-

Also, please merge the current main branch, because the PR changes are mixed with the sub-slicing changes, making it harder to review.

When I run git checkout -b, should I branch off of main or develop? My file changes only touch 3 files, but if I merge into main, there are now 6 changes.

DannyLiCom avatar Nov 04 '25 08:11 DannyLiCom

  1. What's the justification for this change? Why are we adding this for every cluster?
  2. What's diagon?
  3. Could you cover the changes with unit tests please? https://github.com/AI-Hypercomputer/xpk/blob/main/docs/testing.md#unit-test

Perhaps @xibinliu needs to explain for it to be clearer.

DannyLiCom avatar Nov 04 '25 09:11 DannyLiCom

Also, please merge the current main branch, because the PR changes are mixed with the sub-slicing changes, making it harder to review.

When I run git checkout -b, should I branch off of main or develop? My file changes only touch 3 files, but if I merge into main, there are now 6 changes.

The main. Please just pull latest main and merge it into your branch - that should do the job.

scaliby avatar Nov 04 '25 09:11 scaliby

Also, please merge the current main branch, because the PR changes are mixed with the sub-slicing changes, making it harder to review.

When I run git checkout -b, should I branch off of main or develop? My file changes only touch 3 files, but if I merge into main, there are now 6 changes.

The main. Please just pull latest main and merge it into your branch - that should do the job.

Done

DannyLiCom avatar Nov 05 '25 01:11 DannyLiCom

  1. What's the justification for this change? Why are we adding this for every cluster?
  2. What's diagon?
  3. Could you cover the changes with unit tests please? https://github.com/AI-Hypercomputer/xpk/blob/main/docs/testing.md#unit-test

ML Diagnostics components

For point 1: Add the --managed-mldiagnostics flag to allow the user to decide whether or not to install these ML Diagnostics components.

DannyLiCom avatar Nov 05 '25 07:11 DannyLiCom

@jamOne- @scaliby The unit tests are failing because our installation involves the Kubernetes API server. Does the installation process require unit tests? If so, it will get stuck at get-credentials.

[XPK] ********************************************************************************
[XPK] b'E1106 04:14:54.063296    1983 memcache.go:265] "Unhandled Error" err="couldn\'t get current server API group list: Get \\"http://localhost:8080/api?timeout=32s\\": dial tcp [::1]:8080: connect: connection refused"\nE1106 04:14:54.063919    1983 memcache.go:265] "Unhandled Error" err="couldn\'t get current server API group list: Get \\"http://localhost:8080/api?timeout=32s\\": dial tcp [::1]:8080: connect: connection refused"\nE1106 04:14:54.065489    1983 memcache.go:265] "Unhandled Error" err="couldn\'t get current server API group list: Get \\"http://localhost:8080/api?timeout=32s\\": dial tcp [::1]:8080: connect: connection refused"\nE1106 04:14:54.065944    1983 memcache.go:265] "Unhandled Error" err="couldn\'t get current server API group list: Get \\"http://localhost:8080/api?timeout=32s\\": dial tcp [::1]:8080: connect: connection refused"\nThe connection to the server localhost:8080 was refused - did you specify the right host or port?\n'
[XPK] ********************************************************************************

DannyLiCom avatar Nov 06 '25 06:11 DannyLiCom

@DannyLiCom please just mock this behavior on command level, so there will be no actual command execution happening.

scaliby avatar Nov 06 '25 10:11 scaliby

  1. What's the justification for this change? Why are we adding this for every cluster?
  2. What's diagon?
  3. Could you cover the changes with unit tests please? https://github.com/AI-Hypercomputer/xpk/blob/main/docs/testing.md#unit-test

Perhaps @xibinliu needs to explain for it to be clearer.

Hi @jamOne- Google Cloud ML Diagnostics is an end-to-end managed platform for ML Engineers to optimize and diagnose their AI/ML workloads on Google Cloud. ML Engineers need to integrate their ML workload with google-cloud-mldiagnostics open source SDK (see the PR in MaxText) as well as deploy some GKE webhooks and operators in GKE cluster (this PR) to get a seamless workload tracking and profiling experience.

xibinliu avatar Nov 07 '25 14:11 xibinliu

  1. What's the justification for this change? Why are we adding this for every cluster?
  2. What's diagon?
  3. Could you cover the changes with unit tests please? https://github.com/AI-Hypercomputer/xpk/blob/main/docs/testing.md#unit-test

Perhaps @xibinliu needs to explain for it to be clearer.

Hi @jamOne- Google Cloud ML Diagnostics is an end-to-end managed platform for ML Engineers to optimize and diagnose their AI/ML workloads on Google Cloud. ML Engineers need to integrate their ML workload with google-cloud-mldiagnostics open source SDK (see the PR in MaxText) as well as deploy some GKE webhooks and operators in GKE cluster (this PR) to get a seamless workload tracking and profiling experience.

c.c. @scaliby

DannyLiCom avatar Nov 10 '25 01:11 DannyLiCom

Hi @jamOne- or @scaliby Our code is largely complete here, and I would like to ask you to review it. Thank you!

DannyLiCom avatar Nov 10 '25 02:11 DannyLiCom

Hi @jamOne- Moving the installation logic to managed_ml_diagnostics.py is causing the unit test to fail.

=================================== FAILURES ===================================
__________ test_install_mldiagnostics_prerequisites_commands_executed __________

mocks = _Mocks(common_print_mock=<MagicMock name='xpk_print' id='140425777035104'>, commands_print_mock=<MagicMock name='xpk_p...ype' id='140425771897264'>, commands_tester=<xpk.core.testing.commands_tester.CommandsTester object at 0x7fb76c4419f0>)
mocker = <pytest_mock.plugin.MockerFixture object at 0x7fb76c5fb520>

    def test_install_mldiagnostics_prerequisites_commands_executed(
        mocks: _Mocks,
        mocker,
    ):
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'rollout',
          'status',
          'deployment/kueue-controller-manager',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'rollout',
          'status',
          'deployment/cert-manager-webhook',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'apply',
          '-f',
          'https://github.com/cert-manager/cert-manager/releases/',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'gcloud',
          'artifacts',
          'generic',
          'download',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'create',
          'namespace',
          'gke-mldiagnostics',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'apply',
          '-f',
          '-n',
          'gke-mldiagnostics',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'label',
          'namespace',
          'default',
          'managed-mldiagnostics-gke=true',
      )
    
>     mocks.commands_tester.assert_command_run(
          'kubectl',
          'rollout',
          'status',
          'deployment/kueue-controller-manager',
          times=1,
      )

src/xpk/commands/cluster_test.py:320: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <xpk.core.testing.commands_tester.CommandsTester object at 0x7fb76c4419f0>
times = 1
command_parts = ('kubectl', 'rollout', 'status', 'deployment/kueue-controller-manager')
matching = []

    def assert_command_run(self, *command_parts: str, times: int = 1):
      """Asserts the command composed from the command parts (joined with '.*') was run exactly `times` times."""
      matching = self.get_matching_commands(*command_parts)
      if not matching:
>       raise AssertionError(
            f"{command_parts} was not found in {self.commands_history}"
E           AssertionError: ('kubectl', 'rollout', 'status', 'deployment/kueue-controller-manager') was not found in []

src/xpk/core/testing/commands_tester.py:62: AssertionError
=========================== short test summary info ============================
FAILED src/xpk/commands/cluster_test.py::test_install_mldiagnostics_prerequisites_commands_executed - AssertionError: ('kubectl', 'rollout', 'status', 'deployment/kueue-controller-manager') was not found in []
======================== 1 failed, 163 passed in 1.85s =========================
make: *** [Makefile:39: run-unittests] Error 1

The following error occurs because the path provided here is under cluster:

run_command_with_updates_path=(
              'xpk.commands.cluster.run_command_with_updates'
          ),
          run_command_for_value_path=(
              'xpk.commands.cluster.run_command_for_value'
          ),

The unit test was passing before I moved the code to managed_ml_diagnostics.py.

If I want to modify this, it looks like I need to change the code inside the CommandsTester class.

DannyLiCom avatar Nov 13 '25 07:11 DannyLiCom

Hi @jamOne- Moving the installation logic to managed_ml_diagnostics.py is causing the unit test to fail.

=================================== FAILURES ===================================
__________ test_install_mldiagnostics_prerequisites_commands_executed __________

mocks = _Mocks(common_print_mock=<MagicMock name='xpk_print' id='140425777035104'>, commands_print_mock=<MagicMock name='xpk_p...ype' id='140425771897264'>, commands_tester=<xpk.core.testing.commands_tester.CommandsTester object at 0x7fb76c4419f0>)
mocker = <pytest_mock.plugin.MockerFixture object at 0x7fb76c5fb520>

    def test_install_mldiagnostics_prerequisites_commands_executed(
        mocks: _Mocks,
        mocker,
    ):
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'rollout',
          'status',
          'deployment/kueue-controller-manager',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'rollout',
          'status',
          'deployment/cert-manager-webhook',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'apply',
          '-f',
          'https://github.com/cert-manager/cert-manager/releases/',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'gcloud',
          'artifacts',
          'generic',
          'download',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'create',
          'namespace',
          'gke-mldiagnostics',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'apply',
          '-f',
          '-n',
          'gke-mldiagnostics',
      )
    
      mocks.commands_tester.set_result_for_command(
          (0, ''),
          'kubectl',
          'label',
          'namespace',
          'default',
          'managed-mldiagnostics-gke=true',
      )
    
>     mocks.commands_tester.assert_command_run(
          'kubectl',
          'rollout',
          'status',
          'deployment/kueue-controller-manager',
          times=1,
      )

src/xpk/commands/cluster_test.py:320: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <xpk.core.testing.commands_tester.CommandsTester object at 0x7fb76c4419f0>
times = 1
command_parts = ('kubectl', 'rollout', 'status', 'deployment/kueue-controller-manager')
matching = []

    def assert_command_run(self, *command_parts: str, times: int = 1):
      """Asserts the command composed from the command parts (joined with '.*') was run exactly `times` times."""
      matching = self.get_matching_commands(*command_parts)
      if not matching:
>       raise AssertionError(
            f"{command_parts} was not found in {self.commands_history}"
E           AssertionError: ('kubectl', 'rollout', 'status', 'deployment/kueue-controller-manager') was not found in []

src/xpk/core/testing/commands_tester.py:62: AssertionError
=========================== short test summary info ============================
FAILED src/xpk/commands/cluster_test.py::test_install_mldiagnostics_prerequisites_commands_executed - AssertionError: ('kubectl', 'rollout', 'status', 'deployment/kueue-controller-manager') was not found in []
======================== 1 failed, 163 passed in 1.85s =========================
make: *** [Makefile:39: run-unittests] Error 1

The following error occurs because the path provided here is under cluster:

run_command_with_updates_path=(
              'xpk.commands.cluster.run_command_with_updates'
          ),
          run_command_for_value_path=(
              'xpk.commands.cluster.run_command_for_value'
          ),

The unit test was passing before I moved the code to managed_ml_diagnostics.py.

If I want to modify this, it looks like I need to change the code inside the CommandsTester class.

I suspect the problem is that now the mocked function are in a different module. Please move the related unit tests to the new file as well: managed_ml_diagnostics_test.py. See https://github.com/AI-Hypercomputer/xpk/blob/3ca9359a2a9d35ab596e6744b4fd584b034daf15/src/xpk/commands/cluster_test.py#L54 on how to set up the CommandTester. Just note to pass the correct path there, it'll be different now, since you moved the code to the new file.

jamOne- avatar Nov 13 '25 08:11 jamOne-

Hi @jamOne- I believe I have made all the modifications. Please review it again. Thanks!

DannyLiCom avatar Nov 14 '25 08:11 DannyLiCom

Since this PR adds new dependencies and a new flag, we need to discuss this with our PM on Monday.

jamOne- avatar Nov 14 '25 10:11 jamOne-

Since this PR adds new dependencies and a new flag, we need to discuss this with our PM on Monday.

Okay sure, thank you.

DannyLiCom avatar Nov 14 '25 13:11 DannyLiCom

Hi @jamOne- Any updates?

DannyLiCom avatar Nov 18 '25 06:11 DannyLiCom

Hi @jamOne- @scaliby I think I'm done with the modifications. Please review it again.

DannyLiCom avatar Nov 21 '25 05:11 DannyLiCom

@scaliby I don't have the permission to merge.

DannyLiCom avatar Nov 21 '25 15:11 DannyLiCom