Managed ml diagnostics and xpk integration
Description
Adding Webhooks, Injector, and connection-operator installations during cluster creation.
Testing
Test with the cluster create command and add the flag --managed-mldiagnostics. The cluster will then install the required ML Diagnostics components: cert-manager, the injection-webhook, and the connection-operator.
like this: https://paste.googleplex.com/6009656740806656
@xibinliu and @Shuang-cnt please review it.
- What's the justification for this change? Why are we adding this for every cluster?
- What's diagon?
- Could you cover the changes with unit tests please? https://github.com/AI-Hypercomputer/xpk/blob/main/docs/testing.md#unit-test
Also, please merge the current main branch, because the PR changes are mixed with the sub-slicing changes, making it harder to review.
Also, please merge the current main branch, because the PR changes are mixed with the sub-slicing changes, making it harder to review.
When I run git checkout -b, should I branch off of main or develop? My file changes only touch 3 files, but if I merge into main, there are now 6 changes.
- What's the justification for this change? Why are we adding this for every cluster?
- What's diagon?
- Could you cover the changes with unit tests please? https://github.com/AI-Hypercomputer/xpk/blob/main/docs/testing.md#unit-test
Perhaps @xibinliu needs to explain for it to be clearer.
Also, please merge the current main branch, because the PR changes are mixed with the sub-slicing changes, making it harder to review.
When I run
git checkout -b, should I branch off of main or develop? My file changes only touch 3 files, but if I merge into main, there are now 6 changes.
The main. Please just pull latest main and merge it into your branch - that should do the job.
Also, please merge the current main branch, because the PR changes are mixed with the sub-slicing changes, making it harder to review.
When I run
git checkout -b, should I branch off of main or develop? My file changes only touch 3 files, but if I merge into main, there are now 6 changes.The
main. Please just pull latestmainand merge it into your branch - that should do the job.
Done
- What's the justification for this change? Why are we adding this for every cluster?
- What's diagon?
- Could you cover the changes with unit tests please? https://github.com/AI-Hypercomputer/xpk/blob/main/docs/testing.md#unit-test
ML Diagnostics components
For point 1: Add the --managed-mldiagnostics flag to allow the user to decide whether or not to install these ML Diagnostics components.
@jamOne- @scaliby The unit tests are failing because our installation involves the Kubernetes API server. Does the installation process require unit tests? If so, it will get stuck at get-credentials.
[XPK] ********************************************************************************
[XPK] b'E1106 04:14:54.063296 1983 memcache.go:265] "Unhandled Error" err="couldn\'t get current server API group list: Get \\"http://localhost:8080/api?timeout=32s\\": dial tcp [::1]:8080: connect: connection refused"\nE1106 04:14:54.063919 1983 memcache.go:265] "Unhandled Error" err="couldn\'t get current server API group list: Get \\"http://localhost:8080/api?timeout=32s\\": dial tcp [::1]:8080: connect: connection refused"\nE1106 04:14:54.065489 1983 memcache.go:265] "Unhandled Error" err="couldn\'t get current server API group list: Get \\"http://localhost:8080/api?timeout=32s\\": dial tcp [::1]:8080: connect: connection refused"\nE1106 04:14:54.065944 1983 memcache.go:265] "Unhandled Error" err="couldn\'t get current server API group list: Get \\"http://localhost:8080/api?timeout=32s\\": dial tcp [::1]:8080: connect: connection refused"\nThe connection to the server localhost:8080 was refused - did you specify the right host or port?\n'
[XPK] ********************************************************************************
@DannyLiCom please just mock this behavior on command level, so there will be no actual command execution happening.
- What's the justification for this change? Why are we adding this for every cluster?
- What's diagon?
- Could you cover the changes with unit tests please? https://github.com/AI-Hypercomputer/xpk/blob/main/docs/testing.md#unit-test
Perhaps @xibinliu needs to explain for it to be clearer.
Hi @jamOne- Google Cloud ML Diagnostics is an end-to-end managed platform for ML Engineers to optimize and diagnose their AI/ML workloads on Google Cloud. ML Engineers need to integrate their ML workload with google-cloud-mldiagnostics open source SDK (see the PR in MaxText) as well as deploy some GKE webhooks and operators in GKE cluster (this PR) to get a seamless workload tracking and profiling experience.
- What's the justification for this change? Why are we adding this for every cluster?
- What's diagon?
- Could you cover the changes with unit tests please? https://github.com/AI-Hypercomputer/xpk/blob/main/docs/testing.md#unit-test
Perhaps @xibinliu needs to explain for it to be clearer.
Hi @jamOne- Google Cloud ML Diagnostics is an end-to-end managed platform for ML Engineers to optimize and diagnose their AI/ML workloads on Google Cloud. ML Engineers need to integrate their ML workload with google-cloud-mldiagnostics open source SDK (see the PR in MaxText) as well as deploy some GKE webhooks and operators in GKE cluster (this PR) to get a seamless workload tracking and profiling experience.
c.c. @scaliby
Hi @jamOne- or @scaliby Our code is largely complete here, and I would like to ask you to review it. Thank you!
Hi @jamOne-
Moving the installation logic to managed_ml_diagnostics.py is causing the unit test to fail.
=================================== FAILURES ===================================
__________ test_install_mldiagnostics_prerequisites_commands_executed __________
mocks = _Mocks(common_print_mock=<MagicMock name='xpk_print' id='140425777035104'>, commands_print_mock=<MagicMock name='xpk_p...ype' id='140425771897264'>, commands_tester=<xpk.core.testing.commands_tester.CommandsTester object at 0x7fb76c4419f0>)
mocker = <pytest_mock.plugin.MockerFixture object at 0x7fb76c5fb520>
def test_install_mldiagnostics_prerequisites_commands_executed(
mocks: _Mocks,
mocker,
):
mocks.commands_tester.set_result_for_command(
(0, ''),
'kubectl',
'rollout',
'status',
'deployment/kueue-controller-manager',
)
mocks.commands_tester.set_result_for_command(
(0, ''),
'kubectl',
'rollout',
'status',
'deployment/cert-manager-webhook',
)
mocks.commands_tester.set_result_for_command(
(0, ''),
'kubectl',
'apply',
'-f',
'https://github.com/cert-manager/cert-manager/releases/',
)
mocks.commands_tester.set_result_for_command(
(0, ''),
'gcloud',
'artifacts',
'generic',
'download',
)
mocks.commands_tester.set_result_for_command(
(0, ''),
'kubectl',
'create',
'namespace',
'gke-mldiagnostics',
)
mocks.commands_tester.set_result_for_command(
(0, ''),
'kubectl',
'apply',
'-f',
'-n',
'gke-mldiagnostics',
)
mocks.commands_tester.set_result_for_command(
(0, ''),
'kubectl',
'label',
'namespace',
'default',
'managed-mldiagnostics-gke=true',
)
> mocks.commands_tester.assert_command_run(
'kubectl',
'rollout',
'status',
'deployment/kueue-controller-manager',
times=1,
)
src/xpk/commands/cluster_test.py:320:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <xpk.core.testing.commands_tester.CommandsTester object at 0x7fb76c4419f0>
times = 1
command_parts = ('kubectl', 'rollout', 'status', 'deployment/kueue-controller-manager')
matching = []
def assert_command_run(self, *command_parts: str, times: int = 1):
"""Asserts the command composed from the command parts (joined with '.*') was run exactly `times` times."""
matching = self.get_matching_commands(*command_parts)
if not matching:
> raise AssertionError(
f"{command_parts} was not found in {self.commands_history}"
E AssertionError: ('kubectl', 'rollout', 'status', 'deployment/kueue-controller-manager') was not found in []
src/xpk/core/testing/commands_tester.py:62: AssertionError
=========================== short test summary info ============================
FAILED src/xpk/commands/cluster_test.py::test_install_mldiagnostics_prerequisites_commands_executed - AssertionError: ('kubectl', 'rollout', 'status', 'deployment/kueue-controller-manager') was not found in []
======================== 1 failed, 163 passed in 1.85s =========================
make: *** [Makefile:39: run-unittests] Error 1
The following error occurs because the path provided here is under cluster:
run_command_with_updates_path=(
'xpk.commands.cluster.run_command_with_updates'
),
run_command_for_value_path=(
'xpk.commands.cluster.run_command_for_value'
),
The unit test was passing before I moved the code to managed_ml_diagnostics.py.
If I want to modify this, it looks like I need to change the code inside the CommandsTester class.
Hi @jamOne- Moving the installation logic to
managed_ml_diagnostics.pyis causing the unit test to fail.=================================== FAILURES =================================== __________ test_install_mldiagnostics_prerequisites_commands_executed __________ mocks = _Mocks(common_print_mock=<MagicMock name='xpk_print' id='140425777035104'>, commands_print_mock=<MagicMock name='xpk_p...ype' id='140425771897264'>, commands_tester=<xpk.core.testing.commands_tester.CommandsTester object at 0x7fb76c4419f0>) mocker = <pytest_mock.plugin.MockerFixture object at 0x7fb76c5fb520> def test_install_mldiagnostics_prerequisites_commands_executed( mocks: _Mocks, mocker, ): mocks.commands_tester.set_result_for_command( (0, ''), 'kubectl', 'rollout', 'status', 'deployment/kueue-controller-manager', ) mocks.commands_tester.set_result_for_command( (0, ''), 'kubectl', 'rollout', 'status', 'deployment/cert-manager-webhook', ) mocks.commands_tester.set_result_for_command( (0, ''), 'kubectl', 'apply', '-f', 'https://github.com/cert-manager/cert-manager/releases/', ) mocks.commands_tester.set_result_for_command( (0, ''), 'gcloud', 'artifacts', 'generic', 'download', ) mocks.commands_tester.set_result_for_command( (0, ''), 'kubectl', 'create', 'namespace', 'gke-mldiagnostics', ) mocks.commands_tester.set_result_for_command( (0, ''), 'kubectl', 'apply', '-f', '-n', 'gke-mldiagnostics', ) mocks.commands_tester.set_result_for_command( (0, ''), 'kubectl', 'label', 'namespace', 'default', 'managed-mldiagnostics-gke=true', ) > mocks.commands_tester.assert_command_run( 'kubectl', 'rollout', 'status', 'deployment/kueue-controller-manager', times=1, ) src/xpk/commands/cluster_test.py:320: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = <xpk.core.testing.commands_tester.CommandsTester object at 0x7fb76c4419f0> times = 1 command_parts = ('kubectl', 'rollout', 'status', 'deployment/kueue-controller-manager') matching = [] def assert_command_run(self, *command_parts: str, times: int = 1): """Asserts the command composed from the command parts (joined with '.*') was run exactly `times` times.""" matching = self.get_matching_commands(*command_parts) if not matching: > raise AssertionError( f"{command_parts} was not found in {self.commands_history}" E AssertionError: ('kubectl', 'rollout', 'status', 'deployment/kueue-controller-manager') was not found in [] src/xpk/core/testing/commands_tester.py:62: AssertionError =========================== short test summary info ============================ FAILED src/xpk/commands/cluster_test.py::test_install_mldiagnostics_prerequisites_commands_executed - AssertionError: ('kubectl', 'rollout', 'status', 'deployment/kueue-controller-manager') was not found in [] ======================== 1 failed, 163 passed in 1.85s ========================= make: *** [Makefile:39: run-unittests] Error 1The following error occurs because the path provided here is under cluster:
run_command_with_updates_path=( 'xpk.commands.cluster.run_command_with_updates' ), run_command_for_value_path=( 'xpk.commands.cluster.run_command_for_value' ),The unit test was passing before I moved the code to
managed_ml_diagnostics.py.If I want to modify this, it looks like I need to change the code inside the
CommandsTester class.
I suspect the problem is that now the mocked function are in a different module. Please move the related unit tests to the new file as well: managed_ml_diagnostics_test.py. See https://github.com/AI-Hypercomputer/xpk/blob/3ca9359a2a9d35ab596e6744b4fd584b034daf15/src/xpk/commands/cluster_test.py#L54 on how to set up the CommandTester. Just note to pass the correct path there, it'll be different now, since you moved the code to the new file.
Hi @jamOne- I believe I have made all the modifications. Please review it again. Thanks!
Since this PR adds new dependencies and a new flag, we need to discuss this with our PM on Monday.
Since this PR adds new dependencies and a new flag, we need to discuss this with our PM on Monday.
Okay sure, thank you.
Hi @jamOne- Any updates?
Hi @jamOne- @scaliby I think I'm done with the modifications. Please review it again.
@scaliby I don't have the permission to merge.