xpk icon indicating copy to clipboard operation
xpk copied to clipboard

`xpk cluster adapt`

Open gcie opened this issue 9 months ago • 6 comments

Fixes / Features

  • new command: xpk cluster adapt that prepares an existing cluster for XPK

gcie avatar Apr 29 '25 12:04 gcie

What about testing this? Is it possible to create a testing scenario with tpu involved? For example create cluster with gcloud command and run xpk adapt on it?

pawloch00 avatar Apr 30 '25 11:04 pawloch00

What about the scenario where cluster is created with cluster toolkit, than xpk cluster adapt is done and than we want to remove cluster using xpk cluster delete?

pawloch00 avatar Apr 30 '25 12:04 pawloch00

What about testing this? Is it possible to create a testing scenario with tpu involved? For example create cluster with gcloud command and run xpk adapt on it?

We will have coverage in github workflow after we refactor cluster_create to use cluster_adapt.

What about the scenario where cluster is created with cluster toolkit, than xpk cluster adapt is done and than we want to remove cluster using xpk cluster delete?

I tested it manually and it works. If you want to cover it with tests, then we need to have integration tests for GPUs finished.

gcie avatar Apr 30 '25 12:04 gcie

But it should work for tpu too, thus making it eligible for testing now.

pawloch00 avatar Apr 30 '25 12:04 pawloch00

But it should work for tpu too, thus making it eligible for testing now.

We will have tests with TPU in github workflows after refactoring due to cluster_create depending on cluster_adapt. Also, the scope of this task was to make this command for A3 ultra only. We'll add TPU support in later PRs.

gcie avatar Apr 30 '25 12:04 gcie

But it should work for tpu too, thus making it eligible for testing now.

We will have tests with TPU in github workflows after refactoring due to cluster_create depending on cluster_adapt. Also, the scope of this task was to make this command for A3 ultra only. We'll add TPU support in later PRs.

Then it needs to be clearly specified in Readme that it is supported only for A3 ultra

[edit] Please add readme section with this command

pawloch00 avatar Apr 30 '25 12:04 pawloch00

And one more question about additional networks and subnetwork names for A3U machine. Are we assuming the subnetwork names are as we expect in the A3 workloads with RDMA annonations?

Can you let me know why you think it would matter in this case? Btw, xpk workload create now automatically set subnetwork names on the fly to whatever was configured on cluster.

gcie avatar Jun 09 '25 10:06 gcie

And one more question about additional networks and subnetwork names for A3U machine. Are we assuming the subnetwork names are as we expect in the A3 workloads with RDMA annonations?

Can you let me know why you think it would matter in this case? Btw, xpk workload create now automatically set subnetwork names on the fly to whatever was configured on cluster.

We expect the network object names in a specific format, otherwise the workloads will result in errors.

sharabiani avatar Jun 09 '25 10:06 sharabiani