mpi-operator icon indicating copy to clipboard operation
mpi-operator copied to clipboard

questions about applying for nodes and gpus

Open ThomaswellY opened this issue 2 years ago • 9 comments

Hi, i have been using mpi-operator to achieve distributed training recently。 the most command i used is “kubectl apply -f yaml”. Let me take the mpi-operator yaml for example apiVersion: kubeflow.org/v1 kind: MPIJob metadata: name: cifar spec: slotsPerWorker: 1 runPolicy: cleanPodPolicy: Running mpiReplicaSpecs: Launcher: replicas: 1 restartPolicy: Never template: spec: nodeName: containers: - image: 10.252.39.13:5000/deepspeed_ms:v2 name: mpijob-cifar-deepspeed-container imagePullPolicy: Always command: - mpirun - --allow-run-as-root - python - cifar/cifar10_deepspeed.py - --epochs=100 - --deepspeed_mpi - --deepspeed - --deepspeed_config - cifar/ds_config.json env: - name: OMP_NUM_THREADS value: "1" Worker: replicas: 2 restartPolicy: OnFailure template: spec: nodeName: containers: - image: 10.252.39.13:5000/deepspeed_ms:v2 name: deepspeed-mpijob-container resources: limits: cpu: 2 memory: 8Gi nvidia.com/gpu: 2 there are some questions i'm confused about:

  1. the content about applying for gpu-resources seems in "Worker". Does the cifar-worker-0 and cifar-worker-1 pods are separatly applying for an node(in k8s cluster) with 2 gpu? Then what role does "slotsPerWorker" play?
  2. I have excuted the "apply -f yaml" on the example yaml, with different replicas like "replicas: 1" ,"replicas: 4", and the resources limits was fixed at "nvidia.com/gpu: 1". I found interesting results : *When replicas is set to large numerber, It takes a bit more time for the cifar-launcher pod to complete. *the logs printed in cifar-launcher pod (when replicas: 4) were just like the result ( when replicas: 1) repeated 4 times. so does these mean, the four pods have separately applyed for one gpu (from node in k8s cluster, and preferentially from the same node if gpus are enough), and printed out the average result. the whole process had nothing to do with distribution? *by the way, when setting "repicas: 3" , there is error reported in my case: train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 64 !=21 * 1 * 3 this did confuse me.
  3. If i have node-A with 1 gpus and node-B with 3 gpus, and wanna apply for 4 gpus, then how should i modify the "Worker" part? Thank in advance for your apply~

ThomaswellY avatar May 24 '23 02:05 ThomaswellY

@ThomaswellY Can you create an issue on https://github.com/kubeflow/training-operator since the mpi-operator doesn't support v1 API?

tenzen-y avatar May 25 '23 19:05 tenzen-y

or you can consider upgrading to the v2beta API :)

To answer some of your questions: Ideally the number of workers should match the number of nodes you want to run on. The slotsPerWorker filed denotes how many tasks will be run in each worker. Generally, this should match the number of GPUs you have per worker. You don't need to set the OMP_NUM_THREADS, since that's actually what slotsPerWorker sets.

If i have node-A with 1 gpus and node-B with 3 gpus, and wanna apply for 4 gpus, then how should i modify the "Worker" part?

In that case, you might want to set the number of GPUs per worker to 1 (along with slotsPerWorker to 1) and have replicas=4. Not ideal, but it should work.

alculquicondor avatar May 25 '23 19:05 alculquicondor

@ThomaswellY Can you create an issue on https://github.com/kubeflow/training-operator since the mpi-operator doesn't support v1 API?

Thanks for your reply~ the api-resources of my k8s clusters in shown below: (base) [root@gpu-233 operator]# kubectl api-resources | grep jobs cronjobs cj batch/v1 true CronJob jobs batch/v1 true Job mpijobs kubeflow.org/v1 true MPIJob mxjobs kubeflow.org/v1 true MXJob pytorchjobs kubeflow.org/v1 true PyTorchJob tfjobs kubeflow.org/v1 true TFJob xgboostjobs kubeflow.org/v1 true XGBoostJob doesn't that indicates that, in my k8s cluster env, mpijobs is supported by kubeflow.org/v1 API ?
I have applied configs of the example yaml with kubeflow.org/v1 API successfully, and have seen no siginificant errors in pod logs. @tenzen-y

ThomaswellY avatar May 26 '23 00:05 ThomaswellY

Thanks for your reply~ I am a little confused about which type of API can support my resource (mpijob in my case). The command "kubectl api-resources" shows mpijobs in my k8s cluster is supported by kubeflow.org/v1, if not, what is the suitable way to confirm which API can support my mpijobs-resource? any official docs would be helpful~

or you can consider upgrading to the v2beta API :)

To answer some of your questions: Ideally the number of workers should match the number of nodes you want to run on. The slotsPerWorker filed denotes how many tasks will be run in each worker. Generally, this should match the number of GPUs you have per worker. You don't need to set the OMP_NUM_THREADS, since that's actually what slotsPerWorker sets.

If i have node-A with 1 gpus and node-B with 3 gpus, and wanna apply for 4 gpus, then how should i modify the "Worker" part?

In that case, you might want to set the number of GPUs per worker to 1 (along with slotsPerWorker to 1) and have replicas=4. Not ideal, but it should work.

I have applied the example yaml in this way successfully, but it seems that 4 gpus are separately be used by 4 pods, and what each worker excuted is a single-gpu training. So it's not distributed training( in this case, i means multi-node with singe-gpu per node training), and whole process takes more time than single-gpu training in one pod which set "replicas=1". What confused me is that, the value of "replicas" seems to only serve as a multiplier for "nvidia.com/gpu". in general, there are some things i wanna confirm:

  1. How to confirm which API can support the mpi-operator, if "kubectl api-resources" did not work, then which command should be submitted?
  2. when resource limit sets gpu number to 1 ( because one node of k8s cluster has only one gpu available in this case), then distributed training can not be launched, even multi-pod can separately executes single-gpu training when set replicas>1, it's in fact a repetitive behavior of single-training.
  3. If i have node-1 with 2 gpus and node-2 with 4 gpus, the most effective distributed training that mpi-operator can launcher is about 2 nodes with 2 gpus per node, and the ideal config is that setting "slotsPerWorker: 2","replicas: 2",and "nvidia.com/gpu: 2". The questions are a little too many, I am sorry if that troubles you. Thanks you in advance~ @alculquicondor

ThomaswellY avatar May 26 '23 01:05 ThomaswellY

doesn't that indicates that, in my k8s cluster env, mpijobs is supported by kubeflow.org/v1 API ?

That is correct. @tenzen-y's point is that the v1 implementation is no longer hosted in this repo. If you wish to use the newer v2beta1 version, you have to disable training-operator and install the operator in this repo https://github.com/kubeflow/mpi-operator#installation

The rest of the questions:

  1. the command did work, you are running v1.
  2. It sounds like a problem in your application, not mpi-operator. Did you miss any parameters in your command? I'm not familiar with deepspeed.
  3. yes

alculquicondor avatar May 26 '23 12:05 alculquicondor

@ThomaswellY Thanks @alculquicondor. Yes, I meant this repo doesn't support kubeflow.org/v1, and this repo supports only kubeflow.org/v2beta1. Currently, the kubeflow.org/v1 is supported in https://github.com/kubeflow/training-operator.

Also, I would suggest v2beta1 MPIJob for the deepspeed since https://github.com/kubeflow/training-operator/issues/1792#issuecomment-1519576554.

tenzen-y avatar May 26 '23 17:05 tenzen-y

Also it seems that https://github.com/kubeflow/mpi-operator/pull/549 has proof that v2beta1 can run deepspeed

alculquicondor avatar May 26 '23 17:05 alculquicondor

@alculquicondor @tenzen-y thanks for your kind help! maybe i should use v2beta1 for deepspeed. Anyway, I have executed #549 successfully even in v1, however it seems only cifar10_deepspeed.py needs no modifications, as for gan_deepspeed_train.py, the extra modification is necessary (like args.local_rank = int(os.environ['LOCAL_RANK'])). So #549 is only an example for applying mpi-operator with deepspeed, maybe we can do more for normally applying other script with deepspeed.

ThomaswellY avatar May 29 '23 00:05 ThomaswellY

@ThomaswellY Thank you for the report!

So https://github.com/kubeflow/mpi-operator/pull/549 is only an example for applying mpi-operator with deepspeed, maybe we can do more for normally applying other script with deepspeed.

Feel free to open PRs. I'm happy to review them :)

tenzen-y avatar May 29 '23 04:05 tenzen-y