torchx icon indicating copy to clipboard operation
torchx copied to clipboard

Add AWS Inf2 instances support for aws_batch scheduler

Open shixianc opened this issue 1 year ago • 10 comments

Add AWS Inf2 instances support for aws_batch scheduler. There're usecases to use torchx to launch data parallel inference jobs on inf2 instances on AWS Batch.

Configurations are referencing https://aws.amazon.com/ec2/instance-types/inf2/

Test plan: Updated unittest to cover new changes.

671 passed, 98 warnings in 177.52s (0:02:57)
❯ torchx run -s local_docker --dryrun dist.ddp -h aws_inf2.48xlarge -j 1 --m abc

=== SCHEDULER REQUEST ===

  • !!python/object:torchx.schedulers.docker_scheduler.DockerContainer command:
    • bash
    • -c
    • torchrun --rdzv_backend c10d --rdzv_endpoint localhost:0 --rdzv_id 'abc-kr12qr37093rz' --nnodes 1 --nproc_per_node 1 --tee 3 --role '' -m abc image: sha256:3f8f845e25030d9523bf299dee1c3ca6f2b008fc2bf0a2161ed949efe168a3e1 kwargs: devices:
      • /dev/neuron0:/dev/neuron0:rwm
      • /dev/neuron1:/dev/neuron1:rwm
      • /dev/neuron2:/dev/neuron2:rwm
      • /dev/neuron3:/dev/neuron3:rwm
      • /dev/neuron4:/dev/neuron4:rwm
      • /dev/neuron5:/dev/neuron5:rwm
      • /dev/neuron6:/dev/neuron6:rwm
      • /dev/neuron7:/dev/neuron7:rwm
      • /dev/neuron8:/dev/neuron8:rwm
      • /dev/neuron9:/dev/neuron9:rwm
      • /dev/neuron10:/dev/neuron10:rwm
      • /dev/neuron11:/dev/neuron11:rwm environment: LOGLEVEL: WARNING TORCHX_JOB_ID: local_docker://torchx/abc-kr12qr37093rz TORCHX_RANK0_HOST: abc-kr12qr37093rz-abc-0 TORCHX_TRACKING_EXPERIMENT_NAME: default-experiment hostname: abc-kr12qr37093rz-abc-0 labels: torchx.pytorch.org/app-id: abc-kr12qr37093rz torchx.pytorch.org/replica-id: '0' torchx.pytorch.org/role-name: abc torchx.pytorch.org/version: 0.8.0dev0 mem_limit: 377472m mounts: [] name: abc-kr12qr37093rz-abc-0 nano_cpus: 192000000000 network: torchx privileged: false shm_size: 377472m

shixianc avatar Nov 25 '24 21:11 shixianc

Hi @shixianc!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

facebook-github-bot avatar Nov 25 '24 21:11 facebook-github-bot

@d4l3k, @kiukchung could you help review? thanks!

shixianc avatar Nov 25 '24 21:11 shixianc

LGTM -- can you sign the CLA?

Signed a few mins ago, probably take some time to propagate.

shixianc avatar Nov 25 '24 21:11 shixianc

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

facebook-github-bot avatar Nov 25 '24 22:11 facebook-github-bot

Hi @d4l3k Could you help check on the lint failure? I ran locally lintrunner --skip PYRE --force-color --all-files and returns ok No lint issues. Is this something I need to further look into from my end?

shixianc avatar Nov 26 '24 01:11 shixianc

@shixianc can you rebase onto main? https://github.com/pytorch/torchx/commit/53933e31490e830accb7926316b3025d3455e8c4 updates pyre version.

kiukchung avatar Nov 26 '24 20:11 kiukchung

@kiukchung @d4l3k rebased. Could you help rerun the checks?

shixianc avatar Nov 26 '24 21:11 shixianc

@ashvinnihalani @d4l3k @kiukchung I made a small correction on the config, could you help merging the CR? (not sure if linting blocker still there)

shixianc avatar Dec 20 '24 16:12 shixianc

@kiukchung Small ping about this.

ashvinnihalani avatar Dec 24 '24 20:12 ashvinnihalani

@kiukchung @d4l3k kindly ping again to request to kick off lint workflow, thanks.

shixianc avatar Jan 03 '25 18:01 shixianc

this landed in https://github.com/meta-pytorch/torchx/pull/1002

d4l3k avatar Oct 04 '25 00:10 d4l3k