Add AWS Inf2 instances support for aws_batch scheduler. There're usecases to use torchx to launch data parallel inference jobs on inf2 instances on AWS Batch.

Configurations are referencing https://aws.amazon.com/ec2/instance-types/inf2/

Test plan: Updated unittest to cover new changes.

671 passed, 98 warnings in 177.52s (0:02:57)

❯ torchx run -s local_docker --dryrun dist.ddp -h aws_inf2.48xlarge -j 1 --m abc

=== SCHEDULER REQUEST ===

!!python/object:torchx.schedulers.docker_scheduler.DockerContainer command:
- bash
- -c
- torchrun --rdzv_backend c10d --rdzv_endpoint localhost:0 --rdzv_id 'abc-kr12qr37093rz' --nnodes 1 --nproc_per_node 1 --tee 3 --role '' -m abc image: sha256:3f8f845e25030d9523bf299dee1c3ca6f2b008fc2bf0a2161ed949efe168a3e1 kwargs: devices:
  - /dev/neuron0:/dev/neuron0:rwm
  - /dev/neuron1:/dev/neuron1:rwm
  - /dev/neuron2:/dev/neuron2:rwm
  - /dev/neuron3:/dev/neuron3:rwm
  - /dev/neuron4:/dev/neuron4:rwm
  - /dev/neuron5:/dev/neuron5:rwm
  - /dev/neuron6:/dev/neuron6:rwm
  - /dev/neuron7:/dev/neuron7:rwm
  - /dev/neuron8:/dev/neuron8:rwm
  - /dev/neuron9:/dev/neuron9:rwm
  - /dev/neuron10:/dev/neuron10:rwm
  - /dev/neuron11:/dev/neuron11:rwm environment: LOGLEVEL: WARNING TORCHX_JOB_ID: local_docker://torchx/abc-kr12qr37093rz TORCHX_RANK0_HOST: abc-kr12qr37093rz-abc-0 TORCHX_TRACKING_EXPERIMENT_NAME: default-experiment hostname: abc-kr12qr37093rz-abc-0 labels: torchx.pytorch.org/app-id: abc-kr12qr37093rz torchx.pytorch.org/replica-id: '0' torchx.pytorch.org/role-name: abc torchx.pytorch.org/version: 0.8.0dev0 mem_limit: 377472m mounts: [] name: abc-kr12qr37093rz-abc-0 nano_cpus: 192000000000 network: torchx privileged: false shm_size: 377472m

Nov 25 '24 21:11 shixianc

Hi @shixianc!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Nov 25 '24 21:11 facebook-github-bot

@d4l3k, @kiukchung could you help review? thanks!

Nov 25 '24 21:11 shixianc

LGTM -- can you sign the CLA?

Signed a few mins ago, probably take some time to propagate.

Nov 25 '24 21:11 shixianc

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

Nov 25 '24 22:11 facebook-github-bot

Hi @d4l3k Could you help check on the lint failure? I ran locally lintrunner --skip PYRE --force-color --all-files and returns ok No lint issues. Is this something I need to further look into from my end?

Nov 26 '24 01:11 shixianc

@shixianc can you rebase onto main? https://github.com/pytorch/torchx/commit/53933e31490e830accb7926316b3025d3455e8c4 updates pyre version.

Nov 26 '24 20:11 kiukchung

@kiukchung @d4l3k rebased. Could you help rerun the checks?

Nov 26 '24 21:11 shixianc

@ashvinnihalani @d4l3k @kiukchung I made a small correction on the config, could you help merging the CR? (not sure if linting blocker still there)

Dec 20 '24 16:12 shixianc

@kiukchung Small ping about this.

Dec 24 '24 20:12 ashvinnihalani

@kiukchung @d4l3k kindly ping again to request to kick off lint workflow, thanks.

Jan 03 '25 18:01 shixianc

this landed in https://github.com/meta-pytorch/torchx/pull/1002

Oct 04 '25 00:10 d4l3k