Add AWS Inf2 instances support for aws_batch scheduler
Add AWS Inf2 instances support for aws_batch scheduler. There're usecases to use torchx to launch data parallel inference jobs on inf2 instances on AWS Batch.
Configurations are referencing https://aws.amazon.com/ec2/instance-types/inf2/
Test plan: Updated unittest to cover new changes.
671 passed, 98 warnings in 177.52s (0:02:57)
❯ torchx run -s local_docker --dryrun dist.ddp -h aws_inf2.48xlarge -j 1 --m abc
=== SCHEDULER REQUEST ===
- !!python/object:torchx.schedulers.docker_scheduler.DockerContainer
command:
- bash
- -c
- torchrun --rdzv_backend c10d --rdzv_endpoint localhost:0 --rdzv_id 'abc-kr12qr37093rz'
--nnodes 1 --nproc_per_node 1 --tee 3 --role '' -m abc
image: sha256:3f8f845e25030d9523bf299dee1c3ca6f2b008fc2bf0a2161ed949efe168a3e1
kwargs:
devices:
- /dev/neuron0:/dev/neuron0:rwm
- /dev/neuron1:/dev/neuron1:rwm
- /dev/neuron2:/dev/neuron2:rwm
- /dev/neuron3:/dev/neuron3:rwm
- /dev/neuron4:/dev/neuron4:rwm
- /dev/neuron5:/dev/neuron5:rwm
- /dev/neuron6:/dev/neuron6:rwm
- /dev/neuron7:/dev/neuron7:rwm
- /dev/neuron8:/dev/neuron8:rwm
- /dev/neuron9:/dev/neuron9:rwm
- /dev/neuron10:/dev/neuron10:rwm
- /dev/neuron11:/dev/neuron11:rwm environment: LOGLEVEL: WARNING TORCHX_JOB_ID: local_docker://torchx/abc-kr12qr37093rz TORCHX_RANK0_HOST: abc-kr12qr37093rz-abc-0 TORCHX_TRACKING_EXPERIMENT_NAME: default-experiment hostname: abc-kr12qr37093rz-abc-0 labels: torchx.pytorch.org/app-id: abc-kr12qr37093rz torchx.pytorch.org/replica-id: '0' torchx.pytorch.org/role-name: abc torchx.pytorch.org/version: 0.8.0dev0 mem_limit: 377472m mounts: [] name: abc-kr12qr37093rz-abc-0 nano_cpus: 192000000000 network: torchx privileged: false shm_size: 377472m
Hi @shixianc!
Thank you for your pull request and welcome to our community.
Action Required
In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.
Process
In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.
Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.
If you have received this in error or have any questions, please contact us at [email protected]. Thanks!
@d4l3k, @kiukchung could you help review? thanks!
LGTM -- can you sign the CLA?
Signed a few mins ago, probably take some time to propagate.
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!
Hi @d4l3k
Could you help check on the lint failure? I ran locally lintrunner --skip PYRE --force-color --all-files
and returns ok No lint issues.
Is this something I need to further look into from my end?
@shixianc can you rebase onto main? https://github.com/pytorch/torchx/commit/53933e31490e830accb7926316b3025d3455e8c4 updates pyre version.
@kiukchung @d4l3k rebased. Could you help rerun the checks?
@ashvinnihalani @d4l3k @kiukchung I made a small correction on the config, could you help merging the CR? (not sure if linting blocker still there)
@kiukchung Small ping about this.
@kiukchung @d4l3k kindly ping again to request to kick off lint workflow, thanks.
this landed in https://github.com/meta-pytorch/torchx/pull/1002