deep-learning-containers icon indicating copy to clipboard operation
deep-learning-containers copied to clipboard

[bug] smdistributed is not included in HuggingFace training image

Open dbpprt opened this issue 1 year ago • 2 comments

Checklist

  • [x] I've prepended issue tag with type of change: [bug]
  • [ ] (If applicable) I've attached the script to reproduce the bug
  • [x] (If applicable) I've documented below the DLC image/dockerfile this relates to
  • [ ] (If applicable) I've documented below the tests I've run on the DLC image
  • [x] I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
  • [ ] I've built my own container based off DLC (and I've attached the code used to build my own image)

Concise Description: smdistributed is not available.

ModuleNotFoundError: No module named ‘smdistributed’

DLC image/dockerfile: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training:2.1.0-transformers4.36.0-gpu-py310-cu121-ubuntu20.04

Current behavior:

Expected behavior:

Additional context: Installing it manually gives the following error:

ErrorMessage "ImportError: libsmddpcpp.so: cannot open shared object file: No such file or directory

from: https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.1.0/cu121/2024-02-04/smdistributed_dataparallel-2.1.0-cp310-cp310-linux_x86_64.whl

dbpprt avatar Jun 05 '24 10:06 dbpprt

Any progress here?

niklas-palm avatar Dec 03 '24 07:12 niklas-palm

Hi @dbpprt, would you mind providing more information about what you're trying to do (repro steps, etc)

Based on the documentation (https://sagemaker.readthedocs.io/en/v2.91.0/api/training/sdp_versions/latest/smd_data_parallel_pytorch.html#pytorch-api), it appears that torch.distrubted should be used over smdistributed.

Moreover, if you could try with the latest training image 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-training:2.8.0-transformers4.56.2-gpu-py312-cu129-ubuntu22.04, we can verify if your issue persists in supported versions of the container

(release notes https://github.com/aws/deep-learning-containers/releases/tag/v1.0-hf-4.56.2-pt-2.8.0-tr-gpu-py312)

arjkesh avatar Oct 27 '25 23:10 arjkesh

This issue has been automatically marked as stale due to 60 days of inactivity. Please comment or remove the stale label to keep it open. It will be closed in 7 days if no further activity occurs.

github-actions[bot] avatar Dec 28 '25 19:12 github-actions[bot]