No module named mpi4py [bug]

Open ProxJ opened this issue 3 years ago • 1 comments

Checklist

[x] I've prepended issue tag with type of change: [bug]
[x] (If applicable) I've attached the script to reproduce the bug
[x] (If applicable) I've documented below the DLC image/dockerfile this relates to
[x] (If applicable) I've documented below the tests I've run on the DLC image
[x] I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
[ ] I've built my own container based off DLC (and I've attached the code used to build my own image)

Concise Description: When running most, if not all the PyTorch CPU images with distribution set to MPI, I get No module named mpi4py. To reproduce:

from sagemaker.pytorch import PyTorch
from torchvision.datasets import MNIST
from torchvision import transforms

import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/DEMO-pytorch-mnist'

role = sagemaker.get_execution_role()

MNIST.mirrors = ["https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/"]

MNIST(
    'data',
    download=True,
    transform=transforms.Compose(
        [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
    )
)
inputs = sagemaker_session.upload_data(path='data', bucket=bucket, key_prefix=prefix)
print('input spec (in this case, just an S3 path): {}'.format(inputs))

estimator = PyTorch(entry_point='mnist.py',
                    role=role,
                    py_version='py38',
                    framework_version='1.11.0',
                    instance_count=2,
                    instance_type='ml.m5.large',
                    hyperparameters={
                        'epochs': 1,
                        'backend': 'gloo'
                    },
                    distribution = {
                        "mpi": {
                            "enabled": True,
                            "processes_per_host": 1,
                        }
                    }
                   )
estimator.fit({'training': inputs})

DLC image/dockerfile:

Current behavior:

Error log

Invoking script with the following command:
mpirun --host algo-1,algo-2 -np 2 --allow-run-as-root --display-map --tag-output -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -x NCCL_MIN_NRINGS=4 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD=/opt/conda/lib/python3.8/site-packages/gethostname.cpython-38-x86_64-linux-gnu.so -x SM_HOSTS -x SM_NETWORK_INTERFACE_NAME -x SM_HPS -x SM_USER_ENTRY_POINT -x SM_FRAMEWORK_PARAMS -x SM_RESOURCE_CONFIG -x SM_INPUT_DATA_CONFIG -x SM_OUTPUT_DATA_DIR -x SM_CHANNELS -x SM_CURRENT_HOST -x SM_MODULE_NAME -x SM_LOG_LEVEL -x SM_FRAMEWORK_MODULE -x SM_INPUT_DIR -x SM_INPUT_CONFIG_DIR -x SM_OUTPUT_DIR -x SM_NUM_CPUS -x SM_NUM_GPUS -x SM_MODEL_DIR -x SM_MODULE_DIR -x SM_TRAINING_ENV -x SM_USER_ARGS -x SM_OUTPUT_INTERMEDIATE_DIR -x SM_CHANNEL_TRAINING -x SM_HP_BACKEND -x SM_HP_EPOCHS -x PYTHONPATH /opt/conda/bin/python3.8 -m mpi4py mnist.py --backend gloo --epochs 1
Warning: Permanently added 'algo-2,10.x.xxx.182' (ECDSA) to the list of known hosts.
Data for JOB [41164,1] offset 0 Total slots allocated 2
 ========================   JOB MAP   ========================
 Data for node: algo-1#011Num slots: 1#011Max slots: 0#011Num procs: 1
 #011Process OMPI jobid: [41164,1] App: 0 Process rank: 0 Bound: N/A
 Data for node: algo-2#011Num slots: 1#011Max slots: 0#011Num procs: 1
 #011Process OMPI jobid: [41164,1] App: 0 Process rank: 1 Bound: N/A
 =============================================================
Data for JOB [41164,1] offset 0 Total slots allocated 2
 ========================   JOB MAP   ========================
 Data for node: algo-1#011Num slots: 1#011Max slots: 0#011Num procs: 1
 #011Process OMPI jobid: [41164,1] App: 0 Process rank: 0 Bound: N/A
 Data for node: algo-2#011Num slots: 1#011Max slots: 0#011Num procs: 1
 #011Process OMPI jobid: [41164,1] App: 0 Process rank: 1 Bound: N/A
 =============================================================
[1,mpirank:0,algo-1]<stderr>:/opt/conda/bin/python3.8: No module named mpi4py
[1,mpirank:0,algo-1]<stderr>:
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
/opt/conda/lib/python3.8/site-packages/paramiko/transport.py:236: CryptographyDeprecationWarning: Blowfish has been deprecated
  "class": algorithms.Blowfish,
2022-05-30 14:15:47,290 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2022-05-30 14:15:47,296 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2022-05-30 14:15:47,310 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2022-05-30 14:15:47,318 sagemaker_pytorch_container.training INFO     Invoking user training script.
2022-05-30 14:15:47,858 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2022-05-30 14:15:47,871 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2022-05-30 14:15:47,884 sagemaker-training-toolkit INFO     Starting MPI run as worker node.
2022-05-30 14:15:47,885 sagemaker-training-toolkit INFO     Waiting for MPI Master to create SSH daemon.
2022-05-30 14:15:47,902 paramiko.transport INFO     Connected (version 2.0, client OpenSSH_8.2p1)
2022-05-30 14:15:48,201 paramiko.transport INFO     Authentication (publickey) successful!
2022-05-30 14:15:48,201 sagemaker-training-toolkit INFO     Can connect to host algo-1
2022-05-30 14:15:48,201 sagemaker-training-toolkit INFO     MPI Master online, creating SSH daemon.
2022-05-30 14:15:48,201 sagemaker-training-toolkit INFO     Writing environment variables to /etc/environment for the MPI process.
2022-05-30 14:15:48,208 sagemaker-training-toolkit INFO     Waiting for MPI process to finish.
2022-05-30 14:15:50,213 sagemaker-training-toolkit INFO     Process[es]: [psutil.Process(pid=51, name='orted', status='sleeping', started='14:15:48')]
2022-05-30 14:15:50,213 sagemaker-training-toolkit INFO     Orted process found [psutil.Process(pid=51, name='orted', status='sleeping', started='14:15:48')]
2022-05-30 14:15:50,213 sagemaker-training-toolkit INFO     Waiting for orted process [psutil.Process(pid=51, name='orted', status='sleeping', started='14:15:48')]
2022-05-30 14:15:50,546 sagemaker-training-toolkit INFO     Orted process exited
[1,mpirank:1,algo-2]<stderr>:/opt/conda/bin/python3.8: No module named mpi4py[1,mpirank:1,algo-2]<stderr>:
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
  Process name: [[41164,1],0]
  Exit code:    1
--------------------------------------------------------------------------
2022-05-30 14:15:50,526 sagemaker-training-toolkit ERROR    Reporting training FAILURE
2022-05-30 14:15:50,526 sagemaker-training-toolkit ERROR    ExecuteUserScriptError:
ExitCode 1
ErrorMessage ""
Command "mpirun --host algo-1,algo-2 -np 2 --allow-run-as-root --display-map --tag-output -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -x NCCL_MIN_NRINGS=4 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD=/opt/conda/lib/python3.8/site-packages/gethostname.cpython-38-x86_64-linux-gnu.so -x SM_HOSTS -x SM_NETWORK_INTERFACE_NAME -x SM_HPS -x SM_USER_ENTRY_POINT -x SM_FRAMEWORK_PARAMS -x SM_RESOURCE_CONFIG -x SM_INPUT_DATA_CONFIG -x SM_OUTPUT_DATA_DIR -x SM_CHANNELS -x SM_CURRENT_HOST -x SM_MODULE_NAME -x SM_LOG_LEVEL -x SM_FRAMEWORK_MODULE -x SM_INPUT_DIR -x SM_INPUT_CONFIG_DIR -x SM_OUTPUT_DIR -x SM_NUM_CPUS -x SM_NUM_GPUS -x SM_MODEL_DIR -x SM_MODULE_DIR -x SM_TRAINING_ENV -x SM_USER_ARGS -x SM_OUTPUT_INTERMEDIATE_DIR -x SM_CHANNEL_TRAINING -x SM_HP_BACKEND -x SM_HP_EPOCHS -x PYTHONPATH /opt/conda/bin/python3.8 -m mpi4py mnist.py --backend gloo --epochs 1"
2022-05-30 14:15:50,526 sagemaker-training-toolkit ERROR    Encountered exit_code 1

2022-05-30 14:16:00 Uploading - Uploading generated training model2022-05-30 14:16:20,575 sagemaker-training-toolkit INFO     MPI process finished.
2022-05-30 14:16:20,575 sagemaker-training-toolkit INFO     Reporting training SUCCESS

2022-05-30 14:16:26 Failed - Training job failed
ProfilerReport-1653919936: NoIssuesFound

Expected behavior: It should run successfully.

Additional context: The Pytorch Dockerfile.cpu do not pip install mpi4py while the Dockerfile.gpu ones, and all the ones for Tensorflow do.

May 30 '22 16:05 ProxJ

Hi @ProxJ - I know this issue was opened long ago - are you still having issues with this?

We have mpi4py installed in latest PT 2.0 training containers https://github.com/aws/deep-learning-containers/blob/master/pytorch/training/docker/2.0/py3/Dockerfile.cpu#L122

Jun 08 '23 07:06 arjkesh