deep-learning-containers
deep-learning-containers copied to clipboard
No module named mpi4py [bug]
Checklist
- [x] I've prepended issue tag with type of change: [bug]
- [x] (If applicable) I've attached the script to reproduce the bug
- [x] (If applicable) I've documented below the DLC image/dockerfile this relates to
- [x] (If applicable) I've documented below the tests I've run on the DLC image
- [x] I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
- [ ] I've built my own container based off DLC (and I've attached the code used to build my own image)
Concise Description:
When running most, if not all the PyTorch CPU images with distribution set to MPI, I get No module named mpi4py.
To reproduce:
from sagemaker.pytorch import PyTorch
from torchvision.datasets import MNIST
from torchvision import transforms
import sagemaker
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/DEMO-pytorch-mnist'
role = sagemaker.get_execution_role()
MNIST.mirrors = ["https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/"]
MNIST(
'data',
download=True,
transform=transforms.Compose(
[transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
)
)
inputs = sagemaker_session.upload_data(path='data', bucket=bucket, key_prefix=prefix)
print('input spec (in this case, just an S3 path): {}'.format(inputs))
estimator = PyTorch(entry_point='mnist.py',
role=role,
py_version='py38',
framework_version='1.11.0',
instance_count=2,
instance_type='ml.m5.large',
hyperparameters={
'epochs': 1,
'backend': 'gloo'
},
distribution = {
"mpi": {
"enabled": True,
"processes_per_host": 1,
}
}
)
estimator.fit({'training': inputs})
DLC image/dockerfile:
Current behavior:
Error log
Invoking script with the following command:
mpirun --host algo-1,algo-2 -np 2 --allow-run-as-root --display-map --tag-output -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -x NCCL_MIN_NRINGS=4 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD=/opt/conda/lib/python3.8/site-packages/gethostname.cpython-38-x86_64-linux-gnu.so -x SM_HOSTS -x SM_NETWORK_INTERFACE_NAME -x SM_HPS -x SM_USER_ENTRY_POINT -x SM_FRAMEWORK_PARAMS -x SM_RESOURCE_CONFIG -x SM_INPUT_DATA_CONFIG -x SM_OUTPUT_DATA_DIR -x SM_CHANNELS -x SM_CURRENT_HOST -x SM_MODULE_NAME -x SM_LOG_LEVEL -x SM_FRAMEWORK_MODULE -x SM_INPUT_DIR -x SM_INPUT_CONFIG_DIR -x SM_OUTPUT_DIR -x SM_NUM_CPUS -x SM_NUM_GPUS -x SM_MODEL_DIR -x SM_MODULE_DIR -x SM_TRAINING_ENV -x SM_USER_ARGS -x SM_OUTPUT_INTERMEDIATE_DIR -x SM_CHANNEL_TRAINING -x SM_HP_BACKEND -x SM_HP_EPOCHS -x PYTHONPATH /opt/conda/bin/python3.8 -m mpi4py mnist.py --backend gloo --epochs 1
Warning: Permanently added 'algo-2,10.x.xxx.182' (ECDSA) to the list of known hosts.
Data for JOB [41164,1] offset 0 Total slots allocated 2
======================== JOB MAP ========================
Data for node: algo-1#011Num slots: 1#011Max slots: 0#011Num procs: 1
#011Process OMPI jobid: [41164,1] App: 0 Process rank: 0 Bound: N/A
Data for node: algo-2#011Num slots: 1#011Max slots: 0#011Num procs: 1
#011Process OMPI jobid: [41164,1] App: 0 Process rank: 1 Bound: N/A
=============================================================
Data for JOB [41164,1] offset 0 Total slots allocated 2
======================== JOB MAP ========================
Data for node: algo-1#011Num slots: 1#011Max slots: 0#011Num procs: 1
#011Process OMPI jobid: [41164,1] App: 0 Process rank: 0 Bound: N/A
Data for node: algo-2#011Num slots: 1#011Max slots: 0#011Num procs: 1
#011Process OMPI jobid: [41164,1] App: 0 Process rank: 1 Bound: N/A
=============================================================
[1,mpirank:0,algo-1]<stderr>:/opt/conda/bin/python3.8: No module named mpi4py
[1,mpirank:0,algo-1]<stderr>:
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
/opt/conda/lib/python3.8/site-packages/paramiko/transport.py:236: CryptographyDeprecationWarning: Blowfish has been deprecated
"class": algorithms.Blowfish,
2022-05-30 14:15:47,290 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training
2022-05-30 14:15:47,296 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
2022-05-30 14:15:47,310 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.
2022-05-30 14:15:47,318 sagemaker_pytorch_container.training INFO Invoking user training script.
2022-05-30 14:15:47,858 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
2022-05-30 14:15:47,871 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
2022-05-30 14:15:47,884 sagemaker-training-toolkit INFO Starting MPI run as worker node.
2022-05-30 14:15:47,885 sagemaker-training-toolkit INFO Waiting for MPI Master to create SSH daemon.
2022-05-30 14:15:47,902 paramiko.transport INFO Connected (version 2.0, client OpenSSH_8.2p1)
2022-05-30 14:15:48,201 paramiko.transport INFO Authentication (publickey) successful!
2022-05-30 14:15:48,201 sagemaker-training-toolkit INFO Can connect to host algo-1
2022-05-30 14:15:48,201 sagemaker-training-toolkit INFO MPI Master online, creating SSH daemon.
2022-05-30 14:15:48,201 sagemaker-training-toolkit INFO Writing environment variables to /etc/environment for the MPI process.
2022-05-30 14:15:48,208 sagemaker-training-toolkit INFO Waiting for MPI process to finish.
2022-05-30 14:15:50,213 sagemaker-training-toolkit INFO Process[es]: [psutil.Process(pid=51, name='orted', status='sleeping', started='14:15:48')]
2022-05-30 14:15:50,213 sagemaker-training-toolkit INFO Orted process found [psutil.Process(pid=51, name='orted', status='sleeping', started='14:15:48')]
2022-05-30 14:15:50,213 sagemaker-training-toolkit INFO Waiting for orted process [psutil.Process(pid=51, name='orted', status='sleeping', started='14:15:48')]
2022-05-30 14:15:50,546 sagemaker-training-toolkit INFO Orted process exited
[1,mpirank:1,algo-2]<stderr>:/opt/conda/bin/python3.8: No module named mpi4py[1,mpirank:1,algo-2]<stderr>:
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[41164,1],0]
Exit code: 1
--------------------------------------------------------------------------
2022-05-30 14:15:50,526 sagemaker-training-toolkit ERROR Reporting training FAILURE
2022-05-30 14:15:50,526 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
ExitCode 1
ErrorMessage ""
Command "mpirun --host algo-1,algo-2 -np 2 --allow-run-as-root --display-map --tag-output -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -x NCCL_MIN_NRINGS=4 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD=/opt/conda/lib/python3.8/site-packages/gethostname.cpython-38-x86_64-linux-gnu.so -x SM_HOSTS -x SM_NETWORK_INTERFACE_NAME -x SM_HPS -x SM_USER_ENTRY_POINT -x SM_FRAMEWORK_PARAMS -x SM_RESOURCE_CONFIG -x SM_INPUT_DATA_CONFIG -x SM_OUTPUT_DATA_DIR -x SM_CHANNELS -x SM_CURRENT_HOST -x SM_MODULE_NAME -x SM_LOG_LEVEL -x SM_FRAMEWORK_MODULE -x SM_INPUT_DIR -x SM_INPUT_CONFIG_DIR -x SM_OUTPUT_DIR -x SM_NUM_CPUS -x SM_NUM_GPUS -x SM_MODEL_DIR -x SM_MODULE_DIR -x SM_TRAINING_ENV -x SM_USER_ARGS -x SM_OUTPUT_INTERMEDIATE_DIR -x SM_CHANNEL_TRAINING -x SM_HP_BACKEND -x SM_HP_EPOCHS -x PYTHONPATH /opt/conda/bin/python3.8 -m mpi4py mnist.py --backend gloo --epochs 1"
2022-05-30 14:15:50,526 sagemaker-training-toolkit ERROR Encountered exit_code 1
2022-05-30 14:16:00 Uploading - Uploading generated training model2022-05-30 14:16:20,575 sagemaker-training-toolkit INFO MPI process finished.
2022-05-30 14:16:20,575 sagemaker-training-toolkit INFO Reporting training SUCCESS
2022-05-30 14:16:26 Failed - Training job failed
ProfilerReport-1653919936: NoIssuesFound
Expected behavior: It should run successfully.
Additional context: The Pytorch Dockerfile.cpu do not pip install mpi4py while the Dockerfile.gpu ones, and all the ones for Tensorflow do.
Hi @ProxJ - I know this issue was opened long ago - are you still having issues with this?
We have mpi4py installed in latest PT 2.0 training containers https://github.com/aws/deep-learning-containers/blob/master/pytorch/training/docker/2.0/py3/Dockerfile.cpu#L122