[ECS] Add support for GPU with Docker 19.03
Summary
Docker 19.03 now has built-in support for GPU, there is no need to specify an alternate runtime. However, at run time, --gpus all or a specific set of GPUs needs to be passed as an argument and it can't be done with dockerd config.
Description
Without --gpus all
$ docker run --rm nvidia/cuda:10.1-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "exec: \"nvidia-smi\": executable file not found in $PATH": unknown.
With --gpus all
$ docker run --gpus all --runtime runc --rm nvidia/cuda:10.1-base nvidia-smi
Thu Aug 29 13:54:36 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:00:0F.0 Off | 0 |
| N/A 35C P8 26W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 00000000:00:10.0 Off | 0 |
| N/A 32C P8 29W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 On | 00000000:00:11.0 Off | 0 |
| N/A 40C P8 27W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 On | 00000000:00:12.0 Off | 0 |
| N/A 35C P8 29W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 On | 00000000:00:13.0 Off | 0 |
| N/A 36C P8 26W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 On | 00000000:00:14.0 Off | 0 |
| N/A 34C P8 30W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 On | 00000000:00:15.0 Off | 0 |
| N/A 40C P8 26W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 On | 00000000:00:16.0 Off | 0 |
| N/A 33C P8 29W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 8 Tesla K80 On | 00000000:00:17.0 Off | 0 |
| N/A 36C P8 26W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 9 Tesla K80 On | 00000000:00:18.0 Off | 0 |
| N/A 31C P8 30W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 10 Tesla K80 On | 00000000:00:19.0 Off | 0 |
| N/A 37C P8 26W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 11 Tesla K80 On | 00000000:00:1A.0 Off | 0 |
| N/A 33C P8 29W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 12 Tesla K80 On | 00000000:00:1B.0 Off | 0 |
| N/A 38C P8 26W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 13 Tesla K80 On | 00000000:00:1C.0 Off | 0 |
| N/A 34C P8 32W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 14 Tesla K80 On | 00000000:00:1D.0 Off | 0 |
| N/A 39C P8 27W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 15 Tesla K80 On | 00000000:00:1E.0 Off | 0 |
| N/A 34C P8 30W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Expected Behavior
Containers with GPU requirements should start
Observed Behavior
Environment Details
$ docker info
Client:
Debug Mode: false
Server:
Containers: 2
Running: 1
Paused: 0
Stopped: 1
Images: 4
Server Version: 19.03.1
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: splunk
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
runc version: 425e105d5a03fabd737a126ad93d62a9eeede87f
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-957.27.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 64
Total Memory: 720.3GiB
Name: ip-10-45-8-153.us-west-2.compute.internal
ID: GA4Z:BCED:2FQG:AUKO:KUAX:7X5W:SBAR:NWB3:IHCH:6HQN:TIFW:PLOB
Docker Root Dir: /var/lib/docker
Debug Mode: true
File Descriptors: 31
Goroutines: 51
System Time: 2019-08-29T13:55:51.172917082Z
EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: true
$ curl http://localhost:51678/v1/metadata
{"Cluster":"BATCHCLUSTER_Batch_0d2792a4-22a0-37e9-8e8a-5e8b68c1be17","ContainerInstanceArn":"arn:aws:ecs:us-west-2::container-instance/BATCHCLUSTER_Batch_0d2792a4-22a0-37e9-8e8a-5e8b68c1be17/50e649e34b83423189684b82669a1cea","Version":"Amazon ECS Agent - v1.30.0 (02ff320c)"}
Supporting Log Snippets
Logs can be provided on request.
ECS already has support for running workloads that leverage GPU - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-gpu.html
Blog post: https://aws.amazon.com/blogs/compute/scheduling-gpus-for-deep-learning-tasks-on-amazon-ecs/
Are you looking for something else that's not provided by the feature?
I know, but it does not work with Docker 19.03, only with 18.09.
Docker 18.09, you had to specify --runtime nvidia to run GPU. Now, with 19.03, it is no longer required, however, you have to pass the --gpus argument to a container at runtime to expose GPU.
Here a workaround : change the default runtime for Docker on GPU instances:
- Override systemd configuration for docker :
Create a file
/etc/systemd/system/docker.service.d/override.confwith :
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd --host=fd://
- Set
nvidiaas default runtime in docker daemon : In/etc/docker/daemon.json:
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia"
}
- Restart docker :
systemctl daemon-reload
systemctl restart docker
- Check docker service :
systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/docker.service.d
└─override.conf
Active: active (running) since Mon 2020-01-20 17:35:47 CET; 17h ago
Docs: https://docs.docker.com
Main PID: 9065 (dockerd)
Tasks: 14
Memory: 100.1M
CGroup: /system.slice/docker.service
└─9065 /usr/bin/dockerd --host=fd://
- Check default runtime :
docker -D info | grep Runtime
Runtimes: nvidia runc
Default Runtime: nvidia
-
Now, you can launch your "gpu" docker without the option
--gpus alland so, work on ECS , voilà -
WARNING : if you have other docker containers on the same GPU instance, you have to launch them with
--runtime=runcoption. Exemple withecs-agentsystemd definition :
[Unit]
Description=AWS ECS Agent
...
[Service]
...
ExecStart=/usr/bin/docker run \
--runtime=runc \
--name=ecs-agent \
....
This would be really nice- I recently spent some time getting GPU support working on my own Ubuntu AMI's and one thing I ran into was that the agent still tries to force the nvidia runtime even though it is no longer required. Switching to use the new gpu argument instead of the runtime argument would simplify provisioning machines.
We ran into this issue ourselves. This bug of using the wrong runtime environment with docker 19.03 is blocking us from using a dynamic ECS instance directly from a task via an autoscaling group. There is no way to control the Docker runtime environment , Docker version or AMI for an ECS autoscaling group.
As mentioned, the docker runtime environment (nvidia) and version (19.03) currently do work together for GPU instances run purely without ECS and no modifications to the instance.
Just checking to see if there's any plans to do something about this issue. I don't see anybody assigned, but this seems like a major shortcoming of ECS with GPU support. It'd be nice to be able to let ECS start up containers that need gpu without having to hack in device support.
Edit: I'm not able to get the workaround working. Hardcoding the nvidia runtime as default makes ecs-agent fail to come up (which needs runc). Leaving the default runc runtime fails to bring up the nvidia runtime for my gpu-related container.
Edit: I'm not able to get the workaround working. Hardcoding the nvidia runtime as default makes ecs-agent fail to come up (which needs runc). Leaving the default runc runtime fails to bring up the nvidia runtime for my gpu-related container.
For ECS agent, you have to build your own AMI and set ECS agent as a systemd service with specific option runtime :
[Unit]
Description=AWS ECS Agent
[Service]
...
ExecStart=/usr/bin/docker run \
--runtime=runc \
--name=ecs-agent \
I use Packer with Ansible to build easly my own AMI.
But you're right : it is just a workaround and ECS with GPU support would be appreciated !
Hello, so as I understand it the issue that you are all experiencing is that you are not using the ECS-optimized AMI, and you would like to run GPU-enabled ECS tasks without having to configure nvidia-container-runtime?
As a workaround you can essentially do what alecks3474 has suggested in https://github.com/aws/containers-roadmap/issues/457#issuecomment-576606795.
The one caveat being that you shouldn't set nvidia as the default runtime. ECS Agent handles setting nvidia as the runtime for gpu containers when there is a GPU present in the task definition, so setting it as the default is not necessary, and as @kevinclark found will cause issues for the ecs agent container.
Below is from a GPU ecs-optimized AMI, which shows that we have the nvidia runtime enabled but the default runtime is set to runc, so that the ecs agent and other non-gpu containers can still run properly.
% docker -D info | grep Runtime
Runtimes: nvidia runc
Default Runtime: runc
% docker -D info | grep "Server Version"
Server Version: 19.03.6-ce
If we run task/service from aws ecs console (new console) where we have to pass --gpus all so it can work with auto scalling group....right now its work if we manually run image in ec2 (after login through ssh) but if i want to manage this from aws ecs console then not sure how or where i can pass (--gpus all)....
i tried https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-gpu.html this already but still not able to run gpu based container...if i run --gpus all manually (after login through ec2) then it work but it won't it should worked form ecs console only
any help would be appreciated.
In case anyone stumbles on this issue, the documentation here https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#getting-started provides steps on how to configure docker to run a docker container with nvidia gpus.
These are the steps I followed and was able to run the docker image with the gpu.
Setting up NVIDIA Container Toolkit Setup the package repository and the GPG key:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Run the docker with gpu runtime
sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
is there any way to enable gpu sharing based on time slicing like in EKS, but in ECS ?