Preface: Combining work done by @ahlgol and @Future-Outlier with some extra testing/eval/a bunch of nvidia-headache-fixes to get it working fully on ubuntu server. https://github.com/flyteorg/flyte/pull/3256

If ahlgol merges this into the previous PR, this one will close, otherwise we can just use this one (I kept the previous PR's commits)

Setup / testing

0. Prerequisites

Ensure you have installed them and you can run them all

Installing the NVIDIA Container Toolkit: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
Nvidia container-toolkit sample-workload: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html
Support for Container Device Interface: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html
NVIDIA device plugin for Kubernetes (Finish All the Quick Start) https://github.com/NVIDIA/k8s-device-plugin#quick-start
A public docker server may be necessary (I pushed just to make sure), so login with docker login
general reqs:
- kustomize
- helm
- kubectl
- docker
flyte envd reqs:
- pip install flytekitplugins-envd
- see below: create_envd_context.sh and run that

My env: (may or may not be necessary)

/etc/docker/daemon.json

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

docker context list

NAME        DESCRIPTION                               DOCKER ENDPOINT               ERROR
default *   Current DOCKER_HOST based configuration   unix:///var/run/docker.sock

/etc/containerd/config.toml

version = 2

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]

    [plugins."io.containerd.grpc.v1.cri".containerd]

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
            SystemdCgroup = true

lsb_release -a

No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy

nvidia-smi

Wed Nov  1 03:52:14 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.06              Driver Version: 545.23.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                       On  | 00000000:00:05.0 Off |                    0 |
| N/A   32C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

1. Get branch

Download the branch, build the dockerfile, tag the image, and push it:

git clone https://github.com/danpf/flyte
git checkout danpf-sandbox-gpu
cd flyte/docker/sandbox-bundled
make build-gpu
docker tag flyte-sandbox-gpu:latest dancyrusbio/flyte-sandbox-gpu:latest
docker login
docker push dancyrusbio/flyte-sandbox-gpu:latest

2. Start the cluster

flytectl demo start --image dancyrusbio/flyte-sandbox-gpu:latest --disable-agent --force

3. See if you can use the gpu

$ kubectl describe node | grep -i gpu
  nvidia.com/gpu:     2
  nvidia.com/gpu:     2
  nvidia.com/gpu     0           0

4. run the final job:

create the runme.py script shown below, and then run

pyflyte run --remote runme.py  check_if_gpu_available

Testing scripts

# create_envd_context.sh
envd context create --name flyte-sandbox --builder tcp --builder-address localhost:30003 --use

quickly rebuild and push your docker image (change the name obviously)

# rebuild.sh
make build-gpu && docker tag flyte-sandbox-gpu dancyrusbio/flyte-sandbox-gpu && docker push dancyrusbio/flyte-sandbox-gpu

start a new flyte sandbox cluster

# start_new_flyte_cluster.sh
flytectl demo start --image dancyrusbio/flyte-sandbox-gpu:latest --disable-agent --force

This is the final flyte script to check if your gpu is working

# runme.py
from flytekit import ImageSpec, Resources, task

gpu = "1"

@task(
    retries=2,
    cache=True,
    cache_version="1.0",
    requests=Resources(gpu=gpu),
    environment={"PYTHONPATH": "/root"},
    container_image=ImageSpec(
            cuda="11.8.0",
            python_version="3.9.13",
            packages=["flytekit", "torch"],
            apt_packages=["git"],
            registry="localhost:30000",
    )
)
def check_if_gpu_available() -> bool:
    import torch
    return torch.cuda.is_available()

Proof!

$ kubectl describe node | grep -i gpu
  nvidia.com/gpu:     2
  nvidia.com/gpu:     2
  nvidia.com/gpu     0           0

previous pr

A new Dockerfile and build-target "build-gpu" in docker/sandbox-bundled that builds a CUDA enabled image named flyte-sandbox-gpu. Describe your changes

Build target added in Makefile for "build-gpu" that builds Dockerfile.gpu
Build target added in Makefile for "manifests-gpu" that adds gpu-operator.yaml to manifests
Dockerfile.gpu is based on existing Dockerfile, but uses a base image from nvidia and installs k3s and crictl and adds containerd config template for nvidia container runtime
Adds bin/k3d-entrypoint-gpu-check.sh that checks if container is started in an nvidia enabled image and exits otherwise.
bin/k3d-entrypoint.sh have been modified to allow for stderr to pass to output, so warning from other entrypoint scripts can be seen (now it will be missing in logfile however)

Check all the applicable boxes

I updated the documentation accordingly. All new and existing tests passed.

All commits are signed-off.

Note to reviewers

Changes have been added following info from these sources (plus some trial and error): https://itnext.io/enabling-nvidia-gpus-on-k3s-for-cuda-workloads-a11b96f967b0 https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html https://k3d.io/v5.4.6/usage/advanced/cuda/

Nov 01 '23 03:11 danpf

Thank you for opening this pull request! 🙌

These tips will help get your PR across the finish line:

Most of the repos have a PR template; if not, fill it out to the best of your knowledge.
Sign off your commits (Reference: DCO Guide).

Nov 01 '23 03:11 welcome[bot]

Thanks a lot for your help, you and the author of the first PR really make a significant contribution to Flyte.

Nov 01 '23 03:11 Future-Outlier

Hi, thanks a lot for your contributions. These are really amazing.

Nov 01 '23 11:11 Future-Outlier

Here are some questions! I believe that if you can provide them, you can help lots of Flyte users to use sandbox GPU image, and also help reviewers to review it more easily.

Should we need to taint the GPU node? why or why not?
Should we need to set the config in flyte sandbox-config? why or why not?
Should we need to Change the k3d-entrypoint-gpu-check Permissions? why or why not?
Is the cuda's version necessary to be the same as your GPU cuda version? Does it have any limit?

Those questions above are related to the 1st GPU PR's discussion here. https://github.com/flyteorg/flyte/pull/3256#issuecomment-1784590139

Nov 01 '23 11:11 Future-Outlier

I think after we solve the security issue and remove everything about the gpu operator file, this PR can be merged, thanks for your tons of work.

Nov 06 '23 12:11 Future-Outlier

I'm not sure where else to explain this, but to answer any questions about the Dockerfile.gpu vs the Dockerfile file:

Here is a side-by-side diff screenshot of the two files:

The differences between the two files are shown in red. Essentially everything that is added to Dockerfile.gpu is due to the fact that the base image of k3s is scratch and the base image of our cuda is ubuntu. So you need to install a few requirements, install CRICTL, set the kubectl alias, and set some extra volumes/paths (at least according to the various docs)

Nov 08 '23 00:11 danpf

@danpf , it looks good to me, I think after remove these 2 changes, it's time to merge it, thanks a lot.

Nov 08 '23 03:11 Future-Outlier

Do you think we could get anyone to try and follow/install this? does it still work for you on WSL?

Nov 08 '23 05:11 danpf

@pingsutw will use a EC2 instance to test this

Nov 08 '23 05:11 Future-Outlier

It works on WSL, but WSL has some additional settings, which is complicated for me, in my WSL, I saw all pods about GPU started, so I think it's correct.

Nov 09 '23 06:11 Future-Outlier

Hey folks. I am working on a project that would greatly benefit from being able to have tasks be able to utilize GPUs in Sandbox. What is the current status of this PR?

Feb 20 '24 01:02 granthamtaylor

Hey folks. I am working on a project that would greatly benefit from being able to have tasks be able to utilize GPUs in Sandbox. What is the current status of this PR?

It works, but haven't add tests and not reviewed by other maitainers.

Feb 20 '24 01:02 Future-Outlier

Hey folks. I am working on a project that would greatly benefit from being able to have tasks be able to utilize GPUs in Sandbox. What is the current status of this PR?

You can

cd flyte
gh pr checkout 4340
make build_gpu

to create the image, thank you!

Feb 20 '24 01:02 Future-Outlier

Do we still need help testing/installing this? If so, what are the most up-to-date instructions?

Feb 20 '24 15:02 davidmirror-ops

@davidmirror-ops The current instructions in the OP are up to date (to my knowledge, but it has been some time). We couldn't convince anyone to test/install this. You will need an Nvidia gpu to do so.

Feb 20 '24 15:02 danpf

I am building a PC to function as private workstation. I will be getting a 4090 in about two weeks. I can test once it is finished.

This contribution is extremely useful for my intent, thank you for developing the feature!

Feb 20 '24 20:02 granthamtaylor

Added GPU enabled sandbox image. (v2?)

Setup / testing

0. Prerequisites

1. Get branch

2. Start the cluster

3. See if you can use the gpu

4. run the final job:

Testing scripts

Proof!

previous pr