gpu driver installer fails on 2.1 image with cuda=11.5
export ACCELERATOR_TYPE="nvidia-tesla-t4"
export CUDA_VERSION=11.5
export MACHINE_TYPE=n1-standard-1
export IMAGE_VERSION=2.1
date
time gcloud dataproc clusters create ${CLUSTER_NAME} \
--region ${REGION} \
--zone ${ZONE} \
--subnet ${SUBNET} \
--no-address \
--service-account=${GSA} \
--master-machine-type ${MACHINE_TYPE} \
--worker-machine-type ${MACHINE_TYPE} \
--master-boot-disk-type pd-standard \
--master-boot-disk-size 1024 \
--image-version ${IMAGE_VERSION} \
--tags=${TAGS} \
--bucket ${BUCKET} \
--initialization-action-timeout=15m \
--max-idle=${IDLE_TIMEOUT} \
--enable-component-gateway \
--metadata include-gpus=true \
--worker-accelerator type=${ACCELERATOR_TYPE} \
--master-accelerator type=${ACCELERATOR_TYPE} \
--metadata gpu-driver-provider=NVIDIA \
--initialization-actions ${INIT_ACTIONS_ROOT}/gpu/install_gpu_driver.sh \
--metadata init-actions-repo=${INIT_ACTIONS_ROOT} \
--metadata install-gpu-agent=true \
--metadata cuda-version=${CUDA_VERSION} \
--scopes 'https://www.googleapis.com/auth/cloud-platform'
date
Cluster creation fails with
Waiting for cluster creation operation...done.
ERROR: (gcloud.dataproc.clusters.create) Operation [projects/cjac-2021-00/regions/us-central1/operations/bc05b7c9-4ab6-32f9-afb9-d39e2415fe52] failed: Multiple Errors:
- Initialization action failed. Failed action 'gs://cjac-docker-on-yarn/dataproc-initialization-actions/gpu/install_gpu_driver.sh', see output in: gs://cjac-docker-on-yarn/google-cloud-dataproc-metainfo/08184442-2a2e-4898-b1d5-b1b942e50879/cluster-1668020639-m/dataproc-initialization-script-1_output
- Initialization action failed. Failed action 'gs://cjac-docker-on-yarn/dataproc-initialization-actions/gpu/install_gpu_driver.sh', see output in: gs://cjac-docker-on-yarn/google-cloud-dataproc-metainfo/08184442-2a2e-4898-b1d5-b1b942e50879/cluster-1668020639-w-0/dataproc-initialization-script-1_output
- Initialization action failed. Failed action 'gs://cjac-docker-on-yarn/dataproc-initialization-actions/gpu/install_gpu_driver.sh', see output in: gs://cjac-docker-on-yarn/google-cloud-dataproc-metainfo/08184442-2a2e-4898-b1d5-b1b942e50879/cluster-1668020639-w-1/dataproc-initialization-script-1_output.
real 3m55.995s
user 0m1.388s
sys 0m0.117s
+ date
Wed May 24 04:10:14 PM PDT 2023
Kernel driver build fails because of failure to sign the driver:
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 495.29.05.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
ERROR: The kernel module failed to load. Secure boot is enabled on this system, so this is likely because it was not signed by a key that is trusted by the kernel. Please try installing the driver again, and sign the kernel module when prompted to do so.
ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.
Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.
ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.
Adding the --no-shielded-secure-boot flag will at least get your cluster going.
Dataproc 2.1 enabled secure boot by default and the current approach to signing drivers found in multiple of this repositories scripts (e.g. spark-rapids.sh and install_gpu_driver.sh) is not working correctly.
Thanks! I'll patch that in to the 2.1 tests and ask the product team if there's a better way to support signed drivers.
@cjac @rwlee there is any way to set properly via API dataproc cluster creation the flagg --no-shielded-secure-boot, I tried some options in the docs, but didn't work well. I did need to fix the presets GPU in 2.0-x images.
Try it with gcloud and pass the verbose argument. You should see in the logs how the rest http requests are formatted
Okay, this should work now. 2.1 does not support secure boot at this time; the kernel fails to load the MOK from the EFI variables. Until or unless this changes, even custom images created with the new --trusted-cert argument will not successfully execute the installer with secure boot enabled.
So do pass --no-shielded-secure-boot when Dataproc images version is 2.1
I tested with CUDA 11.8 and not 11.5 ; I hope this will suffice
This should work now for all supported os variants. 2.1 does not support secure boot at this time; the kernel fails to load the MOK from the EFI variables. Until or unless this changes, even custom images created with the new --trusted-cert argument will not successfully execute the installer with secure boot enabled.
So do pass --no-shielded-secure-boot when Dataproc images version is 2.1
Working as of #1205
The 2.1 and 2.0 images do not pass the db efi variables from the GCE images to the dataproc cluster nodes, so the kernel won't load the modules even if they're signed by the proper key. My recommendation is to use 2.2 clusters.