initialization-actions icon indicating copy to clipboard operation
initialization-actions copied to clipboard

gpu driver installer fails on 2.1 image with cuda=11.5

Open cjac opened this issue 2 years ago • 4 comments

export ACCELERATOR_TYPE="nvidia-tesla-t4"
export CUDA_VERSION=11.5
export MACHINE_TYPE=n1-standard-1
export IMAGE_VERSION=2.1


  date
  time gcloud dataproc clusters create ${CLUSTER_NAME} \
    --region ${REGION} \
    --zone ${ZONE} \
    --subnet ${SUBNET} \
    --no-address \
    --service-account=${GSA} \
    --master-machine-type ${MACHINE_TYPE} \
    --worker-machine-type ${MACHINE_TYPE} \
    --master-boot-disk-type pd-standard \
    --master-boot-disk-size 1024 \
    --image-version ${IMAGE_VERSION} \
    --tags=${TAGS} \
    --bucket ${BUCKET} \
    --initialization-action-timeout=15m \
    --max-idle=${IDLE_TIMEOUT} \
    --enable-component-gateway \
    --metadata include-gpus=true \
    --worker-accelerator type=${ACCELERATOR_TYPE} \
    --master-accelerator type=${ACCELERATOR_TYPE} \
    --metadata gpu-driver-provider=NVIDIA \
    --initialization-actions ${INIT_ACTIONS_ROOT}/gpu/install_gpu_driver.sh \
    --metadata init-actions-repo=${INIT_ACTIONS_ROOT} \
    --metadata install-gpu-agent=true \
    --metadata cuda-version=${CUDA_VERSION} \
    --scopes 'https://www.googleapis.com/auth/cloud-platform'
  date

Cluster creation fails with

Waiting for cluster creation operation...done.
ERROR: (gcloud.dataproc.clusters.create) Operation [projects/cjac-2021-00/regions/us-central1/operations/bc05b7c9-4ab6-32f9-afb9-d39e2415fe52] failed: Multiple Errors:
 - Initialization action failed. Failed action 'gs://cjac-docker-on-yarn/dataproc-initialization-actions/gpu/install_gpu_driver.sh', see output in: gs://cjac-docker-on-yarn/google-cloud-dataproc-metainfo/08184442-2a2e-4898-b1d5-b1b942e50879/cluster-1668020639-m/dataproc-initialization-script-1_output
 - Initialization action failed. Failed action 'gs://cjac-docker-on-yarn/dataproc-initialization-actions/gpu/install_gpu_driver.sh', see output in: gs://cjac-docker-on-yarn/google-cloud-dataproc-metainfo/08184442-2a2e-4898-b1d5-b1b942e50879/cluster-1668020639-w-0/dataproc-initialization-script-1_output
 - Initialization action failed. Failed action 'gs://cjac-docker-on-yarn/dataproc-initialization-actions/gpu/install_gpu_driver.sh', see output in: gs://cjac-docker-on-yarn/google-cloud-dataproc-metainfo/08184442-2a2e-4898-b1d5-b1b942e50879/cluster-1668020639-w-1/dataproc-initialization-script-1_output.

real    3m55.995s
user    0m1.388s
sys     0m0.117s
+ date
Wed May 24 04:10:14 PM PDT 2023

Kernel driver build fails because of failure to sign the driver:

Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 495.29.05.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

ERROR: The kernel module failed to load. Secure boot is enabled on this system, so this is likely because it was not signed by a key that is trusted by the kernel. Please try installing the driver again, and sign the kernel module when prompted to do so.


ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release.

Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.


ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

cjac avatar May 24 '23 23:05 cjac

Adding the --no-shielded-secure-boot flag will at least get your cluster going.

Dataproc 2.1 enabled secure boot by default and the current approach to signing drivers found in multiple of this repositories scripts (e.g. spark-rapids.sh and install_gpu_driver.sh) is not working correctly.

rwlee avatar Jun 29 '23 23:06 rwlee

Thanks! I'll patch that in to the 2.1 tests and ask the product team if there's a better way to support signed drivers.

cjac avatar Jun 30 '23 00:06 cjac

@cjac @rwlee there is any way to set properly via API dataproc cluster creation the flagg --no-shielded-secure-boot, I tried some options in the docs, but didn't work well. I did need to fix the presets GPU in 2.0-x images.

ryukinix avatar Oct 25 '23 19:10 ryukinix

Try it with gcloud and pass the verbose argument. You should see in the logs how the rest http requests are formatted

cjac avatar Oct 25 '23 22:10 cjac

Okay, this should work now. 2.1 does not support secure boot at this time; the kernel fails to load the MOK from the EFI variables. Until or unless this changes, even custom images created with the new --trusted-cert argument will not successfully execute the installer with secure boot enabled.

So do pass --no-shielded-secure-boot when Dataproc images version is 2.1

cjac avatar Aug 06 '24 01:08 cjac

I tested with CUDA 11.8 and not 11.5 ; I hope this will suffice

This should work now for all supported os variants. 2.1 does not support secure boot at this time; the kernel fails to load the MOK from the EFI variables. Until or unless this changes, even custom images created with the new --trusted-cert argument will not successfully execute the installer with secure boot enabled.

So do pass --no-shielded-secure-boot when Dataproc images version is 2.1

cjac avatar Aug 06 '24 01:08 cjac

Working as of #1205

cjac avatar Aug 06 '24 01:08 cjac

The 2.1 and 2.0 images do not pass the db efi variables from the GCE images to the dataproc cluster nodes, so the kernel won't load the modules even if they're signed by the proper key. My recommendation is to use 2.2 clusters.

cjac avatar Oct 19 '24 01:10 cjac