initialization-actions not all worker nodes install modules correctly on 2.0-rocky8 clusters with GPU resource

Command:

gcloud dataproc clusters create $CLUSTER_NAME  \
    --region $GCS_REGION \
    --zone XXX \
    --project XXX \
    --image-version=2.0-rocky8 \
    --master-machine-type n1-standard-4 \
    --num-workers $NUM_WORKERS \
    --subnet default \
    --worker-machine-type n1-standard-4 \
    --initialization-actions XXXX/install_gpu_driver.sh,XXXX/rapids.sh \
    --metadata gpu-driver-provider="NVIDIA" \
    --metadata rapids-runtime=SPARK \
    --metadata cuda-version="$CUDA_VER" \
    --worker-accelerator type=nvidia-tesla-t4,count=$NUM_GPUS \
    --optional-components=JUPYTER,ZEPPELIN \
    --bucket $GCS_BUCKET \
    --enable-component-gateway \

nvidia_uvm and nvidia module are not installed in work0, the cluster will hang

diff dataproc-initialization-script-0_output ../w-1/dataproc-initialization-script-0_output
119,126d118
<   bash-4.4.20-4.el8_6.x86_64                                                    
<   dbus-1:1.12.8-18.el8_6.1.x86_64                                               
<   dbus-common-1:1.12.8-18.el8_6.1.noarch                                        
<   dbus-daemon-1:1.12.8-18.el8_6.1.x86_64                                        
<   dbus-libs-1:1.12.8-18.el8_6.1.x86_64                                          
<   dbus-tools-1:1.12.8-18.el8_6.1.x86_64                                         
<   device-mapper-8:1.02.181-3.el8_6.2.x86_64                                     
<   device-mapper-libs-8:1.02.181-3.el8_6.2.x86_64                                
128,132d119
<   kernel-headers-4.18.0-372.19.1.el8_6.x86_64                                   
<   kernel-tools-4.18.0-372.19.1.el8_6.x86_64                                     
<   kernel-tools-libs-4.18.0-372.19.1.el8_6.x86_64                                
<   kpartx-0.8.4-22.el8_6.1.x86_64                                                
<   libdnf-0.63.0-8.1.el8_6.x86_64                                                
134,143d120
<   openssl-1:1.1.1k-7.el8_6.x86_64                                               
<   openssl-devel-1:1.1.1k-7.el8_6.x86_64                                         
<   openssl-libs-1:1.1.1k-7.el8_6.x86_64                                          
<   pcre2-10.32-3.el8_6.x86_64                                                    
<   pcre2-devel-10.32-3.el8_6.x86_64                                              
<   pcre2-utf16-10.32-3.el8_6.x86_64                                              
<   pcre2-utf32-10.32-3.el8_6.x86_64                                              
<   python3-hawkey-0.63.0-8.1.el8_6.x86_64                                        
<   python3-libdnf-0.63.0-8.1.el8_6.x86_64                                        
<   python3-perf-4.18.0-372.19.1.el8_6.x86_64                                     
145,152d121
<   rng-tools-6.14-6.git.b2b7934e.el8_6.x86_64                                    
<   selinux-policy-3.14.3-95.el8_6.1.noarch                                       
<   selinux-policy-targeted-3.14.3-95.el8_6.1.noarch                              
<   systemd-239-58.el8_6.3.x86_64                                                 
<   systemd-libs-239-58.el8_6.3.x86_64                                            
<   systemd-pam-239-58.el8_6.3.x86_64                                             
<   systemd-udev-239-58.el8_6.3.x86_64                                            
<   tuned-2.18.0-2.el8_6.1.noarch                                                 
156,160d124
<   vim-minimal-2:8.0.1763-19.el8_6.4.x86_64                                      
< Installed:
<   kernel-4.18.0-372.19.1.el8_6.x86_64                                           
<   kernel-core-4.18.0-372.19.1.el8_6.x86_64                                      
<   kernel-modules-4.18.0-372.19.1.el8_6.x86_64                                   
182c146
<   kernel-devel-4.18.0-372.19.1.el8_6.x86_64                                     
---
>   kernel-devel-4.18.0-372.16.1.el8_6.0.1.x86_64                                 
427d390
< ++ sed -n 's/.*version[[:blank:]]\+\([0-9]\+\.[0-9]\).*/\1/p'
428a392
> ++ sed -n 's/.*version[[:blank:]]\+\([0-9]\+\.[0-9]\).*/\1/p'

Aug 10 '22 02:08 nvliyuan

Hi @medb , not sure can you please help with this issue? Since spark-rapids supports Rocky8 since the 22.08 version which will soon release, customers will hit this issue while spinning up the Rocky8 cluster.

Aug 15 '22 05:08 nvliyuan

@nvliyuan - I created a rocky8 cluster with GPU but did not run the install_gpu_driver.sh script, instead I ran this gpu installation script.

Below is the output from master and workers

Master

Fri Oct 14 11:27:39 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 495.46       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   76C    P0    34W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Worker 0

Fri Oct 14 11:27:48 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 495.46       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   77C    P0    35W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Worker 1

Fri Oct 14 11:27:54 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 495.46       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   77C    P0    34W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Therefore, can we use this script in the install_gpu_driver.sh to setup the drivers with the correct kernel headers and then allow the current logic to configure yarn and spark configurations to remain as it is.

@pulkit-jain-G - FYI.

Oct 14 '22 11:10 jayadeep-jayaraman

will close this issue after pr #1028 merged

Nov 17 '22 07:11 nvliyuan