[rapids] Adjust some Spark config defaults when using RAPIDS Accelerator
Adds a few more settings to the Spark defaults that are updated as part of setting up the cluster to run with the RAPIDS Accelerator. We have found these settings to perform better that the Dataproc defaults for Spark when queries are running on the GPU.
cc: @mengdong @viadea
Signed-off-by: Jason Lowe [email protected]
@nvliyuan fyi as well
@medb could you help review and approve?thx This is just some default config settings for spark
@jlowe we do not need to explicitly set spark.sql.autoBroadcastJoinThreshold=10m because this is default value, right?
@jayadeep-jayaraman can you help review this?
we do not need to explicitly set spark.sql.autoBroadcastJoinThreshold=10m because this is default value, right?
Apologies for the late reply. We need this because it is not the default on Dataproc. Dataproc overrides spark.sql.autoBroadcaqstJoinThreshold to a different value. For example, check /etc/spark/conf/spark-defaults.conf after creating a Dataproc cluster.
/gcbrun
I am seeing few errors while running the tests
Error 1
2022-10-26T17:07:17.898739918Z /opt/conda/default/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.4)
2022-10-26T17:07:17.898747252Z warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
2022-10-26T17:07:17.898754795Z Traceback (most recent call last):
2022-10-26T17:07:17.898777281Z File "verify_rapids_dask.py", line 1, in <module>
2022-10-26T17:07:17.898784606Z import cudf
2022-10-26T17:07:17.898792677Z File "/opt/conda/default/lib/python3.8/site-packages/cudf/__init__.py", line 5, in <module>
2022-10-26T17:07:17.898800120Z validate_setup()
2022-10-26T17:07:17.898807795Z File "/opt/conda/default/lib/python3.8/site-packages/cudf/utils/gpu_utils.py", line 20, in validate_setup
2022-10-26T17:07:17.898815406Z from rmm._cuda.gpu import (
2022-10-26T17:07:17.898822504Z File "/opt/conda/default/lib/python3.8/site-packages/rmm/__init__.py", line 16, in <module>
2022-10-26T17:07:17.898829592Z from rmm import mr
2022-10-26T17:07:17.898836858Z File "/opt/conda/default/lib/python3.8/site-packages/rmm/mr.py", line 14, in <module>
2022-10-26T17:07:17.898843973Z from rmm._lib.memory_resource import (
2022-10-26T17:07:17.898851286Z File "/opt/conda/default/lib/python3.8/site-packages/rmm/_lib/__init__.py", line 15, in <module>
2022-10-26T17:07:17.898858783Z from .device_buffer import DeviceBuffer
2022-10-26T17:07:17.898866026Z File "device_buffer.pyx", line 1, in init rmm._lib.device_buffer
2022-10-26T17:07:17.898873438Z TypeError: C function cuda.ccudart.cudaStreamSynchronize has wrong signature (expected __pyx_t_4cuda_7ccudart_cudaError_t (__pyx_t_4cuda_7ccudart_cudaStream_t), got cudaError_t (cudaStream_t))
Error 2 Can we add bigger sized PD's while creating the cluster instead of using the default disk size ?
1/1 local-dirs usable space is below configured utilization percentage/no more usable space [ /hadoop/yarn/nm-local-dir : used space above threshold of 90.0% ] ; 1/1 log-dirs usable space is below configured utilization percentage/no more usable space [ /var/log/hadoop-yarn/userlogs : used space above threshold of 90.0%
** Error 3 **
For Ubuntu I am seeing the below error
ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
I found this link which is for Ubuntu 16.04 and not the version we are using but it points to ubuntu killing the process when running jobs in standalone mode. Can we instantiate the spark-shell to use yarn instead of standalone mode and see if this error gets resolved ?
This change has been pulled into #1018, so this is no longer necessary.