initialization-actions icon indicating copy to clipboard operation
initialization-actions copied to clipboard

[rapids] Adjust some Spark config defaults when using RAPIDS Accelerator

Open jlowe opened this issue 3 years ago • 5 comments

Adds a few more settings to the Spark defaults that are updated as part of setting up the cluster to run with the RAPIDS Accelerator. We have found these settings to perform better that the Dataproc defaults for Spark when queries are running on the GPU.

cc: @mengdong @viadea

Signed-off-by: Jason Lowe [email protected]

jlowe avatar Sep 19 '22 20:09 jlowe

@nvliyuan fyi as well

viadea avatar Sep 19 '22 21:09 viadea

@medb could you help review and approve?thx This is just some default config settings for spark

viadea avatar Sep 19 '22 21:09 viadea

@jlowe we do not need to explicitly set spark.sql.autoBroadcastJoinThreshold=10m because this is default value, right?

viadea avatar Sep 30 '22 22:09 viadea

@jayadeep-jayaraman can you help review this?

mattahrens avatar Oct 10 '22 18:10 mattahrens

we do not need to explicitly set spark.sql.autoBroadcastJoinThreshold=10m because this is default value, right?

Apologies for the late reply. We need this because it is not the default on Dataproc. Dataproc overrides spark.sql.autoBroadcaqstJoinThreshold to a different value. For example, check /etc/spark/conf/spark-defaults.conf after creating a Dataproc cluster.

jlowe avatar Oct 11 '22 16:10 jlowe

/gcbrun

jayadeep-jayaraman avatar Oct 26 '22 15:10 jayadeep-jayaraman

I am seeing few errors while running the tests

Error 1

2022-10-26T17:07:17.898739918Z /opt/conda/default/lib/python3.8/site-packages/scipy/__init__.py:138: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.4)
2022-10-26T17:07:17.898747252Z   warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion} is required for this version of "
2022-10-26T17:07:17.898754795Z Traceback (most recent call last):
2022-10-26T17:07:17.898777281Z   File "verify_rapids_dask.py", line 1, in <module>
2022-10-26T17:07:17.898784606Z     import cudf
2022-10-26T17:07:17.898792677Z   File "/opt/conda/default/lib/python3.8/site-packages/cudf/__init__.py", line 5, in <module>
2022-10-26T17:07:17.898800120Z     validate_setup()
2022-10-26T17:07:17.898807795Z   File "/opt/conda/default/lib/python3.8/site-packages/cudf/utils/gpu_utils.py", line 20, in validate_setup
2022-10-26T17:07:17.898815406Z     from rmm._cuda.gpu import (
2022-10-26T17:07:17.898822504Z   File "/opt/conda/default/lib/python3.8/site-packages/rmm/__init__.py", line 16, in <module>
2022-10-26T17:07:17.898829592Z     from rmm import mr
2022-10-26T17:07:17.898836858Z   File "/opt/conda/default/lib/python3.8/site-packages/rmm/mr.py", line 14, in <module>
2022-10-26T17:07:17.898843973Z     from rmm._lib.memory_resource import (
2022-10-26T17:07:17.898851286Z   File "/opt/conda/default/lib/python3.8/site-packages/rmm/_lib/__init__.py", line 15, in <module>
2022-10-26T17:07:17.898858783Z     from .device_buffer import DeviceBuffer
2022-10-26T17:07:17.898866026Z   File "device_buffer.pyx", line 1, in init rmm._lib.device_buffer
2022-10-26T17:07:17.898873438Z TypeError: C function cuda.ccudart.cudaStreamSynchronize has wrong signature (expected __pyx_t_4cuda_7ccudart_cudaError_t (__pyx_t_4cuda_7ccudart_cudaStream_t), got cudaError_t (cudaStream_t))

Error 2 Can we add bigger sized PD's while creating the cluster instead of using the default disk size ?

1/1 local-dirs usable space is below configured utilization percentage/no more usable space [ /hadoop/yarn/nm-local-dir : used space above threshold of 90.0% ] ; 1/1 log-dirs usable space is below configured utilization percentage/no more usable space [ /var/log/hadoop-yarn/userlogs : used space above threshold of 90.0% 

** Error 3 **

For Ubuntu I am seeing the below error

ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted

I found this link which is for Ubuntu 16.04 and not the version we are using but it points to ubuntu killing the process when running jobs in standalone mode. Can we instantiate the spark-shell to use yarn instead of standalone mode and see if this error gets resolved ?

jayadeep-jayaraman avatar Oct 26 '22 17:10 jayadeep-jayaraman

This change has been pulled into #1018, so this is no longer necessary.

jlowe avatar Oct 26 '22 18:10 jlowe