Q27 intermittent failure in nightly automation
This is the intermittent CUBLAS_STATUS_NOT_INITIALIZED error that I thought was in Q28 @VibhuJawa
Encountered Exception while running query
Traceback (most recent call last):
File "/raid/nicholasb/prod/tpcx-bb/tpcx_bb/xbb_tools/utils.py", line 280, in run_dask_cudf_query
config=config,
File "/raid/nicholasb/prod/tpcx-bb/tpcx_bb/xbb_tools/utils.py", line 61, in benchmark
result = func(*args, **kwargs)
File "queries/q27/tpcx_bb_query_27.py", line 130, in main
["review_idx_global_pos", "pr_item_sk", "word", "sentence"]
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask_cudf/core.py", line 255, in sor
t_values
ignore_index=ignore_index,
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask_cudf/sorting.py", line 225, in
sort_values
divisions = quantile_divisions(df, by, npartitions)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask_cudf/sorting.py", line 177, in
quantile_divisions
divisions = _approximate_quantile(df[by], qn).compute()
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/base.py", line 279, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/base.py", line 567, in compute
results = schedule(dsk, keys, **kwargs)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/client.py", line 2676, i
n get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/client.py", line 1991, i
n gather
asynchronous=asynchronous,
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/client.py", line 832, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/utils.py", line 340, in
sync
raise exc.with_traceback(tb)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/utils.py", line 324, in
f
result[0] = yield future
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/client.py", line 1850, i
n _gather
raise exception.with_traceback(traceback)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/optimization.py", line 963, in
__call__
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/core.py", line 151, in get
result = _execute_task(task, cache)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/core.py", line 121, in _execute
_task
return func(*(_execute_task(a, cache) for a in args))
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/utils.py", line 30, in apply
return func(*args, **kwargs)
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/dataframe/core.py", line 5401,
in apply_and_enforce
df = func(*args, **kwargs)
File "queries/q27/tpcx_bb_query_27.py", line 63, in ner_parser
for doc in docs:
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/language.py", line 819, in pipe
for doc in docs:
File "nn_parser.pyx", line 253, in pipe
File "nn_parser.pyx", line 273, in spacy.syntax.nn_parser.Parser.predict
File "nn_parser.pyx", line 286, in spacy.syntax.nn_parser.Parser.greedy_parse
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/model.py", line 167, in __call__
return self.predict(x)
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/model.py", line 131, in predict
y, _ = self.begin_update(X, drop=None)
File "_parser_model.pyx", line 243, in spacy.syntax._parser_model.ParserModel.begin_update
File "_parser_model.pyx", line 293, in spacy.syntax._parser_model.ParserStepModel.__init__
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
X, inc_layer_grad = layer.begin_update(X, drop=drop)
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/api.py", line 295, in begin_update
X, bp_layer = layer.begin_update(layer.ops.flatten(seqs_in, pad=pad), drop=drop)
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
X, inc_layer_grad = layer.begin_update(X, drop=drop)
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/api.py", line 379, in uniqued_fwd
Y_uniq, bp_Y_uniq = layer.begin_update(X_uniq, drop=drop)
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
X, inc_layer_grad = layer.begin_update(X, drop=drop)
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/layernorm.py", line 62, in begin_update
X, backprop_child = self.child.begin_update(X, drop=0.0)
File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/maxout.py", line 76, in begin_update
output__boc = self.ops.gemm(X__bi, W, trans2=True)
File "ops.pyx", line 986, in thinc.neural.ops.CupyOps.gemm
File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/cupy/linalg/_product.py", line 34, i
n dot
return a.dot(b, out)
File "cupy/core/core.pyx", line 1412, in cupy.core.core.ndarray.dot
File "cupy/core/_routines_linalg.pyx", line 418, in cupy.core._routines_linalg.dot
File "cupy/core/_routines_linalg.pyx", line 503, in cupy.core._routines_linalg.tensordot_core
File "cupy/cuda/device.pyx", line 47, in cupy.cuda.device.get_cublas_handle
File "cupy/cuda/device.pyx", line 213, in cupy.cuda.device.Device.cublas_handle.__get__
File "cupy/cuda/device.pyx", line 200, in cupy.cuda.device.Device._get_handle
File "cupy/cuda/device.pyx", line 201, in cupy.cuda.device.Device._get_handle
File "cupy_backends/cuda/libs/cublas.pyx", line 346, in cupy_backends.cuda.libs.cublas.create
File "cupy_backends/cuda/libs/cublas.pyx", line 350, in cupy_backends.cuda.libs.cublas.create
File "cupy_backends/cuda/libs/cublas.pyx", line 339, in cupy_backends.cuda.libs.cublas.check_status
cupy_backends.cuda.libs.cublas.CUBLASError: CUBLAS_STATUS_NOT_INITIALIZED
CUDA version mismatches would be the normal culprit, but I'm not sure how this would only show up intermittently. @jakirkham do you have any thoughts on this?
cc @anaruse (in case you have thoughts here 🙂)
Hmm, it might have something to do with the issue below. https://github.com/cupy/cupy/issues/3935
Based on the comment they recommend setting the CUPY_GPU_MEMORY_LIMIT="90%" environment variable as they say there's no enough memory left to create cublas handler .
QQ : Is the cublas handler here supposed to use the existing rmm pool (because if it's not this may explain the pool growth we see while using SPACY)
cc: @beckernick
QQ : Is the cublas handler here supposed to use the existing rmm pool (because if it's not this may explain the pool growth we see while using SPACY)
I don't think so (meaning it shouldn't use the pool).
Edit: To expand on this, we would need an API in cuBLAS (that CuPy would then use), which would allow us to specify a chunk of memory to use for the initialization.
Is there a way for us to trigger cuBLAS initialization early? If so, maybe we can do this as part of Dask-CUDA startup.
Thanks for the pointer @anaruse @jakirkham . Let's look into triggering this early and/or perhaps reserving memory.
@anaruse @jakirkham , do you think there is any downside/risk of initializing a handle to the cuBLAS library context and then essentially "throwing it away" before we do anything else on the cluster?
I.e., running something like this on every Dask worker?
def init_cublas():
from cupy.cuda import cublas
cublas.create() # allocates 64MB of GPU memory and returns a handle
return None
client.run(init_cublas)
Initializing the cuBLAS context beforehand wouldn't necessarily change the allocation dynamics of the workload that triggers RMM to grow the pool to just at the edge of total capacity. Without visibility into that chunk of memory, this might still be a risk, right?
Perhaps in combination with CUPY_GPU_MEMORY_LIMIT we may be able to avoid it, but it's not immediately clear to me how that would work.
Idk about creating cuBLAS handle like that. The library may expect us to do cleanup. Not sure what happens if we don't do that cleanup.
That said, maybe we can do some warmup step (like matrix multiplication), which would get CuPy to initialize cuBLAS. Admittedly that's a bit hacky, but perhaps workable.
@anaruse may have a better suggestion
I poked around the CuPy code and think something like this might work. Should add CuPy takes care of the cleanup of the handle in this case
In [1]: import cupy
In [2]: cupy.cuda.device.get_cublas_handle()
Out[2]: 94517515719584
In [3]: cupy.cuda.device.get_cublas_handle()
Out[3]: 94517515719584
The first run is a bit slow (as it allocates the handle), but the second one is a bit faster (as it is cached). Note the pointer returned as a Python int is the same in both cases
We could call this with client.run or similar to make sure this is setup on all workers
Mentioned in an edit in the RMM thread, but it seems spaCy just uses CuPy for cuBLAS and doesn't use cuBLAS directly. So I think that initialization step should be sufficient