gpu-bdb Q27 intermittent failure in nightly automation

This is the intermittent CUBLAS_STATUS_NOT_INITIALIZED error that I thought was in Q28 @VibhuJawa

Encountered Exception while running query
Traceback (most recent call last):
  File "/raid/nicholasb/prod/tpcx-bb/tpcx_bb/xbb_tools/utils.py", line 280, in run_dask_cudf_query
    config=config,
  File "/raid/nicholasb/prod/tpcx-bb/tpcx_bb/xbb_tools/utils.py", line 61, in benchmark
    result = func(*args, **kwargs)
  File "queries/q27/tpcx_bb_query_27.py", line 130, in main
    ["review_idx_global_pos", "pr_item_sk", "word", "sentence"]
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask_cudf/core.py", line 255, in sor
t_values
    ignore_index=ignore_index,
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask_cudf/sorting.py", line 225, in
sort_values
    divisions = quantile_divisions(df, by, npartitions)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask_cudf/sorting.py", line 177, in
quantile_divisions
    divisions = _approximate_quantile(df[by], qn).compute()
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/base.py", line 279, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/base.py", line 567, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/client.py", line 2676, i
n get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/client.py", line 1991, i
n gather
    asynchronous=asynchronous,
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/client.py", line 832, in sync
    self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/utils.py", line 340, in
sync
    raise exc.with_traceback(tb)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/utils.py", line 324, in
f
    result[0] = yield future
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/distributed/client.py", line 1850, i
n _gather
    raise exception.with_traceback(traceback)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/optimization.py", line 963, in
__call__
    return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/core.py", line 151, in get
    result = _execute_task(task, cache)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/core.py", line 121, in _execute
_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/utils.py", line 30, in apply
    return func(*args, **kwargs)
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/dask/dataframe/core.py", line 5401,
in apply_and_enforce
    df = func(*args, **kwargs)
  File "queries/q27/tpcx_bb_query_27.py", line 63, in ner_parser
    for doc in docs:
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/spacy/language.py", line 819, in pipe
    for doc in docs:
  File "nn_parser.pyx", line 253, in pipe
  File "nn_parser.pyx", line 273, in spacy.syntax.nn_parser.Parser.predict
  File "nn_parser.pyx", line 286, in spacy.syntax.nn_parser.Parser.greedy_parse
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/model.py", line 167, in __call__
    return self.predict(x)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/model.py", line 131, in predict
    y, _ = self.begin_update(X, drop=None)
  File "_parser_model.pyx", line 243, in spacy.syntax._parser_model.ParserModel.begin_update
  File "_parser_model.pyx", line 293, in spacy.syntax._parser_model.ParserStepModel.__init__
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/api.py", line 295, in begin_update
    X, bp_layer = layer.begin_update(layer.ops.flatten(seqs_in, pad=pad), drop=drop)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/api.py", line 379, in uniqued_fwd
    Y_uniq, bp_Y_uniq = layer.begin_update(X_uniq, drop=drop)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/feed_forward.py", line 46, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/layernorm.py", line 62, in begin_update
    X, backprop_child = self.child.begin_update(X, drop=0.0)
  File "/home/nfs/nicholasb/.local/lib/python3.7/site-packages/thinc/neural/_classes/maxout.py", line 76, in begin_update
    output__boc = self.ops.gemm(X__bi, W, trans2=True)
  File "ops.pyx", line 986, in thinc.neural.ops.CupyOps.gemm
  File "/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-automated-tests/lib/python3.7/site-packages/cupy/linalg/_product.py", line 34, i
n dot
    return a.dot(b, out)
  File "cupy/core/core.pyx", line 1412, in cupy.core.core.ndarray.dot
  File "cupy/core/_routines_linalg.pyx", line 418, in cupy.core._routines_linalg.dot
  File "cupy/core/_routines_linalg.pyx", line 503, in cupy.core._routines_linalg.tensordot_core
  File "cupy/cuda/device.pyx", line 47, in cupy.cuda.device.get_cublas_handle
  File "cupy/cuda/device.pyx", line 213, in cupy.cuda.device.Device.cublas_handle.__get__
  File "cupy/cuda/device.pyx", line 200, in cupy.cuda.device.Device._get_handle
  File "cupy/cuda/device.pyx", line 201, in cupy.cuda.device.Device._get_handle
  File "cupy_backends/cuda/libs/cublas.pyx", line 346, in cupy_backends.cuda.libs.cublas.create
  File "cupy_backends/cuda/libs/cublas.pyx", line 350, in cupy_backends.cuda.libs.cublas.create
  File "cupy_backends/cuda/libs/cublas.pyx", line 339, in cupy_backends.cuda.libs.cublas.check_status
cupy_backends.cuda.libs.cublas.CUBLASError: CUBLAS_STATUS_NOT_INITIALIZED

Dec 16 '20 14:12 beckernick

CUDA version mismatches would be the normal culprit, but I'm not sure how this would only show up intermittently. @jakirkham do you have any thoughts on this?

Dec 16 '20 14:12 beckernick

cc @anaruse (in case you have thoughts here 🙂)

Dec 16 '20 17:12 jakirkham

Hmm, it might have something to do with the issue below. https://github.com/cupy/cupy/issues/3935

Dec 16 '20 17:12 anaruse

Based on the comment they recommend setting the CUPY_GPU_MEMORY_LIMIT="90%" environment variable as they say there's no enough memory left to create cublas handler .

QQ : Is the cublas handler here supposed to use the existing rmm pool (because if it's not this may explain the pool growth we see while using SPACY)

cc: @beckernick

Dec 16 '20 17:12 VibhuJawa

QQ : Is the cublas handler here supposed to use the existing rmm pool (because if it's not this may explain the pool growth we see while using SPACY)

I don't think so (meaning it shouldn't use the pool).

Edit: To expand on this, we would need an API in cuBLAS (that CuPy would then use), which would allow us to specify a chunk of memory to use for the initialization.

Dec 16 '20 18:12 jakirkham

Is there a way for us to trigger cuBLAS initialization early? If so, maybe we can do this as part of Dask-CUDA startup.

Dec 16 '20 18:12 jakirkham

Thanks for the pointer @anaruse @jakirkham . Let's look into triggering this early and/or perhaps reserving memory.

Dec 16 '20 19:12 beckernick

@anaruse @jakirkham , do you think there is any downside/risk of initializing a handle to the cuBLAS library context and then essentially "throwing it away" before we do anything else on the cluster?

I.e., running something like this on every Dask worker?


def init_cublas():
    from cupy.cuda import cublas
    cublas.create() # allocates 64MB of GPU memory and returns a handle
    return None

client.run(init_cublas)

Feb 19 '21 15:02 beckernick

Initializing the cuBLAS context beforehand wouldn't necessarily change the allocation dynamics of the workload that triggers RMM to grow the pool to just at the edge of total capacity. Without visibility into that chunk of memory, this might still be a risk, right?

Perhaps in combination with CUPY_GPU_MEMORY_LIMIT we may be able to avoid it, but it's not immediately clear to me how that would work.

Feb 19 '21 15:02 beckernick

Idk about creating cuBLAS handle like that. The library may expect us to do cleanup. Not sure what happens if we don't do that cleanup.

That said, maybe we can do some warmup step (like matrix multiplication), which would get CuPy to initialize cuBLAS. Admittedly that's a bit hacky, but perhaps workable.

@anaruse may have a better suggestion

Feb 26 '21 19:02 jakirkham

I poked around the CuPy code and think something like this might work. Should add CuPy takes care of the cleanup of the handle in this case

In [1]: import cupy

In [2]: cupy.cuda.device.get_cublas_handle()
Out[2]: 94517515719584

In [3]: cupy.cuda.device.get_cublas_handle()
Out[3]: 94517515719584

The first run is a bit slow (as it allocates the handle), but the second one is a bit faster (as it is cached). Note the pointer returned as a Python int is the same in both cases

We could call this with client.run or similar to make sure this is setup on all workers

Mar 10 '21 20:03 jakirkham

Mentioned in an edit in the RMM thread, but it seems spaCy just uses CuPy for cuBLAS and doesn't use cuBLAS directly. So I think that initialization step should be sufficient

Mar 10 '21 20:03 jakirkham