NVTabular [BUG] cuDF failure processing full Criteo dataset, when the parquet files are exported from big query

Describe the bug cuDF failure processing full Criteo dataset, when the parquet files were exported from big query. CSV and parquet from GCS works.

Steps/Code to reproduce bug https://gist.github.com/mengdong/d6a24fc266d9806ccd74cd9890b67c6a replace GCS location to a local file on SSD will lead to same error

on a system with 4xT4s

sudo docker run --gpus '"device=0,1,2,3"' -it --rm
--shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864
-v ~:/scripts train_merlin python3 /scripts/merlin/gcs_nvt_test.py

Expected behavior error:

distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
<Client: 'tcp://127.0.0.1:40781' processes=4 threads=4, memory=372.53 GiB>
/nvtabular/nvtabular/utils.py:166: FutureWarning: The `client` argument is deprecated from Dataset and will be removed in a future version of NVTabular. By default, a global client in the same python context will be detected automatically, and `nvt.utils.set_dask_client` can be used for explicit control.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/cudf/core/dataframe.py:1253: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
  warnings.warn(
/nvtabular/nvtabular/utils.py:166: FutureWarning: The `client` argument is deprecated from Workflow and will be removed in a future version of NVTabular. By default, a global client in the same python context will be detected automatically, and `nvt.utils.set_dask_client` can be used for explicit control.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/cudf/core/dataframe.py:1253: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
  warnings.warn(
/nvtabular/nvtabular/ops/categorify.py:1036: UserWarning: Category DataFrame (with columns: Index(['C1', 'C1_size'], dtype='object')) is 3560679468 bytes in size. This is large compared to the suggested upper limit of 1980465152 bytes!(12.5% of the total memory by default)
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/cudf/core/dataframe.py:1253: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:3077: FutureWarning: keep_index is deprecated and will be removed in the future.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/cudf/core/dataframe.py:1253: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
  warnings.warn(
distributed.worker - WARNING - Compute Failed
Function:  _write_uniques
args:      ([pyarrow.Table
C20: string
C20_size: int64, pyarrow.Table
C20: string
C20_size: int64, pyarrow.Table
C20: string
C20_size: int64, pyarrow.Table
C20: string
C20_size: int64, pyarrow.Table
C20: string
C20_size: int64, pyarrow.Table
C20: string
C20_size: int64, pyarrow.Table
C20: string
C20_size: int64, pyarrow.Table
C20: string
C20_size: int64], './/categories', <nvtabular.graph.selector.ColumnSelector object at 0x7f40b81bc730>, FitOptions(col_groups=[<nvtabular.graph.selector.ColumnSelector object at 0x7f40c12ebaf0>, <nvtabular.graph.selector.ColumnSelector object at 0x7f40c5541df0>, <nvtabular.graph.selector.ColumnSelector object at 0x7f40928af340>, <nvtabular.graph.selector.ColumnSelector object at 0x7f40928af280>, <nvtabular.graph.selector.ColumnSelector object at 0x7f409e1b95e0>, <nvtabular.graph.selector.ColumnSelector object at 0x7f409e1b9520>, <nvtabular.graph.selector.ColumnSelector object at 0x7f40a473d9a0>, <nvtabular.graph.selector.ColumnSelector object at 0x7f40a473d8b0>, <
kwargs:    {}
Exception: "RuntimeError('cuDF failure at: ../src/copying/concatenate.cu:391: Total number of concatenated chars exceeds size_type range')"

Traceback (most recent call last):
  File "/scripts/merlin/gcp_test.py", line 97, in <module>
    main()
  File "/scripts/merlin/gcp_test.py", line 87, in main
    criteo_workflow = analyze_dataset(criteo_workflow, dataset)
  File "/scripts/merlin/gcp_test.py", line 33, in analyze_dataset
    workflow.fit(dataset)
  File "/nvtabular/nvtabular/workflow/workflow.py", line 206, in fit
    results = [r.result() for r in dask_client.compute(stats)]
  File "/nvtabular/nvtabular/workflow/workflow.py", line 206, in <listcomp>
    results = [r.result() for r in dask_client.compute(stats)]
  File "/usr/local/lib/python3.8/dist-packages/distributed/client.py", line 236, in result
    raise exc.with_traceback(tb)
  File "/usr/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/nvtabular/nvtabular/ops/categorify.py", line 1020, in _write_uniques
    df = dispatch._concat_columns([dispatch._from_host(df)] + size_columns)
  File "/nvtabular/nvtabular/dispatch.py", line 551, in _from_host
    return cudf.DataFrame.from_arrow(x)
  File "/usr/local/lib/python3.8/dist-packages/cudf/core/dataframe.py", line 4863, in from_arrow
    out = super().from_arrow(table)
  File "/usr/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py", line 2198, in from_arrow
    else libcudf.interop.from_arrow(data, data.column_names)[0]
  File "cudf/_lib/interop.pyx", line 167, in cudf._lib.interop.from_arrow
RuntimeError: cuDF failure at: ../src/copying/concatenate.cu:391: Total number of concatenated chars exceeds size_type range

Additional context Add any other context about the problem here.

Environment details:

Google Cloud (failed on on prem system as well) Using the NGC merlin-training image 22.02 1 x N1 instance with 96 vCPUs and 360GB RAM 4 x T4s

Feb 18 '22 02:02 mengdong

Parquet files exported directly from BQ have row group sizes that are incompatible with cudf. Our docs here: highlight the steps to fix this.

https://nvidia-merlin.github.io/NVTabular/main/resources/troubleshooting.html#checking-the-schema-of-the-parquet-file

It may be possible to fix on the BQ side by specifying the row group size and partitions, but I'm not sure if BQ exposes that in their parquet export. If not it's something we should bring up with Google.

@mengdong please follow the steps at the link and let us know if that resolves your issue.

Mar 07 '22 18:03 EvenOldridge

Dong Meng recommmended P1. But I I would recommend to move to P1 for now. I want to I I think the the goal from from my side on the front vertex side we just want to make sure the whole pipeline works and then we can release the pipeline. It has been delayed time after time so. This is this is just a big query. I think we can, you know, put this aside at the moment.

Mar 14 '22 21:03 viswa-nvidia

Closing because the issue is stale and there's a proposed work-around. If this comes back up, feel free to re-open.

Sep 13 '22 17:09 nv-alaiacano