[BUG] cuDF failure processing full Criteo dataset, when the parquet files are exported from big query
Describe the bug cuDF failure processing full Criteo dataset, when the parquet files were exported from big query. CSV and parquet from GCS works.
Steps/Code to reproduce bug https://gist.github.com/mengdong/d6a24fc266d9806ccd74cd9890b67c6a replace GCS location to a local file on SSD will lead to same error
on a system with 4xT4s
sudo docker run --gpus '"device=0,1,2,3"' -it --rm
--shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864
-v ~:/scripts train_merlin python3 /scripts/merlin/gcs_nvt_test.py
Expected behavior error:
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
<Client: 'tcp://127.0.0.1:40781' processes=4 threads=4, memory=372.53 GiB>
/nvtabular/nvtabular/utils.py:166: FutureWarning: The `client` argument is deprecated from Dataset and will be removed in a future version of NVTabular. By default, a global client in the same python context will be detected automatically, and `nvt.utils.set_dask_client` can be used for explicit control.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/cudf/core/dataframe.py:1253: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(
/nvtabular/nvtabular/utils.py:166: FutureWarning: The `client` argument is deprecated from Workflow and will be removed in a future version of NVTabular. By default, a global client in the same python context will be detected automatically, and `nvt.utils.set_dask_client` can be used for explicit control.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/cudf/core/dataframe.py:1253: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(
/nvtabular/nvtabular/ops/categorify.py:1036: UserWarning: Category DataFrame (with columns: Index(['C1', 'C1_size'], dtype='object')) is 3560679468 bytes in size. This is large compared to the suggested upper limit of 1980465152 bytes!(12.5% of the total memory by default)
warnings.warn(
/usr/local/lib/python3.8/dist-packages/cudf/core/dataframe.py:1253: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:3077: FutureWarning: keep_index is deprecated and will be removed in the future.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/cudf/core/dataframe.py:1253: UserWarning: The deep parameter is ignored and is only included for pandas compatibility.
warnings.warn(
distributed.worker - WARNING - Compute Failed
Function: _write_uniques
args: ([pyarrow.Table
C20: string
C20_size: int64, pyarrow.Table
C20: string
C20_size: int64, pyarrow.Table
C20: string
C20_size: int64, pyarrow.Table
C20: string
C20_size: int64, pyarrow.Table
C20: string
C20_size: int64, pyarrow.Table
C20: string
C20_size: int64, pyarrow.Table
C20: string
C20_size: int64, pyarrow.Table
C20: string
C20_size: int64], './/categories', <nvtabular.graph.selector.ColumnSelector object at 0x7f40b81bc730>, FitOptions(col_groups=[<nvtabular.graph.selector.ColumnSelector object at 0x7f40c12ebaf0>, <nvtabular.graph.selector.ColumnSelector object at 0x7f40c5541df0>, <nvtabular.graph.selector.ColumnSelector object at 0x7f40928af340>, <nvtabular.graph.selector.ColumnSelector object at 0x7f40928af280>, <nvtabular.graph.selector.ColumnSelector object at 0x7f409e1b95e0>, <nvtabular.graph.selector.ColumnSelector object at 0x7f409e1b9520>, <nvtabular.graph.selector.ColumnSelector object at 0x7f40a473d9a0>, <nvtabular.graph.selector.ColumnSelector object at 0x7f40a473d8b0>, <
kwargs: {}
Exception: "RuntimeError('cuDF failure at: ../src/copying/concatenate.cu:391: Total number of concatenated chars exceeds size_type range')"
Traceback (most recent call last):
File "/scripts/merlin/gcp_test.py", line 97, in <module>
main()
File "/scripts/merlin/gcp_test.py", line 87, in main
criteo_workflow = analyze_dataset(criteo_workflow, dataset)
File "/scripts/merlin/gcp_test.py", line 33, in analyze_dataset
workflow.fit(dataset)
File "/nvtabular/nvtabular/workflow/workflow.py", line 206, in fit
results = [r.result() for r in dask_client.compute(stats)]
File "/nvtabular/nvtabular/workflow/workflow.py", line 206, in <listcomp>
results = [r.result() for r in dask_client.compute(stats)]
File "/usr/local/lib/python3.8/dist-packages/distributed/client.py", line 236, in result
raise exc.with_traceback(tb)
File "/usr/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/nvtabular/nvtabular/ops/categorify.py", line 1020, in _write_uniques
df = dispatch._concat_columns([dispatch._from_host(df)] + size_columns)
File "/nvtabular/nvtabular/dispatch.py", line 551, in _from_host
return cudf.DataFrame.from_arrow(x)
File "/usr/local/lib/python3.8/dist-packages/cudf/core/dataframe.py", line 4863, in from_arrow
out = super().from_arrow(table)
File "/usr/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py", line 2198, in from_arrow
else libcudf.interop.from_arrow(data, data.column_names)[0]
File "cudf/_lib/interop.pyx", line 167, in cudf._lib.interop.from_arrow
RuntimeError: cuDF failure at: ../src/copying/concatenate.cu:391: Total number of concatenated chars exceeds size_type range
Additional context Add any other context about the problem here.
Environment details:
Google Cloud (failed on on prem system as well) Using the NGC merlin-training image 22.02 1 x N1 instance with 96 vCPUs and 360GB RAM 4 x T4s
Parquet files exported directly from BQ have row group sizes that are incompatible with cudf. Our docs here: highlight the steps to fix this.
https://nvidia-merlin.github.io/NVTabular/main/resources/troubleshooting.html#checking-the-schema-of-the-parquet-file
It may be possible to fix on the BQ side by specifying the row group size and partitions, but I'm not sure if BQ exposes that in their parquet export. If not it's something we should bring up with Google.
@mengdong please follow the steps at the link and let us know if that resolves your issue.
Dong Meng recommmended P1. But I I would recommend to move to P1 for now. I want to I I think the the goal from from my side on the front vertex side we just want to make sure the whole pipeline works and then we can release the pipeline. It has been delayed time after time so. This is this is just a big query. I think we can, you know, put this aside at the moment.
Closing because the issue is stale and there's a proposed work-around. If this comes back up, feel free to re-open.