hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] Looking for guidance on enabling CDC on an existing hudi table

Open Dakarun opened this issue 8 months ago • 3 comments

Tips before filing an issue

  • Have you gone through our FAQs? -> Page 404s

  • Join the mailing list to engage in conversations and get faster support at [email protected].

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

We recently upgraded to hudi 0.15 and there are a number of tables that I'd like to enable the CDC feature on a number of table on a go-forward basis. So far I've tried:

  • Adding hoodie.table.cdc.enabled true in my write properties
  • Updating the .hoodie/hoodie.properties file to include hoodie.table.cdc.enabled=true and hoodie.table.cdc.supplemental.logging.mode=DATA_BEFORE_AFTER key/values.

I've been unable to find a migration strategy for updating existing hudi tables to enable CDC in the documentation or blog, but the documentation search appears to be broken so I may have missed something. Would someone be able to provide guidance on how they recommend to migrate an existing table to enable CDC? Is the only path to rewrite the entire table with a bulk insert?

To Reproduce

Steps to reproduce the behavior:

  1. Have a non-cdc hudi table (ours were created with hudi 0.12)
  2. Perform a write with the cdc flags turned on, or update the .hoodie.properties and then perform the writes
  3. Read the table via hudi_table_changes
  4. Hit pyspark.errors.exceptions.captured.IllegalArgumentException: It isn't a CDC hudi table on

Expected behavior

CDC data is provided for any writes that have happened since the setting was enabled

Environment Description

  • Hudi version : 0.15.0

  • Spark version : 3.5.2

  • Hive version : Glue Catalog

  • Hadoop version : N/A

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : Yes

Additional context

Performing the tests with an aws glue container

Stacktrace

  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/session.py", line 1631, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
  File "/usr/lib/spark/python/pyspark/errors/exceptions/captured.py", line 185, in deco
    raise converted from None
pyspark.errors.exceptions.captured.IllegalArgumentException: It isn't a CDC hudi table on s3://bucket/prefix```

Dakarun avatar May 19 '25 14:05 Dakarun

I did manage to get it running using a filesystem read, i.e.

spark.read.format("hudi") \
    .option("hoodie.datasource.query.type", "incremental") \
    .option("hoodie.datasource.query.incremental.format", "cdc") \
    .option("hoodie.datasource.read.begin.instanttime", "from") \
    .option("hoodie.datasource.read.end.instanttime", "to") \
    .load("s3://bucket/prefix/")

So it does seem the writers started producing cdc files. Reads through sql still fail however.

Dakarun avatar May 19 '25 16:05 Dakarun

Hi @Dakarun

Implementing CDC on existing tables is not directly feasible. The recommended approach is to create a new table based on the existing data, then drop the original table and rename the newly created table.

rangareddy avatar May 30 '25 12:05 rangareddy

Thanks @rangareddy. Do you happen to know if I should expect the current "workaround" (e.g. my previous comment about a filesystem based read) to stop working in future versions?

Dakarun avatar May 30 '25 18:05 Dakarun

Hi @Dakarun

To migrate a non-CDC table to a CDC table, please follow these steps:

  1. Create a new Hudi table with hoodie.table.cdc.enabled=true.
  2. Copy (rewrite) the data from the old table to the new table.
  3. Use the new table for future writes and CDC queries.

rangareddy avatar Aug 26 '25 11:08 rangareddy