[SUPPORT] Looking for guidance on enabling CDC on an existing hudi table
Tips before filing an issue
-
Have you gone through our FAQs? -> Page 404s
-
Join the mailing list to engage in conversations and get faster support at [email protected].
-
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
We recently upgraded to hudi 0.15 and there are a number of tables that I'd like to enable the CDC feature on a number of table on a go-forward basis. So far I've tried:
- Adding
hoodie.table.cdc.enabledtruein my write properties - Updating the
.hoodie/hoodie.propertiesfile to includehoodie.table.cdc.enabled=trueandhoodie.table.cdc.supplemental.logging.mode=DATA_BEFORE_AFTERkey/values.
I've been unable to find a migration strategy for updating existing hudi tables to enable CDC in the documentation or blog, but the documentation search appears to be broken so I may have missed something. Would someone be able to provide guidance on how they recommend to migrate an existing table to enable CDC? Is the only path to rewrite the entire table with a bulk insert?
To Reproduce
Steps to reproduce the behavior:
- Have a non-cdc hudi table (ours were created with hudi 0.12)
- Perform a write with the cdc flags turned on, or update the .hoodie.properties and then perform the writes
- Read the table via
hudi_table_changes - Hit
pyspark.errors.exceptions.captured.IllegalArgumentException: It isn't a CDC hudi table on
Expected behavior
CDC data is provided for any writes that have happened since the setting was enabled
Environment Description
-
Hudi version : 0.15.0
-
Spark version : 3.5.2
-
Hive version : Glue Catalog
-
Hadoop version : N/A
-
Storage (HDFS/S3/GCS..) : S3
-
Running on Docker? (yes/no) : Yes
Additional context
Performing the tests with an aws glue container
Stacktrace
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/session.py", line 1631, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/usr/lib/spark/python/pyspark/errors/exceptions/captured.py", line 185, in deco
raise converted from None
pyspark.errors.exceptions.captured.IllegalArgumentException: It isn't a CDC hudi table on s3://bucket/prefix```
I did manage to get it running using a filesystem read, i.e.
spark.read.format("hudi") \
.option("hoodie.datasource.query.type", "incremental") \
.option("hoodie.datasource.query.incremental.format", "cdc") \
.option("hoodie.datasource.read.begin.instanttime", "from") \
.option("hoodie.datasource.read.end.instanttime", "to") \
.load("s3://bucket/prefix/")
So it does seem the writers started producing cdc files. Reads through sql still fail however.
Hi @Dakarun
Implementing CDC on existing tables is not directly feasible. The recommended approach is to create a new table based on the existing data, then drop the original table and rename the newly created table.
Thanks @rangareddy. Do you happen to know if I should expect the current "workaround" (e.g. my previous comment about a filesystem based read) to stop working in future versions?
Hi @Dakarun
To migrate a non-CDC table to a CDC table, please follow these steps:
- Create a new Hudi table with
hoodie.table.cdc.enabled=true. - Copy (rewrite) the data from the old table to the new table.
- Use the new table for future writes and CDC queries.