Delta Kernel Draft PR
Important Read
- Please ensure the GitHub issue is mentioned at the beginning of the PR
What is the purpose of the pull request
(For example: This pull request implements the sync for delta format.)
Brief change log
(for example:)
- Fixed JSON parsing error when persisting state
- Added unit tests for schema evolution
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
- Added integration tests for end-to-end.
- Added TestConversionController to verify the change.
- Manually verified the change by running a job locally.
@vaibhavk1992 can you write up a summary of next steps and blockers for this feature?
Below is the summary of the difference between two schemes (Delta vs Kernel) also added the what remains the difference between two.
Comparison of Schema Responses: Delta Kernel vs Delta Log
This document outlines the differences in schema responses when using Delta Kernel and Delta Log APIs to retrieve changes in a Delta table. The comparison highlights the structure and format of the responses, providing insights into how the two approaches differ.
Delta Kernel Schema Response
When using the DeltaKernelIncrementalChangesState class to retrieve changes, the response is in the form of a row of columnar batch type. Each row is represented as an object of the io.delta.kernel.data.Row interface, which provides methods to access individual fields. The response is minimalistic and focuses on the raw data representation.
Sample Output
1 row is an object ==> io.delta.kernel.internal.data.ColumnarBatchRow@20c03e47
Key Characteristics
-
Row Representation: Each row is an instance of
ColumnarBatchRow, which provides methods to access fields likegetLong,getString, etc. -
Minimal Metadata: The response contains only the essential fields (e.g.,
version,timestamp,commitInfo). - Raw Data: The schema is not enriched with additional metadata or actions; it is a direct representation of the data in the columnar batch.
Use Case
This format is suitable for low-level data processing where the focus is on performance and accessing raw data.
Delta Log Schema Response
When using the DeltaLog.getChanges method, the response is a tuple containing the version number and a list of actions. The actions include detailed metadata about the changes, such as CommitInfo and AddFile.
Sample Output from delta table changes
(2,
Vector(
CommitInfo(None, 2025-08-15 15:00:46.05, None, None, WRITE, Map(mode -> Append, partitionBy -> []), None, None, None, Some(1), Some(Serializable), Some(true), Some(Map(numFiles -> 1, numOutputRows -> 50, numOutputBytes -> 10226)), None, None, Some(Apache-Spark/3.4.2 Delta-Lake/2.4.0), Some(cf7b1472-4c68-4f89-aa97-c8f16512ecfc)),
AddFile(part-00000-e8eeadc8-4e26-46a7-8c61-0bf60e5e7ada-c000.snappy.parquet, Map(), 10226, 1755250246045, true, {"numRecords":50,"minValues":{"id":51,"firstName":"0WI98","lastName":"08VkW","gender":"Female","birthDate":"2013-02-16T21:18:43.000+05:30","level":"ERROR","date_field":"2025-08-15","timestamp_field":"2025-08-15T15:00:45.884+05:30","double_field":0.018425752795049544,"float_field":0.109567106,"long_field":-8844008067348082419,"record_field":{"nested_int":-2060061976}},"maxValues":{"id":100,"firstName":"xZnER","lastName":"ymLQw","gender":"Male","birthDate":"2023-08-07T15:06:55.000+05:30","level":"WARN","date_field":"2025-08-15","timestamp_field":"2025-08-15T15:00:45.885+05:30","double_field":0.9914942463945434,"float_field":0.9841615,"long_field":8775924211265194460,"record_field":{"nested_int":1923869027}},"nullCount":{"id":0,"firstName":25,"lastName":23,"gender":0,"birthDate":0,"level":0,"boolean_field":28,"date_field":24,"timestamp_field":26,"double_field":25,"float_field":28,"long_field":25,"binary_field":32,"primitive_map":22,"record_map":25,"primitive_list":28,"record_list":29,"record_field":{"nested_int":28}}}, null, null)
))
Key Characteristics
-
Rich Metadata: The response includes detailed metadata such as
CommitInfo(e.g., operation type, timestamp, and user metadata) andAddFile(e.g., file path, size, and statistics). -
Structured Actions: Each action is represented as a specific object (e.g.,
CommitInfo,AddFile), making it easier to interpret the changes. - Verbose Output: The response is more verbose, providing a comprehensive view of the changes.
This issue is currently in blocked state. I raised it with delta team quite a few time but no response over it. https://delta-users.slack.com/archives/C04TRPG3LHZ/p1758730559515289
@vaibhavk1992 I pushed some changes in the latest commit to extract the add and remove files per version.
@the-other-tim-brown @vinishjail97 @rahil-c All the comments have been addressed and the build is passing too. Please review for the final merge.
@vaibhavk1992 make sure to fill out the PR template
@vaibhavk1992 Thanks for pushing on this, I took a brief scan and the PR looks close to landing. My main question is around testing, did you happen to test conversion end to end using your delta kernel source to hudi or to iceberg?
I believe the IT class ITConversionController registers the sources in order to trigger conversions from different sourceFormats to different targetFormats, see https://github.com/apache/incubator-xtable/blob/main/xtable-core/src/test/java/org/apache/xtable/ITConversionController.java#L191.
I am wondering if we should try to add some similar IT flow as the above class to confirm Delta Kernel works correctly end to end with other target formats or at the very least wanted to know if you did some manual testing to confirm you can convert to an iceberg or hudi table using your delta kernel source? I think we can also add this end to end IT test plan as a fast follow in a separate pr.
@rahil-c As discussed ITConversionController would be part of the follow PR. I have resolved the 2 comments from you as well. If things are looking good can you and @the-other-tim-brown please approve the changes.