incubator-xtable icon indicating copy to clipboard operation
incubator-xtable copied to clipboard

Support source/sink for plain Parquet/ORC/Avro Tables

Open anoopj opened this issue 2 years ago • 9 comments

Supporting plain Parquet/ORC/Avro (partitioned as well as unpartitioned) may be useful for "upgrading" legacy data to table formats. Sink may be useful for exporting a specific snapshot for interoperability reasons.

This feature is lower priority, as Iceberg/Delta etc have native support for metadata-only conversions and offer Spark procedures.

anoopj avatar Nov 03 '23 00:11 anoopj

@anoopj what would the metadata look like for a sink export?

I like the idea of a generic bootstrap so that users could take existing data and try out all 3 formats if they want to do some testing with other tools.

the-other-tim-brown avatar Nov 04 '23 22:11 the-other-tim-brown

@anoopj what would the metadata look like for a sink export?

Sink could be based on manifest files in SymlinkTextInputFormat. BigQuery also now supports manifest files.

I like the idea of a generic bootstrap so that users could take existing data and try out all 3 formats if they want to do some testing with other tools.

Yes, bootstrap is probably higher priority than sink.

anoopj avatar Nov 06 '23 16:11 anoopj

@jackwener any interest in looking into something like this?

the-other-tim-brown avatar Dec 20 '23 03:12 the-other-tim-brown

@the-other-tim-brown I'm trying to find a good first issue to ramp up on XTable. Can I take a look at this one? Perhaps we can split it into different issues. One initial task could be to add support for the Parquet input data format, for example? I'm not sure what the code looks like, but ultimately, we can create something modular enough to extend to AVRO or other formats later, if it hasn't been done already. I would be interested to discuss of the possible approaches to fill up the partitioning and statistics info...

However, just to check that I understand the scenario correctly: if today I wanted to bootstrap 2 different systems, Hudi and Iceberg, with existing Parquet files, couldn't I use the native capabilities of either system for a 1st initial import, and then use the current XTable to generate the metafiles for the remaining system?

marqub avatar Apr 30 '24 14:04 marqub

@the-other-tim-brown I'm trying to find a good first issue to ramp up on XTable. Can I take a look at this one? Perhaps we can split it into different issues. One initial task could be to add support for the Parquet input data format, for example? I'm not sure what the code looks like, but ultimately, we can create something modular enough to extend to AVRO or other formats later, if it hasn't been done already. I would be interested to discuss of the possible approaches to fill up the partitioning and statistics info...

I think it makes sense to start with just one of the file formats like Parquet. We can discuss how to get the info you would need.

However, just to check that I understand the scenario correctly: if today I wanted to bootstrap 2 different systems, Hudi and Iceberg, with existing Parquet files, couldn't I use the native capabilities of either system for a 1st initial import, and then use the current XTable to generate the metafiles for the remaining system?

Yes you could do that as well.

There is another issue I had my eye on that I could guide you through as well if you are interested: https://github.com/apache/incubator-xtable/issues/411

the-other-tim-brown avatar May 01 '24 05:05 the-other-tim-brown

I think it makes sense to start with just one of the file formats like Parquet. We can discuss how to get the info you would need.

However, just to check that I understand the scenario correctly: if today I wanted to bootstrap 2 different systems, Hudi and Iceberg, with existing Parquet files, couldn't I use the native capabilities of either system for a 1st initial import, and then use the current XTable to generate the metafiles for the remaining system?

Yes you could do that as well.

Ok, if you agree that we want to move away from this workaround approach, then I think supporting Parquet is a good first issue for me to smooth the learning curve.

There is another issue I had my eye on that I could guide you through as well if you are interested: #411

ok, this one could be a good next step, but for now, I prefer to limit the amount of novelty.

I should have some time to start on the parquet issue next week. How do you prefer to communicate? Is there a slack channel?

marqub avatar May 02 '24 08:05 marqub

@marqub we do not have a slack setup for the project yet, I can shoot you an email to connect and discuss any of the details in the meantime.

the-other-tim-brown avatar May 04 '24 01:05 the-other-tim-brown

Hi, Is someone working on it? I am new to this project and would like to get started.

Reactor11 avatar Oct 10 '24 05:10 Reactor11

Hi, Is someone working on it? I am new to this project and would like to get started.

@Reactor11 there is a similar effort for a parquet file source that is being worked on: https://github.com/apache/incubator-xtable/issues/553

the-other-tim-brown avatar Oct 16 '24 01:10 the-other-tim-brown