incubator-xtable icon indicating copy to clipboard operation
incubator-xtable copied to clipboard

First commit on supporting parquet

Open sapienza88 opened this issue 11 months ago • 3 comments

Important Read

  • Please ensure the GitHub issue is mentioned at the beginning of the PR

What is the purpose of the pull request

This PR is a first attempt (building off of previous attempt) to include Parquet file for Syncing to XTable.

Brief change log

  • Added Schema Extractor for Parquet (almost similar to Avro's)
  • Added Table Extractor using metadata
  • Added Conversion Source Script

sapienza88 avatar Feb 15 '25 16:02 sapienza88

Thanks for working on the PR @unical1988, added comments.

There seems to be some confusion about extracting partition values, let me know what you think of this.

basePath/ 
                p1/.. (Can be recursive partitions for parquet files)
                p2/ ..
                p3/.. 
                .hoodie/  (Hudi Metadata)
                metadata/ (Iceberg metadata) 
                _delta_log/ (Delta metadata) 

To extract the partition fields (emphasis on fields here not the actual values) we can it in two ways:

  1. Assume table is not partitioned, this would just sync the parquet files in the target formats using the physical paths you have extracted in one of the classes. When you read those tables, partition pruning won't work.
  2. Ask user input (from YAML configuration) for the partition fields from the parquet file schema. Many of these analytical datasets are partitioned by date either through an actual date column in the parquet file or a timestamp field through which the date is actually extracted.

We would want to read the configuration (or the partition fields) into a Java object (if I am not wrong). p1/ then could be date - year - month -day and p2/could be location and p3/ could be ID, so given these fields we could extract the partitionValues located at the related subdirectories for a specific parquet file, is that correct?if yes, how could the Java object be defined?

sapienza88 avatar Feb 24 '25 19:02 sapienza88

Thanks for working on the PR @unical1988, added comments.

There seems to be some confusion about extracting partition values, let me know what you think of this.

basePath/ 
                p1/.. (Can be recursive partitions for parquet files)
                p2/ ..
                p3/.. 
                .hoodie/  (Hudi Metadata)
                metadata/ (Iceberg metadata) 
                _delta_log/ (Delta metadata) 

To extract the partition fields (emphasis on fields here not the actual values) we can it in two ways:

  1. Assume table is not partitioned, this would just sync the parquet files in the target formats using the physical paths you have extracted in one of the classes. When you read those tables, partition pruning won't work.
  2. Ask user input (from YAML configuration) for the partition fields from the parquet file schema. Many of these analytical datasets are partitioned by date either through an actual date column in the parquet file or a timestamp field through which the date is actually extracted.
public class InputPartitionColumn {
   String fieldName; 
   PartitionTransformType transformType;
}

InputPartitionKeyConfig should be part of Table object in DatasetConfig.  

1. No transform -> The values for partition keys in the parquet file are concatenated and partitionPath is generated.  Configuring this in InternalTable object. 
2. Transformation ->  timestamp -> transform(timestamp) -> year/date/month/xyz.parquet

vinishjail97 avatar Mar 12 '25 23:03 vinishjail97

Thanks for working on the PR @unical1988, added comments. There seems to be some confusion about extracting partition values, let me know what you think of this.

basePath/ 
                p1/.. (Can be recursive partitions for parquet files)
                p2/ ..
                p3/.. 
                .hoodie/  (Hudi Metadata)
                metadata/ (Iceberg metadata) 
                _delta_log/ (Delta metadata) 

To extract the partition fields (emphasis on fields here not the actual values) we can it in two ways:

  1. Assume table is not partitioned, this would just sync the parquet files in the target formats using the physical paths you have extracted in one of the classes. When you read those tables, partition pruning won't work.
  2. Ask user input (from YAML configuration) for the partition fields from the parquet file schema. Many of these analytical datasets are partitioned by date either through an actual date column in the parquet file or a timestamp field through which the date is actually extracted.
public class InputPartitionColumn {
   String fieldName; 
   PartitionTransformType transformType;
}

InputPartitionKeyConfig should be part of Table object in DatasetConfig.  

1. No transform -> The values for partition keys in the parquet file are concatenated and partitionPath is generated.  Configuring this in InternalTable object. 
2. Transformation ->  timestamp -> transform(timestamp) -> year/date/month/xyz.parquet

@vinishjail97 I made a slight change to the proposed class InputPartitionColumn in the latest commit, pls check it out!

sapienza88 avatar Mar 14 '25 02:03 sapienza88