pinot icon indicating copy to clipboard operation
pinot copied to clipboard

Batch Ingestion from Delta Table

Open athultr1997 opened this issue 3 years ago • 2 comments

Currently, Pinot cannot perform batch ingestion from a delta table. It would be an excellent feature since the project is open-sourced and has many users.

We can build a new RecordReader interface for delta tables utilizing the delta-standalone library.

athultr1997 avatar Jul 21 '22 16:07 athultr1997

Yes, this would be a great feature to add. cc: @xiangfu0

mayankshriv avatar Jul 21 '22 17:07 mayankshriv

I just read a bit about the delta lib. A simple flow may look like below, where we can open the delta table with the lib and loop through all the records. The lib also supports data filtering, and that can be some advanced options for data ingestion.

import io.delta.standalone.data.RowRecord;
import io.delta.standalone.Snapshot;

DeltaLog log = DeltaLog.forTable(new Configuration(), "/data/sales");
CloseableIterator<RowRecord> dataIter = log.update().open();

try {
    while (dataIter.hasNext()) {
        // We get a delta record here, and can convert to pinot GenericRow as far as I can tell
        RowRecord row = dataIter.next();
        int year = row.getInt("year");
        String customer = row.getString("customer");
        float totalCost = row.getFloat("total_cost");
    }
} finally {
    dataIter.close();
}

klsince avatar Jul 26 '22 17:07 klsince