graphnet
graphnet copied to clipboard
Batch ParquetDataset by event rather than by file
The _calculate_sizes function (which calculates the number of events in each batch) within the ParquetDataset class calculates the batch size by appending the length of each file inside the batch.
It would be useful to batch by event rather than by file so we can process high energy events (which have many rows per event) without manually chunking the .parquet file beforehand (.parquet is well suited to handle large files anyway).
I am working on a fix by updating the way query_table batches events.