graphnet icon indicating copy to clipboard operation
graphnet copied to clipboard

Batch ParquetDataset by event rather than by file

Open OscarBarreraGithub opened this issue 1 year ago • 0 comments

The _calculate_sizes function (which calculates the number of events in each batch) within the ParquetDataset class calculates the batch size by appending the length of each file inside the batch.

It would be useful to batch by event rather than by file so we can process high energy events (which have many rows per event) without manually chunking the .parquet file beforehand (.parquet is well suited to handle large files anyway).

I am working on a fix by updating the way query_table batches events.

OscarBarreraGithub avatar Jul 15 '24 15:07 OscarBarreraGithub