iceberg-rust icon indicating copy to clipboard operation
iceberg-rust copied to clipboard

expose statistics in iceberg-datafusion for fast COUNT()

Open debugmiller opened this issue 7 months ago • 2 comments

Is your feature request related to a problem or challenge?

Running SELECT COUNT(1) when using iceberg-datafusion results in a table scan. This can be avoided by implementing ExecutionPlan::statistics. Datafusion does this for its built-in parquet scanner by fetching the statistics from parquet metadata when constructing the ExecutionPlan. I was looking to implement this in a similar way (at least for tables without deletes) by iterating over the ManifestEntrys and summing the record_counts. I have a draft PR but wanted to confirm this approach is acceptable before putting in the work to clean it up.

Describe the solution you'd like

count(*) in datafusion does not perform a table scan

Willingness to contribute

I would be willing to contribute to this feature with guidance from the Iceberg Rust community

debugmiller avatar Jun 09 '25 18:06 debugmiller

Thanks @debugmiller for reporting this. I think it's feasible, but we need to take delete files into accont. If there exists eq deletions, this will not work.

liurenjie1024 avatar Jun 12 '25 10:06 liurenjie1024

@debugmiller Do you have your branch up somewhere? It would be great to have the TableProvider take statistics into account.

tonyalaribe avatar Oct 14 '25 21:10 tonyalaribe