expose statistics in iceberg-datafusion for fast COUNT()
Is your feature request related to a problem or challenge?
Running SELECT COUNT(1) when using iceberg-datafusion results in a table scan. This can be avoided by implementing ExecutionPlan::statistics. Datafusion does this for its built-in parquet scanner by fetching the statistics from parquet metadata when constructing the ExecutionPlan. I was looking to implement this in a similar way (at least for tables without deletes) by iterating over the ManifestEntrys and summing the record_counts. I have a draft PR but wanted to confirm this approach is acceptable before putting in the work to clean it up.
Describe the solution you'd like
count(*) in datafusion does not perform a table scan
Willingness to contribute
I would be willing to contribute to this feature with guidance from the Iceberg Rust community
Thanks @debugmiller for reporting this. I think it's feasible, but we need to take delete files into accont. If there exists eq deletions, this will not work.
@debugmiller Do you have your branch up somewhere? It would be great to have the TableProvider take statistics into account.