iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

API: Fix estimated row count in ContentScanTask

Open wypoon opened this issue 3 years ago • 2 comments

#5720 added estimatedRowsCount to ScanTask and provided an implementation in ContentScanTask. For a ScanTask that scans the entire data file, we can do better for this estimate, and return the row count from the metadata. I added a parameterized test, that fails without this fix, for the parquet and orc format (it passes for the avro format). Instead of returning a count of 10000 for the number of rows (we wrote 10000 rows), it returns 9998 for parquet and 9937 for orc. With the fix, the row count is 10000 for all 3 formats. In case a file is bigger than the read split size, then there is no escaping underestimating the row count, because whenever the scannedFileFraction < 1.0, there is rounding down when multiplying the row count by the fraction and casting to long.

Note: This bug is already present in #4446.

wypoon avatar Sep 13 '22 18:09 wypoon

@aokolnychyi please look. Also, if you don't mind, I'd like to change estimatedRowsCount to estimatedRowCount, as that is more idiomatic. Row count is the count of rows. If you agree, I'll update this PR to make that change as well. cc @singhpk234

wypoon avatar Sep 13 '22 23:09 wypoon

@rdblue would you mind taking a look? This is a simple fix.

wypoon avatar Sep 20 '22 16:09 wypoon

@stevenzwu congratulations on becoming a committer! (I read it in Ryan Blue's board report draft.) Once you have your credentials, can you please merge this? Thanks!

wypoon avatar Oct 12 '22 19:10 wypoon

thank @wypoon for the contribution and @singhpk234 for the review.

stevenzwu avatar Oct 12 '22 19:10 stevenzwu