Ilya Cherkasov

Results 23 comments of Ilya Cherkasov

>mor read_optimized can use it. can i set spark-sql to use read_optimized to test it out?

Okay so let's compare. For clean experiment, I created 2 separate sessions for queries below. ``` scala> spark.time({ | val df = spark.read | .format("org.apache.hudi") | .option("hoodie.datasource.query.type", "read_optimized") | .load("s3://path/table/")...

for snapshot: 441,483,112, query time 28141ms for read-optimized: 22,887,045, query time 26054ms. ![read-optimized](https://github.com/apache/hudi/assets/892781/d61438ac-3792-4217-9b79-23783128def1) ![snapshot](https://github.com/apache/hudi/assets/892781/3d8d3326-8eb6-4a3a-88a7-0b46d27405e7) ``` scala> spark.time({ | val df = spark.read | .format("org.apache.hudi") | .option("hoodie.datasource.query.type", "read_optimized") | .load("s3://table/") |...

```WholeStageCodegen (1) duration: total (min, med, max )13.4 m (79 ms, 1.5 s, 3.4 s )``` for snapshot. ```WholeStageCodegen (1) duration: total (min, med, max )6.5 m (249 ms, 552...

@KnightChess Did I understand you correctly, you are claiming that bloom filters actually work correctly?

how can we clarify that the difference is not cause by read-optimized and snapshot paths excluding any bloom filters on indexes?

I.e. it's caused by a RO reader just reading different files?

>What do you think about, TBH a bit of mixed emotions here. With 0.14 there is practically no way in understanding how indexing or statistical means are affecting queries apart...

>when number of output rows with bloom is clearly lot less than number of output rows without bloom. @ad1happy2go The query performance is same for both ro and snapshot cases,...