Ilya Cherkasov
Ilya Cherkasov
>mor read_optimized can use it. can i set spark-sql to use read_optimized to test it out?
Okay so let's compare. For clean experiment, I created 2 separate sessions for queries below. ``` scala> spark.time({ | val df = spark.read | .format("org.apache.hudi") | .option("hoodie.datasource.query.type", "read_optimized") | .load("s3://path/table/")...
Sure, but anything specific you want to see?
for snapshot: 441,483,112, query time 28141ms for read-optimized: 22,887,045, query time 26054ms.   ``` scala> spark.time({ | val df = spark.read | .format("org.apache.hudi") | .option("hoodie.datasource.query.type", "read_optimized") | .load("s3://table/") |...
```WholeStageCodegen (1) duration: total (min, med, max )13.4 m (79 ms, 1.5 s, 3.4 s )``` for snapshot. ```WholeStageCodegen (1) duration: total (min, med, max )6.5 m (249 ms, 552...
@KnightChess Did I understand you correctly, you are claiming that bloom filters actually work correctly?
how can we clarify that the difference is not cause by read-optimized and snapshot paths excluding any bloom filters on indexes?
I.e. it's caused by a RO reader just reading different files?
>What do you think about, TBH a bit of mixed emotions here. With 0.14 there is practically no way in understanding how indexing or statistical means are affecting queries apart...
>when number of output rows with bloom is clearly lot less than number of output rows without bloom. @ad1happy2go The query performance is same for both ro and snapshot cases,...