Implement predicate pruning for `like` expressions
cc @alamb
This is very clever -- I will review it tomorrow
@alamb can these stats be truncated? I know stats in pages truncate large strings, e.g. if the min value is "B" could it be that the actual min value is "BA"? If so I think this approach may not work at all. Imagine we have a row group with data ["BA", "ZD"] which generates min/max stats ["B", "Z"]. Now we want to know if col LIKE '%A%' is possible. Clearly the answer should be yes but if we convert it to the predicate form we get 'B' <= '' AND '' <= 'Z' which gives false 😞. I think this could be resolved by truncating the stats column to be the same length as the prefix?
@adriangb in theory I think parquet statistics can be truncated.
Now we want to know if col LIKE '%A%'
I don't think we can use statistics for substring match -- we can only use statistics for equality and prefix matching
so like col LIKE 'A%'
The predicate would be transformed into 'B' <= 'A' AND 'A' <= 'Z' which I do think is correct
Consider the values ["ABC", "XYZ"] with stats ["AB", "XY"] and the filter col like 'A%'. This becomes 'AB' <= 'A' AND 'A' <= 'XY' which is false, but we need true. To fix this we'd need to truncate the stats to the length of the filter to get 'A' <= 'A' AND 'A' <= 'XY' which then gives the right result. %A% is just an obvious case because you get '' as the prefix which gives obvious issues.
Okay @alamb I pushed a pretty big rework. Lots of new test cases, lots of comments explaining what's going on. I removed the not like part; I'm thinking this is complex enough as is and most of the benefit (maybe even in ClickBench?) will come from like. We can tackle not like, ilike and not ilike in future PRs. Especially since it's going to be important to evaluate each test case carefully.
I will note that I am a bit concerned about the interaction of truncated stats and how we apply these filters. Take the (absurd) case of stats that were truncated so that all you have is "","". You basically know nothing about the data, there could be anything in there. Yet col = 'A' transforms into '' <= 'A' and 'A' <= '' which is false. Substitute in non-truncated stats and 'A' <= 'A' and 'A' <= 'Z' is true. I added a test case for this behavior on the existing code. It doesn't differentiate between "ABC" truncated to "" and "" actually being the min string but it shows the behavior which would be the same in both cases.
This is important because for the case of a max stat of "A" and a filter col like 'A_' if the stat might have been truncated from "AB" I need to let it pass, if I know for a fact that "A" is the max string in the column I can indeed reject the column.
Argh the only ClickBench query this maybe could improve is 23:
WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%'
Since the like starts with a wildcard this won't help. Maybe we can get not like to do something smart there, tbd...
I am wondering if simple like patterns are not already converted to startswith etc, such that pruning already is applied or we need to implement that case instead of for like?
I am wondering if simple
likepatterns are not already converted tostartswithetc, such that pruning already is applied.
Good question. Based on performance of some queries I saw I'd say no, but it it's worth double checking. Any suggestions as to a definitive easy way to check? I guess I can run datafusion-cli against a very large parquet file in object storage (high latency) and a query that should filter (col like 'A%') and one that can't?
I don't see where startswith or any other transformations (lower, upper, etc.) are handled in the pruning transformation.
I would take a look at the (physical plan) of queries involving like first to see if it still uses like or is transformed to another function.
I made a big parquet file as follows:
import random
import string
import polars as pl
df = pl.DataFrame({'col': ["A" + "".join(random.choices(string.ascii_letters, k=1_000)) for _ in range(1_000_000)]})
df.write_parquet('data.parquet', compression='uncompressed')
This came out to ~1GB. I then uploaded it to a GCS bucket.
I ran queries col = 'Z' and col like 'Z' against it and got 2s and 23s respectively. IMO that means it's not getting pushed down.
The explain plans reflect that as well:
ParquetExec: file_groups={10 groups: [[data.parquet:0..100890471], [data.parquet:100890471..201780942], [data.parquet:201780942..302671413], [data.parquet:302671413..403561884], [data.parquet:403561884..504452355], ...]}, projection=[col], predicate=col@0 = Z, pruning_predicate=CASE WHEN col_null_count@2 = col_row_count@3 THEN false ELSE col_min@0 <= Z AND Z <= col_max@1 END, required_guarantees=[col in (Z)], metrics=[output_rows=0, elapsed_compute=10ns, predicate_evaluation_errors=0, bytes_scanned=19368790, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=3, pushdown_rows_filtered=0, page_index_rows_filtered=0, row_groups_matched_statistics=0, row_groups_matched_bloom_filter=0, file_scan_errors=0, file_open_errors=0, num_predicate_creation_errors=0, time_elapsed_scanning_until_data=18.748µs, time_elapsed_opening=7.717746249s, time_elapsed_processing=64.457827ms, page_index_eval_time=10.134µs, pushdown_eval_time=20ns, time_elapsed_scanning_total=19.21µs]
ParquetExec: file_groups={10 groups: [[data.parquet:0..100890471], [data.parquet:100890471..201780942], [data.parquet:201780942..302671413], [data.parquet:302671413..403561884], [data.parquet:403561884..504452355], ...]}, projection=[col], predicate=col@0 LIKE Z, metrics=[output_rows=1000000, elapsed_compute=10ns, predicate_evaluation_errors=0, bytes_scanned=1006955145, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=0, pushdown_rows_filtered=0, page_index_rows_filtered=0, row_groups_matched_statistics=0, row_groups_matched_bloom_filter=0, file_scan_errors=0, file_open_errors=0, num_predicate_creation_errors=0, time_elapsed_scanning_until_data=49.346124581s, time_elapsed_opening=2.18377s, time_elapsed_processing=1.545583231s, page_index_eval_time=20ns, pushdown_eval_time=20ns, time_elapsed_scanning_total=49.654700084s]
The like query also has:
FilterExec: col@0 LIKE Z, metrics=[output_rows=0, elapsed_compute=1.878551ms]
So it doesn't seem like it's being transformed into another expression. It probably would be smart to do so as a general optimization outside of pruning.
I also think pruning should handle whatever that produces (startswith in the case of like 'A%' or = in the case of like 'A') as well as additional simple cases like upper(), lower(), etc.
So it doesn't seem like it's being transformed into another expression. It probably would be smart to do so as a general optimization outside of pruning.
I also think pruning should handle whatever that produces (startswith in the case of like 'A%' or = in the case of like 'A') as well as additional simple cases like upper(), lower(), etc.
That certainly makes sense to me
Perhaps we could implement some sort of simplification in https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs for LIKE into starts_with https://docs.rs/datafusion/latest/datafusion/functions/expr_fn/fn.starts_with.html (though since it is a function figuring out how to make the rewrite work might be tricky)
Then we can implement the rules in pruning predicate for starts_with 🤔
At most we could simplify the col like 'A%' case but we can't simplify 'A%B' so I think it's still worth it to implement pruning rewrites for both.
Do you have any thoughts on my concerns for possibly truncated stats, in particular how = may even be wrong as of today if the stats are truncated enough?
I think it’s fine to support like for now and leave the further combination / optimization for future work. I see that only simplifying to starts_with won't get all the benefits.
Of course, the pruning needs to be correct :). We could add (negative) test cases / add issues if the already implemented logic for = pruning is incorrect.
Well there's no tests for hypothetical cases with truncated stats. All of the tests are against the stats themselves with no indication of how those are meant to correspond with the original data. There were no unit tests of Utf8 filtering at all as far as I can tell.
The current implementation of = is certainly not wrong in the real world, but I'm not sure if that's because it's not used in situations where stats are truncated, if truncation only happens at extremes like a 10MB value where practically it's not a problem, etc.
I suggest we (I can help tomorrow):
- File a ticket to simplify
liketo=andstarts_withwhen possible (to help follow on optimizations like this) - File / find a ticket about testing with truncated statistics
- Determining what, if anything, is left to add directly to pruning prediate
Sounds good.
All of that said, I think this PR is currently as correct as = and had pretty good test coverage. Does it need to wait on those tasks or can it proceed in parallel?
@alamb I re-arranged some of the comments on assertions in https://github.com/apache/datafusion/pull/12978/commits/ae3426da4b19ff9e8adf038547b6a4e552190f1f which I feel like helped a lot with readability of the tests. There's a couple other tests with a similar pattern that I think could benefit.
I was also thinking about doing some more black-box testing: I think given any min, max you can always convert that into a RecordBatch with an array of the form [min, max] and you should never have the pruning say the array should be excluded but the array has any matches. Does that make sense? Maybe this could even be fuzz tested?
I was also thinking about doing some more black-box testing: I think given any
min, maxyou can always convert that into a RecordBatch with an array of the form[min, max]and you should never have the pruning say the array should be excluded but the array has any matches. Does that make sense? Maybe this could even be fuzz tested?
We could also randomly generate an array with an element or elements within the min max range. This would cover not only logic at the edges, but also for the values inside the range.
Some sqllogictests failed and this is expected. @adriangb can you please update them with
cargo test --test sqllogictests -- --complete
?
Some sqllogictests failed and this is expected. @adriangb can you please update them with
cargo test --test sqllogictests -- --complete
You will need to merge with main to reproduce the failures (the failing tests are new).
When you do so, you should get a diff like this
@@ -510,7 +510,7 @@ physical_plan
01)CoalesceBatchesExec: target_batch_size=8192
02)--FilterExec: binary_col@0 LIKE %a% AND largebinary_col@1 LIKE %a% AND binaryview_col@2 LIKE %a%
03)----RepartitionExec: partitioning=RoundRobinBatch(2), input_partitions=1
-04)------ParquetExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/binary_as_string.parquet]]}, projection=[binary_col, largebinary_col, binaryview_col], predicate=binary_col@0 LIKE %a% AND largebinary_col@1 LIKE %a% AND binaryview_col@2 LIKE %a%
+04)------ParquetExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/binary_as_string.parquet]]}, projection=[binary_col, largebinary_col, binaryview_col], predicate=binary_col@0 LIKE %a% AND largebinary_col@1 LIKE %a% AND binaryview_col@2 LIKE %a%, pruning_predicate=CASE WHEN binary_col_null_count@2 = binary_col_row_count@3 THEN false ELSE binary_col_min@0 <= AND <= binary_col_max@1 END AND CASE WHEN largebinary_col_null_count@6 = largebinary_col_row_count@7 THEN false ELSE largebinary_col_min@4 <= AND <= largebinary_col_max@5 END AND CASE WHEN binaryview_col_null_count@10 = binaryview_col_row_count@11 THEN false ELSE binaryview_col_min@8 <= AND <= binaryview_col_max@9 END, required_guarantees=[]
thoughts
- for a predicate
LIKE %a%we should not generate additional conditions at all. There is no min,max predicate we can practically derive fromLIKE %a%expression. -
character is problematic to read and to work with. Some tools don't handle it properly. For example when pasting into iterm with python3 running for inspection, it got displaed as\U+0FFFFeven though this is code pointx10FFFF.
Some sqllogictests failed and this is expected. @adriangb can you please update them with
cargo test --test sqllogictests -- --completeYou will need to merge with
mainto reproduce the failures (the failing tests are new). When you do so, you should get a diff like this
Thanks I was wondering how to update that. When I run it I get various errors along the lines of:
Execution error: Error completing "string/string_view.slt": failed to rename file from test_files/string/init_data.slt.part.temp to test_files/string/./init_data.slt.part
Execution error: Error completing "string/large_string.slt": failed to rename file from test_files/string/string_query.slt.part.temp to test_files/string/./string_query.slt.part
Execution error: Error completing "string/dictionary_utf8.slt": failed to rename file from test_files/string/string_query.slt.part.temp to test_files/string/./string_query.slt.part
Any idea what's going on?
When I run it I get various errors along the lines of:
Execution error: Error completing "string/string_view.slt": failed to rename file from test_files/string/init_data.slt.part.temp to test_files/string/./init_data.slt.part Execution error: Error completing "string/large_string.slt": failed to rename file from test_files/string/string_query.slt.part.temp to test_files/string/./string_query.slt.part Execution error: Error completing "string/dictionary_utf8.slt": failed to rename file from test_files/string/string_query.slt.part.temp to test_files/string/./string_query.slt.partAny idea what's going on?
Yes -- https://github.com/apache/datafusion/issues/12752
but don't worry. see https://github.com/apache/datafusion/pull/12978#issuecomment-2441489968 . this test case shouldn't change after all
Yes -- #12752
but don't worry. see #12978 (comment) . this test case shouldn't change after all
Having other issues now. I tried in 46531f5bf7be7a59255fc8723f0be3efd3beb908 but got a bunch of errors from missing files. I tried to follow what CI is doing but got to compiling the databricks repo and got a missing malloc.h so I stopped there since I wasn't even sure if I was barking up the right tree. Is this the expected workflow for updating these snapshots? It might be helpful if failed CI runs offered them as a downloadable artifact.
for a predicate LIKE %a% we should not generate additional conditions at all. There is no min,max predicate we can practically derive from LIKE %a% expression
By the way I added a special case for the wildcard being the first character (no prefix) to bail out and add no filtering
Looks like the special case on '%...' actually eliminated the diff in those tests, so no need to help me figure out how to update them now. I'll re-request review from both of you. Thank you for your help so far.
This is on my list to review today
I didn't make it here today -- it is the top of my list for tomorrow
Thanks for the review Andrew. I'll address your comments tomorrow and hopefully we can merge this week!
Let's open an issue to track a fuzz tester for predicate pushdown.
@alamb @findepi I think I've accepted or applied all of your excellent feedback, and I added some coverage / more test cases to the increment_utf8 function 😄