datafusion Implement predicate pruning for `like` expressions

Oct 16 '24 18:10 adriangb

cc @alamb

Oct 16 '24 18:10 adriangb

This is very clever -- I will review it tomorrow

Oct 16 '24 19:10 alamb

@alamb can these stats be truncated? I know stats in pages truncate large strings, e.g. if the min value is "B" could it be that the actual min value is "BA"? If so I think this approach may not work at all. Imagine we have a row group with data ["BA", "ZD"] which generates min/max stats ["B", "Z"]. Now we want to know if col LIKE '%A%' is possible. Clearly the answer should be yes but if we convert it to the predicate form we get 'B' <= '' AND '' <= 'Z' which gives false 😞. I think this could be resolved by truncating the stats column to be the same length as the prefix?

Oct 17 '24 22:10 adriangb

@adriangb in theory I think parquet statistics can be truncated.

Now we want to know if col LIKE '%A%'

I don't think we can use statistics for substring match -- we can only use statistics for equality and prefix matching

so like col LIKE 'A%'

The predicate would be transformed into 'B' <= 'A' AND 'A' <= 'Z' which I do think is correct

Oct 18 '24 20:10 alamb

Consider the values ["ABC", "XYZ"] with stats ["AB", "XY"] and the filter col like 'A%'. This becomes 'AB' <= 'A' AND 'A' <= 'XY' which is false, but we need true. To fix this we'd need to truncate the stats to the length of the filter to get 'A' <= 'A' AND 'A' <= 'XY' which then gives the right result. %A% is just an obvious case because you get '' as the prefix which gives obvious issues.

Oct 18 '24 20:10 adriangb

Okay @alamb I pushed a pretty big rework. Lots of new test cases, lots of comments explaining what's going on. I removed the not like part; I'm thinking this is complex enough as is and most of the benefit (maybe even in ClickBench?) will come from like. We can tackle not like, ilike and not ilike in future PRs. Especially since it's going to be important to evaluate each test case carefully.

I will note that I am a bit concerned about the interaction of truncated stats and how we apply these filters. Take the (absurd) case of stats that were truncated so that all you have is "","". You basically know nothing about the data, there could be anything in there. Yet col = 'A' transforms into '' <= 'A' and 'A' <= '' which is false. Substitute in non-truncated stats and 'A' <= 'A' and 'A' <= 'Z' is true. I added a test case for this behavior on the existing code. It doesn't differentiate between "ABC" truncated to "" and "" actually being the min string but it shows the behavior which would be the same in both cases.

This is important because for the case of a max stat of "A" and a filter col like 'A_' if the stat might have been truncated from "AB" I need to let it pass, if I know for a fact that "A" is the max string in the column I can indeed reject the column.

Oct 19 '24 14:10 adriangb

Argh the only ClickBench query this maybe could improve is 23:

WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%'

Since the like starts with a wildcard this won't help. Maybe we can get not like to do something smart there, tbd...

Oct 19 '24 14:10 adriangb

I am wondering if simple like patterns are not already converted to startswith etc, such that pruning already is applied or we need to implement that case instead of for like?

Oct 19 '24 14:10 Dandandan

I am wondering if simple like patterns are not already converted to startswith etc, such that pruning already is applied.

Good question. Based on performance of some queries I saw I'd say no, but it it's worth double checking. Any suggestions as to a definitive easy way to check? I guess I can run datafusion-cli against a very large parquet file in object storage (high latency) and a query that should filter (col like 'A%') and one that can't?

I don't see where startswith or any other transformations (lower, upper, etc.) are handled in the pruning transformation.

Oct 19 '24 14:10 adriangb

I would take a look at the (physical plan) of queries involving like first to see if it still uses like or is transformed to another function.

Oct 19 '24 14:10 Dandandan

I made a big parquet file as follows:

import random
import string
import polars as pl

df = pl.DataFrame({'col': ["A" + "".join(random.choices(string.ascii_letters, k=1_000)) for _ in range(1_000_000)]})
df.write_parquet('data.parquet', compression='uncompressed')

This came out to ~1GB. I then uploaded it to a GCS bucket.

I ran queries col = 'Z' and col like 'Z' against it and got 2s and 23s respectively. IMO that means it's not getting pushed down.

The explain plans reflect that as well:

ParquetExec: file_groups={10 groups: [[data.parquet:0..100890471], [data.parquet:100890471..201780942], [data.parquet:201780942..302671413], [data.parquet:302671413..403561884], [data.parquet:403561884..504452355], ...]}, projection=[col], predicate=col@0 = Z, pruning_predicate=CASE WHEN col_null_count@2 = col_row_count@3 THEN false ELSE col_min@0 <= Z AND Z <= col_max@1 END, required_guarantees=[col in (Z)], metrics=[output_rows=0, elapsed_compute=10ns, predicate_evaluation_errors=0, bytes_scanned=19368790, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=3, pushdown_rows_filtered=0, page_index_rows_filtered=0, row_groups_matched_statistics=0, row_groups_matched_bloom_filter=0, file_scan_errors=0, file_open_errors=0, num_predicate_creation_errors=0, time_elapsed_scanning_until_data=18.748µs, time_elapsed_opening=7.717746249s, time_elapsed_processing=64.457827ms, page_index_eval_time=10.134µs, pushdown_eval_time=20ns, time_elapsed_scanning_total=19.21µs]

ParquetExec: file_groups={10 groups: [[data.parquet:0..100890471], [data.parquet:100890471..201780942], [data.parquet:201780942..302671413], [data.parquet:302671413..403561884], [data.parquet:403561884..504452355], ...]}, projection=[col], predicate=col@0 LIKE Z, metrics=[output_rows=1000000, elapsed_compute=10ns, predicate_evaluation_errors=0, bytes_scanned=1006955145, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=0, pushdown_rows_filtered=0, page_index_rows_filtered=0, row_groups_matched_statistics=0, row_groups_matched_bloom_filter=0, file_scan_errors=0, file_open_errors=0, num_predicate_creation_errors=0, time_elapsed_scanning_until_data=49.346124581s, time_elapsed_opening=2.18377s, time_elapsed_processing=1.545583231s, page_index_eval_time=20ns, pushdown_eval_time=20ns, time_elapsed_scanning_total=49.654700084s]

The like query also has:

FilterExec: col@0 LIKE Z, metrics=[output_rows=0, elapsed_compute=1.878551ms]

So it doesn't seem like it's being transformed into another expression. It probably would be smart to do so as a general optimization outside of pruning.

I also think pruning should handle whatever that produces (startswith in the case of like 'A%' or = in the case of like 'A') as well as additional simple cases like upper(), lower(), etc.

Oct 19 '24 14:10 adriangb

So it doesn't seem like it's being transformed into another expression. It probably would be smart to do so as a general optimization outside of pruning.

I also think pruning should handle whatever that produces (startswith in the case of like 'A%' or = in the case of like 'A') as well as additional simple cases like upper(), lower(), etc.

That certainly makes sense to me

Perhaps we could implement some sort of simplification in https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs for LIKE into starts_with https://docs.rs/datafusion/latest/datafusion/functions/expr_fn/fn.starts_with.html (though since it is a function figuring out how to make the rewrite work might be tricky)

Then we can implement the rules in pruning predicate for starts_with 🤔

Oct 21 '24 19:10 alamb

At most we could simplify the col like 'A%' case but we can't simplify 'A%B' so I think it's still worth it to implement pruning rewrites for both.

Do you have any thoughts on my concerns for possibly truncated stats, in particular how = may even be wrong as of today if the stats are truncated enough?

Oct 21 '24 20:10 adriangb

I think it’s fine to support like for now and leave the further combination / optimization for future work. I see that only simplifying to starts_with won't get all the benefits.

Of course, the pruning needs to be correct :). We could add (negative) test cases / add issues if the already implemented logic for = pruning is incorrect.

Oct 21 '24 20:10 Dandandan

Well there's no tests for hypothetical cases with truncated stats. All of the tests are against the stats themselves with no indication of how those are meant to correspond with the original data. There were no unit tests of Utf8 filtering at all as far as I can tell.

The current implementation of = is certainly not wrong in the real world, but I'm not sure if that's because it's not used in situations where stats are truncated, if truncation only happens at extremes like a 10MB value where practically it's not a problem, etc.

Oct 21 '24 20:10 adriangb

I suggest we (I can help tomorrow):

File a ticket to simplify like to = and starts_with when possible (to help follow on optimizations like this)
File / find a ticket about testing with truncated statistics
Determining what, if anything, is left to add directly to pruning prediate

Oct 21 '24 21:10 alamb

Sounds good.

All of that said, I think this PR is currently as correct as = and had pretty good test coverage. Does it need to wait on those tasks or can it proceed in parallel?

Oct 21 '24 21:10 adriangb

@alamb I re-arranged some of the comments on assertions in https://github.com/apache/datafusion/pull/12978/commits/ae3426da4b19ff9e8adf038547b6a4e552190f1f which I feel like helped a lot with readability of the tests. There's a couple other tests with a similar pattern that I think could benefit.

I was also thinking about doing some more black-box testing: I think given any min, max you can always convert that into a RecordBatch with an array of the form [min, max] and you should never have the pruning say the array should be excluded but the array has any matches. Does that make sense? Maybe this could even be fuzz tested?

Oct 26 '24 21:10 adriangb

I was also thinking about doing some more black-box testing: I think given any min, max you can always convert that into a RecordBatch with an array of the form [min, max] and you should never have the pruning say the array should be excluded but the array has any matches. Does that make sense? Maybe this could even be fuzz tested?

We could also randomly generate an array with an element or elements within the min max range. This would cover not only logic at the edges, but also for the values inside the range.

Oct 28 '24 11:10 findepi

Some sqllogictests failed and this is expected. @adriangb can you please update them with

cargo test --test sqllogictests -- --complete

?

Oct 28 '24 11:10 findepi

Some sqllogictests failed and this is expected. @adriangb can you please update them with
cargo test --test sqllogictests -- --complete

You will need to merge with main to reproduce the failures (the failing tests are new). When you do so, you should get a diff like this

@@ -510,7 +510,7 @@ physical_plan
 01)CoalesceBatchesExec: target_batch_size=8192
 02)--FilterExec: binary_col@0 LIKE %a% AND largebinary_col@1 LIKE %a% AND binaryview_col@2 LIKE %a%
 03)----RepartitionExec: partitioning=RoundRobinBatch(2), input_partitions=1
-04)------ParquetExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/binary_as_string.parquet]]}, projection=[binary_col, largebinary_col, binaryview_col], predicate=binary_col@0 LIKE %a% AND largebinary_col@1 LIKE %a% AND binaryview_col@2 LIKE %a%
+04)------ParquetExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/binary_as_string.parquet]]}, projection=[binary_col, largebinary_col, binaryview_col], predicate=binary_col@0 LIKE %a% AND largebinary_col@1 LIKE %a% AND binaryview_col@2 LIKE %a%, pruning_predicate=CASE WHEN binary_col_null_count@2 = binary_col_row_count@3 THEN false ELSE binary_col_min@0 <= 􏿿 AND  <= binary_col_max@1 END AND CASE WHEN largebinary_col_null_count@6 = largebinary_col_row_count@7 THEN false ELSE largebinary_col_min@4 <= 􏿿 AND  <= largebinary_col_max@5 END AND CASE WHEN binaryview_col_null_count@10 = binaryview_col_row_count@11 THEN false ELSE binaryview_col_min@8 <= 􏿿 AND  <= binaryview_col_max@9 END, required_guarantees=[]

thoughts

for a predicate LIKE %a% we should not generate additional conditions at all. There is no min,max predicate we can practically derive from LIKE %a% expression.
􏿿 character is problematic to read and to work with. Some tools don't handle it properly. For example when pasting into iterm with python3 running for inspection, it got displaed as \U+0FFFF even though this is code pointx10FFFF.

Oct 28 '24 12:10 findepi

Some sqllogictests failed and this is expected. @adriangb can you please update them with
cargo test --test sqllogictests -- --complete
You will need to merge with main to reproduce the failures (the failing tests are new). When you do so, you should get a diff like this

Thanks I was wondering how to update that. When I run it I get various errors along the lines of:

Execution error: Error completing "string/string_view.slt": failed to rename file from test_files/string/init_data.slt.part.temp to test_files/string/./init_data.slt.part
Execution error: Error completing "string/large_string.slt": failed to rename file from test_files/string/string_query.slt.part.temp to test_files/string/./string_query.slt.part
Execution error: Error completing "string/dictionary_utf8.slt": failed to rename file from test_files/string/string_query.slt.part.temp to test_files/string/./string_query.slt.part

Any idea what's going on?

Oct 28 '24 13:10 adriangb

When I run it I get various errors along the lines of:

Execution error: Error completing "string/string_view.slt": failed to rename file from test_files/string/init_data.slt.part.temp to test_files/string/./init_data.slt.part
Execution error: Error completing "string/large_string.slt": failed to rename file from test_files/string/string_query.slt.part.temp to test_files/string/./string_query.slt.part
Execution error: Error completing "string/dictionary_utf8.slt": failed to rename file from test_files/string/string_query.slt.part.temp to test_files/string/./string_query.slt.part

Any idea what's going on?

Yes -- https://github.com/apache/datafusion/issues/12752

but don't worry. see https://github.com/apache/datafusion/pull/12978#issuecomment-2441489968 . this test case shouldn't change after all

Oct 28 '24 14:10 findepi

Yes -- #12752

but don't worry. see #12978 (comment) . this test case shouldn't change after all

Having other issues now. I tried in 46531f5bf7be7a59255fc8723f0be3efd3beb908 but got a bunch of errors from missing files. I tried to follow what CI is doing but got to compiling the databricks repo and got a missing malloc.h so I stopped there since I wasn't even sure if I was barking up the right tree. Is this the expected workflow for updating these snapshots? It might be helpful if failed CI runs offered them as a downloadable artifact.

Oct 28 '24 17:10 adriangb

for a predicate LIKE %a% we should not generate additional conditions at all. There is no min,max predicate we can practically derive from LIKE %a% expression

By the way I added a special case for the wildcard being the first character (no prefix) to bail out and add no filtering

Oct 28 '24 17:10 adriangb

Looks like the special case on '%...' actually eliminated the diff in those tests, so no need to help me figure out how to update them now. I'll re-request review from both of you. Thank you for your help so far.

Oct 29 '24 01:10 adriangb

This is on my list to review today

Oct 29 '24 12:10 alamb

I didn't make it here today -- it is the top of my list for tomorrow

Oct 29 '24 21:10 alamb

Thanks for the review Andrew. I'll address your comments tomorrow and hopefully we can merge this week!

Let's open an issue to track a fuzz tester for predicate pushdown.

Oct 31 '24 00:10 adriangb

@alamb @findepi I think I've accepted or applied all of your excellent feedback, and I added some coverage / more test cases to the increment_utf8 function 😄

Nov 01 '24 17:11 adriangb