Pyiceberg scan using custom python filters
Feature Request / Improvement
Ciao all,
I'm looking for a way to bypass the limited number of supported filters on Pyiceberg without raising an out-of-memory error on my running instance (see #170).
A user, for example, can query by id = 12 (that's fine, it's supported) but she can also compose a complex query like id = 12 or LOG(a_number) = 12345 (this filter does not make any sense, but it give an idea of complex filters).
In this case, I'm forced to transform her query into id = 12 or NOT a_number IS NULL, and once I have the final arrow table, I filter it by the actual user's filters.
tmp_res = table.scan(
row_filter='id = 12 or NOT a_number IS NULL',
selected_fields=('id', 'a_number')
).to_arrow()
res = apply_users_filters(
table=tmp_res,
filters='id = 12 or LOG(a_number) = 12345',
)
But it's risky, especially if the user is scanning an xTB table and, for some reason, the column she's filtering is/are strings.
As a temporary workaround, is it possible to extend the scan & project_table functions to support arbitrary Python lambda functions invoked every time a new file is scanned to filter out all the unnecessary records before merging them into the final table?