Pyiceberg scan using custom python filters

Open bigluck opened this issue 1 year ago • 0 comments

Feature Request / Improvement

Ciao all,

I'm looking for a way to bypass the limited number of supported filters on Pyiceberg without raising an out-of-memory error on my running instance (see #170).

A user, for example, can query by id = 12 (that's fine, it's supported) but she can also compose a complex query like id = 12 or LOG(a_number) = 12345 (this filter does not make any sense, but it give an idea of complex filters).

In this case, I'm forced to transform her query into id = 12 or NOT a_number IS NULL, and once I have the final arrow table, I filter it by the actual user's filters.

tmp_res = table.scan(
    row_filter='id = 12 or NOT a_number IS NULL',
    selected_fields=('id', 'a_number')
).to_arrow()
res = apply_users_filters(
    table=tmp_res,
    filters='id = 12 or LOG(a_number) = 12345',
)

But it's risky, especially if the user is scanning an xTB table and, for some reason, the column she's filtering is/are strings.

As a temporary workaround, is it possible to extend the scan & project_table functions to support arbitrary Python lambda functions invoked every time a new file is scanned to filter out all the unnecessary records before merging them into the final table?

Apr 09 '24 07:04 bigluck