datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

POC: Try to optimize `columnized_output_exprs`

Open goldmedal opened this issue 1 year ago • 1 comments

Which issue does this PR close?

Just a POC for https://github.com/apache/datafusion/issues/13015#issuecomment-2424113317

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

goldmedal avatar Oct 19 '24 17:10 goldmedal

The latest commit doesn't make the performance better. Just clean up the codebase.

goldmedal avatar Oct 20 '24 15:10 goldmedal

I may be misreading this, but it seems like columnize_output_expr is called in a loop for each expression being projected:

https://github.com/apache/datafusion/blob/ac827abe1b66b1dfa02ce65ae857477f68667843/datafusion/expr/src/logical_plan/builder.rs#L1505-L1511

However, then for each expr it builds up the same hash table: https://github.com/apache/datafusion/blob/ac827abe1b66b1dfa02ce65ae857477f68667843/datafusion/expr/src/utils.rs#L879-L883

Maybe we could refactor the code so the hash table is built once rather than once per expression? Something like

let columizer = Columnizer(plan);
for e in exprs {
  columnizer.columnize(e)
}

🤔

coupled with your changes to avoid hash map might make a significant difference in planning

alamb avatar Oct 24 '24 11:10 alamb

coupled with your changes to avoid hash map might make a significant difference in planning

Thanks, @alamb. I'll check it in a few days. If it makes sense, I will do the benchmark again 👍

goldmedal avatar Oct 25 '24 07:10 goldmedal

However, then for each expr it builds up the same hash table:

https://github.com/apache/datafusion/blob/ac827abe1b66b1dfa02ce65ae857477f68667843/datafusion/expr/src/utils.rs#L879-L883

It doesn't work for this case because the hash tables are different. The case is

SELECT max(a1), max(a2), ... max(a200) from t

I think sharing the hash map may work in some cases like

SELECT max(a1), max(a1) + 1, max(a1) +2, ... from t

We only need to hash max(a1) once, and then look up the column reference many times. In physical_select_aggregates_from_200 case, we still need to hash every expression.

goldmedal avatar Oct 27 '24 14:10 goldmedal

Thanks for double checking @goldmedal

alamb avatar Oct 28 '24 11:10 alamb

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Dec 28 '24 01:12 github-actions[bot]