Add a ScalarUDFImpl::simplfy() API
Is your feature request related to a problem or challenge?
As part of porting arrow_cast https://github.com/apache/arrow-datafusion/issues/9287 and make_array https://github.com/apache/arrow-datafusion/issues/9288, it has become clear that some functions have special simplification semantics:
-
arrow_castsimplifies tocastof a particular data type. It is important for the rest of datafusion to understand theCastsemantics as the there are several special cases for Expr::Cast (such asunwrap_cast_in_comparison) -
make_arrayhas special rewrite rules to combine / fold witharray_appendandarray_prepaendin the simplifier (source link)
Also I think some systems may want to provide the ability to define UDFs that are more like macros (can be expressed in terms of other built in expressions), which needs some way for datafusion to "inline" the definition
Similarly, specialized functions (e.g replace regexp_match with a version that had the regexp pre-compiled ...) - https://github.com/apache/arrow-datafusion/issues/8051 sound very similar
Describe the solution you'd like
I propose adding a function such as the following, we could implement the simplifications for make_array and arrow_cast in the UDF (and not in the main simplify code):
/// Was the expression simplified?
enum Simplified {
/// The function call was simplified to an entirely new Expr
Rewritten(Expr),
/// the function call could not be simplified, and the arguments
/// are return unmodified
Original(Vec<Expr>)
}
pub trait ScalarUDFImpl {
...
/// Apply any function specific simplification rules
///
/// Some functions like arrow_cast have special semantic simplifications
/// (into `Expr::Cast` for example) that can improve planning.
///
/// If there is a simpler representation of calling this function with the
/// specified arguments, return `Simplified::Rewritten` with the simplification.
/// If no such simplification was possible, returns `Simplified::Original` with
/// the unmodified arguments (the default)
///
/// This function should only apply simplifications specific to this function.
/// DataFusion will automatically simplify the arguments with a variety
/// of rewrites during optimization
fn simplify(&self, args: Vec<Expr>) -> Result<Simplified> {
Ok(Simplified::Original(args)
}
...
}
Describe alternatives you've considered
There may be better
Additional context
No response
FYI @jayzhan211 -- I wonder if you have any thoughts about this ticket (or maybe have time to try implementing it).
I think a good test case would be to implement a udf function such as cast_to_int64 that was rewritten into cast(x as int64) or something
cc @universalmind303 and @thinkharderdev as I think you have been thinking about similar things
In this case, should we add a function-level optimizer trying to rewrite the function? I think it would be a large match expression. I have done a similar issue today #9213 . Maybe I could do parts of it.
In this case, should we add a function-level optimizer trying to rewrite the function? I think it would be a large match expression. I have done a similar issue today #9213 . Maybe I could do parts of it.
Yes, @Lordworms I think that is an excellent idea. cc @mustafasrepo for your thoughts
Note that https://github.com/apache/arrow-datafusion/issues/9213 is for an aggregate function, which I think would have a similar API though it would be on a different struct (AggreagateUDF)
In this case, should we add a function-level optimizer trying to rewrite the function? I think it would be a large match expression. I have done a similar issue today #9213 . Maybe I could do parts of it.
Yes, @Lordworms I think that is an excellent idea. cc @mustafasrepo for your thoughts
Note that #9213 is for an aggregate function, which I think would have a similar API though it would be on a different struct (
AggreagateUDF)
I think, function level optimizer would be much more scalable. Also, simplification logic will be more contained. Also, once we can bring similar changes in ScalarUDFImpl::simplfy to AggregateUDF for support of this feature.
@jayzhan211 -- I think this proposal is also likely to be needed to get make_array moved into datafusion-functions-array - https://github.com/apache/arrow-datafusion/issues/9288
@alamb How is simplify like for array concat operator rewrite? If we have MakeArrayUDF, args are for MakeArray only, but we need BinaryExpr that includes two array functions and 1 operator to start the simplification. It seems function level optimizer can only works for simplification like A function to B Expr / function.
simplify for arrow_cast is just trying to do actual casting x as i64 right?
@alamb How is
simplifylike for array concat operator rewrite? If we have MakeArrayUDF,argsare for MakeArray only, but we need BinaryExpr that includes two array functions and 1 operator to start the simplification. It seems function level optimizer can only works for simplification like A function to B Expr / function.
🤔 that is a good point that the simplification is needed for StringConcat operator 🤔
simplify for arrow_cast is just trying to do actual casting x as i64 right?
Yes
Sorry for my delay, I have been out the last few days. I am catching up with reviews