substrait icon indicating copy to clipboard operation
substrait copied to clipboard

[bug] concat:vchar is not very usable

Open miretskiy opened this issue 1 year ago • 5 comments

concat:vchar overload is not very usable:

impls:
      - args:
          - value: "varchar<L1>"
            name: "input"
        variadic:
          min: 1
        options:
          null_handling:
            values: [ IGNORE_NULLS, ACCEPT_NULLS ]
        return: "varchar<L1>"

Each argument must be varchar<L1> -- that is it must be varchar arguments with exactly the same length; and it must produce same result (varchar<L1>). That means that arguments of length L1 cannot be concatenated together; in general the length of each argument must be at most L1/n (on average). While this works, it's probably not the desired behavior.

The additional usability issue with this overload is that if using e.g. substrait-go library to construct varchar literals, the produced literals carry the exact length of the argument (https://github.com/substrait-io/substrait-go/blob/1a2a6ae5fd9986045f69d6ec590875c9af45b88f/literal/utils.go#L296) This means that you cannot construct an expression to call this overload with e.g. concat(NewVarChar('hello'), NewVarChar(' world'))

Given that concat already provides variadic overload for strings, this version is superfluous, and error prone. It should probably be removed.

miretskiy avatar Apr 14 '25 17:04 miretskiy

The variadic attribute should allow any number of arguments as long as the minimum of one is provided. There is no maximum specified.

EpsilonPrime avatar Apr 14 '25 21:04 EpsilonPrime

The variadic attribute should allow any number of arguments as long as the minimum of one is provided. There is no maximum specified.

Right; the issue is not in the number of arguments, but the fact that all arguments must be varchar<L1> type.

miretskiy avatar Apr 14 '25 21:04 miretskiy

Ah, I see. Both DuckDB and Snowflake use a maximum length for their varchar literals which is different than Substrait Go's behavior. The L1 requirement does seem to be the issue as it would restrict the input and output to have the same length. The output length could be calculated given the input lengths if we had some way of referencing the lengths of the variadic count. We don't have that capability yet.

EpsilonPrime avatar Apr 14 '25 22:04 EpsilonPrime

The L1 requirement does seem to be the issue as it would restrict the input and output to have the same length.

This is the issue I assumed we're talking about. Specifically, the output having the same length as the input.

For

but the fact that all arguments must be varchar<L1> type.

The expectation is that the producer should/can upcast all of the input types. That being said, I don't know if this function is particularly usable. I think a better concat definition for general use might be

concat(varchar<N>, varchar<M>) -> varchar<N + M>

with no variadics, unless we want a concat version that truncates the output.

vbarua avatar Apr 14 '25 22:04 vbarua

Good point @EpsilonPrime

Both DuckDB and Snowflake use a maximum length for their varchar literals

It's an option; literal.NewVarChar is a helper that takes a string -- it is reasonable for it to use VarChar<maxL> imo. On the other hand, if the caller cares that the varchars are of specific length, they can use expression.NewLiteral[T] where they can be explicit about the type.

miretskiy avatar Apr 15 '25 14:04 miretskiy