Simplify Inserts of Default Values/Dictionaries
Currently the Client insert method requires the caller to create a complete, populated dataset, including any default values. There should be mechanism where missing/default values are automatically supplied for None values when inserting into a non-nullable columns.
This is closely related to an interface which allows building an insert of Map/Dictionaries using the map/dict keys to determine column names. Missing keys would use the ClickHouse default value.
Determining "non-standard" default values from the "Describe Table" query to ClickHouse is somewhat challenging, as the default expression is a string that ClickHouse Connect would have to evaluate.
Hi @genzgd are there any plans to implement this behavior into clickhouse-connect soon?
And if not: are you open to accept a PR for this?
This is not currently a priority, so a PR in this area would be welcome.
Just a warning, one of the reasons that it's not a priority is because it's a tricky problem. When inserting data using Native format, there's no way to specify that the inserted value should be converted to the ClickHouse default value (although this functionality was implemented for RowBinary using a new format). So I don't see any easy way to make it work for calculated default values like now() or column_1 * column_2.
One possible solution would be to do such inserts using a JSON or RowBinaryWithDefaults format, but that would be significantly slower and require a fair amount of additional code.
Because of the complexity of the problem, I would recommend outlining your planned approach here for review before writing a lot of code.
Got it - Thanks for elaborating on that. That does sound more complex than I was hoping it'd be ;-)
It also does sound like RowBinaryWithDefaults would be the way to go, to ensure it would work with the calculated values you describe.
Why would it be significantly slower - if I understand correctly there's an extra byte added to each field?
This library originally started with the RowBinary format -- if you look at very old (pre 0.3.0) versions, you'll see the original code. However, it turns out to be very slow to call a separate conversion function on each value when building (or reading) the data for the binary row. The ClickHouse Native format is columnar, so hundreds or thousands of values in the same format can be converted using a loop in one function without the need for (expensive) Python function call overhead. Such functions can also be translated to C and optimized. The performance difference is at least 10x based on that early code.
So adding the extra byte is not really a big issue -- the problem is essentially you would you have to rewrite all of the currently optimized serialization code just to get a much slower result. The real answer is probably to push harder for support for a NativeWithDefaults format in ClickHouse server that would add an additional "defaults" column similar to the current "nullable" column, but that's going to be a non-trivial amount of C++ work.
I see - Thanks for the explanation. Going back and losing that 10x performance improvement indeed seems very undesirable. So it makes sense to not pick this up until NativeWithDefaults is supported. I can ask for support for that. There's some good case to be made to have it :]
depdens on https://github.com/ClickHouse/ClickHouse/issues/58662