vector icon indicating copy to clipboard operation
vector copied to clipboard

clickhouse sink doesn't support encoding.codec = raw_message

Open acpeakhour opened this issue 1 year ago • 12 comments

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

The clickhouse sink doesn't support encoding.codec = raw_message causing high CPU load parsing json in the transform.

Use Cases

Many of the sinks already support encoding.codec = raw_message - why not clickhouse?

Attempted Solutions

No response

Proposal

No response

References

No response

Version

No response

acpeakhour avatar Jun 20 '24 04:06 acpeakhour

Hi @acpeakhour ,

The clickhouse sink only deals with structured data which is why the codec option is not supported. Could you describe your use-case a bit more? How would you expect the raw_message to be sent to Clickhouse?

jszwedko avatar Jun 20 '24 13:06 jszwedko

The use case arises when the payload is JSON and is contained within the message field, which is typical for JSON data from external sources. Inserting this structured data directly into Clickhouse is convenient.

Currently, the only solution for this is a transform with:

source = ''' . |= object!(parse_json!(.message)) '''

Followed by using encoder.skip_fields or further .del(.message) to remove Vector-specific metadata from the event. Clickhouse already parses JSON data during insertion. Supporting raw_message for the Clickhouse sink would:

  • Reduce CPU usage in Vector for this use case
  • Provide flexibility to users in data handling when inserting into Clickhouse
  • Allow direct insertion of raw data into Clickhouse

Without this feature, we're forced to use Vector's event schema in Clickhouse and then parse it there, or have Vector parse the JSON, which is less efficient.

The HTTP sink supports encoding.codec = raw_message, but it doesn't support templates in the URI, limiting its usefulness. As a result, we currently have to accept high Vector CPU usage when inserting into Clickhouse.

Supporting raw_message in the Clickhouse sink would align it with other sinks' capabilities and provide users with more control over their data pipeline, potentially improving performance and reducing complexity.

acpeakhour avatar Jun 21 '24 00:06 acpeakhour

This is the same issue for the Elasticsearch Sink

acpeakhour avatar Jun 21 '24 03:06 acpeakhour

Thanks for the additional detail! I think I'm still missing something though. I'm not super familiar with Clickhouse, but it seems to require you to insert structured / formatted data: https://clickhouse.com/docs/en/sql-reference/statements/insert-into. What would the INSERT statement look like with raw text? Maybe LineAsString (https://clickhouse.com/docs/en/sql-reference/formats#lineasstring)? Could you give an example INSERT statement?

jszwedko avatar Jun 21 '24 17:06 jszwedko

INSERT INTO table FORMAT JSONEachRow

https://clickhouse.com/docs/en/integrations/data-formats/json

acpeakhour avatar Jun 21 '24 21:06 acpeakhour

INSERT INTO table FORMAT JSONEachRow

https://clickhouse.com/docs/en/integrations/data-formats/json

That's what Vector already uses though 🤔

jszwedko avatar Jun 21 '24 22:06 jszwedko

Ah, I think I see what you are saying. If message is already JSON we could insert it directly rather than requiring it to be parsed in Vector. Agreed, that seems like a reasonable enhancement to this sink.

jszwedko avatar Jun 21 '24 22:06 jszwedko

Yes, that is what I am saying. Supporting raw_message in the sink saves the json_parse in the transform in the case where the message content is JSON. I think it is likely the same for the elasticsearch sink as well.

For me at least, all our events are JSON. It seems using the json_parse was a common workaround for this. I jumped for joy when I saw raw_message supported as a codec, but cried when it wasn't for clickhouse and elastic.

I believe this is a common use case.

acpeakhour avatar Jun 21 '24 22:06 acpeakhour

Agreed, this does seem like a potentially common use-case if not using Vector for any event processing (which would typically require parsing).

jszwedko avatar Jun 24 '24 17:06 jszwedko

I quickly hacked something together, changing the sinks ClickhouseConfig members encoding type from Transformer to EncodingConfigWithFraming: https://github.com/vectordotdev/vector/commit/8b99f9836d095b3845595ab3c9f0e28aab613657

This change allows (and requires) for encoding.codec to be specified by the user. I ran some quick tests and clickhouse rows were correctly inserted for both encoding.codec = raw_message and encoding.codec = json.

Since I have no substantial knowledge of neither Rust nor Vector, the code should be taken with a huge grain of salt, but it may indicate that no major changes are necessary.

zu3st avatar Jul 10 '24 14:07 zu3st