connect icon indicating copy to clipboard operation
connect copied to clipboard

gcp: Switch BigQuery output to Storage Write API.

Open grosskur opened this issue 1 year ago • 8 comments

Currently, the BigQuery output uses the "load jobs" API to batch load data, which works but has limitations when used to make frequent updates as is typically done in a streaming system. For example, the default quota is 1,500 load jobs per table per day, which means you can only create ~1 load job per minute. See:

https://cloud.google.com/bigquery/quotas#load_jobs

The Storage Write API is a newer API that works for both batch and streaming and has some advantages. See:

https://cloud.google.com/bigquery/docs/write-api#advantages

This change switches the BQ output to use the Storage Write API. At a high level, it now creates a "managed stream" and then repeatedly calls AppendRows on it to send batches of data. Each row is a seralized protobuf message, and the API uses gRPC bi-directional streaming:

https://github.com/googleapis/googleapis/tree/master/google/cloud/bigquery/storage/v1

Notes:

  • Only simple JSON messages have been tested with STRING, INTEGER, and DATETIME fields. CSV messages have not been tested.

  • The type of each field in a message needs to be mapped to a protobuf field type for transport, which is handled using reflection and heuristics. To make this work reliably, the BQ table must exist in order to infer the target schema.

  • The "default" stream (at-least-once delivery) is always used. Using an application-created stream (exactly-once delivery, ACID features) is not supported.

  • The options write_disposition, create_disposition, ignore_unknown_values, max_bad_records, and auto_detect are no longer supported because they were features of the load jobs API. Some of them could potentially be reimplemented.

  • The option job_labels is no longer supported because it is specific to the load jobs API.

Fixes: #1167

grosskur avatar Feb 11 '24 05:02 grosskur

This is PR is still pretty preliminary, but I wanted to send it in case it's useful. It doesn't have as much functionality as the current gcp_bigquery output that uses the load jobs API, so maybe it could be added under a different name, e.g., gcp_bigquery_write so as not to disrupt existing users. Open to any/all thoughts and feedback.

grosskur avatar Feb 11 '24 05:02 grosskur

Hey @grosskur, yeah I think a separate output type specifically for this api would be easier for us than trying to reach parity with the current behaviour, we could document the advantages/disadvantages and leave it to users to decide. I think gcp_bigquery_write works as a name.

Jeffail avatar Feb 14 '24 16:02 Jeffail

Cool---thanks for the feedback. I'll spend some time and get it cleaned up as a separate output called gcp_bigquery_write.

grosskur avatar Feb 15 '24 09:02 grosskur

@grosskur Any progress? We are also interested in leveraging BQ

bir avatar Apr 03 '24 19:04 bir

Hey @grosskur are you still using this streaming api? Does this code work? Thanks in advance.

firegrass avatar Jun 24 '24 19:06 firegrass

Just a quick update. I have this working here if anyone wants to copy it.

https://github.com/mooselabs/redpanda-connect-plugin-gcp-bigquery-stream

firegrass avatar Jun 24 '24 21:06 firegrass