connect gcp: Switch BigQuery output to Storage Write API.

Currently, the BigQuery output uses the "load jobs" API to batch load data, which works but has limitations when used to make frequent updates as is typically done in a streaming system. For example, the default quota is 1,500 load jobs per table per day, which means you can only create ~1 load job per minute. See:

https://cloud.google.com/bigquery/quotas#load_jobs

The Storage Write API is a newer API that works for both batch and streaming and has some advantages. See:

https://cloud.google.com/bigquery/docs/write-api#advantages

This change switches the BQ output to use the Storage Write API. At a high level, it now creates a "managed stream" and then repeatedly calls AppendRows on it to send batches of data. Each row is a seralized protobuf message, and the API uses gRPC bi-directional streaming:

https://github.com/googleapis/googleapis/tree/master/google/cloud/bigquery/storage/v1

Notes:

Only simple JSON messages have been tested with STRING, INTEGER, and DATETIME fields. CSV messages have not been tested.
The type of each field in a message needs to be mapped to a protobuf field type for transport, which is handled using reflection and heuristics. To make this work reliably, the BQ table must exist in order to infer the target schema.
The "default" stream (at-least-once delivery) is always used. Using an application-created stream (exactly-once delivery, ACID features) is not supported.
The options write_disposition, create_disposition, ignore_unknown_values, max_bad_records, and auto_detect are no longer supported because they were features of the load jobs API. Some of them could potentially be reimplemented.
The option job_labels is no longer supported because it is specific to the load jobs API.

Fixes: #1167

Feb 11 '24 05:02 grosskur

This is PR is still pretty preliminary, but I wanted to send it in case it's useful. It doesn't have as much functionality as the current gcp_bigquery output that uses the load jobs API, so maybe it could be added under a different name, e.g., gcp_bigquery_write so as not to disrupt existing users. Open to any/all thoughts and feedback.

Feb 11 '24 05:02 grosskur

Hey @grosskur, yeah I think a separate output type specifically for this api would be easier for us than trying to reach parity with the current behaviour, we could document the advantages/disadvantages and leave it to users to decide. I think gcp_bigquery_write works as a name.

Feb 14 '24 16:02 Jeffail

Cool---thanks for the feedback. I'll spend some time and get it cleaned up as a separate output called gcp_bigquery_write.

Feb 15 '24 09:02 grosskur

@grosskur Any progress? We are also interested in leveraging BQ

Apr 03 '24 19:04 bir

Hey @grosskur are you still using this streaming api? Does this code work? Thanks in advance.

Jun 24 '24 19:06 firegrass

Just a quick update. I have this working here if anyone wants to copy it.

https://github.com/mooselabs/redpanda-connect-plugin-gcp-bigquery-stream

Jun 24 '24 21:06 firegrass