gcp: Switch BigQuery output to Storage Write API.
Currently, the BigQuery output uses the "load jobs" API to batch load data, which works but has limitations when used to make frequent updates as is typically done in a streaming system. For example, the default quota is 1,500 load jobs per table per day, which means you can only create ~1 load job per minute. See:
https://cloud.google.com/bigquery/quotas#load_jobs
The Storage Write API is a newer API that works for both batch and streaming and has some advantages. See:
https://cloud.google.com/bigquery/docs/write-api#advantages
This change switches the BQ output to use the Storage Write API. At a high level, it now creates a "managed stream" and then repeatedly calls AppendRows on it to send batches of data. Each row is a seralized protobuf message, and the API uses gRPC bi-directional streaming:
https://github.com/googleapis/googleapis/tree/master/google/cloud/bigquery/storage/v1
Notes:
-
Only simple JSON messages have been tested with
STRING,INTEGER, andDATETIMEfields. CSV messages have not been tested. -
The type of each field in a message needs to be mapped to a protobuf field type for transport, which is handled using reflection and heuristics. To make this work reliably, the BQ table must exist in order to infer the target schema.
-
The "default" stream (at-least-once delivery) is always used. Using an application-created stream (exactly-once delivery, ACID features) is not supported.
-
The options
write_disposition,create_disposition,ignore_unknown_values,max_bad_records, andauto_detectare no longer supported because they were features of the load jobs API. Some of them could potentially be reimplemented. -
The option
job_labelsis no longer supported because it is specific to the load jobs API.
Fixes: #1167
This is PR is still pretty preliminary, but I wanted to send it in case it's useful. It doesn't have as much functionality as the current gcp_bigquery output that uses the load jobs API, so maybe it could be added under a different name, e.g., gcp_bigquery_write so as not to disrupt existing users. Open to any/all thoughts and feedback.
Hey @grosskur, yeah I think a separate output type specifically for this api would be easier for us than trying to reach parity with the current behaviour, we could document the advantages/disadvantages and leave it to users to decide. I think gcp_bigquery_write works as a name.
Cool---thanks for the feedback. I'll spend some time and get it cleaned up as a separate output called gcp_bigquery_write.
@grosskur Any progress? We are also interested in leveraging BQ
Hey @grosskur are you still using this streaming api? Does this code work? Thanks in advance.
Just a quick update. I have this working here if anyone wants to copy it.
https://github.com/mooselabs/redpanda-connect-plugin-gcp-bigquery-stream