beam icon indicating copy to clipboard operation
beam copied to clipboard

[Feature Request]: Set quota project in `beam.io.ReadFromBigQuery`

Open shahar1 opened this issue 1 year ago • 7 comments

What would you like to happen?

This issue relates to the Python SDK, but it is probably relevant to other SDKs as well: We have a use case where queries initiated by beam.io.ReadFromBigQuery should be billed on a specific GCP Project ID. As we use a custom container, the only option for now would be setting the env. var. GOOGLE_CLOUD_QUOTA_PROJECT in the Dockerfile - but it affects all other GCP services as well. It would be best making it configurable via the connector (i.e., beam.io.ReadFromBigQuery(..., quota_project_id='some-project-id)). When implementing, you could gain inspiration from similar feature in beam.io.WriteToBigQuery: https://github.com/apache/beam/pull/16186.

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

  • [X] Component: Python SDK
  • [X] Component: Java SDK
  • [X] Component: Go SDK
  • [X] Component: Typescript SDK
  • [X] Component: IO connector
  • [ ] Component: Beam YAML
  • [ ] Component: Beam examples
  • [ ] Component: Beam playground
  • [ ] Component: Beam katas
  • [ ] Component: Website
  • [ ] Component: Spark Runner
  • [ ] Component: Flink Runner
  • [ ] Component: Samza Runner
  • [ ] Component: Twister2 Runner
  • [ ] Component: Hazelcast Jet Runner
  • [x] Component: Google Cloud Dataflow Runner

shahar1 avatar Apr 28 '24 08:04 shahar1

@shahar1 this level of customization might make sense.

Let's explore your specific concern for a moment, I might have others, but imagine worth understanding your needs/usecase:

What Quotas are getting hit that are problematic? Or, what are the specific billing charges you are looking to attribute elsewhere?
Your are running on Dataflow? Or other? [ not critical, but curious ] What read method? [ BQ Storage Read API? ] You want to run the compute in one GCP project, but use BQ from another? If this unloads, writes to GCS and then into Dataflow [ that is another way that can occur ], do you intend to specify which project [ bucket within ] that the data is written?

brucearctor avatar Apr 29 '24 21:04 brucearctor

Also, I wonder whether implimentation of this issue would help with https://github.com/apache/beam/issues/30747

brucearctor avatar Apr 29 '24 21:04 brucearctor

@shahar1 this level of customization might make sense.

Let's explore your specific concern for a moment, I might have others, but imagine worth understanding your needs/usecase:

What Quotas are getting hit that are problematic? Or, what are the specific billing charges you are looking to attribute elsewhere? Your are running on Dataflow? Or other? [ not critical, but curious ] What read method? [ BQ Storage Read API? ] You want to run the compute in one GCP project, but use BQ from another? If this unloads, writes to GCS and then into Dataflow [ that is another way that can occur ], do you intend to specify which project [ bucket within ] that the data is written?

Thank for your response! Here are the answers for your questions:

  1. I'd like to attribute the queries execution to another project. In my case, the BigQuery is on project A, and beam.io.ReadFromBigQuery runs on project B - I'd like to bill project B for the queries (for that matter it could also be project C).
  2. We use Dataflow and direct runner (when implementing, it should better a be a general solution and not Dataflow specific).
  3. I use both methods - if I'm not wrong, in both cases you could set the quota_project_id via ClientOptions.
  4. Yup, you got the idea correctly :)

As for #30747 - it is related, but there might be some changes in implementation as GCS is project's resource rather than a service.

shahar1 avatar Apr 30 '24 06:04 shahar1

@shahar1 sounds like you've got a decent idea/design in mind, which could be supported.

Are you interested in contributing? Feel free to start and include me on PRs, if that's the case.

brucearctor avatar Apr 30 '24 15:04 brucearctor

@shahar1 sounds like you've got a decent idea/design in mind, which could be supported.

Are you interested in contributing? Feel free to start and include me on PRs, if that's the case.

I'd be happy to try! I need first to learn how development works here (I'm coming from the Airflow community)

shahar1 avatar Apr 30 '24 17:04 shahar1

.take-issue

shahar1 avatar Apr 30 '24 19:04 shahar1

This should be pretty good --> https://github.com/apache/beam/blob/master/CONTRIBUTING.md

If you find a problem [ or that is outdated ], let's overcome and fix the docs along the way.

brucearctor avatar Apr 30 '24 19:04 brucearctor

Are there any updates on this?

eminik avatar Sep 12 '24 17:09 eminik

Are there any updates on this?

I haven't managed to work on it yet. If you (or anyone else) find this feature useful and want to take over, please let me know.

shahar1 avatar Sep 12 '24 18:09 shahar1