github-activity-counter icon indicating copy to clipboard operation
github-activity-counter copied to clipboard

Disabling Dataflow?

Open ahmetb opened this issue 6 years ago • 3 comments

I feel like there should be a deployment option to disable the Cloud Dataflow use.

Pretty much everything else used by this tool feels pay-as-you-go/serverless.

However, it seems like Dataflow is provisioning a n1-standard-4 instance ($97/mo). This is simply not going to be within my budget.

I'd love to see an option to disable Dataflow in scripts –and also an explanation of what Dataflow does.

Note: I'm not at all familiar with Dataflow, so I'm not sure where it's currently utilized in this tool. (It feels like Cloud Run service could directly write to Stackdriver and/or BigQuery.

ahmetb avatar Oct 28 '19 05:10 ahmetb

Agree, this is the one bit that's unlike the others in this stack. Cloud Run could insert the events directly into BigQuery but that's the untipatern given the low quota on individual inserts. Will consider alternative way of streaming events from PubSub to BigQuery

mchmarny avatar Oct 29 '19 15:10 mchmarny

I've modified the PubSub to BigQuery pipeline to use max 1 worker so that should significantly reduce the cost (~$30/mo). Still need to test it but you should be able to use this in setup

gcloud dataflow jobs run $SERVICE_NAME \
    	--gcs-location gs://cloudylabs-public/cloudylabs-pipelines/pubsub-to-bigquery.json \
        --region $SERVICE_REGION \
    	--parameters "inputTopic=projects/${PROJECT}/topics/${SERVICE_NAME},outputTableSpec=${PROJECT}:${SERVICE_NAME}.events"

mchmarny avatar Oct 29 '19 16:10 mchmarny

$30 is still significant for something that runs maybe a few times a day. I see the quota point, what if we published to pubsub, drained once a day and did a batch insert?

ahmetb avatar Oct 29 '19 19:10 ahmetb