stackdriver-tools icon indicating copy to clipboard operation
stackdriver-tools copied to clipboard

Firehose data loss during cf-deployments

Open anthonysroberts opened this issue 7 years ago • 4 comments

bosh deployment.txt metrics loss

Firehose data (logs, mertics) not showing in stackdriver project whilst performing bosh cf-deployments (either CF-deployment or stackdriver-tools deployment). Loss starts when routers (go routers) upgrade is started.

The bosh deployment timeline is attached which shows that firehose data is flowing until the deployment starts to the 2 router instances (fire hose data stops 09:32:23) and does not resume until the stackdirver-nozzle upgrade completes at (10:22:55) -

Task 3026 | 09:31:52 | Updating instance router: router/d99294ac-35f9-40e2-9933-57d3e464c7cf (0) (canary) (00:01:41) Task 3026 | 09:33:33 | Updating instance router: router/90ae7733-8ec0-4382-98b0-de2680634d51 (1) (00:01:46) Task 3026 | 10:22:21 | Updating instance stackdriver-nozzle: stackdriver-nozzle/42560ccc-9fe4-4b01-a1eb-6715610f7d92 (0) (canary) (00:01:00)

I have also attached a stackdriver metrics screen shot for the remaining diego-cell memory (across both cf deployments). This illustrates the loss of metrics for the deployment being upgraded. From the left hand side, you will notice that the remaining available memory for diego is circa 100 GB until 10:22 when metric data returns post upgrade. You will also see remaining memory drop off again around 11 am – that coincides with the stackdriver-tools upgrade to the other cf-deployment (point at which the upgrade hits the routers)

For reference, we have 2 cf-deployments each running a single stackdriver-nozzle both nozzles reporting up to the same single stackdriver project. stackdriver-tools 2.0.1 cf-deployment 1.14.0 GCP IaaS

anthonysroberts avatar Feb 26 '18 13:02 anthonysroberts

We need to differentiate between the nozzle failing to connect to the firehose and a long term outage of the firehose. I'd expect the nozzle to be down and not reporting for ~1 minute. Anything beyond that is not expected.

Few questions:

  • Can you pull the logs for the stackdriver-nozzle and take a look to see what errors it was hitting?
  • Did the CF API/firehose go down during this hour long period?
  • Is the stackdriver-nozzle connecting to firehose via a GCP load balancer? Did that endpoint change during this deployment?

johnsonj avatar Feb 26 '18 18:02 johnsonj

Hi, Errors in nozzle as follows @ 09:32 when router upgrade started (firehose data loss start) {"timestamp":"1519637543.238842964","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":2,"data":{"error":"websocket: close 1006 (abnormal closure): unexpected EOF"}} {"timestamp":"1519637544.205738306","source":"stackdriver-nozzle","message":"stackdriver-nozzle.firehose","log_level":2,"data":{"error":"Error getting token: oauth2: cannot fetch token: 401 Unauthorized\nResponse: {"error":"unauthorized","error_description":"No client authentication found. Remember to put a filter upstream of the TokenEndpointAuthenticationFilter."}"}}

The stackdriver-nozzle is connecting to firehose via a GCP load balancer

Did the CF API/Firehose go down for this hour long period? i cannot go back in retrospect - but on a CF deployment (upgrade to 1.15.0) that is current in the "no log period - i get the following via a heath check of the traffic control (which show no firehose connections):

log-api/e2ca3f18-d010-4f0b-aada-0deb0e8863b8:~$ curl http://localhost:14825/health HELP loggregator_trafficcontroller_appStreamCount Number of open app streams TYPE loggregator_trafficcontroller_appStreamCount gauge loggregator_trafficcontroller_appStreamCount 0 HELP loggregator_trafficcontroller_firehoseStreamCount Number of open firehose streams TYPE loggregator_trafficcontroller_firehoseStreamCount gauge loggregator_trafficcontroller_firehoseStreamCount 0 HELP loggregator_trafficcontroller_slowConsumerCount Number of stream consumers disconnected to avoid backpressure on the Loggregator system TYPE loggregator_trafficcontroller_slowConsumerCount gauge loggregator_trafficcontroller_slowConsumerCount 0 FYI the firehoseStreamCount increments to 1 once the deployment completes the last deployment task which is the stackdriver-nozzle. This is when the firehose data re-appears again

Note that this particular cf-deployment only rolled out stackdriver-tools across all cf components (so fluentd etc)

anthonysroberts avatar Feb 27 '18 11:02 anthonysroberts

Does the nozzle continue to spit out logs after the first websocket closure and oauth error? Is there an error that continues to repeat itself or does it just hang till it comes back online?

johnsonj avatar Feb 28 '18 16:02 johnsonj

the following continue to be shown repeatably in the logs until the point the nozzle is updated: "error":"Error getting token: oauth2: cannot fetch token: 401 Unauthorized\nResponse: {"error":"unauthorized","error_description":"No client authentication found. Remember to put a filter upstream of the TokenEndpointAuthenticationFilter."}"}}

anthonysroberts avatar Mar 02 '18 10:03 anthonysroberts