gcp_stackdriver_logs: 401 Unauthorised each hour
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
Each ~hour we are observing "Http status: 401 Unauthorized" in our vector logs coming from the gcp_stackdriver_logs sink.
Vector is running in GKE as a headless service and consuming logs from kafka, we do not utilise the credentials_path and instead rely on the Service Account to authenticate. The problem resolves itself after a 2/3 minute period however my understanding is that these logs are not retried and are therefore discarded.
Configuration
type = "gcp_stackdriver_logs"
inputs = ["final_cleanup"]
log_id = "{{ type }}"
project_id = "centralized-logging"
severity_key = "log_level"
batch.max_events = 1000
batch.max_bytes = 9900000
resource.type = "{{ resource_type }}"
resource.project_id = "{{ gcp_project_id }}"
resource.instance_id = "{{ hostname }}"
Version
0.34.1-distroless-libc
Debug Output
No response
Example Data
No response
Additional Context
No response
References
- https://github.com/vectordotdev/vector/issues/17559
- https://github.com/vectordotdev/vector/issues/8616
You are correct, it seems like those requests are not retried. I'd argue they should be (per https://github.com/vectordotdev/vector/issues/10870), in addition to refreshing the token before it expires.
Retry logic:
https://github.com/vectordotdev/vector/blob/131ab453d4611699e6f6989546c4b5d289e8768a/src/sinks/util/http.rs#L517-L531
Coming back to this, it seems as though the root issue relates to running more than 1 gcp_stackdriver_logs sink (we had a separate sink sending a subset of logs to a different GCP project). Vector's handling of the authentication token refreshes seems to (perhaps) have a timing/race issue when more than one sink is in play, when we removed the additional sink the 401s were no longer observed.
Update: The 401s have returned since we scaled back to a single gcp_stackdriver_logs sink.
@jszwedko I've taken a stab at changing how the token is refreshed in https://github.com/vectordotdev/vector/pull/20574.
Closing. Fixed in https://github.com/vectordotdev/vector/pull/20574