feat(stackdriver_exporter): Add ErrorLogger for promhttp

Open Pokom opened this issue 2 years ago • 1 comments

I had recently experienced #103 and #166 in production and it took quite some time to recognize there was a problem with stackdriver_exporter because nothing was logged out to indiciate problems gathering metrics. From my perspective, the pod was healthy and online and I could curl /metrics to get results. Grafana Agent however was getting errors when scraping, specifically errors like so:

 [from Gatherer #2] collected metric "stackdriver_gce_instance_compute_googleapis_com_instance_disk_write_bytes_count" { label:{name:"device_name"
value:"REDACTED_FOR_SECURITY"} label:{name:"device_type"  value:"permanent"} label:{name:"instance_id" value:"2924941021702260446"} label:{name:"instance_name"  value:"REDACTED_FOR_SECURITY"} label:{name:"project_id" value:"REDACTED_FOR_SECURITY"}  label:{name:"storage_type" value:"pd-ssd"} label:{name:"unit" value:"By"} label:{name:"zone" value:"us-central1-a"}
counter:{value:0} timestamp_ms:1698871080000} was collected before with the same name and label values

To help identify the root cause I've added the ability to opt into logging out errors that come from the handler. Specifically, I've created the struct customPromErrorLogger that implements the promhttp.http.Logger interface. There is a new flag: monitoring.enable-promhttp-custom-logger which if it is set to true, then we create an instance of customPromErrorLogger and use it as the value for ErrorLogger in promhttp.Handler{}. Otherwise, stackdriver_exporter works as it did before and does not log out errors collectoing metrics.

refs #103, #166

Nov 03 '23 14:11 Pokom

@SuperQ when you get a chance, mind providing a review? This would really helpful for us to at least get alerted on when we enter this state.

Nov 10 '23 14:11 Pokom