prometheus-engine Add support for targeting services for scraping

An increasing number of providers are placing their metrics endpoints behind a service rather than digesting them directly from a deployment. Kyverno and ArgoCD are notable examples. Other implementations of Prometheus collectors have a CRD that allows you to scrape metrics by targeting a service. https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/crds/crd-servicemonitors.yaml

The only way to target these endpoints currently is to ignore the service, parse the deployment that's generating metrics for a label you can use, and construct a PodMonitoring resource for each deployment that generates metrics. It sucks because it makes what should be a single simple monitoring resource into multiple monitoring resources that are generally more brittle.

Apr 11 '22 14:04 cnorling

There might be some way to use the podMonitoring resource to listen on a service that I'm just not aware of. Here's an example yaml based on the prom-example example in GCP's documentation here that includes a service. I feel like this should work, but it doesn't...

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prom-example
  namespace: gmp-test
  labels:
    app: prom-example
spec:
  selector:
    matchLabels:
      app: prom-example
  replicas: 3
  template:
    metadata:
      labels:
        app: prom-example
    spec:
      containers:
        - image: nilebox/prometheus-example-app@sha256:dab60d038c5d6915af5bcbe5f0279a22b95a8c8be254153e22d7cd81b21b84c5
          name: prom-example
          ports:
            - name: metrics
              containerPort: 1234
          command:
            - "/main"
            - "--process-metrics"
            - "--go-metrics"
---
apiVersion: v1
kind: Service
metadata:
  name: gmp-test-service
  namespace: gmp-test
spec:
  ports:
    - port: 5678
      protocol: TCP
      targetPort: metrics
  selector:
    app: prom-example
  sessionAffinity: None
  type: ClusterIP
---
apiVersion: monitoring.googleapis.com/v1alpha1
kind: PodMonitoring
metadata:
  name: prom-example
  namespace: gmp-test
spec:
  selector:
    matchLabels:
      app: prom-example
  endpoints:
    - port: 5678
      interval: 5s

Apr 11 '22 16:04 cnorling

Hello,

We currently support Pod scraping only due to potential scalability concerns with monitoring services and endpoints on larger clusters.

What about using the label selector defined on the Service's .spec.selector in your PodMonitoring's .spec.selector.matchLabels and specifying the underlying deployment's container port in .endpoints?

Apr 28 '22 22:04 pintohutch

That's pretty much what we're doing right now, it just translates into more work for services like argocd that have multiple published metrics locations.

Apr 29 '22 01:04 cnorling

Gotcha. That makes sense.

For the time being, we don't plan on supporting ServiceMonitoring mainly due to the scalability concerns I mentioned earlier. Also PodMonitoring generally fits most use-cases, albeit occasionally using workarounds, as you stated.

If you would like to leverage the prometheus-operator ServiceMonitor CRD, another option is to replace the OSS image with the gke.gcr.io/prometheus-engine/prometheus image in the prometheus-operator or kube-prometheus stack.

Apr 29 '22 16:04 pintohutch

If I have N pods running behind my service, does a matching PodMonitoring scrape all of them at the same time? I have some metrics I don't want to scrape in parallel - getting the metrics from just one pod is enough - and for this, I think I'll need ServiceMonitoring.

Feb 07 '24 22:02 reith

@reith - Prometheus does not scrape targets at the same time, but uses an offset algorithm to spread the load amongst the targets found in a job.

Can you describe more what you're trying to scrape behind your service?

Feb 08 '24 18:02 pintohutch

@pintohutch The metric I'm trying to push is directly read from a DB; it's not a per-pod metric and I don't need to calculate the metric - run the query - for each pod.

I know it doesn't sound perfect to push these metrics from pods but they fit my architecture best. Ideally, I'd prefer GCP Monitoring to make a chart from the data that is also replicated to the Bigquery but it doesn't seem viable. I could also create another single-pod deployment but it brings some overhead.

Feb 08 '24 18:02 reith

The metric I'm trying to push is directly read from a DB; it's not a per-pod metric and I don't need to calculate the metric - run the query - for each pod.

IIUC you have N pods, each with essentially the same set of metrics you're trying to scrape (i.e. foo_total=1 on all N pods?) And you don't want to have to write N copies of the data to GMP?

I know it doesn't sound perfect to push these metrics from pods but they fit my architecture best. Ideally, I'd prefer GCP Monitoring to make a chart from the data that is also replicated to the Bigquery but it doesn't seem viable. I could also create another single-pod deployment but it brings some overhead.

@lyanco may be able to answer any questions you have around product gaps in Google Cloud Monitoring, but I think we'd need a little more detail.

Feb 08 '24 19:02 pintohutch

The metric I'm trying to push is directly read from a DB; it's not a per-pod metric and I don't need to calculate the metric - run the query - for each pod.

IIUC you have N pods, each with essentially the same set of metrics you're trying to scrape (i.e. foo_total=1 on all N pods?) And you don't want to have to write N copies of the data to GMP?

That's right. I think a ServiceMonitor would let me do this but I haven't worked with other implementations of Prometheus operators.

Feb 08 '24 19:02 reith

I think you can do this with PodMonitoring by adding a specific label to one of the pods and then adding that label to the selector field in the PodMonitoring.

Feb 09 '24 16:02 lyanco

Yea - if there's a particular pod that is, say leader: true or something, that could work.

Feb 09 '24 17:02 pintohutch

It could also work if you have a special label on your metrics from one of the pods using metricRelabeling to drop the time series from other pods.

The fundamental reason we haven't supported service-based monitoring is due to scaling concerns when running Prometheus as a DaemonSet. Having every pod in a 1000 node cluster watching K8s endpoints stresses the api server.

Feb 09 '24 17:02 pintohutch

I think you can do this with PodMonitoring by adding a specific label to one of the pods and then adding that label to the selector field in the PodMonitoring.

The pods are part of a Deployment and have the same set of labels. I think it's an anti-pattern to treat specific pods of Deployments differently. I don't want to bother myself with relabeling pods once a pod crashes or the number of replicas decreases.

It could also work if you have a special label on your metrics from one of the pods using metricRelabeling to drop the time series from other pods.

I'd also like to decrease the number of redundant readings, not just the number of samples pushed to GMP.

The fundamental reason we haven't supported service-based monitoring is due to scaling concerns when running Prometheus as a DaemonSet. Having every pod in a 1000 node cluster watching K8s endpoints stresses the api server.

I don't understand why it'd need to watch endpoints. The operator could watch services and configure Prometheus to scrape cluster IPs, couldn't it?

Feb 09 '24 17:02 reith

I don't understand why it'd need to watch endpoints.

If we wanted service-monitoring, it may be preferable to watch endpoints so we can get the service labels in addition to the pod or node (via __meta_kubernetes_endpoint_address_target_kind ) hosting the service to enrich the target relabeling we do (this is what prometheus-operator does for example).

The operator could watch services and configure Prometheus to scrape cluster IPs, couldn't it?

Indeed, but this would essentially be folding in features of Prometheus service discovery to the operator, which would be a pretty big expansion in concerns and complexity. Not to mention the risk of OOMing the operator in larger clusters with dozens or hundreds of services (which is bad for a lot of reasons, not the least of which it being our webhook server).

This is not to say it's impossible or that we won't some day pursue this feature, but it just hasn't been high-enough priority to warrant feature development for, as most can get by with pod-level monitoring.

Feb 09 '24 18:02 pintohutch

I had a discussion with @pintohutch regarding how to watch endpoints efficiently for service monitoring. Although it's more of a general discussion than GMP-specific, we both think others may find it helpful. So I'll post it here:

mail1 mail2 mail3

Mar 31 '24 18:03 haanhvu

@simonpasquier @ArthurSens Please check my comment above. For more details please check my (just submitted) GSoC proposal.

Mar 31 '24 18:03 haanhvu