Daniel Clark issues

Results 21 issues of


                                            Daniel Clark

Use cloudbuild for multi-arch images

This was running too slow locally and is usually not needed during normal iteration. Move the multi-arch images to cloudbuild. Also add some doc nits in the manifests.

feat: initial Prometheus analyzers

Closes # ## 📑 Description Added a prometheus integration with two analyzers: 1. `PrometheusConfigValidate` 2. `PrometheusConfigRelabelReport` The integration does not deploy any Prometheus stack in the cluster. Instead, it searches...

Access NVIDIA GPUs in K8s in a non-privileged container

Hello - I'm trying to see if it's possible to deploy NVIDIA DCGM on K8s with the `securityContext.privileged` field set to `false` for security reasons. I was able to get...

[DRAFT] Prevent pod restarts on startup

It is tempting to only rely on defaulting webhooks to ensure any changes to the OperatorConfig updates the .collection.externalLabels and rules.externalLabels fields with the default project, location, and cluster labels....

[Design] Optimizations for target status reporting

The feature [doesn't scale well](https://github.com/GoogleCloudPlatform/prometheus-engine/issues/774) in larger clusters with lots of PodMonitorings. Let's explore ways to reduce the resource footprint of the operator when this feature is enabled. Acceptance Criteria:...

Investigate failing "open" for some CRs

In cases where the webhooks can't reach the operator (e.g. operator OOMs), is it worth trying out a [`failurePolicy=Ignore`](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#failure-policy) in some cases? Acceptance Criteria: - Assess trade-offs of "failing open"...

Re-introduce controller-runtime to e2e tests

If this is feasible, this would be nice as its API is easier to manage and allows us to avoid regenerating clientsets that are only used for testing.

Add TLSInsecureSkipVerify to NodeMonitoring

So it can be used to scrape the kubelet in clusters with self-signed certs (e.g. kind). Akin to https://github.com/GoogleCloudPlatform/prometheus-engine/issues/223 but for NodeMonitoring.

[Design] Dynamic resource usage for GMP operator

Can we find ways to avoid OOM crashes in the gmp-operator? Maybe using a [VPA](https://gist.github.com/pintohutch/65bc578f1ca7f9d07ad44ff944168bb6)? Acceptance criteria: - Proposal with design and trade-offs

Add project_id, location, and cluster labels via relabeling to kubelet metrics

The hardcoded `scrape_config` for the kubelet and does [not include](https://github.com/GoogleCloudPlatform/prometheus-engine/blob/f1923f31bfc1c75457198674d865b45630938afc/pkg/operator/collection.go#L558-L563) `project_id`, `location`, or `cluster`, which is in contrast to the `scrape_config` [relabeling](https://github.com/GoogleCloudPlatform/prometheus-engine/blob/f1923f31bfc1c75457198674d865b45630938afc/pkg/operator/apis/monitoring/v1/types.go#L626-L642) from `PodMonitoring`. In practice, this isn't a big...