datadog-agent icon indicating copy to clipboard operation
datadog-agent copied to clipboard

Add support for filtering tags in k8s workloads.

Open lmello opened this issue 2 years ago • 13 comments

What does this PR do?

Just add the anotation tags.datadog.com/disable: "comma,separated,tags,to,be,disabled"

Add support in taglist to remove tags.

The idea is to use this before calling the tags.Compute() method The current implementation is done for k8s workloads, but it could be expanded to other places.

(ps: I am still learning go, please forgive any mistakes. )

Motivation

Wanted to exclude tags from the kubernetes workload for specific pods, this pr allows to do that during runtime.

Example: pod_name is high cardinality, I want it enabled for some services, but not for others.

Additional Notes

It reuses an existing annotation and is compatible with previous use cases of the annotation (only kube_service was supported before my change.)

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

Tested using unit testing and planning to rollout the agent with the changes in a subset of our infrastructure.

Reviewer's Checklist

  • [ ] If known, an appropriate milestone has been selected; otherwise the Triage milestone is set.
  • [ ] Use the major_change label if your change either has a major impact on the code base, is impacting multiple teams or is changing important well-established internals of the Agent. This label will be use during QA to make sure each team pay extra attention to the changed behavior. For any customer facing change use a releasenote.
  • [ ] A release note has been added or the changelog/no-changelog label has been applied.
  • [ ] Changed code has automated tests for its functionality.
  • [ ] Adequate QA/testing plan information is provided if the qa/skip-qa label is not applied.
  • [ ] At least one team/.. label has been applied, indicating the team(s) that should QA this change.
  • [ ] If applicable, docs team has been notified or an issue has been opened on the documentation repo.
  • [ ] If applicable, the need-change/operator and need-change/helm labels have been applied.
  • [ ] If applicable, the k8s/<min-version> label, indicating the lowest Kubernetes version compatible with this feature.
  • [ ] If applicable, the config template has been updated.

lmello avatar Oct 16 '23 14:10 lmello

CLA assistant check
All committers have signed the CLA.

bits-bot avatar Oct 16 '23 14:10 bits-bot

Can someone provide a feedback on this PR ?

lmello avatar Jan 05 '24 19:01 lmello

@lmello maybe you need to contact support to get them to send engineers here to look at it? I'm interested in your feature being added, so I hope they'll expedite the process.

2rs2ts avatar Jan 10 '24 19:01 2rs2ts

upport to get them to send engineers here to look at it? I'm interested in your feature being added, so I hope they'll expedite the process.

@2rs2ts I gave up, contacted support, TAM and it didn't move up. We end up building our own pipeline before shipping metrics and logs to datadog using vector, to sanitize and clean tags.

lmello avatar Jul 23 '24 00:07 lmello

I gave up, contacted support, TAM and it didn't move up. We end up building our own pipeline before shipping metrics and logs to datadog using vector, to sanitize and clean tags.

Boy is that a common story. Sorry to hear it, but I'm not surprised; it feels like engineering must be a skeleton crew over there

2rs2ts avatar Oct 23 '24 22:10 2rs2ts

it's amazing that someone at datadog wont even awk this PR. The company is just telling us all they care about is money and making firewalling tags in metrics hard to do that no one wants to do it.

mzupan avatar Apr 23 '25 16:04 mzupan

This pull request has been automatically marked as stale because it has not had activity in the past 15 days.

It will be closed in 30 days if no further activity occurs. If this pull request is still relevant, adding a comment or pushing new commits will keep it open. Also, you can always reopen the pull request if you missed the window.

Thank you for your contributions!

dd-octo-sts[bot] avatar Oct 18 '25 22:10 dd-octo-sts[bot]

Incredible that if the Datadog engineers just don't get to a PR in a month, the bot will just close it. That's insane and will definitely lead to a lot of wasted work. And spamming OP with messages doesn't help anybody...

2rs2ts avatar Nov 05 '25 01:11 2rs2ts

Hi @2rs2ts,

Thank you for taking the time to share your feedback. We recently introduced a system to help manage long-standing issues through automated stale management. We understand this may have come as a surprise for some existing issues. We want to apologize if it felt that way. It was not our intent, and we want to reassure you that every issue is important to us and reviewed carefully. For your specific issue, we’re reviewing it to ensure the discussion progresses and receives the right level of attention. Thank you for helping improve the Datadog Agent.

— The Datadog Agent Team

chouetz avatar Nov 12 '25 17:11 chouetz

Hi @lmello,

First of all, I want to sincerely apologize for the long silence on this pull request. You’re absolutely right; it’s completely unacceptable for a contribution to wait this long, and I’m truly sorry that your work didn’t receive the attention it deserved. This PR unfortunately fell through the cracks due to a missing triage label, and we’ve started an internal initiative to ensure this kind of situation doesn’t happen again.

Regarding the intent of this change: the need you identified is totally valid, and we’ve heard similar feedback from other users over time. However, we believe the specific implementation in this PR could introduce a few significant issues.

  • Metric collisions: Allowing arbitrary tag removal can cause collisions in the Agent’s metric aggregator (for example, with gauge metrics), potentially producing inaccurate values.
    • For instance, removing the pod_name tag when two pods from the same deployment run on the same host can merge their metrics incorrectly.
  • Impact on official integrations: Since the tag removal happens at the “tagger” component level, it would also affect built-in integration metrics and internal product features that rely on consistent tagging, leading to degraded functionality and user experience.

The good news is that over the past year we’ve developed a safer and more flexible approach that addresses this same need. You can now configure tag cardinality for DogStatsD metrics directly, either at the Agent, client, or even metric level.

  • The Agent setting dogstatsd_tag_cardinality supports none, low, orchestrator, or high.
    • For Kubernetes, we recommend orchestrator to avoid the type of metric collisions mentioned above.
  • DogStatsD client libraries now also support configuring cardinality:
    • Globally via the DD_CARDINALITY environment variable (set in the application's environment) or client option.
    • Per-metric when submitting new data points.

You can find more details here:

Would this new capability address the original problem you were trying to solve?
If not, we’d love to continue the conversation to explore other options within the Agent or the client libraries.

Your feedback is extremely valuable, and we truly appreciate your patience and the effort you’ve put into contributing to the project.

Thanks again for raising this, and sorry once more for the delay.

clamoriniere avatar Nov 13 '25 21:11 clamoriniere

Can't speak for OP, but FWIW, I think I filed a support ticket saying "please implement this person's FR" a long time ago, so I can answer for myself. And then with OP's feedback maybe we will have a picture of where to go from there.

So, the ability to set cardinality levels at the DogStatsD client is pretty good. I would want to see the same for checks cardinality. A problem we've had to deal with is needing to add custom tags of different names for stuff that exposes prometheus metrics that give you stats on pods, nodes, etc. The prometheus check will collect the name of the pod, node, etc. that's providing those metrics, which would get conflated with the prometheus labels trying to convey the same kind of info, but about other pods/nodes. So, we'd just create custom labels or mappings, adding to our cardinality and making monitors and dashboards way more annoying to write.

Also, the way you can only modify the cardinality level with low granularity is disappointing. It'd be better to get as granular as possible. But yeah, overriding a particular client to use low cardinality instead of orchestrator, then having to add in some tags back manually, could be useful for some cases. It's definitely better than nothing, and I appreciate it existing.

2rs2ts avatar Dec 01 '25 23:12 2rs2ts

Hi @2rs2ts 👋

So, the ability to set cardinality levels at the DogStatsD client is pretty good. I would want to see the same for checks cardinality.

Good news — this capability already exists for check instances. The documentation is a bit hard to find, but you can configure per-check tag cardinality here:
https://docs.datadoghq.com/getting_started/integrations/#per-check-tag-configuration

Let us know if this works for your Prometheus checks.
If you need even more control over the data exposed by a Prometheus/OpenMetrics check, you can also build a custom integration as described here:
https://docs.datadoghq.com/developers/custom_checks/prometheus/

clamoriniere avatar Dec 08 '25 15:12 clamoriniere

@clamoriniere

Good news — this capability already exists for check instances.

Thanks, that's great to hear! I wish it was easier to find that stuff in the docs, but then again, it's not like I actively looked for it; I just assumed–based on the earlier reply–that it was only for StatsD clients.

Let us know if this works for your Prometheus checks.

I'll test it out next time it comes up, but FWIW, rather than "working" for me, it just probably gets me closer to my ideal setup (e.g., get rid of a lot of stuff I don't want, at the cost of a few things I do want that I will have to add back in manually or possibly do without.) I believe this will be true for most users: their particular situations call for different sets of tags, and the cardinality levels are not granular enough to align for most of them. But, again, this is way better than not having the option!

If you need even more control over the data exposed by a Prometheus/OpenMetrics check, you can also build a custom integration

Yeah, in our case we were scraping metrics we created ourselves, so implementing a custom check was not necessary (we'd have to do this work either way, might as well create only one moving part, rather than two.)

2rs2ts avatar Dec 15 '25 23:12 2rs2ts