Gate histogram features with a feature flag
Before merging the sparsehistogram branch into main, we need to gate all features related to native histograms behind a feature flag.
Some code paths don't need gating, though, because they will never be taken if, e.g. native histograms are never ingested in the first place.
The UI is a bit tricky here because e.g. it shouldn't auto-suggest PromQL functions related to native histograms, but it cannot directly access the feature flags set. @Nexucis has prepared a solution here: #11090 .
This issue probably deserves a TODO list because it will comprise many sub tasks.
@beorn7 do we only want to have a flag for ingestion of histograms while always allowing querying of native histograms even after disabling its ingestion? Or does turning off native histograms flag stops both ingestion and querying for them?
Good question. I guess for the least amount of surprise, there should be no visible change of behavior. I.e. if we only disable ingestion, we would still show histogram related PromQL functions in the autocomplete, and we would not error out on any usage of those functions. Maybe that's not a big deal, but it might be better if we let people decide that are not so deeply involved in the histogram development. Their perspective might be closer to that of a normal user.
The UI is a bit tricky here because e.g. it shouldn't auto-suggest PromQL functions related to native histograms, but it cannot directly access the feature flags set. @Nexucis has prepared a solution here: https://github.com/prometheus/prometheus/pull/11090 .
Actually it won't be that tricky. Once the feature flag is added to the code, we just need to modify the UI to get the status of the flag and made the trick to filter or not the associated keywords :)
get the status of the flag
@Nexucis do you also get the config file info in the UI?
@codesome for the moment no. Why ? Do you think it's better to get the config instead of the status flag ?
In https://github.com/prometheus/prometheus/pull/11253 I am planning to make it a runtime config for histograms. I don't know how the status flag is connected to this. If we can get this info in status flag, then it should work.
https://github.com/prometheus/prometheus/issues/11168#issuecomment-1219614131
I was thinking about this earlier and I thought about Prometheus's guarantee of read after write and how we should respect it i.e what has been ingested should be always queryable even if then we disable ingesting new histograms data.
The only worry I have is what if there is a bug in one of the new query paths for histograms and we have no control to disable them to avoid issues. But then again I'm not yet that familiar with the code so I don't know if an issue querying histograms could impact querying other series.
Also, since "experimental" means that we do not have any stability guarantees, we might (need to) change the on-disk format of histograms further down the road. In that case, feature flag enabled or not, you won't be able to read what an older Prometheus server has written using the previous version of the (experimental) histogram format.
In a way, the moment you use an experimental feature, you are waiving all guarantees anyway, so I tend more and more towards only gating the ingestion. That should be enough that a user never flipping the feature flag will never notice anything.
What's a bit more subtle in accomplishing the latter is if we need to gate things in PromQL so that e.g. histogram_fraction keeps yielding an error or if it is fine that it will suddenly work (and return empty results as long as no actual histograms have been ingested). Similarly (and already discussed above) we need to think about autocompletion and tool tips. For the latter, I think it's more important to not show anything histogram related because it will be very confusing for users not in touch with histograms.
In #11253 I am planning to make it a runtime config for histograms. I don't know how the status flag is connected to this. If we can get this info in status flag, then it should work.
mmm from my knowledge, the flag status and the flag are supposed to be static, so I don't see how it is going to change at runtime.
But if the preferred solution is to have a runtime config and the UI need to get it to know if histogram is active or not, then I guess we can do it. As long as you have an HTTP endpoint that is providing it of course and a nice json struct so it is easy to extract the useful information
by the way, for an experimental feature, is it really worth it to have a runtime way to deactivate it ? That's quite unusual comparing to other experimental feature.
From my understanding, experimental feature has been always activated through a flag. There is certainly a need/requirement I don't get here with the runtime activation.
@Nexucis yes, you are right. The idea surely comes from those offering Prometheus as a service, as they use the code in a multi-tenant setup and don't want to restart everything if a single tenant wants the feature activated. That alone shouldn't be a reason to have runtime configurability in prometheus/prometheus. But perhaps there are other reasons. Generally, I like to think if a configuration is easy to implement it in a way changeable during runtime, we should consider doing so. (There had been some other discussions to move more flags into the config.)
So yeah, in general, I'd like to see more feedback on the runtime config for feature "flags" to see if it is worth it outside the multi-tenant use case above (which could still be implemented in other ways as long as we allow the plumbing for it to be in the code).
~How do you feel about this being a runtime reloadable config via files vs re-using the current CLI flag to enable this feature? Having it in the file will not require a restart to enable or disable this.~
Looks like there is a consensus on using the CLI flag https://github.com/prometheus/prometheus/pull/11253
so for the followup around the topic: should we remove the histogram functions from the autocomplete list when the flag is not set. From the last comment/suggestion on #11253, we can of course just put a tooltip explaining this is an experimental feature and behaviour might change in future release.
It's really depending of what you want @codesome @beorn7, just you can know it's possible to deactivate or add the tooltip, both options are viable :). On my side I don't have a strong opinion. It might felt weird to use the histogram function and the experimental flag has never been activated. But that's maybe a corner case.
@Nexucis At the current state, I think it makes sense to always show the autocomplete with the message saying this is for the experimental histogram feature. I think this is also a good way to make people aware of native histograms for those who do not look at the changelog carefully.
That said, I am moving the issue from P1 to P2 since the requirements for P1 is satisfied.
What we have now is the absolute minimum: Gate the ingestion of histograms (and point out in the documentation, including tool tips etc., that histogram features are experimental and ingestion has to be enabled by a feature flag – this latter point is still TODO).
If we go with that, we will process histograms in PromQL if encountered in the storage, even if the feature flag is not set. We will not error out on using histogram_fraction. We will also silently allow scraping using the old protobuf format again (which is a feature we might drop again, so at the very least we have to point out in the documentation that it is not a stable feature). And there might be a few other subtle things in the same spirit as the protobuf ingestion (where the relation to Native Histograms isn't immediately obvious).
The main problem with that is that users might accidentally use those experimental features, unaware of their experimental nature. They might then be taken by surprise if those features suddenly change or disappear.
Personally, I'm fine with the minimal approach because I think it's sufficiently unlikely that users will run into them without realizing what they are doing. However, I would like to make sure that we are all on the same page. In particular @roidelapluie and @LeviHarrison as the relevant lead maintainers should agree.
I'm fine with this approach too :)
I had one more line of thought about this:
- To scrape Native Histograms, Prometheus needs to content-negotiate the old protobuf format for now.
- client_golang still supports that, even if no Native Histograms are used.
- Therefore, if we don't gate protobuf scraping by the feature flag, all updated Prometheus servers in the world will suddenly scrape targets instrumented with client_golang via protobuf (again).
- I would expect that this will improve things in most cases: client_golang uses the rather inefficient official protobuf Go code, but the expensive code paths are taken anyway, even if serving text format. In total, I would expect less usage of CPU, RAM, and the network.
- However, we cannot be 100% sure that will be the case in all situations. In some scenarios, things might get worse. Also, it is a quite invasive change in how things are done, and it affects a lot of scrapes in the world.
My conclusion here is that we should probably gate the protobuf scrapes with a feature flag, too. The question is if it should be a separate one (so that you could use protobuf scraping without using Native Histograms) or if the one histogram flag should also gate protobuf scraping (practically it makes more sense, but it has a weird semantics – on the other hand, even with two separate flags, we should probably let the histogram flag implicitly activate the protobuf flag).
Any thoughts?
Bumped to P1 again, as we should really not merge before we have at least an informed decision to not gate protobuf scraping (or, what I think is more likely, gate protobuf scraping after all).
With some offline discussions, since no one is asking for protobuf scraping actively, we decided to enable protobuf scraping iff histograms are enabled. Else it stays disabled by default.
Maybe we can now even close this? I will amend the autocomplete doc strings to mention that it is an experimental feature, and then I think we don't need to gate the features in the UI. @Nexucis seems to be fine with that approach, see above.
Closing for now, simply re-open if you disagree with my statement above.
totally ok with that @beorn7 :)