common icon indicating copy to clipboard operation
common copied to clipboard

Release v1.0.0 stable version of common (or subset of packages)

Open bwplotka opened this issue 10 months ago • 26 comments

Hi 👋🏽

It's time to take common to a higher stability level. We damage the ecosystem by making breaking changes (I authored some 🙃 ) e.g. https://github.com/thanos-io/thanos/pull/8162 (pulling some dep suddenly break another dep etc.).

  • Why not doing this?
  • Should we have a separate common/exp module with 0.x like we do in client_golang now for experimental packages (that we can move to stable mod one day?)
  • Should we be more strict on merging new code (in terms of API flexibility, naming) given it will be hard to change?
  • Do we need to do any major changes before 1.0?

If we need more data on how much problematic 0.x common version is, to motivate this, I can try to research exact numbers, let me know!

cc @aknuds1 @SuperQ

bwplotka avatar Mar 17 '25 21:03 bwplotka

Should we have a separate common/exp module with 0.x like we do in client_golang now for experimental packages (that we can move to stable mod one day?)

I lack experience personally with having an exp module, but it sounds like it could be a good idea, to allow for experimentation.

Do we need to do any major changes before 1.0?

I'd like for the otlptranslator package to be put to use in intended clients, e.g. Prometheus and OTel Collector, before stabilizing. I'm not sure about its current API.

aknuds1 avatar Mar 18 '25 06:03 aknuds1

I don't think an exp module would be useful, because as soon as we depend on it the stability requirements are the same. "There's no such thing as a temporary patch".

We already do have good practices around breaking changes.

For example with promslog, we actually created a whole new package in order to keep the existing promlog package in place. We created a release and converted the whole ecosystem over. Then we marked the promlog package deprecated and cut another release. It wasn't until another couple releases that we removed the old promlog package. This was ~6 month process.

There is no preventing change. We must move our codebases forward. We can do our best to review changes to make sure interfaces are stable and well designed. But mistakes happen, design patterns change, and we need to publish a breaking change.

I don't think we need or want a 1.0 of this library.

How would this actually work? What's the day-2 plan? Would we keep release-1.x branches? How long would we support them for?

IMO we don't have the people-time to maintain several release branches over any time in order to support older major versions. We barely do this (LTS) for prometheus itself.

I also think this will just churn the major version of the module for no utility. And that's going to be even more painful for us, as well as downstream users.

Also, how do you define breaking? Just function signatures? Structs? Just function removals? There are lots of small things that could be breaking or deprecated. Any behavior change comes with concequences.

I think we need to improve our design and review quality.

Easy to say, hard to do. But that's where I'm at with what are the realistic ways we can improve the quality of life for downstream users of our code. Similar to how Linux has a "don't break userspace" creed, I think we need to consider our downstream users more when making changes.

SuperQ avatar Mar 18 '25 07:03 SuperQ

Nice writeup @SuperQ :)

aknuds1 avatar Mar 18 '25 08:03 aknuds1

I think we need to improve our design and review quality.

@SuperQ @bwplotka Is there anything concrete we can do in this regard? As a concrete example, I think actually the otlptranslator package shouldn't have been merged into main before making a PoC of its integration into relevant client projects (Prometheus, OTel Collector, Mimir, Thanos, others?). Hammering out the integration in retrospect has me feeling like I'm catching up.

aknuds1 avatar Mar 18 '25 09:03 aknuds1

As a concrete example, I think actually the otlptranslator package shouldn't have been merged into main before making a PoC

Perhaps we shouldn't have it in common. We could/should reduce the friction to creating new repos for Go libraries that don't need to be in common.

The definition in the README.md is this:

This repository contains Go libraries that are shared across Prometheus components and libraries. They are considered internal to Prometheus, without any stability guarantees for external usage.

If a library is meant for external use, perhaps it needs to be split out.

SuperQ avatar Mar 18 '25 09:03 SuperQ

If a library is meant for external use, perhaps it needs to be split out.

That's a good point. @ArthurSens WDYT?

aknuds1 avatar Mar 18 '25 09:03 aknuds1

I'm not sure if Bartek created this issue just because of otlptranslator, but I'm happy to move it elsewhere, no problems!

Regarding releasing Common 1.0, I generally agree with what Ben mentions above. The problem I'm seeing here in this repo is that it does way too many things, and we have way too few hands to maintain such a generic codebase. Some things are not that important and will need a 1.0 release with all the complicated stability guarantees, but other parts have so much adoption that changing any part of it will break dozens of downstream projects.

It would make sense to create a separate repository and assign maintainers who show expertise in the most adopted parts of the codebase. I can take ownership of otlptranslator and help with the expfmt package, but I'm oblivious to what the other packages do or should do.

ArthurSens avatar Mar 18 '25 11:03 ArthurSens

@bwplotka Should we require CI checks and reviews before being able to merge a PR? I notice they are currently not required.

aknuds1 avatar Mar 19 '25 06:03 aknuds1

Perhaps we shouldn't have it in common. We could/should reduce the friction to creating new repos for Go libraries that don't need to be in common.

Agree. Maybe we should offer a template Go repo so it's easier to create those. Also automation to release it (go releases, our CI GH actions propagation etc).

The definition in the README.md is this:

This repository contains Go libraries that are shared across Prometheus components and libraries. They are considered internal to Prometheus, without any stability guarantees for external usage.

If a library is meant for external use, perhaps it needs to be split out.

Let's talk about this more. (:

One could argue that this is no longer true ("considered internal to Prometheus"). There are pieces that leaks to external use, e.g.:

  • Our unlucky global variable that controls validation (https://github.com/prometheus/common/blob/main/model/metric.go#L49). Changing it's default value, changes validation behaviour which absolutely breaks external behaviour of stable projects that imports common for this e.g. Prometheus, client_golang, Kubernetes, Otel SDK etc. (See https://github.com/prometheus/client_golang/releases/tag/v1.21.0 as the consequence).
  • expfmt pieces for auto-generating _total suffixes, which client_golang maintainers were not aware of. It's so hidden that if common maintainers will decide to change at some point, it's very easy to miss on client_golang side. It might some change that might slip even through you review @SuperQ.
  • I see many projects in ecosystem using promlog/promslog package for consistent logging. Not only Prometheus (e.g. Thanos).

It's just not possible to stop users from using useful common utilities outside of Prometheus 🙃

Second argument/question is.. should stable module depend on unstable module? How can we ensure stability if dependency can break anytime? Doesn't go mod tooling punish quite much deps with indirect deps that have breaking changes? It's ok for binaries maybe, but is it ok for libraries like client_golang?

How would this actually work? What's the day-2 plan? Would we keep release-1.x branches? How long would we support them for?

What do you do with Prometheus 1.x version? You keep it, don't support it. That's ok and enough. Let's give option for people to control when they have time to make manual changes to their code, not wake downstream users in random moments because common hiddently broke 4 layers down the upgrade for irrelevant pieces of code.

IMO we don't have the people-time to maintain several release branches over any time in order to support older major versions. We barely do this (LTS) for prometheus itself.

Yes. But equally we don't have people-time to deal with the common upgrade breaking other people code the moment they upgrade client_golang (e.g. https://github.com/prometheus/client_golang/releases/tag/v1.21.0) which suppose to be stable. It just feels like a ticking bomb.

Also, how do you define breaking? Just function signatures? Structs? Just function removals? There are lots of small things that could be breaking or deprecated. Any behavior change comes with concequences.

All of the above like client_golang is enduring for years, with all it's glory and flaming feedback when we break people by mistake 🙃 (or intentionally - sometimes you do break but you make it clear and loud, and you give mitigation plans).

bwplotka avatar Apr 01 '25 22:04 bwplotka

One alternative is for client_golang to stop depending on prometheus/common which might be the desired outcome here. We could vendor what we need if API stability is not on the roadmap for this module.

I definitely agree common has so many different packages with variety of stability needs.

bwplotka avatar Apr 01 '25 22:04 bwplotka

If anyone is interested on why stability of Go libs is essential, feel free to read a short story from @dims on k8s channel: https://kubernetes.slack.com/archives/CHGFYJVAN/p1741621071818569

bwplotka avatar Apr 08 '25 10:04 bwplotka

Just linking another argument for v1. By chained dependency, stable Otel Go SDK depends on stable client_golang which depends on unstable common which causes some deprecation risks.

bwplotka avatar Apr 09 '25 10:04 bwplotka

So the tag v1.20.3 should not be used? renovate suggested the update to me

ptman avatar Sep 03 '25 06:09 ptman

@ptman I can see there's a series of v1.x.x tags, but I have no idea why they were made. I'm unable to find any corresponding releases. @ArthurSens Might you know the history of the v1 series of tags? The latest one, v1.20.3, references a commit of yours.

aknuds1 avatar Sep 03 '25 09:09 aknuds1

Those are likely missapplied tags that were meant for client_golang

SuperQ avatar Sep 03 '25 10:09 SuperQ

I've raised an issue for this, as its breaking my linting https://github.com/prometheus/common/issues/831

TomHellier avatar Sep 03 '25 12:09 TomHellier

Weird, I don't even know how that could happen. I haven't pushed any tags anywhere in the last days

ArthurSens avatar Sep 03 '25 12:09 ArthurSens

I don't know when the tag was created, but the referenced commit is from last year.

aknuds1 avatar Sep 03 '25 12:09 aknuds1

Another data-point: https://github.com/kubernetes/kubernetes/issues/133878

bwplotka avatar Sep 03 '25 17:09 bwplotka

Discussion: https://cloud-native.slack.com/archives/C01AUBA4PFE/p1756921591447069

bwplotka avatar Sep 04 '25 04:09 bwplotka

We chatted about this in client_golang maintainers sync.. I think we would love to maintain a stable version of code for the pieces client_golang depends and ensure client_golang does not import common.

Things that are important for client_golang are expfmt and some model pieces only. One way of doing this is to move (or copy) some or all pieces of expfmt and model to client_golang under 1.0, with a slight API refactor that will allow future flexibility (variadic options etc, review what could be removed).

I was thinking about creating client_golang/exposition for expfmt that will also include experimental OM2 implementation and perhaps decoders that use ABNF (cc @ywwg). For model we only use it in experimental api package, so we could vendor it (don't share it with public) or check what exactly is needed from model.

Then for common/expfmt and common/model -- it's up to the common module maintainers, but I think stable client_golang/exposition package would be a good replacement, so common could deprecate its own common/expfmt and Prometheus could use client_golang ones too. For model, I think we don't plant to share it outside of client_golang so common will be one to use.

Just early thoughts, WDYT? (:

bwplotka avatar Sep 13 '25 13:09 bwplotka

I think it might be better to move expfmt to it's own repo.

SuperQ avatar Sep 13 '25 17:09 SuperQ

What is the aim of moving things around to different repos? Is the idea to stabilize expfmt/model and allow the other libraries in common to remain without a stability guarantee? (Why not do as this issue says and just stabilize this library?) Or is the idea to have a stable version of the code in one place, and a more experimental version elsewhere? Moving those libraries will cause a lot of refactoring as every reference to "model.Foo" will need to be changed in all downstream projects, so I'd prefer a strong reason for relocating the code.

ywwg avatar Sep 15 '25 14:09 ywwg

@SuperQ

I think it might be better to move expfmt to it's own repo.

Yup, that's a solid option, but it's generally more expensive to maintain and have some risk on breaking changes management (they will occur as @SuperQ you mentioned (hopefully less though), even for v1, just it would be easier to coordinate in client_golang).

Was thinking about the client_golang because it exists and have stable releases already and it's arguably the main user of expfmt module (critical path). It's also popular for client_golang users to also use expfmt directly to test things out additionally etc. I'd argue, there is no or almost no Go module that use expfmt but does NOT use client_golang. We can ensure common maintainers have explicit ownership there etc.

@ywwg

What is the aim of moving things around to different repos? Is the idea to stabilize expfmt/model and allow the other libraries in common to remain without a stability guarantee?

That's exactly why yes.

(Why not do as this issue says and just stabilize this library?)

@SuperQ stated it clearly above and I partially agree - there's a huge cost in making things stable and majority of packages in common does NOT need to (or shouldn't be) be stable.

Or is the idea to have a stable version of the code in one place, and a more experimental version elsewhere?

I don't think this is the main reason, we don't plan major changes, we just want to reduce risk of breaking API.

Moving those libraries will cause a lot of refactoring as every reference to "model.Foo" will need to be changed in all downstream projects, so I'd prefer a strong reason for relocating the code.

Yes, that's why I'd propose we move expfmt to stable client_golang and find a way to NOT stabilize model (e.g. client_golang can vendor). Plus some migration period for common users and explicit move by us in Kubernetes and Prometheus.

bwplotka avatar Sep 15 '25 15:09 bwplotka

Looking at the commit history, the parts of common that are not expfmt/model don't have much in the way of recent contributions... which modules do you have in mind that still need the flexibility of no stability guarantee?

ywwg avatar Sep 15 '25 17:09 ywwg

It's not about recent changes but intention. Are we sure other parts have an API that we can maintain in stable manner?

One example for other packages was promlog. It was eventually removed, as common maintainers did not want to drag this tech debt further and this is why you keep experimental support in general (https://github.com/prometheus/common/issues/771#issuecomment-2731987593)

For example with promslog, we actually created a whole new package in order to keep the existing promlog package in place. We created a release and converted the whole ecosystem over. Then we marked the promlog package deprecated and cut another release. It wasn't until another couple releases that we removed the old promlog package. This was ~6 month proces

The promslog is new and there's some high risk of new changes.

BTW, fun story: In Google alone (at least places I know of) we didn't migrate yet off promlog. I know about at least ~2 SWEw of work to move to a new package in my near area but likely SWEmonths+ in wider Google ecosystem. I'm sure the wider ecosystem scale has even bigger workload caused by seemingly easy decision on our side. It's not our fault, we created this code for free and majority of users of this code never helped us, so we can take it back anytime -- I just want give a perspective of the alternative cost and the effective trust we are maintaining. For example keeping promlog package forever with clear no maintenance mode, would likely save considerable amount of $$$ on ecosystem and add near zero cost to us.

FYI this is a policy that Google tries to use for OSS maintenance and is to mitigate things we discussed here. I don't think we have engineering time to use this for common or even main prometheus repo. However, I think it's worth to make stability for critical pieces.

To make client_golang great again I think we simply need expfmt (:

bwplotka avatar Sep 16 '25 11:09 bwplotka