cargo-unmaintained icon indicating copy to clipboard operation
cargo-unmaintained copied to clipboard

Renamed packages are flagged immediately

Open smoelius opened this issue 1 year ago • 8 comments

Suppose a package X is renamed as Y. Then package X will be flagged immediately with "X (not in <repo>)". This is a false positive.

You can see this right now with the icu_locid package:

$ cargo unmaintained -p icu_locid
Scanning 1 packages and their dependencies (pass --verbose for more information)
1/1 (100%)        
icu_locid (not in https://github.com/unicode-org/icu4x)

The package was renamed recently: https://github.com/unicode-org/icu4x/blob/83441614dd3b7ce628031926c279fb25c97f7461/CHANGELOG.md?plain=1#L21-L22

I not yet sure how best to fix this.

smoelius avatar Nov 17 '24 19:11 smoelius

If a package cannot be found by name, maybe it could be identified by other characteristics.

smoelius avatar Jan 17 '25 12:01 smoelius

@smoelius, can I work on this issue?

behalshabnam avatar Mar 30 '25 06:03 behalshabnam

@behalshabnam How would you do it, i.e., what would you change?

I'm asking because I think this issue may require some research.

Here was one idea I had. Given two packages X and Y, compare all of their fields except their names. If only their names differ, consider the packages the same.

But that idea may be silly. There may be fields that one would expect to differ and that should be ignored. I wouldn't know unless I experimented with this idea.

Are your thoughts similar? Or completely different?

smoelius avatar Mar 30 '25 12:03 smoelius

@smoelius I've looked into this issue and my thoughts are almost similar to yours.

Instead of just comparing names, we can determine a “fingerprint” for each package using a selected set of metadata fields, such as version, authors and repository URL that are expected to remain invariant regardless of a renaming.

Then, when Cargo can’t find a package by its original name, it will compare this fingerprint against the fingerprint of the package currently available (even if its name has changed) and decide that they’re the same package if all those key fields match.

Besides, I thought about having a weighted matching, as all fields are not equally important. For example:

  • repository URL and author/maintainer's name can have a higher weight
  • description can have a lower weight

In summary, I would have a two stage process:

  • First, a quick filter to find candidates
  • Then a detailed similarity scoring for flagged pairs

behalshabnam avatar Apr 01 '25 12:04 behalshabnam

@behalshabnam I very much appreciate the thought you've put into this.

I would prefer to not go down the "weighted matching" route, though.

I would prefer the approach we adopt here be precise rather than fuzzy, because the latter could turn into a "tuning parameters" can of worms. (I hope that makes sense.)


Thinking out loud: once this issue is fixed, we should have a test, though I'm not 100% sure what form that test should take.

The test might be just a snapbox test based on a fixture, though I'm not sure.

The package icu_locid mentioned in the issue description might be part of that test.


Would you be willing to try the "compare all fields but name and see what breaks" approach?

smoelius avatar Apr 02 '25 13:04 smoelius

Here is another example of this problem: toml_write was recently renamed to toml_writer: https://github.com/toml-rs/toml/blob/b3594df3b76a95d5d21f5af3a9847e44917c640a/crates/toml_writer/CHANGELOG.md?plain=1#L12

Hence, toml_write is flagged as unmaintained.

As I write this, GitHub reports that the change is 8 hours old.

smoelius avatar Jul 08 '25 10:07 smoelius

What about ...

When cargo-unmaintained sees the crate cant be found in the repository, it checks out a commit older than a reasonable period , or the date of the latest crate release, to find the crate name in the old-ish repository.

If the crate name was found earlier in the repository, then

a) the message can now be "X not present in latest commit of repo Y, but was found at commit Z"

b) trigger a different set of "is this maintained?" rules, such as

  • how long after a repo doesnt contain a crate, should the crate be considered unmaintained?
  • has this repo done a release of any crate since the date the crate was removed?
    • or a simple proxy for this is: does the repo have a new tag after the crate was removed?

jayvdb avatar Sep 08 '25 04:09 jayvdb

@jayvdb Thanks very much for your suggestions.

When cargo-unmaintained sees the crate cant be found in the repository, it checks out a commit older than a reasonable period , or the date of the latest crate release, to find the crate name in the old-ish repository.

Of the two ideas you proposed ("older than a reasonable period" and "date of the latest crate release"), I think I would be more open to the latter. I will try to give this some thought.

Generally speaking, it would be nice if it were easier to tie a crates.io crate to a commit.

smoelius avatar Sep 09 '25 10:09 smoelius