Renamed packages are flagged immediately
Suppose a package X is renamed as Y. Then package X will be flagged immediately with "X (not in <repo>)". This is a false positive.
You can see this right now with the icu_locid package:
$ cargo unmaintained -p icu_locid
Scanning 1 packages and their dependencies (pass --verbose for more information)
1/1 (100%)
icu_locid (not in https://github.com/unicode-org/icu4x)
The package was renamed recently: https://github.com/unicode-org/icu4x/blob/83441614dd3b7ce628031926c279fb25c97f7461/CHANGELOG.md?plain=1#L21-L22
I not yet sure how best to fix this.
If a package cannot be found by name, maybe it could be identified by other characteristics.
@smoelius, can I work on this issue?
@behalshabnam How would you do it, i.e., what would you change?
I'm asking because I think this issue may require some research.
Here was one idea I had. Given two packages X and Y, compare all of their fields except their names. If only their names differ, consider the packages the same.
But that idea may be silly. There may be fields that one would expect to differ and that should be ignored. I wouldn't know unless I experimented with this idea.
Are your thoughts similar? Or completely different?
@smoelius I've looked into this issue and my thoughts are almost similar to yours.
Instead of just comparing names, we can determine a “fingerprint” for each package using a selected set of metadata fields, such as version, authors and repository URL that are expected to remain invariant regardless of a renaming.
Then, when Cargo can’t find a package by its original name, it will compare this fingerprint against the fingerprint of the package currently available (even if its name has changed) and decide that they’re the same package if all those key fields match.
Besides, I thought about having a weighted matching, as all fields are not equally important. For example:
- repository URL and author/maintainer's name can have a higher weight
- description can have a lower weight
In summary, I would have a two stage process:
- First, a quick filter to find candidates
- Then a detailed similarity scoring for flagged pairs
@behalshabnam I very much appreciate the thought you've put into this.
I would prefer to not go down the "weighted matching" route, though.
I would prefer the approach we adopt here be precise rather than fuzzy, because the latter could turn into a "tuning parameters" can of worms. (I hope that makes sense.)
Thinking out loud: once this issue is fixed, we should have a test, though I'm not 100% sure what form that test should take.
The test might be just a snapbox test based on a fixture, though I'm not sure.
The package icu_locid mentioned in the issue description might be part of that test.
Would you be willing to try the "compare all fields but name and see what breaks" approach?
Here is another example of this problem: toml_write was recently renamed to toml_writer: https://github.com/toml-rs/toml/blob/b3594df3b76a95d5d21f5af3a9847e44917c640a/crates/toml_writer/CHANGELOG.md?plain=1#L12
Hence, toml_write is flagged as unmaintained.
As I write this, GitHub reports that the change is 8 hours old.
What about ...
When cargo-unmaintained sees the crate cant be found in the repository, it checks out a commit older than a reasonable period , or the date of the latest crate release, to find the crate name in the old-ish repository.
If the crate name was found earlier in the repository, then
a) the message can now be "X not present in latest commit of repo Y, but was found at commit Z"
b) trigger a different set of "is this maintained?" rules, such as
- how long after a repo doesnt contain a crate, should the crate be considered unmaintained?
- has this repo done a release of any crate since the date the crate was removed?
- or a simple proxy for this is: does the repo have a new tag after the crate was removed?
@jayvdb Thanks very much for your suggestions.
When cargo-unmaintained sees the crate cant be found in the repository, it checks out a commit older than a reasonable period , or the date of the latest crate release, to find the crate name in the old-ish repository.
Of the two ideas you proposed ("older than a reasonable period" and "date of the latest crate release"), I think I would be more open to the latter. I will try to give this some thought.
Generally speaking, it would be nice if it were easier to tie a crates.io crate to a commit.