Define an order of precedence + provide a means to override it (REUSE.yaml)
Help, I got here because I got a PendingDeprecationWarning
PendingDeprecationWarning: Copyright and licensing information for 'my-project/foo.py' has been found in both 'my-project/foo.py' and in the DEP5 file located at '.reuse/dep5'. The information for these two sources has been aggregated. In the future this behaviour will change, and you will need to explicitly enable aggregation. See https://github.com/fsfe/reuse-tool/issues/779. You need do nothing yet. Run with
--suppress-deprecationto hide this warning.
Don't worry, there's nothing you need to do quite yet, except be vigilant of upcoming changes.
The reason you're getting this warning is because of the following scenario. You have a file my-project/foo.py which contains the following header:
# SPDX-FileCopyrightText: 2023 Jane Doe
#
# SPDX-License-Identifier: GPL-3.0-or-later
But you also have a .reuse/dep5 file which contains the following section:
Files: my-project/*.py
Copyright: 2020 Example NGO
License: 0BSD
The problem: Under which licence is the file? Who are the copyright holders?
Currently, we err on the side of caution, and just aggregate the results. The answer to both questions is 'both', as far as the tool is concerned.
However, that behaviour is not actually specified in the REUSE Specification, and there is a consensus among the maintainers of REUSE that the current behaviour isn't great. So we want to change it.
A naïve proposal + some history
We want to define an order of precedence for copyright and licensing information. Here is a concrete proposal:
Copyright and Licensing Information is considered according to the following order of precedence:
- Information defined in the
.licensefile.- Information defined in the Commentable File.
- Information defined in
.reuse/dep5.There is no merging of information from different sources. Only the source with the highest precedence is considered.
In fact, this proposal is so concrete that—for a few hours—it was in REUSE Specification 3.1 and tool version 2.0.0! However, because of quick negative feedback, this update to the specification was promptly reverted, and tool version 2.0.0 was yanked. A little embarrassing on our part, but we're thankful for the constructive feedback.
Copied from the change log:
While the intention of the breaking change was sound (don't mix information sources; define a single source of truth), there were legitimate use-cases that were broken as a result of this.
The legitimate use-case is the following scenario: You copy a project Foo wholesale into your own project as a static dependency. Foo is not REUSE-compliant, but does contain copyright statements in some code headers. You write a section into .reuse/dep5 broadly declaring that static/Foo/* is under its declared licence, and attribute The Foo Authors as the copyright holders. However, because the DEP5 file is now no longer applied to the files that contain copyright statements, REUSE will complain that these files do not have a declared licence.
Within the restrictions of the above proposal, there is no good workable solution to this use-case. You could manually edit the headers (not great, especially when Foo is big, or you regularly need to update it), or you could manually add .license files, which may be a huge task.
An actual but not-so-concrete proposal
We still want to define an order of precedence. But we must provide a way to force aggregation (current behaviour) or hard-coding precedence (e.g., prefer .reuse/dep5 over the file contents).
There does not yet exist a concrete way of doing this, but you may think of it like this. Given the example .reuse/dep5 section at the start of this issue, we could instead write this:
Files: my-project/*.py
Copyright: 2020 Example NGO
License: 0BSD
Precedence: [file (default)|aggregate|dep5]
The problem, however, is that DEP5 does not support this field, and we don't want to make it support this field.
So we want to pivot away from DEP5 and adopt a different configuration method. We've been brainstorming this since 2021 (volunteer projects aren't very fast), and we're internally referring to it as REUSE.yaml (although the YAML part is a bit up in the air).
An actual for-real-this-time concrete proposal
Find below a real and actual concrete proposal:
# The version of the TOML schema. A simple integer should be fine.
# Mandatory.
version = 1
[[annotations]]
# The path (or paths) that are covered, relative to REUSE.toml's directory. Mandatory.
path = [
".bumpversion.cfg",
"setup.cfg"
]
# A string that defines the precedence of copyright and licensing information.
# The choices are:
#
# - "toml" -> Treat the information in this file as the ultimate authority of
# the copyright and licensing of the covered files. If multiple nested
# REUSE.tomls have this precedence for the same file, then the topmost REUSE.toml
# is authoritative.
# - "closest" -> Use the information closest to the file (including inside the file)
# if available. If no such information is found, then the information inside this
# REUSE.toml is applied to that file. TODO: what if there is only partial information
# inside the file?
# - "aggregate" -> Aggregate the information from this file with the information
# inside of the covered files.
#
# Not mandatory. Defaults to "closest".
precedence = "toml"
# The copyright notice (or notices) that apply to the above paths. Mandatory.
SPDX-FileCopyrightText = "2017 Free Software Foundation Europe e.V. <https://fsfe.org>"
# The license expression (or expressions) that apply to the above paths.
# Mandatory.
SPDX-License-Identifier = "GPL-3.0-or-later"
# Subsequent tables override previous tables. This does NOT interact with the
# 'precedence' key.
[[annotations]]
# Can contain globs. I would define the globbing behaviour as equal to Python's
# glob.glob (recursive=True, include_hidden=True), but that would require copying
# the glob.glob code and tweaking it to work without filesystem reads. So instead,
# I used the wcmatch.glob 3rd-party library, with GLOBSTAR, MATCHBASE, and DOTGLOB,
# which gives a behaviour very similar to .gitignore and glob.glob (but not precisely
# matching either).
path = "docs/reuse*.rst"
precedence = "toml"
SPDX-FileCopyrightText = [
"2017 Free Software Foundation Europe e.V. <https://fsfe.org>",
"2023 Jane Doe",
]
# These SHOULD be joined with AND, but files support multiple separate
# SPDX-License-Identifier tags, so let's support it here as well.
SPDX-License-Identifier = [
"CC-BY-SA-4.0",
"GPL-3.0-or-later"
]
Some notes on the implementation:
- I chose TOML. I had previously been partial to YAML, but reading over the discussions in the linked issues, and having worked a little more with TOML recently, it's a lot more fool-proof to write, especially as concerns indentation. It's not very good for nesting of data, but we're not doing that, so it's fine. We could bikeshed this choice further, but I propose that we just go with it.
-
path,SPDX-FileCopyrightText, andSPDX-License-Identifiercan be either a single string or a list of strings. This (partially) matches DEP5 behaviour, making it easy to port. It's also convenient to not mandate lists; we'll probably convert string values into single-value lists in the under-the-hood implementation (edit: that is exactly what I did). - I chose to use the full
SPDX-[...]key names. This works better in TOML than in YAML (because the semicolon in YAML messes with this tool's parsing). It's a bit more annoying to type, but it's also very consistent, and means that the user has to memorise less. - The
versionkey doesn't do much of anything precently. I'm not sure if it'll ever become important, but if it does, it'll be good to have.
Some notes about the file itself:
- I intend to support exclusively
REUSE.toml, and NOT.REUSE.toml. People will be peeved by this choice (they don't like random tools littering their clean workspace), but I propose that we stand by this choice. By allowing dotfiles, we would run the risk that the licensing information is hidden on some computers. Licensing information should not be hidden, ergo let's not do dotfiles. - REUSE.toml files can be nested! You can place them anywhere in the project, not just in the root directory. They use the same
closest/aggregate/tomlprecedence system.closestresolves to the file itself OR to the nearest REUSE.toml (can be self).aggregatejust aggregates the REUSE.toml's information always, and then behaves likeclosest.tomlis an aggressive "ignore everything underneath me; I am the ultimate authority here" precedence setting. The topmost REUSE.toml withtomlis authoritative.
Related issues
Here are some issues of relevance (in order of relevance, feel free to reference more):
- https://github.com/fsfe/reuse-docs/issues/81
- https://github.com/fsfe/reuse-docs/issues/70
- https://github.com/fsfe/reuse-tool/issues/775
- https://github.com/fsfe/reuse-tool/issues/572
- https://github.com/fsfe/reuse-tool/issues/661
This issue will exist as a sort of meta issue to refer back to and track work in other issues.
Just documenting this here as well. I think having a precedence implies there is some sort of legal precedence as well. This is a problem if a repo supplies licensing information in a funny way, i.e. adds information in one way and then in another way while keeping both.
So I think it would be good to keep on aggregating and issue a warning whenever that happens because that means licensing information is unclear and that's not what we want. In later versions, we could perhaps even make the warning a failure. This would improve licensing overall by making it clearer.
However, the legitimate use case Carmen describes above, is an issue with my suggestion. Something we should figure out is how common static dependencies are and if they are typically necessary. If the answers are very common and necessary, then perhaps there is a way to declare something a static dependency and in that case be able to override included licensing information? Or perhaps we could ignore something that is declared a dependency? However, there is an underlying question we should also debate: Can a repo be REUSE compliant with static dependencies that aren't?
I think that a common scenario that will be encountered will be documentation inside the repository explaining – with an example – how to add the correct copyright headers to source files. Maybe it's necessary to have a way to escape the copyright information in files to mitigate this?
@DerMolly This mechanism exists :D See https://github.com/fsfe/reuse-tool/blob/6b81d04dfa198423b3a1adc0483368af7b73fc7c/docs/usage.rst#ignoring-parts-of-a-file
Between .license files and the ignore tags, that use-case is well-covered at the moment. It doesn't, however, address the problem of static dependencies.
I added a proposal for REUSE.toml in the section 'An actual for-real-this-time concrete proposal' of the issue.
@mxmehl @silverhook @nicorikken
I support the TOML choice and also that we avoid having a dotfile. REUSE.toml should be plenty fine.
Regarding using SPDX- tags, I’m fine with it, but there is a question whether it is a benefit or a problem for (other) tools. IMHO it’s a benefit, as the content is correct, albeit in a different file than tools might suspect (the same is true for *.license files too, so nothing new). but it might make sense to reach out to ScanCode and FOSSology for feedback. (also regarding merging this with ScanCode’s ABOUT files)
The example TOML looks good to me. I would expect that that using SPDX annotation within the field would also work fine. e.g.:
[[annotations]]
path = "docs/reuse*.rst"
precedence = "toml"
SPDX-FileCopyrightText = [
"2017 Free Software Foundation Europe e.V. <https://fsfe.org>",
"2023 Jane Doe",
]
SPDX-License-Identifier = [
"(CC-BY-SA-4.0 AND GPL-3.0-or-later) OR MIT"
]
I have some concerns about the precedence field and allowing the REUSE.toml to be anywhere. There is room for misuse there, but on the other hand, people will game whatever they get. So I’m not too concerned. Still perhaps something to keep in mind, as it does complicate things quite a bit and makes it a bit less predictable.
~~I no longer intend to implement REUSE.toml files in multiple locations. This is non-trivial to implement, and there are so many permutations of potentially conflicting information.~~
Let's no longer make 'precedence' mandatory -> default to file.
I no longer intend to implement REUSE.toml files in multiple locations. This is non-trivial to implement, and there are so many permutations of potentially conflicting information.
After the discussion today, do you think differently? I think we need this feature, that was one of the USPs of REUSE.toml/yaml.
I wrote that just before the meeting. My mind was changed during the meeting :)
There is also an issue as the deprecation warning is getting printed when there is nothing to aggregate / both .reuse/dep5 and the header are not in conflict, at least with reuse version 2.1.0. Example with a wildcard .reuse/dep5 and two files with license headers:
$ tree -a -I '.git' /tmp/issue779/
/tmp/issue779/
├── foobar.html
├── foobar.yaml
├── LICENSES
│ └── GPL-3.0-or-later.txt
└── .reuse
└── dep5
$ cat /tmp/issue779/.reuse/dep5
Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
Upstream-Name: Issue 779
Upstream-Contact: ACME Inc. <https://example.com>
Source: <URL of the public repository, https://github.com/...>
Files: *
Copyright: ACME Inc. <https://example.com>
License: GPL-3.0-or-later
$ cat /tmp/issue779/foobar.yaml
# Test
#
# SPDX-License-Identifier: GPL-3.0-or-later
# SPDX-FileCopyrightText: ACME Inc. <https://example.com>
---
foo: bar
$ cat /tmp/issue779/foobar.html
<html>
<!--
SPDX-License-Identifier: GPL-3.0-or-later
SPDX-FileCopyrightText: ACME Inc. <https://example.com>
-->
I would expect no warnings as the information is redundant but in sync, no licenses to aggregate and therefore no precedence issues. However, a warning gets printed:
$ reuse --version lint
reuse 2.1.0
$ reuse --root /tmp/issue779/ lint
/tmp/.venv/issue779/lib64/python3.12/site-packages/reuse/project.py:224: PendingDeprecationWarning: Copyright and licensing information for '/tmp/issue779/foobar.html' has been found in both '/tmp/issue779/foobar.html' and in the DEP5 file located at '.reuse/dep5'. The information for these two sources has been aggregated. In the future this behaviour will change, and you will need to explicitly enable aggregation. See <https://github.com/fsfe/reuse-tool/issues/779>. You need do nothing yet. Run with `--suppress-deprecation` to hide this warning.
warnings.warn(
/tmp/.venv/issue779/lib64/python3.12/site-packages/reuse/project.py:224: PendingDeprecationWarning: Copyright and licensing information for '/tmp/issue779/foobar.yaml' has been found in both '/tmp/issue779/foobar.yaml' and in the DEP5 file located at '.reuse/dep5'. The information for these two sources has been aggregated. In the future this behaviour will change, and you will need to explicitly enable aggregation. See <https://github.com/fsfe/reuse-tool/issues/779>. You need do nothing yet. Run with `--suppress-deprecation` to hide this warning.
warnings.warn(
# SUMMARY
* Bad licenses: 0
* Deprecated licenses: 0
* Licenses without file extension: 0
* Missing licenses: 0
* Unused licenses: 0
* Used licenses: GPL-3.0-or-later
* Read errors: 0
* files with copyright information: 2 / 2
* files with license information: 2 / 2
Congratulations! Your project is compliant with version 3.0 of the REUSE Specification :-)
Is this expected behavior? If so, senseful wildcards would effectively prevent users to add .license files or header comments in files.
I agree. The code just concatenates data from both sources without any attempt to avoid duplication, and emits multiple warnings. I logged an issue offering to fix this four months ago but there was no response. So I don't think there's any interest in that aproach. It's possible to suppress the warnings, but it's not possible to suppress the duplicates (in spdx output). Maybe it's messy but harmless. But I wish it was at least possible to instruct the code to not scan the source when there is a .reuse/dep5 file. That seems to me to be a very sensible thing to do, but it is also possible to scan both and cull duplicates (it's just less efficient if there is total agreement - where there is disagreement targeted warnings could help to achieve agreement). I think we have to wait until a decision is made about a configuration file that instructs reuse what to do in this regard. I just hope that the decision supports projects that want to continue to specify all of the licensing information (for REUSE purposes) in a single file separate to the source files (i.e., .reuse/dep5).
@andreashaerter That is an interesting bug/feature/behaviour that probably shouldn't be too difficult to fix. However, #863 should fix this class of issues, so it's a touch low on the priority list. I'll create an issue for it nonetheless.
Interestingly, the code does actually avoid duplication, ultimately. But at that exact stage in the code, not quite.
If there is duplicate removal somewhere, it would be good if it occurred as part of or before the spdx subcommand. Its output contains all the duplicates.
I updated the first post with some new details after doing some tinkering work in #863.
-
fileprecedence renamed toclosest. -
closestprecedence is now the implicit default, and the precedence key is no longer mandatory. - REUSE.toml files can now be nested in a project. Or rather, you can plop down REUSE.toml files anywhere. They use the same
closest/aggregate/tomlprecedence system.closestresolves to the file itself OR to the nearest REUSE.toml (can be self).aggregatejust aggregates the REUSE.toml's information always, and then behaves likeclosest.tomlis an aggressive "ignore everything underneath me; I am the ultimate authority here" precedence setting. The topmost REUSE.toml withtomlis authoritative. - The
pathglobbing behaviour was defined. Copying: I would define the globbing behaviour as equal to Python's glob.glob (recursive=True, include_hidden=True), but that would require copying the glob.glob code and tweaking it to work without filesystem reads. So instead, I used the wcmatch.glob 3rd-party library, with GLOBSTAR (match**), MATCHBASE (match directory-less pattern also in subdirectories), and DOTGLOB (match.foo.pywith*.py), which gives a behaviour very similar to .gitignore and glob.glob (but not precisely matching either).
Some notes:
- The precedence handling of the nesting of REUSE.toml files is simple, but appears complete to me. However:
- What to do if the REUSE.toml precedence is
closest, but the file only contains a copyright line? Presently, this means that the REUSE.toml's information is never applied, and the file has no licence applied to it. There are three options:- This behaviour is desirable actually, and shouldn't be changed.
- If the file contains exclusively copyright info or exclusively licensing info, then apply all the REUSE.toml information. This one seems good to me.
- Selectively apply exclusively the licensing information if there is only copyright information in the file, and vice versa. This seems needlessly complicated to me.
- While I'm fairly happy with how the globbing works, it's going to be a pain to describe in the spec. We could cheat and say 'globbing works like wcmatch.glob with GLOBSTAR, MATCHBASE, and DOTGLOB', but that's a bit weird.
2. If the file contains exclusively copyright info or exclusively licensing info, then apply all the REUSE.toml information. This one seems good to me.
Practically speaking, this sounds like a good trade-off. But the tool should raise a warning if that happens.
If someone is already using REUSE.toml to override info, then it’s on them to make sure the override is exact.
(It’s going to be fun explaining it in the spec.)