Proposal for how to use codemetapy and codemeta.json to help researchers and software maintainers make it easier to cite software

Open willynilly opened this issue 9 months ago • 3 comments

Hey Maarten,

I was hoping we might get a chance to chat sometime. I am currently an eScience Center Fellow and research software engineer at WUR. For my fellowship, I aim to help build tools that help researchers credit those who help create software and to prevent accidental or intentional plagiarism of software. As part of my work, I began to create a Python package called cff2toml (it probably should have been called toml2cff), which I had hoped would help automate creation of CFF files from pyproject.toml files. You can find it here: https://pypi.org/project/cff2toml/

I was hoping to use some of these tools to create a knowledge graph of citation metadata across PyPI packages and to try to estimate the amount of potential plagiarism occuring by research software engineers.

Then I discovered the CodeMeta project, and from there I discovered your tool.

As part of my fellowship, I was hoping to help develop some tools, like yours, and would love to help you if you are open to it.

Here's a problem I have been thinking about. The current strategy for improving citation metadata for software seems to help automate the creation of citation metafiles (like CFF, .zenodo, .tributors, codemeta.json, etc.). However, this strategy assumes that the maintainers of these packages will use these tools. And this is real problem from a research ethics perspective since researchers still need to cite software even if the creators of that software did not make it easy to cite them. Indeed, some software projects no longer have maintainers. Some projects have maintainers who do not want to create these files, perhaps because it takes too much time. Other maintainers might not create these files in time so that the researcher can cite them. And still other maintainers have no idea that there is demand for them to create these files. And even those who create these files often do not keep them updated, so they may be inaccurate or incomplete. And finally, even if a maintainer is willing to keep accurate citation metadata, they may not be technically able for released versions; in these cases, they cannot update the source for tagged branches (no way to add codemeta.json files) and they have no clear way to specify missing or update inaccurate citation metadata.

Here's my proposal for addressing these issues for citation metadata with respect to Python modules using your tool, codemetapy.

Enhance codemetapy so that:

From command line, it can create a default codemeta.json file for any version of any Python package on PyPI.

Currently the tool can do this if the package is installed locally. However, it currently requires the user to install it locally first. Perhaps there is a way to update the command line to allow users to write something like "codemetapy {packagename} {version}" and then the tool would be able to generate this file. Perhaps it still adopts the strategy of generating it from a locally installed version, but it in this case, it would be nice if, for those who currently do not have it installed locally, it could create a temporary environment, install it, and then revert the installation or destroy the temporary environment. Alternatively, perhaps we could see if there is a way that this can be done without requiring the user to install it locally. This might involve a strategy of looking it up from some central server that does the installation remotely and then stores the file in a database and serves it back.

Permit the code maintainer to override default codemeta.json for any version of any Python package on PyPl.

I thought of this approach. First, define and pilot a new standard protocol for metadata harvesters/tools (like this one) for searching and using codemeta files on the main/master branch. For example, tell maintainers to add a codemeta_{version}.json file to their master/main branch, and then tell metadata tool makers (like codemetapy) to check and use this file before trying to generate a codemetadata file through other means. Thus for codemetapy, we could make it so that it uses Gitlab API to check the main/master branch for codemeta_{version}.json and then use it, and otherwise, check the tag associated with the version to see if it has a codemeta_{version}.json file and use it, and otherwise check for codemeta.json on that tag and use it, and otherwise if that all fails, try to generate a codemeta.json output by other means, like parsing project files. We could update the command line options of this tool to allow users to specify a tag prefix (like "v" or "version") and the name of the master/main branch.

Make it easy for maintainers to learn about and include codemeta.json files.

Here are some feature ideas:

Create a command line option that can save the file with the new format that includes the version (e.g., codemeta_{version}.json)
For projects have a maintainer or author email, send an email with a copy of the codemeta_{version}.json
For Github projects, add a command line option that creates an issue and copies/attaches this file into the issue.
For Github projects, add a command line option that forks the project, adds the file to the master/main branch, and then creates a pull request to their main/master branch.

What do you think?

Apr 24 '25 14:04 willynilly

Hi Will,

On Thu Apr 24, 2025 at 4:45 PM CEST, Will Riley wrote:

I was hoping we might get a chance to chat sometime. I am currently an eScience Center Fellow and research software engineer at WUR. For my fellowship, I aim to help build tools that help researchers credit those who help create software and to prevent accidental or intentional plagiarism of software.

I'm definitely open to chat about this yes! Thanks for thinking along! Good to hear you're also onboard with the eScience Center as a fellow, they're doing nice work with the Research Software Directory and I think both our projects could benefit from one-another.

As part of my work, I began to create a python packaged called cff2toml (it probably should have been called toml2cff :)), which I had hoped would help automate creation of CFF files from pyproject.toml files. Then I discovered the CodeMeta project, and from there I discovered your tool.

As part of my fellowship, I was hoping to help develop some tools, like yours, and would love to help you if you are open to it.

Yes, I think it's a good idea if we can all build together on a good open-source software metadata ecosystem and reuse as many tools as possible.

Here's a problem I have been thinking about. The current strategy for improving citation metadata for software seems to help automate the creation of citation metafiles (like CFF, .zenodo, .tributors, codemeta.json, etc.). However, this strategy assumes that the maintainers of these packages will use these tools.

I see what you mean. In the work I have done for the CLARIAH Tool Discovery project we try to partially address this problem as follows:

We use our metadata harvester (https://github.com/proycon/codemeta-harvester) which clones the source git repository and effectively runs codemetapy to extract and combine the different kinds of metadata it can find into one codemeta.json . This will work for repositories that have no codemeta.json yet (if they do have one, that one will be authoritative and be used directly). The information from the harvester (which is just codemeta.json for each tool) can then be published elsewhere. See for example https://tools.clariah.nl , the raw codemeta is also accessible from there (or via a Web API with SPARQL even).

We also allow for the option to add a codemeta-harvest.json to a repo which is basically just like codemeta.json but it will be read by our harvester and needs to only containing information that the harvester can't extract automatically (based on e.g. a pyproject.toml , package.json etc etc). That means developers don't need to add the bits there that change regularly and can be extracted automatically anyway. Of course it doesn't eliminate your concern entirely.

If a codemeta.json is provided, the harvester also does some checks to see if it's up to date with the actual package version.

You might want to check out these two resources to get the full picture, if you haven't seen them yet:

A paper we wrote that is pending publication in the CLARIN post-conference proceedings (this is the camera-ready version): https://github.com/CLARIAH/tool-discovery/blob/master/papers/tooldiscovery.pdf
A short presentation I did two years ago explains this harvesting pipeline: https://video.anaproy.nl/w/9WtwRuoEFFUima4LNRVmwt

And this is real problem from a research ethics perspective since researchers still need to cite software even if the creators of that software did not make it easy to cite them. Indeed, some software projects no longer have maintainers. Some projects have maintainers who do not want to create these files, perhaps because it takes too much time. Other maintainers might not create these files in time so that the researcher can cite them. And still other maintainers have no idea that there is demand for them to create these files. And even those who create these files often do not keep them updated, so they may be inaccurate or incomplete.

Yes, that is a valid concern. I already addressed it a bit above. Another way to address it (partially again) is to have developers set up a CI/CD pipeline or a simple local git pre-hook that automatically regenerates the codemeta.json upon each commit (or each commit that touches certain relevant files). For github users a dedicated github action could do the job, although I don't want to focus merely on github. There's definitely room for improvement here and help is always appreciated.

And finally, even if a maintainer is willing to keep accurate citation metadata, they may not be technically able for released versions; in these cases, they cannot update the source for tagged branches (no way to add codemeta.json files) and they have no clear way to specify missing or update inaccurate citation metadata.

I assume you're referring to the chicken-and-egg problem that arises when describing software. Is the metadata part of the tag or not? We indeed take that approach: even if only the metadata changes, maintainers will have to tag a new release.

Here's my proposal for addressing these issues for citation metadata with respect to Python modules using your tool, codemetapy.

Enhance codemetapy so that:

From command line, it can create a default codemeta.json file for any version of any Python package on PyPI.

Currently the tool can do this if the package is installed locally. However, it currently requires the user to install it locally first. Perhaps there is a way to update the command line to allow users to write something like "codemetapy {packagename} {version}" and then the tool would be able to generate this file. Perhaps it still adopts the strategy of generating it from a locally installed version, but it in this case, it would be nice if, for those who currently do not have it installed locally, it could create a temporary environment, install it, and then revert the installation or destroy the temporary environment. Alternatively, perhaps we could see if there is a way that this can be done without requiring the user to install it locally. This might involve a strategy of looking it up from some central server that does the installation remotely and then stores the file in a database and serves it back.

codemetapy also works for tools that are not installed, including non-Python tools. It can directly read things like pyproject.toml (python), package.json (npm), Cargo.toml (rust) and various more for which codemeta has crosswalks. So indeed, if you download a package from pypi or clone a git repo, codemetapy can extract the metadata. The latter is exactly what codemeta-harvester does ;)

Permit the code maintainer to override default codemeta.json for any version of any Python package on PyPl.

If it's the maintainer they can just add codemeta.json to the source repo right? Or are you talking about situations where there is no relation between the source code maintainer and the package maintainer? For packaging in linux distros such a seperation would be common but for things like pypi/npm/cargo it isn't.

One the starting principles of our approach is to keep metadata at the source, and the source is the source code hosted in some public git forge. Codemeta also puts a focus on the source code. Even PyPI is already considered secondary.

I thought of this approach. First, define and pilot a new standard protocol for metadata harvesters/tools (like this one) for searching and using codemeta files on the main/master branch. For example, tell maintainers to add a codemeta_{version}.json file to their master/main branch, and then tell metadata tool makers (like codemetapy) to check and use this file before trying to generate a codemetadata file through other means. Thus for codemetapy, we could make it so that it uses Gitlab API to check the main/master branch for codemeta_{version}.json and then use it, and then afterwards check the tag associated with the version to see if it has a codemeta_{version}.json file and use it, and then check for codemeta.json on that tag and use it, and then if that all fails, try to generate a codemeta.json output.

codemeta-harvester does some like that currently. It will read (or generate) codemeta.json data for the latest release (= latest git tag following a semantic versioning pattern, indeed allow for a prefix like v like you suggested). The harvester also checks the latest master/main branches (which is also the fallback if there are no git tags at all). Certain metadata properties from the master branch will take precendence, such as codemeta:developmentStatus, for which we use the repostatus.org vocabulary, and the current maintainer.

I'm not a fan of adding codemeta_{version}.json files through. Git already handles versions perfectly fine, so files should not have versions in them. We want to keep the codemeta.json (or codemeta-harvest.json) next to the version it describes so there can be no confusion.

We could update the command line options of this tool to allow users to specify a tag prefix (like "v" or "version") and the name of the master/main branch.

Make it easy for maintainers to learn about and include codemeta.json files.

I think we used to have a contributed functionality in codemetapy that could autogenerate codemeta.json via setup.py was invoked. It may have gotten deprecated as setup.py itself is. I'll have to check.

Here are some feature ideas:

Create a command line option that can save the file with the new format that includes the version (e.g., codemeta_{version}.json)

As I said earlier I don't like extra versioning on top of git (or whatever other VCS). That's what VCS is for.

For projects have a maintainer or author email, send an email with a copy of the codemeta_{version}.json

Sending automated mails to maintainers is a nice idea. We can even use git send-email for that, though not all will be accustomed to such a workflow. (see https://git-send-email.io/ for details)

For Github projects, add a command line option that creates an issue and copies/attaches this file into the issue.

Automatically raising issues is also something I can get behind yeah. In both situations though we do have to be careful of not being too spammy. We certainly wouldn't want to harass software maintainers with automatic requests they didn't sign up for so there's a delicate balance to be kept here.

For Github projects, add a command line option that forks the project, adds the file to the master/main branch, and then creates a pull request to their main/master branch.

What do you think?

I like the ideas and appreciate you thinking along. We do have a bit of an architectural difference here, but that's not a big deal: you're proposing changes/enchancements to codemetapy, whereas I'm more envisioning other tools that use codemetapy as a dependency that do such things (such as codemeta-harvester). The reason is that I want to prevent codemetapy from getting too bloated/complex because it already does quite a lot, maybe even too much, as it is. If the code grows too complex it'll be too hard to maintain. So the things you describe I'd rather see as additions to codemeta-harvester or another independent software base.

Kind regards,

Maarten van Gompel Digital Infrastructure, Humanities Cluster, Koninklijke Nederlandse Akademie van Wetenschappen (KNAW)

web: https://proycon.anaproy.nl gpg: 0x39FE11201A31555C

Apr 24 '25 16:04 proycon

Thank you Maarten for your kind and detailed reply, and for the links.

I am still thinking about some of the current design assumptions.

You mentioned the chicken-and-egg problem:

"I assume you're referring to the chicken-and-egg problem that arises when describing software. Is the metadata part of the tag or not? We indeed take that approach: even if only the metadata changes, maintainers will have to tag a new release."

If I understand you correctly, you are suggesting that if maintainers want to correct or update the metadata (including the author/contributors metadata) for a released piece of software, they can do this by adding metadata to the next released version of the software. I assume that one of your goals for keeping the metadata close to the code is that it maintains a historical record of metadata (including inaccurate metadata). I am trying to figure how to handle a different goal. I want to know how I can discover the correct or accurate metadata for version of software that is missing this metadata in the tagged source code, or which has inaccurate or incomplete metadata (e.g., it is missing an author).

For example, suppose you and I create Python module and I released it, but I accidentally forgot to include you in the codemeta.json as an author. Now suppose that codemetapy-harvester comes along and harvests this inaccurate metadata and redistributes it. Suppose a researcher comes along and finds this inaccurate authorship metadata from a database generated by codemetapy-harvester, and they publish a paper with a citation that only includes me but not you.

What I want to know is how we can prevent such a situation and how we can recover from it after the version has been released.

It seems to me this problem is not sufficiently addressed by the design principle of only keeping the metadata close to the (tagged) code, which serves a historical function of accurately recording a mistake but does not necessarily prevent and may even foster the spread of inaccurate information. I could be wrong, but it seems to me that it requires an additional mechanism for correcting misinformation. I am curious how others have solved/addressed this problem in similar domains.

Zenodo, for example, decouples the source code from the metadata. It allows researchers to modify the record but not the deposited source code. https://support.zenodo.org/help/en-gb/1-upload-deposit/64-can-i-edit-records-after-they-have-been-published

I call this the central-database strategy for correcting the spread of inaccurate metadata for research software. Theoretically, it can even track the metadata changes.

There is nothing wrong with harvesting and reharvesting inaccurate metadata. It becomes more problematic when such inaccurate metadata is shared and then used under the assumption that it is accurate. Researchers have a special ethical duty to accurately credit those who worked on it, and I had assumed that researchers and research support organizations would use tools such as codemetapy-harvester in order to gather accurate authorship metadata about software so that they can ethically cite them.

I am trying to figure out how to systematically support this research integrity need. There are other research integrity needs, like reproducibility of results, which I am not focused on, but which might also be assisted by tracking who contributed to the development of software. For example, accurately crediting people might help figure who to go to when the system breaks, or when assistance is needed to reuse the tool.

I am not an expert in this, but I was also wondering how academic publishing handles metadata mistakes in publications. I suppose they make corrections through addendums, which can be cited. However, this puts a great burden on people to read these corrections. Now that most people access papers digitally, we have new ways that we could dynamically recover from publication mistakes. If people provide links to a Zenodo DOI, they can see the updated metadata, including updated authorship information. We can imagine other kinds of scholarly technologies that alert us to issues with papers, like a button that allows to toggle between viewing the old paper before and after corrections. Digital scholarship tools and practices allows us to prevent and mitigate error propagation and misinformation at different levels of the knowledge production process. It becomes increasingly important for designers to understand how their tools will be used by end users and not just how they tell people how to use them. I hypothesize that the end users of codemetapy-harvester will trust the tool and assume that it generates accurate metadata and that it would take into account corrections from maintainers or some other trusted metadata authority. My worry, which I am still pondering, is that it seems to lack a mechanism to collect corrections to authorship metadata, and thus it would unintentionally, and even ironically, preserve and reinforce an inaccurate history of who contributed to the creation of software. I say ironic, because I know that is not the intention of you or the project, and that you actually working to do the opposite (and are making much progress in that direction).

With respect to codemetapy, I tried to imagine a harvesting protocol mechanism that prevents the spread of inaccurate authorship metadata. This involved creating a special file to help harvesters propagate accurate authorship metadata for already released and archived software. My approach was to create multiple files that were named with version numbers. I'm open to alternative naming conventions or a smaller number of files. In general, I'm open to and curious to learn about any technical strategy that actually solves the problem. I proposed one close-to-the-code, and decentralized approach using codemetapy (or codemetapy-harvester), which I think would move us closer to reducing the spread of inaccurate authorship metadata, and I also recognize a more centralized approach (like Zenodo, which allows updating and correcting records linked to DOIs), but I'm really interested in hear other approaches/strategies.

Also, I have not had a chance to read/review the links you provided. I will do that as soon as possible. So I apologize in advance if those resources already address the problems I raised.

And of course, I am still interested in helping with this project, even if it does not adopt this (or other) proposals I suggest. I really appreciate the work you've done on this project.

Apr 24 '25 20:04 willynilly

Thanks for your thoughts again! It took some time to get back to you, but I've given it some further thought too:

I am still thinking about some of the current design assumptions.

You mentioned the chicken-and-egg problem:

"I assume you're referring to the chicken-and-egg problem that arises when describing software. Is the metadata part of the tag or not? We indeed take that approach: even if only the metadata changes, maintainers will have to tag a new release."

If I understand you correctly, you are suggesting that if maintainers want to correct or update the metadata (including the author/contributors metadata) for a released piece of software, they can do this by adding metadata to the next released version of the software.

Correct

I assume that one of your goals for keeping the metadata close to the code is that it maintains a historical record of metadata (including inaccurate metadata).

Yes, and to ensure that the link between the source code version and the metadata version can not be mistaken.

I am trying to figure how to handle a different goal. I want to know how I can discover the correct or accurate metadata for version of software that is missing this metadata in the tagged source code, or which has inaccurate or incomplete metadata (e.g., it is missing an author).

For example, suppose you and I create Python module and I released it, but I accidentally forgot to include you in the codemeta.json as an author. Now suppose that codemetapy-harvester comes along and harvests this inaccurate metadata and redistributes it. Suppose a researcher comes along and finds this inaccurate authorship metadata from a database generated by codemetapy-harvester, and they publish a paper with a citation that only includes me but not you.

What I want to know is how we can prevent such a situation.

Your goal is indeed a bit different than mine and some design principles are a bit in conflict. But I see where you're coming from and what problem you want to solve.

If you distribute an explicit codemeta.json with an error, then the proper solution in my system would be to release a new version of the software with the corrected metadata, just as one would do a bugfix. The new version is the one that should be cited. This can still work for older releases and doesn't necessarily have to be linear. Say software v1.2.3 (assuming semantic versioning) is used and has a metadata error, and a newer v2.0 was already released as well (but is not what the researcher used), then a version v1.2.4 (if not already released) or otherwise v1.2.3.1 can be released fixing the older metadata error.

Alternatively, you could not distribute a codemeta.json with the source and let the harvester generate it at run-time, but then you'd rely on at least some (automated) intermediary to get the codemeta.json from, like https://tools.clariah.nl .

It seems to me this problem is not sufficiently addressed by the design principle of only keeping the metadata close to the code, which serves a historical function of accurately recording a mistake but does not necessarily prevent and may even foster the spread of inaccurate information. I could be wrong, but it seems to me that it requires an additional mechanism for correcting misinformation. I am curious how others have solved/addressed this problem in similar domains.

I see what you mean yes

Zenodo, for example, decouples the source code from the metadata. It allows researchers to modify the record but not the deposited source code. https://support.zenodo.org/help/en-gb/1-upload-deposit/64-can-i-edit-records-after-they-have-been-published

I call this the central-database strategy for correcting the spread of inaccurate metadata for research software. Theoretically, it can even track the metadata changes.

The central-database strategy itself has a big pitfall when it comes to the spread of inaccurate/incomplete metadata as well. The database may be out of sync with the source. This especially happens if the central database is updated only infrequently by a human in the loop, and if it's not strictly pinned to some version. I think in practise this is a very prevalent source of metadata problems.

There is nothing wrong with harvesting and reharvesting inaccurate metadata. It becomes more problematic when such inaccurate metadata is shared and then used under the assumption that it is accurate. Researchers have a special ethical duty to accurately credit those who worked on it, and I had assumed that researchers and research support organizations would use tools such as codemetapy-harvester in order to gather accurate authorship metadata about software so that they can ethically cite them.

Agreed, this is a valid concern. Tools such as codemeta-harvester should hopefully make it easier to extract complete metadata, as they look at multiple sources. But it is not always perfect and may repropagate errors already present in the existing metadata sources, which I think ideally need to be corrected at that level. (say for instance there's a wrong/missing author in pyproject.toml).

I am trying to figure out how to systematically support this research integrity need. There are other research integrity needs, like reproducibility of results, which I am not focused on, but which might also be assisted by tracking who contributed to the development of software. For example, accurately crediting people might help figure who to go to when the system breaks, or when assistance is needed to reuse the tool.

I am not an expert in this, but I was also wondering how academic publishing handles metadata mistakes in publications. I suppose they make corrections through addendums, which can be cited. However, this puts a great burden on people to read these corrections. Now that most people access papers digitally, we have new ways that we could dynamically recover from publication mistakes. If people provide links to a Zenodo DOI, they could see the updated metadata, including updated authorship information.

The addendums/errata solution is probably the way to go in this scenario as well, if it can be automated a bit further. Not entirely unlike what you suggested.

With respect to codemetapy, I tried to imagine a harvesting protocol mechanism that tries to prevent the spread of inaccurate authorship metadata. This involved creating a special file to help harvesters to propagate inaccurate authorship metadata. My approach was to create multiple files that were named with version numbers. I'm open to alternative naming conventions or a smaller number of files. In general, I'm open to and curious to learn about any technical strategy that actually solves the problem. I proposed one close-to-the-code, and decentralized approach using codemetapy (or codemetapy-harvester), which I think would move us closer to reducing the spread of inaccurate authorship metadata, and I also recognize a more centralized approach (like Zenodo, which allows updating and correcting records linked to DOIs), but I'm really interested in hear other approaches/strategies.

Ok, let me think out loud a bit how we could technically solve this issue whilst staying as close to the current design principles (authorship with authors/decentralisation/close to source/use VCS) as possible:

An addendum/errata mechanism could be considered on two levels, the original source-code level and a more centralised level at some intermediary. The former only works if the tool maintainers are still active and willing to cooperate.

My first solution is what I already described above: just release a new version with the codemeta.json fix. The advantage is that this already works and is compatible with semantic versioning, the disadvantage is that it doesn't fully address your concern and it relies on the cooperation of the maintainers.
Second solution: Instead of having multiple codemeta-$version.json files in a single source tree as you suggested (which defies git). I'd suggest just using git (or another VCS) itself. Technically (with git at least), it IS possible to retag an old version. So the codemeta.json could be updated and retagged. Advantage of this solution is that it already works, a big disadvantage is that retagging changes is a bit frowned upon (old versions should ideally be immutable) and may lead to problems in other pipelines (like packagers), at it does suddenly change the overall state of the older version (the checksum changes).
Another idea that prevents retagging may be to introduce a convention for errata. Say git tag v1.2.3 has a metadata error, then git branch v1.2.3-errata could be published that contains metadata fixes to that release. This branch can be updated as more errors are found. The advantage is that we stick to immutable version tags, the disadvantage that we add complexity and a convention that people would have to learn/use, which they might not want to do.
Now what if the original maintainers aren't active or cooperative? Then you'd need a more centralised intermediary system decoupled from the original source code. A single 'errata' database if you will. You could have such a central database in one git repo for multiple tools, one directory per tool, and in this directory you could keep codemeta-$version.json files exactly as you suggested (or codemeta-harvester-$version.json files for the harvester). Since the VCS is decoupled from the actual source code I think the versioned files in git can be acceptable here. The alternative is to create git branches like "mytool-v1.2.3" tracking a mytool/codemeta.json , but that may be overly complex. This approach does beg the question what if the authors and the intermediary are in conflict? Whose codemeta.json will have the final say? The two system could exist alongside eachother, with one taking precendence over the other (but you have to pick, perhaps dynamically on the basis of latest modification time). This central errata repository could be tied to the harvester so the harvester makes use of corrections in there.

I hope I've given some further food for thought, it's an interesting issue. It may also be worth discussing in the wider codemeta community.

Apr 29 '25 10:04 proycon