ceps icon indicating copy to clipboard operation
ceps copied to clipboard

Add purls (Package URLs) to `PackageRecord`

Open baszalmstra opened this issue 2 years ago • 9 comments

This CEP describes a change to the PackageRecord format and the corresponding repodata.json file to include purls (Package URLs of repackaged packages to identify packages across multiple ecosystems.

rendered

baszalmstra avatar Nov 23 '23 14:11 baszalmstra

Awesome CEP! :)

wolfv avatar Nov 23 '23 20:11 wolfv

Would this also help us address Repology's needs for supporting Conda packages ( https://github.com/repology/repology-updater/issues/518 )?

Edit: Nvm missed Jaime has the same idea

jakirkham avatar May 08 '24 17:05 jakirkham

Where in the current meta.yaml we should define the PURLs. about seems to be the most obvious one, which means this will probably end up in info/about.json.

I agree that about makes the most sense. However, this adds the redundancy of defining the upstream package twice in the recipe. A more sophisticated solution would be adding a new purl source type for the source section, which gets resolved to a PyPI tarball URL by conda build. The purls for a package could then be automatically inferred from the sources it has been built from. In all cases, a manual option to define the purls likely has to remain for some specific use cases.

While this would facilitate simplicity, avoid redundancy, and avoid errors in the recipe, I see the following downsides with that solution:

  • different package outputs or variants may not actually use all sources that are available, requiring manually overriding the purls or another clever solution for that
  • how to verify the hash of a source tarball is more evident if the source URL is stated explicitly in the recipe
  • to avoid complexity, we could only support a subset of purl types. PyPI is by far the most important IMO. It could confuse people if only a subset of purls are valid sources.
  • introducing a new source type is simply more work than introducing a new about field - especially in related tooling such as cf-scripts or conda-smithy
  • backward compatibility?

Whether to serve the PURLs separately in a purls.json or not. I honestly don't think putting it in repodata.json is a good idea. I get that it makes sense if you want to have a canonical link between PyPI in conda-forge so Pixi can solve things nicely. It might also be served in channeldata.json (since most of the time PURLs are tied to the source not the platform-dependent, target artifact).

I do not have a strong opinion here since I am not too involved with the tools that would need to process that data.

ytausch avatar Oct 14 '24 17:10 ytausch

I think a broader question is whether package-urlcan be adopted more directly by the ecosystem. Relying on "the URL where you get your package" or "what it's called on disk" aren't as effective as an agreed-upon grammar for identifying packages, especially in the "is this CVE relevant to me" scenario.

Brief aside, and likely worth including in the text

A purl is a URL composed of seven components:

 scheme:type/namespace/name@version?qualifiers#subpath

To put this in the context of the above, a given .conda package might claim a PURL as a proxy in one or more other types, but by existing, it should claim one in the conda type. Indeed, a subset is already part of the spec. For example, an old version of django:

It might make sense to advocate for some changes to the conda part of the spec (and test data), namely:

  • update the default channel from r.a.com -> c.a.org
    • then the namespace part of the url would encode the conda channel (e.g. pkg:conda/conda-forge/django)
    • i don't know what the "new null" would be, but pretty sure it can't be defaults
  • use label for e.g. main (not channel, as in the example)

While i don't think much can be done about "where you got the source tarball" (because GitHub sources, etc), I don't think a recipe author should have to calculate all these things... but certainly could given the available data today:

# meta.yaml
{% set version = "1.10.1" %}
package: 
  name: django
  version: {{ version }}
# ...
about:
  # ...
  purls:
    - pkg:pypi/django@{{ version }}
    # this should be fully automated, either at build time (weird?) or trivially-derivable
    - pkg:conda/{{ channel_targets.split(" ")[0] }}/[email protected]?subdir={{ target_platform }}&label={{ channel_targets.split(" ")[1] }}&build=py{{ py }}_{{ build_number }}

So the above full purls might expand to

purls:
- pkg:pypi/[email protected]
- pkg:conda/conda-forge/[email protected]?subdir=win-32&label=main&build=py35_0

bollwyvl avatar Oct 23 '24 01:10 bollwyvl

Thinking about this more in the context of "accidental cross-ecosystem namesquatting" on zulip: as pkg: isn't presently a valid package name, the MatchSpec grammar could be expanded to include purls as an alternative package identifier

dependencies:
- pkg:pypi/django >=1.10.1,<1.11

treating everything after the whitespace as "this part is about conda" would still allow for all our variant business, but presumably could eventually be expanded to allow per-ecosystem fields... luckily, pypi only has semi-irrelevant stuff like file_name, but other ecosystems could be more complex.

bollwyvl avatar Jan 04 '25 16:01 bollwyvl

include purls as an alternative package identifier

I don't follow entirely. What would your example refer to? The PyPI package or the corresponding conda-forge package?

ytausch avatar Jan 16 '25 09:01 ytausch

include purls as an alternative package identifier

I don't follow entirely. What would your example refer to? The PyPI package or the corresponding conda-forge package?

Right, the user wants the corresponding conda package, from the highest-priority channel, but resolved in the parallel namespace.

# e.g. in pixi.toml
[dependencies]
# |  a new package identifier
# V
"pkg:pypi/django" = ">=1.10.1,<1.11"
#                    ^      
#                    |  the conda constraints, in the MatchSpec grammar
"pkg:golang/github.com/rhysd/actionlint" = ">=1.7.7"

Where this would be most excellent, for the PyPI case, is if the spec could also capture [extras], which are seeing increasing usage in Python inter-package dependencies (even though pip is bad at them).

There is no consensus in conda packaging on how to capture such "optional dependencies," and some packages just ship all the optional dependencies (this hurts on Big Cloud Vendor API extras), or have multiple outputs, but again without any naming conventions (e.g. some folk just -{extra-name}, I generally push for -with-{extra-name}).

An extreme case might be fastapi:

# e.g. in rattler-build recipe.yaml
recipe:
  version: ${{ version }}
outputs:
  # with fully-specified purls
  - package:
      name: fastapi
      purl: pkg:pypi/fastapi@${{ version }}
    dependencies:
      run:
        - pkg:pypi/starlette >=0.40.0,<0.42.0
        - pkg:pypi/pydantic >=1.7.4,!=1.8,!=1.8.1,!=2.0.0,!=2.0.1,!=2.1.0,<3.0.0
        - pkg:pypi/typing-extensions >=4.8.0
  # or maybe it makes sense to CURIE them, using a `pip:`-like syntax
  - package:
      name: fastapi-standard
      purl:
        pkg:pypi:
          - fastapi[standard]@${{ version }}
    dependencies:
      run:
        - ${{ pin_subpackage("fastapi", exact=True) }}
        - pkg:pypi:
          - fastapi-cli[standard] >=0.0.5
          - httpx >=0.23.0
          - jinja2 >=2.11.2
          - python-multipart >=0.0.7
          - itsdangerous >=1.1.0
          - pyyaml >=5.3.1
          - ujson >=4.0.1,!=4.0.2,!=4.1.0,!=4.2.0,!=4.3.0,!=5.0.0,!=5.1.0
          - orjson >=3.2.1
          - email-validator >=2.0.0
          - uvicorn[standard] >=0.12.0
          - pydantic-settings >=2.0.0
          - pydantic-extra-types >=2.0.0

The latter form would all but remove any package-naming impedance, making tools like grayskull have to do much less work to maintain package resolution stuff.

bollwyvl avatar Jan 20 '25 14:01 bollwyvl