advisory-database Standardising on a format for describing affected functions.

Context: https://github.com/ossf/osv-schema/issues/127 Summary: Many ecosystems/databases are starting to encode more fine grained, "vulnerable symbol" information in advisories.

Go has defined their own ecosystem_specific format for describing affected functions: https://go.dev/security/vuln/database#schema. Most vulns in the Go vuln DB have this.

RustSec has this in their security advisories: https://github.com/rustsec/advisory-db/blob/ee74d43ec2c6203750e38a254f4029f6181c4362/crates/RUSTSEC-2021-0027.json#L30

GHSA also has their own: https://github.com/github/advisory-database/blob/046aafb21a71eadbe8bdd5a72b1491a5f1209fb7/advisories/github-reviewed/2023/07/GHSA-57fc-8q82-gfp3/GHSA-57fc-8q82-gfp3.json#L25

Given that ecosystems have their own nuances for describing these accurately (e.g. OS, arch specifiers), it makes more sense for ecosystems to each define their own ecosystem_specific field. We should define an ecosystem_specific for Python as well as part of this DB.

Aug 25 '23 00:08 oliverchang

@di @sethmlarson thoughts?

Aug 25 '23 00:08 oliverchang

My concern is I'm not sure how we'd get that information automatically, it'd be a manual process for each advisory. Then we'd have to teach pip-audit how to parse ASTs to apply it? cc @woodruffw

Aug 25 '23 00:08 sethmlarson

My concern is I'm not sure how we'd get that information automatically, it'd be a manual process for each advisory. Then we'd have to teach pip-audit how to parse ASTs to apply it? cc @woodruffw

Yes, it'd be manual for now, but I can imagine automation helping out with various pieces here, based on identifying fix commits and extracting the list of changed functions as a rough starting point.

We've been building support for Go (via govulncheck), and Rust in osv-scanner: https://github.com/google/osv-scanner/issues/476, so there is some precendent there in other ecosystems.

Aug 25 '23 00:08 oliverchang

We should define an ecosystem_specific for Python as well as part of this DB.

Yes!

In particular: in addition to being able to relay vulnerable symbols/functions, one thing that would be valuable for Python specifically is being able to signal the kind of distribution that the vulnerability is relevant for.

This has come up a few times in the past, particularly with PyCA Cryptography: they've released security advisories that pertain only to wheel distributions (since those distributions include a static build of OpenSSL), and it'd be great if non-wheel users weren't unnecessarily included in those advisories.

CC @alex (I could have sworn he opened an issue for this somewhere before, but I can't immediately find it)

Aug 25 '23 14:08 woodruffw

I don't remember if I filed a bug, or just groused a bunch :-)

Tracking affected functions makes sense to me, though making use of that information may be more challenging than in a statically typed language.

Aug 25 '23 14:08 alex

Tracking affected functions makes sense to me, though making use of that information may be more challenging than in a statically typed language.

To flag a few (not exhaustive) examples of this:

Python modules and packages can re-export from other modules and packages (including third-party ones)
Decorators do "interesting" things to function and class names if they're done manually rather than via functools (specifically, they result in functions and classes having unexpected <locals> qualnames due to being defined dynamically within a function or other callable)

I wouldn't expect a standard format here to solve either of these perfectly (they're fundamentally too dynamic), but both are worth making considerations around. In particular, for re-exports, it's probably worth nailing down whether all re-exports (potentially many!) will be listed for a vulnerable package, or whether the single original site of export gets listed (which itself might be funky, since it could be a "private"-to-public re-export).

Aug 25 '23 14:08 woodruffw

Hi everyone, my name is Itamar Sher and this is my first time commenting here. So hello 👋 We’re identifying the affected function for vulnerabilities we’re patching in our public vulnerability patches database ( https://github.com/seal-community/patches). I planned to go via the regular contribution channel but since I noticed this discussion I thought it might be a good opportunity to collaborate on how we can automatically share/feed this information back into the advisory without encumbering the maintainers.

Aug 25 '23 15:08 itamarsher

Python has a spec for a fully qualified name (from Python 3.3) https://peps.python.org/pep-3155/ I haven't dug too deep into it but it appears to be a walk down the AST for any given class. eg.

>>> class C:
...   def f(): pass
...   class D:
...     def g(): pass
...
>>> C.__qualname__
'C'
>>> C.f.__qualname__
'C.f'
>>> C.D.__qualname__
'C.D'
>>> C.D.g.__qualname__
'C.D.g'

The one thing I would suggest be added to that form is that packages be rooted with their pep 503 normalized name https://peps.python.org/pep-0503/#normalized-names for the package. eg. packages foo and bar each with a top level function (or class) parse would have that function (or class) referenced as foo.parse and bar.parse

Sep 26 '23 22:09 darakian

Yeah, IMO the full __qualname__ is a reasonable starting point.

The one thing I would suggest be added to that form is that packages be rooted with their pep 503 normalized name https://peps.python.org/pep-0503/#normalized-names for the package

I might be misunderstanding, but I don't think this will be necessary: the OSV finding itself will contain the package name, so including it in the path will be redundant (as well as confusing, since the package name isn't guaranteed to be a valid Python identifier or the same as the top-level module name).

This also leaves open the re-export and decorator/locals issues, but I think a 99% solution is probably better to get started with 🙂

Sep 27 '23 14:09 woodruffw

I might be misunderstanding, but I don't think this will be necessary: the OSV finding itself will contain the package name, so including it in the path will be redundant

I think we do need a root of some sort to capture the case where the package is a single object/function. Eg. Some package foo where a user would use foo as

import foo

foo(some input)

package name isn't guaranteed to be a valid Python identifier or the same as the top-level module name

Can you expand on this? I think maybe I've confused package name with top level identifier. I was under the impression that the 503 package name was the top level identifier, but perhaps that's just a commonality and not a rule. If I'm in error there then yes maybe the top level identifier is the correct approach. Is there a pep for this?

Sep 27 '23 17:09 darakian

Can you expand on this? I think maybe I've confused package name with top level identifier. I was under the impression that the 503 package name was the top level identifier, but perhaps that's just a commonality and not a rule. If I'm in error there then yes maybe the top level identifier is the correct approach. Is there a pep for this?

Correct: it's just a convention, not a rule. The two technically don't have to be similar at all, and the normalized package name still isn't guaranteed to be a valid Python identifier.

A good canonical example of this is PIL and Pillow -- Pillow is a maintained fork of PIL, so it still uses import PIL as its top-level module despite being installed via pip install Pillow.

As far as I know, there's no actual PEP stating this anywhere, it's just a quirk of how Python packaging works 🙂

As a result of this, a __qualname__ like PIL.foo.bar.baz may ambiguously refer to either the PIL package or the Pillow one, but the surrounding OSV content (including the package name) provides the disambiguation for us. So in theory just the __qualname__ should be sufficient, since it's understood it's coming from a specific package.

Sep 27 '23 17:09 woodruffw

Ah yes, pillow. Good example 👍

I guess the question then is; when reading code how does one determine what the top level module is? I assume that's discoverable and that there's only one top level module.

So in theory just the qualname should be sufficient, since it's understood it's coming from a specific package.

Just to clarify; __qualname__ should include the top level module correct?

Sep 27 '23 21:09 darakian

I guess the question then is; when reading code how does one determine what the top level module is? I assume that's discoverable and that there's only one top level module.

Yeah, the discovery process for matching a vulnerability to an environment (or source code) would roughly be:

Confirm that the top-level package in the OSV finding is present
Iterate through all statically discoverable imports in the target, and normalize them (handling import aliases, from ... import ..., etc.)
Compare each fully normalized import's members against all fully qualified paths in the OSV finding

Just to clarify; qualname should include the top level module correct?

Hmm, I thought it did, but from a quick look __qualname__ only shows the top-level name for a scope, rather than the full module resolution path. Using a random example from my host:

>>> from pip_audit import _cli
>>> _cli.AuditOptions.__qualname__
'AuditOptions'

So we'd need to go a little beyond __qualname__, and also probably use __module__ and similar to build up the full path:

>>> _cli.AuditOptions.__module__
'pip_audit._audit'

(this has the nice effect of also canonicalizing the fully qualified name, since AuditOptions isn't actually defined in _cli, just imported into it).

Sep 27 '23 21:09 woodruffw

So would this be a "simple" concatenation of __module__ and __qualname__ or a more complex operation? I'm not following where _audit came from in your example and I'm quite (completely) ignorant of the code inside that pip_audit module.

Sep 27 '23 22:09 darakian

So would this be a "simple" concatenation of module and qualname or a more complex operation?

Yep, it should be "just" that 99% of the time (famous last words). It gets a little bit more complicated when the target is a module or package itself (perhaps it should never be, as a matter of policy?) but even in those cases it should be just different concatenations of __name__, etc.

Sorry for the confusion with my example 😅 -- _audit is another module under pip_audit, so I was trying to demonstrate that a member imported from one module (pip_audit._cli) can actually have a fully qualified name under another module (pip_audit._audit). That was mostly to demonstrate the complexity of resolution, e.g. a user might do this in their code:

from pip_audit import _cli

_cli.AuditOptions(...)

and we would want to catch that, despite the "canonical" path for AuditOptions being pip_audit._audit.AuditOptions not pip_audit._cli.AuditOptions.

Sep 27 '23 22:09 woodruffw

Yep, it should be "just" that 99% of the time (famous last words).

So foo.__module__ + foo.__qualname__? Also, I did a quick look for the source of these two functions and couldn't get at them quickly. I'll try to dig them up and link them here when I get some more time to dig. I don't really wanna be in a world where the spec is basically whatever these two functions spit out

even in those cases it should be just different concatenations of name, etc.

We should iterate out those cases 😄

On your example; I guess AuditOptions can be pulled in from _cli in practice then it's available as a re-export or something? eg. _cli pulls it in for whatever reason and a user could get access through that module? I agree that we should stay focused on a canonical path and maybe this example brings up a question of how to capture aliases for a path, but maybe that's a question to resolve in v2.

So, then my question on AuditOptions is; what makes pip_audit._audit.AuditOptions the canonical path and how would I make that determination for a different project?

Sep 29 '23 16:09 darakian

So foo.module + foo.qualname?

Yep, 99% of the time. There are some items in Python that don't have a __module__ or __qualname__ (like module objects themselves); these need to use __name__ instead (and maybe some other things that I'm forgetting?).

It'd be good to check what Python actually says about those attributes, but it's also worth noting that they're fundamentally dynamic anyways: a misbehaving class can do __module__ shenanigans, and can rewrite its own __qualname__. There's very little Python will probably actually guarantee about these 🙂

On your example; I guess AuditOptions can be pulled in from _cli in practice then it's available as a re-export or something? eg. _cli pulls it in for whatever reason and a user could get access through that module?

Yep, exactly. In practice, Python's lack of module visibility or explicit exporting means that any internal use of a package by itself results in re-exports that users can accidentally depend on.

So, then my question on AuditOptions is; what makes pip_audit._audit.AuditOptions the canonical path and how would I make that determination for a different project?

This is, unfortunately, a good question with no good answer 🙂

As far as I know, Python's module system has no actual definition of "canonical" for a particular resolution path. As such, a tool/ecosystem like OSV will need to make its own determinations about how best to canonicalize paths. A few options:

Always favor the site of definition. In other words: whichever module class Foo appears in is the canonical module, even if Foo gets re-exported elsewhere.
- Problems: Lots of packages use the "re-export to public" pattern for public APIs, meaning that we'll end up "canonicalizing" paths that no user should ever actually import (e.g. somepkg._internal.Foo rather than somepkg.public.Foo).
Follow first-party re-exports and __all__.
- Problems: Can cause duplicates; unclear how to tiebreak between them. Also unclear what to do about wildcards.
Don't bother canonicalizing; list every definition path and first-party re-export.
- Problems: Extremely noisy; will include lots of paths that are "valid" import paths but that no user will ever actually import.

(There are also further problems, like packages that have canonical paths for public APIs but intentionally use __getattr__ to force dynamic lookups as part of their deprecation process. Pydantic is a good example of this.)

Sep 29 '23 16:09 woodruffw

It'd be good to check what Python actually says about those attributes, but it's also worth noting that they're fundamentally dynamic anyways: a misbehaving class can do __module__ shenanigans, and can rewrite its own __qualname__. There's very little Python will probably actually guarantee about these 🙂

Oh for sure. I'm not worried about modules that are intentionally obtuse for a v1 spec. 99% will get us a lot of value for most users and we can tackle the infinite regression of edge cases for v2 (or 3 maybe 😄)

But yes, lets formalize __module__ + __qualname__ as more than just the output of the functions and if that's straight forward we can probably start moving on that 🤞

This is, unfortunately, a good question with no good answer 🙂

:(

As such, a tool/~ecosystem~ format like OSV will need to make its own determinations about how best to canonicalize paths.

I guess for the moment we could also say that it's a data providers obligation to do a best effort canonicalization. There will be aliases in practice of course, but maybe that's an acceptable choice. I tend to prefer to match whatever a user of a package would use so that scanning tools can be more easily maintained/made to be lower noise so if there's a re-export pattern in some library but also the developer documentation advises users to use that re-export then we prefer that export. That's human guidance I know.

Sep 29 '23 19:09 darakian

But yes, lets formalize module + qualname as more than just the output of the functions and if that's straight forward we can probably start moving on that 🤞

Sounds good! As a starting point: we can say that these names take the form:

module1.module2.module3:attribute

...where the LHS of the : is an absolute importable module path, i.e. a value that could be passed into importlib.import_module(...) without needing a relative package reference to resolve against.

The RHS is a gettable attribute within the importable module, meaning that getattr(module3, attribute) is expected to not raise AttributeError (although this isn't guaranteed).

The reason I suggest : as the attribute delimiter is for consistency with other bits of the ecosystem: it's what setuptools and other packaging tooling use for entry_points syntax, and it also shows up in Pyramid and a few other frameworks. So there's precedent for it, and it maintains separation between the "importable component" and "gettable component."

When an entire module is meant to be flagged in a finding, we can probably just drop the :attribute, e.g.:

module1.module2.module3

...which follows the same importlib.import_module(...) rules.

I guess for the moment we could also say that it's a data providers obligation to do a best effort canonicalization. There will be aliases in practice of course, but maybe that's an acceptable choice. I tend to prefer to match whatever a user of a package would use so that scanning tools can be more easily maintained/made to be lower noise so if there's a re-export pattern in some library but also the developer documentation advises users to use that re-export then we prefer that export.

That seems very reasonable to me. There isn't a perfect answer here, so I think anything (like this) that gets us to the 99% case is perfectly suitable 🙂

Sep 29 '23 20:09 woodruffw

Any chance you can give some examples of the : usage? Not sure I follow on what you're suggesting there.

Oct 04 '23 20:10 darakian

Any chance you can give some examples of the : usage? Not sure I follow on what you're suggesting there.

Sure: using the example above:

pip_audit._audit.AuditOptions

would become:

pip_audit._audit:AuditOptions

On the other hand, a report for an entire module would remain the same:

foo.bar.baz

That's really the only change; the main reasons I suggest it are because it's what other tools in the Python packaging ecosystem do, and because it disambiguates different parts of the path well.

Oct 04 '23 20:10 woodruffw

Assuming we're going off https://github.com/pypa/pip-audit/blob/main/pip_audit/_audit.py for your example AuditOptions is a class. I guess I'm not following why that would change. One could still import pip_audit._audit.AuditOptions correct?

Oct 04 '23 21:10 darakian

One could still import pip_audit._audit.AuditOptions correct?

Not through the importlib machinery:

>>> import importlib
>>> importlib.import_module("pip_audit._audit.AuditOptions")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/william/.pyenv/versions/3.11.2/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1206, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1178, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1137, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'pip_audit._audit.AuditOptions'; 'pip_audit._audit' is not a package
>>> importlib.import_module("pip_audit._audit")
<module 'pip_audit._audit' from '/Users/william/devel/pip-audit/pip_audit/_audit.py'>

Similarly with the import keyword:

>>> import pip_audit._audit.AuditOptions
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pip_audit._audit.AuditOptions'; 'pip_audit._audit' is not a package

That's why the disambiguation with : is nice: it makes it clear which parts can actually be imported (which a triaging tool might need to do) versus which parts are attributes.

Oct 04 '23 21:10 woodruffw

Ah ok, so the : wouldn't necessarily be a path indicator, but an indicator of how to interact with the object. I'm wondering how I would validate that and I'm not sure I can commit to supporting it reliably. Any thoughts on that end?

Oct 04 '23 21:10 darakian

Ah ok, so the : wouldn't necessarily be a path indicator, but an indicator of how to interact with the object.

Right: it's an indicator that the thing on the RHS isn't a part of the path, but is rather a class, object, function, etc. (i.e. the things that are the actual vulnerable API surface, rather than the path to the vulnerable API surface).

I'm wondering how I would validate that and I'm not sure I can commit to supporting it reliably. Any thoughts on that end?

On one level, validating this is the same as it would be with the foo.bar.Baz syntax -- you just check to see if the input matches \w[\w\d]*(\.\w[\w\d])*(:\w[\w\d]*)? (very roughy).

If you're asking how you'd check that the LHS is a valid path and the RHS is a valid attribute: you can either do it dynamically with importlib + getattr or get 95% of the way there statically. But neither is 100% guaranteed to resolve all valid attributes under valid paths.

(This is true for the foo.bar.Baz syntax as well -- there's no generalized way to validate that a particular Python module/attribute path is "real" for all users. So I don't think this syntax has any advantages in that respect; the main advantages are that the Python community already uses it elsewhere and it's slightly less ambiguous about the RHS.)

Oct 04 '23 21:10 woodruffw

If you're asking how you'd check that the LHS is a valid path and the RHS is a valid attribute:

This. Intuitively it seems easier to discover valid foo.bar.Baz paths than to know if the trailing element is an attribute or not. Knowing if something is an attribute or not seems like it would require much more familiarity with any given code base. I do agree it's less ambiguous, but if we can't reliably provide that data would foo.bar.Baz be an acceptable fallback?

Oct 04 '23 21:10 darakian

Intuitively it seems easier to discover valid foo.bar.Baz paths than to know if the trailing element is an attribute or not.

Could you elaborate a bit more on what you mean by "discover valid"? In a strict sense, foo.bar.Baz and foo.bar:Baz are exactly as easy to discover and exactly as hard to validate; the benefit of the : is solely to disambiguate the actual API being identified versus a path to it (and because it's what the rest of the packaging ecosystem uses, as mentioned before).

For example, foo.bar.Baz could refer to any of the following things:

Constant Baz through foo.bar
Class Baz through foo.bar
Module Baz under foo.bar
Any (or multiple) of the above, but impossible to statically derive
Any of the above, as a re-export

And the same is true for foo.bar:Baz.

I think the TL;DR of my position is that Python's dynamic nature makes it almost impossible to actually "validate" anything here, whether you use : or . as the final delimiter -- the value of : for the attribute delimiter doesn't come from additional validation power, but instead from consistency with other Python tools and the intent it communicates 🙂.

Oct 05 '23 18:10 woodruffw

Could you elaborate a bit more on what you mean by "discover valid"?

Sure. Part of my job is in populating this data and discovery of paths like foo.bar.Baz is its own complexity, but is something I've been able to do with reasonable consistency. The : qualifier would make the data more useful for sure, but I am not confident I would be able to support that reliably. So, I don't want to be in a position where I'm generating bad data for the rest of the world.

Now, if the : is optional it could certainly be the case that I primarily output foo.bar.Baz and users come and correct me over here. No doubt they'll come correct me for the dot separated strings too, but I guess I'm concerned about the expectation as I expect to fail if : is expected 😅

Oct 05 '23 20:10 darakian

Thanks for explaining!

I understand your position, and I'm also very sympathetic to not adding more work to an already manual process 🙂. I think I might have given you the wrong impression about how difficult this is, by talking about pathological cases: in practice, 99.9% of paths that you discover as foo.bar.Baz can be automatically rewritten as foo.bar:Baz, even when Baz is something like a module itself.

That being said, I recognize that even that may be a dealbreaker for you. Ultimately any amount of path/API information will be beneficial here, so I'll stop hammering on : and just say that this is all good and I'm looking forward to any variant of it coming into existence 🙂

Oct 05 '23 22:10 woodruffw

Alright, if you're comfortable with it I'd prefer to keep the spec location based (and slightly easier 😅) for v1 with just a dot separated path to describe the location of a vulnerable object.

I still need to get some time to go find the source for __module__ and __qualname__, but I think we're basically in agreement on what to do 🎉

Oct 09 '23 19:10 darakian