Dependencies are not created using the scan_package pipeline
Using https://files.pythonhosted.org/packages/ce/21/41a0028f6d610987c0839250357c1a00f351790b8a448c2eb323caa719ac/celery-5.2.7.tar.gz as input.
With the scan_codebase and root_filesystems pipelines, 88 dependencies are created.
When using the scan_package pipelines, no dependencies are created.
Note that the "Datafile resource" requires.txt and PKG-INFO used to create those dependencies exist as codebase resources in the scan_package pipeline as well.
Looks like the issue is in scancode-toolkit. The scancode scan created by the pipeline does not have any dependencies in the top-level dependency field. I've downloaded and extracted celery-5.2.7.tar.gz and scanned it with scancode just to make sure and I got the same result.
I've re-run the scans with the latest SCIO and toolkit:
scan_codebase 117 dependencies:
- 42 dependencies with datasource_id="pypi_editable_egg_pkginfo" datafile_resource="PKG-INFO"
- 74 dependencies with datasource_id="pip_requirements" datafile_resource="requirements/extras/*.txt"
scan_package 42 dependencies:
- 42 dependencies with datasource_id="pypi_editable_egg_pkginfo" datafile_resource="PKG-INFO"
@JonoYang @pombredanne before closing this one, could you confirm that this is the expected behavior for both pipeline?
@tdruez
The addition of the 74 dependencies with datasource_id="pip_requirements" datafile_resource="requirements/extras/*.txt" is a new development for me. I feel like this should also show up in the scan_package output.
@AyanSinhaMahapatra
When we scan https://files.pythonhosted.org/packages/ce/21/41a0028f6d610987c0839250357c1a00f351790b8a448c2eb323caa719ac/celery-5.2.7.tar.gz with the latest version of scancode-toolkit, we only see dependencies with datasource_id="pypi_editable_egg_pkginfo" datafile_resource="PKG-INFO". Are dependencies with datasource_id="pip_requirements" datafile_resource="requirements/extras/*.txt" something we don't report in a normal scan, or is this a bug with package scanning on sctk?
@tdruez @JonoYang This is indeed a bug in SCTK where it seems like things are happening differently from what happens in SCIO, I'm looking into this more (what we see in scan_package is essentially the SCTK scan imported, in scan_codebase the scans are run via the API and packages assembled later).
In SCTK, the PKG-INFO at root, is being ignored in favor of the PKG-INFO in the .egg-info directory, and this seems to be implemented and added in https://github.com/nexB/scancode-toolkit/commit/d6b68d297c9b8c007cf92d9fb9e9df4acecd802a and https://github.com/nexB/scancode-toolkit/issues/3083. What I'm looking into:
- Why we are not adding dependencies from the
requirements/*.txtfiles. But we should. Maybe there should also be a check to report dependency listed in two place only once.
- There are duplicate dependencies between the PKG-INFO and the dependency files, for example
pytz>=2021.3is present in both, and this is counted two times in thescan_codebasecase. - but there are also ones which are only present in the requirements file, like
sphinx-celery
In SCTK we are not returning dependencies from all other files in requirements.
- In SCIO, we are creating packages multiple times for
pkg:pypi/[email protected]I think once fromcelery.egg-info/PKG-INFOand once from PKG-INFO at root.
On a side note, the following package attributes are not visible/populated in the package view when I run a scan in SCIO:
- datasource_id in the Other tab is not populated but should be
- datafile_paths does not seem to be a a package attribute, but we have this in SCTK (this is there for dependencies though, and we also populate it correctly there)
Ah I've reported this before separately actually and this is same as and tracked in SCTK in https://github.com/nexB/scancode-toolkit/issues/3045