scancode.io icon indicating copy to clipboard operation
scancode.io copied to clipboard

Dependencies are not created using the scan_package pipeline

Open tdruez opened this issue 3 years ago • 1 comments

Using https://files.pythonhosted.org/packages/ce/21/41a0028f6d610987c0839250357c1a00f351790b8a448c2eb323caa719ac/celery-5.2.7.tar.gz as input.

With the scan_codebase and root_filesystems pipelines, 88 dependencies are created. When using the scan_package pipelines, no dependencies are created.

Note that the "Datafile resource" requires.txt and PKG-INFO used to create those dependencies exist as codebase resources in the scan_package pipeline as well.

tdruez avatar Sep 01 '22 08:09 tdruez

Looks like the issue is in scancode-toolkit. The scancode scan created by the pipeline does not have any dependencies in the top-level dependency field. I've downloaded and extracted celery-5.2.7.tar.gz and scanned it with scancode just to make sure and I got the same result.

JonoYang avatar Sep 01 '22 17:09 JonoYang

I've re-run the scans with the latest SCIO and toolkit:

scan_codebase 117 dependencies:

  • 42 dependencies with datasource_id="pypi_editable_egg_pkginfo" datafile_resource="PKG-INFO"
  • 74 dependencies with datasource_id="pip_requirements" datafile_resource="requirements/extras/*.txt"

scan_package 42 dependencies:

  • 42 dependencies with datasource_id="pypi_editable_egg_pkginfo" datafile_resource="PKG-INFO"

@JonoYang @pombredanne before closing this one, could you confirm that this is the expected behavior for both pipeline?

tdruez avatar Jul 05 '23 09:07 tdruez

@tdruez

The addition of the 74 dependencies with datasource_id="pip_requirements" datafile_resource="requirements/extras/*.txt" is a new development for me. I feel like this should also show up in the scan_package output.

@AyanSinhaMahapatra

When we scan https://files.pythonhosted.org/packages/ce/21/41a0028f6d610987c0839250357c1a00f351790b8a448c2eb323caa719ac/celery-5.2.7.tar.gz with the latest version of scancode-toolkit, we only see dependencies with datasource_id="pypi_editable_egg_pkginfo" datafile_resource="PKG-INFO". Are dependencies with datasource_id="pip_requirements" datafile_resource="requirements/extras/*.txt" something we don't report in a normal scan, or is this a bug with package scanning on sctk?

JonoYang avatar Jul 06 '23 01:07 JonoYang

@tdruez @JonoYang This is indeed a bug in SCTK where it seems like things are happening differently from what happens in SCIO, I'm looking into this more (what we see in scan_package is essentially the SCTK scan imported, in scan_codebase the scans are run via the API and packages assembled later).

In SCTK, the PKG-INFO at root, is being ignored in favor of the PKG-INFO in the .egg-info directory, and this seems to be implemented and added in https://github.com/nexB/scancode-toolkit/commit/d6b68d297c9b8c007cf92d9fb9e9df4acecd802a and https://github.com/nexB/scancode-toolkit/issues/3083. What I'm looking into:

  1. Why we are not adding dependencies from the requirements/*.txt files. But we should. Maybe there should also be a check to report dependency listed in two place only once.
  • There are duplicate dependencies between the PKG-INFO and the dependency files, for example pytz>=2021.3 is present in both, and this is counted two times in the scan_codebase case.
  • but there are also ones which are only present in the requirements file, like sphinx-celery

In SCTK we are not returning dependencies from all other files in requirements.

  1. In SCIO, we are creating packages multiple times for pkg:pypi/[email protected] I think once from celery.egg-info/PKG-INFO and once from PKG-INFO at root.

On a side note, the following package attributes are not visible/populated in the package view when I run a scan in SCIO:

  • datasource_id in the Other tab is not populated but should be
  • datafile_paths does not seem to be a a package attribute, but we have this in SCTK (this is there for dependencies though, and we also populate it correctly there)

AyanSinhaMahapatra avatar Jul 06 '23 11:07 AyanSinhaMahapatra

Ah I've reported this before separately actually and this is same as and tracked in SCTK in https://github.com/nexB/scancode-toolkit/issues/3045

AyanSinhaMahapatra avatar Jul 06 '23 12:07 AyanSinhaMahapatra