datasets icon indicating copy to clipboard operation
datasets copied to clipboard

PGA downloading a higher number of siva files than expected

Open gomesfernanda opened this issue 7 years ago • 2 comments

I want to download all siva files with "Jupyter Notebook" on PGA.

To know how many they are, I ran: $ pga list --lang "Jupyter Notebook" -f csv

After examining the csv file, I knew that there were 2,606 repos and 3,767 siva files corresponding to them.

To download the siva files, I ran $ pga get --lang "Jupyter Notebook" -v

And the response that I got was:

DEBU[0004] local copy is outdated or non existent
1 / 6349 [>----------------------------------------------------------]   0.02% 40m59s

Meaning that it was downloading 6,349 files, and I have no idea why. If somebody can help me with this.

gomesfernanda avatar Oct 08 '18 16:10 gomesfernanda

I was investigating this today and found out that there are 3,295 siva files for the repo https://github.com/google/skia-buildbot.

So, it was my mistake, pga get IS downloading the exact number of siva files, however I'm intrigued on this extreme number of siva files for one repo. Is it normal?

gomesfernanda avatar Oct 15 '18 11:10 gomesfernanda

@gomesfernanda When you clone the repository with standard refspecs you will obtain something like that:

$ git clone [email protected]:google/skia-buildbot.git
Cloning into 'skia-buildbot'...
remote: Enumerating objects: 3543, done.
remote: Counting objects: 100% (3543/3543), done.
remote: Compressing objects: 100% (2652/2652), done.
remote: Total 108260 (delta 1931), reused 1807 (delta 598), pack-reused 104717
Receiving objects: 100% (108260/108260), 51.61 MiB | 398.00 KiB/s, done.
Resolving deltas: 100% (77333/77333), done.

It contains just a few branches and only 4 root commits:

$ git rev-list --all --remotes --max-parents=0 | wc -l
4

But if you fetch using the same refspec that was used to fetch that repository using Borges:

$ git checkout origin/master

$ git fetch origin +refs/*:refs/*
remote: Enumerating objects: 60611, done.
remote: Counting objects: 100% (60611/60611), done.
remote: Compressing objects: 100% (13044/13044), done.
receiving objects:  53% (82299/155281), 60.64 MiB | 1.14 MiB/s    
[...]

$ git rev-list --all --remotes --max-parents=0 | wc -l
5734

That means, right now that repository will be on 5734 different siva files.

This is because they are using Gerrit. Gerrit generates a new orphan branch per each "pull request".

ajnavarro avatar Oct 16 '18 07:10 ajnavarro