datasets Index v2 repositories

Requires #117

[x] create batch job that uses pga-create container to index repositories
[x] run indexing on the repositories
[ ] add job description to charts repository
[ ] update documentation

Mar 22 '19 13:03 jfontan

Indexing process crashed after some hours (or minutes if used more than 8 threads) with OOMKilled state. After some more debugging and help from @rporres we've found that it was not the indexing process consuming high amount of memory but the kernel running out of memory and killing the process which consumed more memory. The problem was that by default the root filesystem is memory backed, /tmp is part of it and most probably it crashed when two or more big repositories had to be processed and copied to the temporary directory.

Now the process is running in a batch job using a local node directory as /tmp. This decreases memory pressure and hopefully the indexing process won't be killed. This has the side effect of making indexing slower.

The current process uses 16 threads and is currently using 27 Gb of ram, far from the 64 Gb limit from the pod.

May 09 '19 13:05 jfontan

Fingers crossed! cc/ @src-d/machine-learning

May 09 '19 13:05 vmarkovtsev

Indexing has been running since last Tuesday (May 14th) in 5 nodes. Repositories are split in batches of 5000 and each. Each node processes 10 batches except for the first one (does not have limit) and the last one that processes 13 batches. In total there are 53 batches. Here's the number of lines in each batch index:

/pga/data # wc -l index.csv.*
     4830 index.csv.0
     4818 index.csv.1
     4995 index.csv.10
     4990 index.csv.11
     4991 index.csv.12
     4989 index.csv.13
     4991 index.csv.14
     1944 index.csv.15
     4989 index.csv.16
     2193 index.csv.17
     4983 index.csv.2
     4974 index.csv.3
     4522 index.csv.4
     4974 index.csv.5
     4957 index.csv.6
     2590 index.csv.7
     4985 index.csv.8
     4991 index.csv.9
     4995 index.csv.batch.10
     4990 index.csv.batch.11
     4991 index.csv.batch.12
     3241 index.csv.batch.13
     4991 index.csv.batch.14
     1993 index.csv.batch.15
     4989 index.csv.batch.16
     4990 index.csv.batch.17
     4982 index.csv.batch.18
     4987 index.csv.batch.19
     4977 index.csv.batch.20
     3738 index.csv.batch.21
     4991 index.csv.batch.22
     4440 index.csv.batch.23
     1345 index.csv.batch.24
     4986 index.csv.batch.30
     4986 index.csv.batch.31
     4992 index.csv.batch.32
     4984 index.csv.batch.33
     4983 index.csv.batch.34
     4133 index.csv.batch.35
     4985 index.csv.batch.40
     4983 index.csv.batch.41
     4990 index.csv.batch.42
     4986 index.csv.batch.43
     4979 index.csv.batch.44
   199333 total

Files index.csv.<number> are generated by the first node. It does not have limit and is a backup in case it's able to index all batches and one of the other nodes fail. This means that there are duplicates. The real number of processed repositories is: 167072.

The first node appears to have finished (10-19) and pods indexing batches from 20 to 53 are stuck in the middle of their lists. Every day the jobs are checked at least twice as some repositories get stuck processing so the next batch is not started. These jobs are killed to let the next batch run. Kills are done with ABRT signal so stack trace is sent to logs and can be read in kibana. Here are some of the repositories that got stuck for further analysis:

0169ecfe-a1fe-016c-5291-aa7e7315bc9f e83c5163316f89bfbde7d9ab23ca2e25604af290
0169ed0d-a4e4-ff10-6a56-642cf365fe3c 6091827530d6dd43479d6709fb6e9f745c11e900
0169ecf6-3d49-93ed-9e3f-c691f01eef8e 99c545ceef1cd080a0dce87d12649db770f78754 10 git://github.com/deadpixi/libtmt.git Mon May 20 11:03:34 UTC 2019
0169ecfe-6ef6-ef0c-4709-17377ff371fb 678b0b89572768b21d8b74360d55b75b233799c4 20 git://github.com/gentoo/eudev.git Mon May 20 12:20:53 UTC 2019
0169ecd8-512a-2a17-458c-2bd367848fb8 e83c5163316f89bfbde7d9ab23ca2e25604af290 20 git://github.com/Microsoft/git.git Mon May 20 12:20:53 UTC 2019
0169ecd9-354b-8de2-9c53-a7592c114e51 3b56a9af51519d2e77e05efa672a13e6be2e9ebc 30 git://github.com/MrAlex94/Waterfox.git Mon May 20 12:25:39 UTC 2019
0169ed06-9a85-d120-23c6-a64976a97d7d 781c48087175615674b38b31fcc0aae17f0651b6 30 git://github.com/mozilla/mozilla-central.git Mon May 20 12:25:39 UTC 2019
0169ecd8-512a-2a17-458c-2bd367848fb8 e83c5163316f89bfbde7d9ab23ca2e25604af290 30 git://github.com/Microsoft/git.git Mon May 20 12:25:39 UTC 2019

There are also some repositories that had some kind of error and could not be processed as seen in the index lists. If all of the batch was processed correctly they should count 5001 (repos + header). Errors are not analyzed.

May 21 '19 09:05 jfontan

@jfontan I guess that this is done.

Sep 24 '19 15:09 vmarkovtsev

Long overdue but we still have to add documentation on the indexing process.

Sep 26 '19 09:09 jfontan