OSCI icon indicating copy to clipboard operation
OSCI copied to clipboard

Filtering bots from OSCI ranking

Open vlad-isayko opened this issue 5 years ago • 1 comments

The goal is to improve our existing OSCI code which ranks companies on the basis of the number of commits, because the current situation is that there appear to be large number of of commits done by automated processes associated with GitHub accounts that have a company (commercial organization) email domain. These skew the ranking of companies based on commits, which is precisely why our OSCI ranking is based on number of contributors rather than number of commits.

For example, when we look at the OSCI commit-based company counts to end June 2020, we see

OrgName Commits
Microsoft 640009
GitHub 519108
Renovateapp 472705
Google 379847
Red Hat 331087
Travis CI 195377
Intel 150613
IBM 131510
Exoplatform 125844
Odoo 113452
Pyup 82118

However, Renovateapp, Travis CI, Exoplatform and Pyup do not feature highly in our OSCI countributor-based company ranking. In fact, Renovateapp has only 4 active contributors, Travis CI has 67, Exoplatform has 41, Pyup has 4.

When we dig deeper into this, we see:

This is top of commits authors for Pyup:

Company AuthorName Commits
Pyup pyup-bot 349717
Pyup pyup.io bot 10146
Pyup pyup.io vuln bot 22
Pyup pyup.io bot (via Travis CI) 1

As you can see all of them are bots. The same picture for Renovateapp:

Company AuthorName Commits
Renovateapp Renovate Bot 2348935
Renovateapp WhiteSource Renovate 65148
Renovateapp Renovate Bot (via Travis CI) 358
Renovateapp renovate-bot 63
Renovateapp Rhys Arkins 3

TravisCI (Top 10 by commits):

Company AuthorName Commits
Travis CI Deployment Bot (from Travis CI) 426727
Travis CI Travis CI 92799
Travis CI travis-ci 11824
Travis CI TravisCI 9511
Travis CI Travis 8128
Travis CI Deployment Bot (Travis) 7723
Travis CI Deployment Bot 1917
Travis CI raveit65 1322
Travis CI Piotr Milcarz 1317
Travis CI Travis Build Bot (from Travis CI) 1015

The biggest part of commits comming from bots

We would like a way to filter out these automated processes/bot commits, so that we could more accurately generate a ranking of companies based on commits.

One obvious way is to simply have a 'blacklist' of GitHub accounts / email addresses, but perhaps something more sophisticated could be devised, based on 'unhuman' levels of activity.

At the moment, we are using the domain <-> company match list, which filters companies from the top that we form. Perhaps the problem of bots can be solved by creating a similar list that will filter out bots.

vlad-isayko avatar Sep 14 '20 11:09 vlad-isayko

It would be interesting to analyse what those bots actually do, are they contributing anything useful or it's just deployment logs or whatever.

patrickstephens2 avatar Sep 27 '21 13:09 patrickstephens2