Filtering bots from OSCI ranking
The goal is to improve our existing OSCI code which ranks companies on the basis of the number of commits, because the current situation is that there appear to be large number of of commits done by automated processes associated with GitHub accounts that have a company (commercial organization) email domain. These skew the ranking of companies based on commits, which is precisely why our OSCI ranking is based on number of contributors rather than number of commits.
For example, when we look at the OSCI commit-based company counts to end June 2020, we see
| OrgName | Commits |
|---|---|
| Microsoft | 640009 |
| GitHub | 519108 |
| Renovateapp | 472705 |
| 379847 | |
| Red Hat | 331087 |
| Travis CI | 195377 |
| Intel | 150613 |
| IBM | 131510 |
| Exoplatform | 125844 |
| Odoo | 113452 |
| Pyup | 82118 |
However, Renovateapp, Travis CI, Exoplatform and Pyup do not feature highly in our OSCI countributor-based company ranking. In fact, Renovateapp has only 4 active contributors, Travis CI has 67, Exoplatform has 41, Pyup has 4.
When we dig deeper into this, we see:
This is top of commits authors for Pyup:
| Company | AuthorName | Commits |
|---|---|---|
| Pyup | pyup-bot | 349717 |
| Pyup | pyup.io bot | 10146 |
| Pyup | pyup.io vuln bot | 22 |
| Pyup | pyup.io bot (via Travis CI) | 1 |
As you can see all of them are bots. The same picture for Renovateapp:
| Company | AuthorName | Commits |
|---|---|---|
| Renovateapp | Renovate Bot | 2348935 |
| Renovateapp | WhiteSource Renovate | 65148 |
| Renovateapp | Renovate Bot (via Travis CI) | 358 |
| Renovateapp | renovate-bot | 63 |
| Renovateapp | Rhys Arkins | 3 |
TravisCI (Top 10 by commits):
| Company | AuthorName | Commits |
|---|---|---|
| Travis CI | Deployment Bot (from Travis CI) | 426727 |
| Travis CI | Travis CI | 92799 |
| Travis CI | travis-ci | 11824 |
| Travis CI | TravisCI | 9511 |
| Travis CI | Travis | 8128 |
| Travis CI | Deployment Bot (Travis) | 7723 |
| Travis CI | Deployment Bot | 1917 |
| Travis CI | raveit65 | 1322 |
| Travis CI | Piotr Milcarz | 1317 |
| Travis CI | Travis Build Bot (from Travis CI) | 1015 |
The biggest part of commits comming from bots
We would like a way to filter out these automated processes/bot commits, so that we could more accurately generate a ranking of companies based on commits.
One obvious way is to simply have a 'blacklist' of GitHub accounts / email addresses, but perhaps something more sophisticated could be devised, based on 'unhuman' levels of activity.
At the moment, we are using the domain <-> company match list, which filters companies from the top that we form. Perhaps the problem of bots can be solved by creating a similar list that will filter out bots.
It would be interesting to analyse what those bots actually do, are they contributing anything useful or it's just deployment logs or whatever.