kibble Add Github issues and PRs scanner

In this PR I tried to build some basic tooling around scanning data sources. Currently the data is only collected and not persisted anywhere. That's still something I plan to do in this PR.

Although I decided to use old approach with configuration in yaml file it should be treated only as temporary solution before we have a database.

Why didn't I use known PyGithub? It's LGPL.

May 22 '21 11:05 turbaszek

@kaxil @sharanf @Humbedooh @michalslowikowski00 happy to get your opinion.

As suggested on the dev list I introduced concept of DataSource and DataType. Those for now can be configured yaml configuration file:

data_sources:
   - name: github_kibble
     class: kibble.data_sources.github.GithubDataSource
     config:
       repo_owner: apache
       repo_name: kibble
       enabled_data_types:
         - pr_issues

This form allow users to specify any external data sources as long as the class path points to importable object.

The role of DataSource is to provide authentication methods for the external service represented by it. DataType represent single type of information we can get from this source, in case of this PR those are Github issues (which include also PRs). Role of DataType is to define :

how to process the raw data from external source and how to persist them into database (to be done)
how to read the data from database including aggregation, filters etc.

In general this is rough idea I have in m mind: Kibble-2

May 30 '21 20:05 turbaszek

@kaxil @sharanf @Humbedooh @michalslowikowski00 happy to get your opinion.

As suggested on the dev list I introduced concept of DataSource and DataType. Those for now can be configured yaml configuration file:
data_sources:
   - name: github_kibble
     class: kibble.data_sources.github.GithubDataSource
     config:
       repo_owner: apache
       repo_name: kibble
       enabled_data_types:
         - pr_issues
This form allow users to specify any external data sources as long as the class path points to importable object.

The role of DataSource is to provide authentication methods for the external service represented by it. DataType represent single type of information we can get from this source, in case of this PR those are Github issues (which include also PRs). Role of DataType is to define :
* how to process the raw data from external source and how to persist them into database (to be done)

* how to read the data from database including aggregation, filters etc.
In general this is rough idea I have in m mind:

@turbaszek Thanks for working on this. My initial thought is that this looks a lot more granular than what we have in place now - which is good as we have sometimes missed at been able to get to the right level of granularity. For Github the datatypes seem fairly organised and can pretty much already allocated - how do you see this working for example for our project mailing lists? Would each list the be a datasource and the conversations the datatype?

May 31 '21 17:05 sharanf

how do you see this working for example for our project mailing lists? Would each list the be a datasource and the conversations the datatype?

That's a very good question @sharanf!

I would lean to what you've written. Datasource does not only represent an "external service" entity but "account/organization within an external service". So, in case of mailing list each Apache project would required configuring their own data source.

For example:

data_sources:
   - name: asf_mails_kibble
     class: kibble.data_sources.pony_mail.PonyMailDataSource
     config:
       project_name: kibble
       enabled_data_types:
         - mails
   - name: asf_mails_kafka
     class: kibble.data_sources.pony_mail.PonyMailDataSource
     config:
       project_name: kafka
       enabled_data_types:
         - mails
   - name: asf_mails_pulsar
     class: kibble.data_sources.pony_mail.PonyMailDataSource
     config:
       project_name: pulsar
       enabled_data_types:
         - mails

While there's a bit of duplication in configuration it allow more granularity. In case of ASF the config will be big and repeatable but for smaller Kibble deployments it would be smaller and more configuration maybe an advantage.

Additionally this additional granularity is useful in case of sources that need authorisation. In such cases we may want to store the credentials in different way or use different auth methods.

May 31 '21 17:05 turbaszek

@kaxil @Humbedooh @sharanf please let me know if we should proceed and merge (once I fix tests). I would like to make it move

Jun 18 '21 14:06 turbaszek

@kaxil @Humbedooh @sharanf please let me know if we should proceed and merge (once I fix tests). I would like to make it move

@turbaszek From my side I am happy to keep things moving so have no problems with starting to merge your new code once the tests are fixed.

Jun 20 '21 18:06 sharanf