Add Github issues and PRs scanner
In this PR I tried to build some basic tooling around scanning data sources. Currently the data is only collected and not persisted anywhere. That's still something I plan to do in this PR.
Although I decided to use old approach with configuration in yaml file it should be treated only as temporary solution before we have a database.
Why didn't I use known PyGithub? It's LGPL.
@kaxil @sharanf @Humbedooh @michalslowikowski00 happy to get your opinion.
As suggested on the dev list I introduced concept of DataSource and DataType. Those for now can be configured yaml configuration file:
data_sources:
- name: github_kibble
class: kibble.data_sources.github.GithubDataSource
config:
repo_owner: apache
repo_name: kibble
enabled_data_types:
- pr_issues
This form allow users to specify any external data sources as long as the class path points to importable object.
The role of DataSource is to provide authentication methods for the external service represented by it. DataType represent single type of information we can get from this source, in case of this PR those are Github issues (which include also PRs). Role of DataType is to define :
- how to process the raw data from external source and how to persist them into database (to be done)
- how to read the data from database including aggregation, filters etc.
In general this is rough idea I have in m mind:

@kaxil @sharanf @Humbedooh @michalslowikowski00 happy to get your opinion.
As suggested on the dev list I introduced concept of
DataSourceandDataType. Those for now can be configured yaml configuration file:data_sources: - name: github_kibble class: kibble.data_sources.github.GithubDataSource config: repo_owner: apache repo_name: kibble enabled_data_types: - pr_issuesThis form allow users to specify any external data sources as long as the class path points to importable object.
The role of
DataSourceis to provide authentication methods for the external service represented by it.DataTyperepresent single type of information we can get from this source, in case of this PR those are Github issues (which include also PRs). Role ofDataTypeis to define :* how to process the raw data from external source and how to persist them into database (to be done) * how to read the data from database including aggregation, filters etc.In general this is rough idea I have in m mind:
@turbaszek Thanks for working on this. My initial thought is that this looks a lot more granular than what we have in place now - which is good as we have sometimes missed at been able to get to the right level of granularity. For Github the datatypes seem fairly organised and can pretty much already allocated - how do you see this working for example for our project mailing lists? Would each list the be a datasource and the conversations the datatype?
how do you see this working for example for our project mailing lists? Would each list the be a datasource and the conversations the datatype?
That's a very good question @sharanf!
I would lean to what you've written. Datasource does not only represent an "external service" entity but "account/organization within an external service". So, in case of mailing list each Apache project would required configuring their own data source.
For example:
data_sources:
- name: asf_mails_kibble
class: kibble.data_sources.pony_mail.PonyMailDataSource
config:
project_name: kibble
enabled_data_types:
- mails
- name: asf_mails_kafka
class: kibble.data_sources.pony_mail.PonyMailDataSource
config:
project_name: kafka
enabled_data_types:
- mails
- name: asf_mails_pulsar
class: kibble.data_sources.pony_mail.PonyMailDataSource
config:
project_name: pulsar
enabled_data_types:
- mails
While there's a bit of duplication in configuration it allow more granularity. In case of ASF the config will be big and repeatable but for smaller Kibble deployments it would be smaller and more configuration maybe an advantage.
Additionally this additional granularity is useful in case of sources that need authorisation. In such cases we may want to store the credentials in different way or use different auth methods.
@kaxil @Humbedooh @sharanf please let me know if we should proceed and merge (once I fix tests). I would like to make it move
@kaxil @Humbedooh @sharanf please let me know if we should proceed and merge (once I fix tests). I would like to make it move
@turbaszek From my side I am happy to keep things moving so have no problems with starting to merge your new code once the tests are fixed.