No easy way to link contributors table to repo_id
When working on some metrics, we found that there is no nice way to filter the augur_data.contributors table with repo_id via an SQL WHERE clause since they are not connected in any straightforward way.
Adding repo_id to be included in the contributors table seems like the most straight forward solution, however making it available through any auxiliary table would also suffice.
Making repo_id available in the contributors table would allow for SQL statements such as
SELECT
repo.repo_id,
repo.repo_group_id,
contributors.cntrb_location as contributor_loc,
count(*) as loc_count
FROM
contributors,
repo
GROUP BY
repo.repo_id,
repo.repo_group_id,
contributor_loc
LIMIT 1000
and
SELECT
repo.repo_id,
repo.repo_group_id,
contributors.cntrb_company as company,
count(*) as company_count
FROM
contributors,
repo
GROUP BY
repo.repo_id,
repo.repo_group_id,
company
LIMIT 1000
to filter the data further via a WHERE clause.
@drewdahlquist the reason that there is no repo_id column in the contributors table is because there is not a 1:1 relationship between a repository and a contributor. For example, if I were listed in the contributors table, and I contribute to both chaoss/augur and rails/rails, then what should the repo_id associated with my record be? Both are valid, but choosing one obscures that I contributed to the other. Doing it this way, we would need to have X copies of each record, one for each repo the contributor has contributed to, which is a lot of extra storage we don't want to use. So, I think your idea to use an auxiliary mapping table is a better approach, and we could then correctly model the M:N relationship of contributors and repos.
In the interim, if you're looking to get the data for which contributors made commits to a specific repo, I suggest using this query:
SELECT DISTINCT
contributors.cntrb_id
FROM
augur_data.contributors
JOIN commits ON commits.cmt_author_email = contributors.cntrb_canonical
WHERE
commits.repo_id = :repo_id
which will match the commits in the commits table to contributors in the contributors table by their canonical email address (the one that Augur has chosen to be their "main" email address, if they are detected to have more than 1), filtering by :repo_id, and then returns the IDs to be used somewhere else (perhaps this is in a subquery, or a different API call). Let me know if I misunderstood your issue, and thanks for opening it!
Can I work on this issue
@MohitPatni0731 : I think the way to do it is to build a table with a repo_id -- contributor_id column map. It would serve as a many to many bridge table.
@MohitPatni0731 : Let us know if you are working on this issue and would like to have a brief convo about addressing it.