augur icon indicating copy to clipboard operation
augur copied to clipboard

No easy way to link contributors table to repo_id

Open drewdahlquist opened this issue 5 years ago • 4 comments

When working on some metrics, we found that there is no nice way to filter the augur_data.contributors table with repo_id via an SQL WHERE clause since they are not connected in any straightforward way.

Adding repo_id to be included in the contributors table seems like the most straight forward solution, however making it available through any auxiliary table would also suffice.

Making repo_id available in the contributors table would allow for SQL statements such as

SELECT 
    repo.repo_id,
    repo.repo_group_id,
    contributors.cntrb_location as contributor_loc,
    count(*) as loc_count
FROM
    contributors,
    repo
GROUP BY
    repo.repo_id,
    repo.repo_group_id,
    contributor_loc
LIMIT 1000

and

SELECT
    repo.repo_id,
    repo.repo_group_id,
    contributors.cntrb_company as company,
    count(*) as company_count
FROM
    contributors,
    repo
GROUP BY
    repo.repo_id,
    repo.repo_group_id,
    company
LIMIT 1000

to filter the data further via a WHERE clause.

drewdahlquist avatar Dec 18 '20 21:12 drewdahlquist

@drewdahlquist the reason that there is no repo_id column in the contributors table is because there is not a 1:1 relationship between a repository and a contributor. For example, if I were listed in the contributors table, and I contribute to both chaoss/augur and rails/rails, then what should the repo_id associated with my record be? Both are valid, but choosing one obscures that I contributed to the other. Doing it this way, we would need to have X copies of each record, one for each repo the contributor has contributed to, which is a lot of extra storage we don't want to use. So, I think your idea to use an auxiliary mapping table is a better approach, and we could then correctly model the M:N relationship of contributors and repos.

In the interim, if you're looking to get the data for which contributors made commits to a specific repo, I suggest using this query:

SELECT DISTINCT
	contributors.cntrb_id 
FROM
	augur_data.contributors
	JOIN commits ON commits.cmt_author_email = contributors.cntrb_canonical
WHERE
	commits.repo_id = :repo_id

which will match the commits in the commits table to contributors in the contributors table by their canonical email address (the one that Augur has chosen to be their "main" email address, if they are detected to have more than 1), filtering by :repo_id, and then returns the IDs to be used somewhere else (perhaps this is in a subquery, or a different API call). Let me know if I misunderstood your issue, and thanks for opening it!

ccarterlandis avatar Dec 19 '20 21:12 ccarterlandis

Can I work on this issue

MohitPatni0731 avatar Dec 21 '20 12:12 MohitPatni0731

@MohitPatni0731 : I think the way to do it is to build a table with a repo_id -- contributor_id column map. It would serve as a many to many bridge table.

sgoggins avatar Mar 02 '21 13:03 sgoggins

@MohitPatni0731 : Let us know if you are working on this issue and would like to have a brief convo about addressing it.

sgoggins avatar Mar 20 '21 13:03 sgoggins