Code-Pile icon indicating copy to clipboard operation
Code-Pile copied to clipboard

GitHub Diffs

Open herbiebradley opened this issue 3 years ago • 5 comments

A scraper for GitHub diffs, given a JSONL containing for each commit, the hash, commit message, and repository name as a string.

This uses PyArrow via dask to save to parquet, which makes it easily parallelisable and gives low memory usage.

See #31

herbiebradley avatar Oct 07 '22 12:10 herbiebradley

Make sure to end your file in a new line

LouisCastricato avatar Oct 07 '22 12:10 LouisCastricato

LGTM, @ncoop57 can you check?

LouisCastricato avatar Oct 07 '22 14:10 LouisCastricato

Looks good @herbiebradley. The only thing needed is a minimum test with a dummy parquet file that is tested with pytest: https://docs.pytest.org/en/7.1.x/getting-started.html. We want to make sure we don't have bugs. Also, could you enable maintainer edits for the PR in case I need to modify something quickly I can? https://github.blog/2016-09-07-improving-collaboration-with-forks/

ncoop57 avatar Oct 09 '22 14:10 ncoop57

@herbiebradley @reshinthadithyan This is looking pretty solid, could you add a quick test so that I can merge?

ncoop57 avatar Nov 09 '22 17:11 ncoop57

@reshinthadithyan you might want to add your scripts to this branch before we merge?

herbiebradley avatar Nov 09 '22 18:11 herbiebradley