GitHub Diffs
A scraper for GitHub diffs, given a JSONL containing for each commit, the hash, commit message, and repository name as a string.
This uses PyArrow via dask to save to parquet, which makes it easily parallelisable and gives low memory usage.
See #31
Make sure to end your file in a new line
LGTM, @ncoop57 can you check?
Looks good @herbiebradley. The only thing needed is a minimum test with a dummy parquet file that is tested with pytest: https://docs.pytest.org/en/7.1.x/getting-started.html. We want to make sure we don't have bugs. Also, could you enable maintainer edits for the PR in case I need to modify something quickly I can? https://github.blog/2016-09-07-improving-collaboration-with-forks/
@herbiebradley @reshinthadithyan This is looking pretty solid, could you add a quick test so that I can merge?
@reshinthadithyan you might want to add your scripts to this branch before we merge?