Code-Pile icon indicating copy to clipboard operation
Code-Pile copied to clipboard

GitHub Diffs

Open herbiebradley opened this issue 3 years ago • 7 comments

GitHub Diffs

Description

Dataset is on BigQuery as a table of commit hashes and messages.

Procedure

From commit hash and message, produce dict containing:

  • Raw files before changes
  • Commit message
  • Diff file

This requires for each commit, downloading the files after changes and applying the reverse patch to obtain the files before changes.

We also need to decide on a suitable length threshold to filter on since we need to include most or all of the before file in the context window, which restricts the line numbers significantly.

Minimal working example here: https://gist.github.com/herbiebradley/b08d2e13775384fe4b5353e831dac43a

  • [x] Minimal working example
  • [x] Decide on length threshold
  • [x] parquet output
  • [x] Inherit from dataset.py base classes
  • [x] Parallel processing
  • [ ] Bitbucket modifications - see #5

Example

Give an example of the columns and data:

before_file commit_message diff
['from setuptools import setup, find_packages\n', '\n', 'setup(\n', ... ] Change version [{'addition_count': 1, 'deletion_count': 1, 'hunks': [[[3, 7], [3, 7], '', ' setup(', " name = 'denoising-diffusion-pytorch',", ' packages = find_packages(),', "- version = '0.26.1',", "+ version = '0.26.3',", " license='MIT',", " description = 'Denoising Diffusion Probabilistic " "Models - Pytorch',", " author = 'Phil Wang',"]], 'patch_info': <PatchInfo: diff --git a/setup.py b/setup.py>, 'src_file': 'a/setup.py', 'tgt_file': 'b/setup.py'}]

herbiebradley avatar Sep 27 '22 23:09 herbiebradley

What will be the filtering criteria for repositories we're going to index for scraping diffs?

>10 GitHub stars
>2 commits
Must have a liberal license
Exclude forks

cc @ncoop57, @herbiebradley

reshinthadithyan avatar Sep 28 '22 01:09 reshinthadithyan

Yes, these seem like sensible criteria, I think that should be everything we need.

herbiebradley avatar Sep 28 '22 10:09 herbiebradley

By length criteria, do you mean the Length of commit_message? If that's the case, the Table has commit message column, we can query with length constraints.

reshinthadithyan avatar Sep 28 '22 12:09 reshinthadithyan

I meant the length of the combined data, but after checking with Louis we decided this doesn't need to be filtered because the constraint is too highly variable and model-dependent.

So the criteria you mention above should be fine alone.

herbiebradley avatar Sep 28 '22 13:09 herbiebradley

Updated to remove Python specific stuff, to allow for scraping all languages.

herbiebradley avatar Sep 28 '22 19:09 herbiebradley

We also need to only include diffs that modify files not delet files or create new ones. We should also filter unhelpful commit msgs such as ones with less than a few words

ncoop57 avatar Sep 28 '22 21:09 ncoop57

We also need to only include diffs that modify files not delet files or create new ones. We should also filter unhelpful commit msgs such as ones with less than a few words

Discussed this with Joel and we think that at least diffs which create files could be useful at some point in the future and potentially those which delete files too - not necessarily for ELM replication but for training refactoring models. Since this dataset could be used on several possible projects, I think it will help long term to not remove these from the scrape.

Filtering out unhelpful commit messages seems good, but I can think of some scenarios where we have short helpful commit messages so need to carefully decide on how to do that.

herbiebradley avatar Sep 28 '22 22:09 herbiebradley