Code-Pile icon indicating copy to clipboard operation
Code-Pile copied to clipboard

Bitbucket diffs

Open aaronrmm opened this issue 3 years ago • 11 comments

Bitbucket has an API for public repos

Dataset URL - None

Does the dataset exists in a scraped format ? No (searched using google, papers with code, and kaggle).

Description

Bitbucket is far less popular for open source git repos, but does have them, and does provide an API for querying and filtering them. Because there are no stars in bitbucket as there are in github, we would have to approximate with number of watchers or number of contributors. It can also be filtered by language. It does not appear to be filterable by license.

Procedure

  1. Approximate the value of a bitbucket dataset by pulling metrics on open source. Using the Bitbucket API, pull the following information :
  • number of public repositories
  • distribution of watchers per repository
  • distribution of contributors per
  • number of commits per
  1. With the above information, determine a good metric for how repositories should be prioritized. Sort the repo list with this metric.

  2. Start pulling commit diffs from the highest priority repos. Docs

aaronrmm avatar Sep 14 '22 15:09 aaronrmm

Great idea!

LouisCastricato avatar Sep 15 '22 13:09 LouisCastricato

This is annoying: https://community.developer.atlassian.com/t/cant-filter-public-repos-with-bitbucket-api/61919

aaronrmm avatar Sep 24 '22 19:09 aaronrmm

Any ideas on how to fix this?

LouisCastricato avatar Sep 24 '22 19:09 LouisCastricato

I can pull it all down unfiltered. Or all repos since a certain timestamp. Otherwise, no.

aaronrmm avatar Sep 24 '22 22:09 aaronrmm

...which is what I'm currently doing.

aaronrmm avatar Sep 25 '22 00:09 aaronrmm

Are there any existing indexes we could use? @aaronrmm

ncoop57 avatar Sep 25 '22 23:09 ncoop57

If you can get the commit message and hash then it should be simple to adapt the code at #31 to fit Bitbucket.

herbiebradley avatar Sep 27 '22 23:09 herbiebradley

I can get message, hash, diff, author, date, patch, parent commit. Which I think is everything needed. Currently am grabbing all the commit hashes for all the repos.

aaronrmm avatar Sep 29 '22 16:09 aaronrmm

Hi everyone, in term of code from repos, I just managed to get all public repositories from Bitbucket through their APIs, their API is limited call (1000 / hours), I have used Kaggle to create multiple notebooks (different IPs) to get it, and finally, I got progress on this. To summarize, I got 1261420 repos from bitbucket that we can download, I attached here a sample of the data, the full dataset you can be found at https://drive.google.com/file/d/13QsJRhhpL64m3jhsH4up0CBtxDIalO-A/view?usp=sharing. This data included: ['type', 'full_name', 'links', 'name', 'slug', 'description', 'scm', 'website', 'owner', 'workspace', 'is_private', 'project', 'fork_policy', 'created_on', 'updated_on', 'size', 'language', 'has_issues', 'has_wiki', 'uuid', 'mainbranch', 'override_settings', 'parent'] from repos. We can make some filters based on size, language,... I wrote a script to download all repos, we need to discuss the server and storage of this data. If we can manage to download this data with Gitlab as well as Github that we have it's will great resource. cc @ncoop57

PhungVanDuy avatar Oct 02 '22 06:10 PhungVanDuy

epic @PhungVanDuy!!! though it might be worth opening up a completely separate issue with this info since this issue I think is specifically only for diffs.

ncoop57 avatar Oct 04 '22 19:10 ncoop57

epic @PhungVanDuy!!! though it might be worth opening up a completely separate issue with this info since this issue I think is specifically only for diffs.

Thank you for suggestion I just create the new one issue https://github.com/CarperAI/Code-Pile/issues/34

PhungVanDuy avatar Oct 05 '22 01:10 PhungVanDuy