Bitbucket diffs
Bitbucket has an API for public repos
Dataset URL - None
Does the dataset exists in a scraped format ? No (searched using google, papers with code, and kaggle).
Description
Bitbucket is far less popular for open source git repos, but does have them, and does provide an API for querying and filtering them. Because there are no stars in bitbucket as there are in github, we would have to approximate with number of watchers or number of contributors. It can also be filtered by language. It does not appear to be filterable by license.
Procedure
- Approximate the value of a bitbucket dataset by pulling metrics on open source. Using the Bitbucket API, pull the following information :
- number of public repositories
- distribution of watchers per repository
- distribution of contributors per
- number of commits per
-
With the above information, determine a good metric for how repositories should be prioritized. Sort the repo list with this metric.
-
Start pulling commit diffs from the highest priority repos. Docs
Great idea!
This is annoying: https://community.developer.atlassian.com/t/cant-filter-public-repos-with-bitbucket-api/61919
Any ideas on how to fix this?
I can pull it all down unfiltered. Or all repos since a certain timestamp. Otherwise, no.
...which is what I'm currently doing.
Are there any existing indexes we could use? @aaronrmm
If you can get the commit message and hash then it should be simple to adapt the code at #31 to fit Bitbucket.
I can get message, hash, diff, author, date, patch, parent commit. Which I think is everything needed. Currently am grabbing all the commit hashes for all the repos.
Hi everyone, in term of code from repos, I just managed to get all public repositories from Bitbucket through their APIs, their API is limited call (1000 / hours), I have used Kaggle to create multiple notebooks (different IPs) to get it, and finally, I got progress on this. To summarize, I got 1261420 repos from bitbucket that we can download, I attached here a sample of the data, the full dataset you can be found at https://drive.google.com/file/d/13QsJRhhpL64m3jhsH4up0CBtxDIalO-A/view?usp=sharing. This data included: ['type', 'full_name', 'links', 'name', 'slug', 'description', 'scm', 'website', 'owner', 'workspace', 'is_private', 'project', 'fork_policy', 'created_on', 'updated_on', 'size', 'language', 'has_issues', 'has_wiki', 'uuid', 'mainbranch', 'override_settings', 'parent'] from repos. We can make some filters based on size, language,... I wrote a script to download all repos, we need to discuss the server and storage of this data. If we can manage to download this data with Gitlab as well as Github that we have it's will great resource. cc @ncoop57
epic @PhungVanDuy!!! though it might be worth opening up a completely separate issue with this info since this issue I think is specifically only for diffs.
epic @PhungVanDuy!!! though it might be worth opening up a completely separate issue with this info since this issue I think is specifically only for diffs.
Thank you for suggestion I just create the new one issue https://github.com/CarperAI/Code-Pile/issues/34