Code-Pile
Code-Pile copied to clipboard
This repository contains all the code for collecting large scale amounts of code from GitHub.
Discourse crawler
This PR contains Big query queries and Github graphql API scraping code for (pre-2015 issues + comments). This is not easily reusable code with easily pluggable configurations. This works and...
## Programming & Computing Sub-Reddits Dataset URL - [awesome list of programming subreddits](https://github.com/iCHAIT/awesome-subreddits) [Code Pile Spreadsheet](https://docs.google.com/spreadsheets/d/1OrOnv-Cv1wRq0jNk4AegHiMtLk88YQDz5b1TP-o5SE8/edit#gid=1020850625) [Another list of programming subreddits](https://github.com/learn-anything/reddit#linux) Thanks to @ncoop57! Does the dataset exist in a...
Follow work in data documentation space such as https://arxiv.org/abs/1803.09010 and https://arxiv.org/abs/2201.07311 We will be basing our documentation off the template from huggingface: https://github.com/huggingface/datasets/blob/main/templates/README.md
gitlab
## Title Dataset URL - [here]([https://gitlab.com) Does the dataset exists in a scraped format ? No URL if Yes - [here]() ## Description Gitlab, like github, but not in bigquery...
A scraper for GitHub diffs, given a JSONL containing for each commit, the hash, commit message, and repository name as a string. This uses PyArrow via `dask` to save to...
Adds crawler for tutorialspoint.com
We should follow a similar process to the BigScience workshop's dataset processing. They include many of the tools ready for us to use such as data deduplication, both exact match...
## Title Dataset URL - [LinusTechTip](https://linustechtips.com/forum/20-programming/) Does the dataset exist in a scraped format? No ## Description This well-known programming forum, just scanned there have more than 10.000 topics from...
## Mailing Lists Dataset URL - * [git](https://git-scm.com/community) * [python](https://www.python.org/community/lists/) * [mailing list archives](https://www.mail-archive.com/) Does the dataset exists in a scraped format ? No ## Description In general. (Almost) every...