Code-Pile icon indicating copy to clipboard operation
Code-Pile copied to clipboard

This repository contains all the code for collecting large scale amounts of code from GitHub.

Results 23 Code-Pile issues
Sort by recently updated
recently updated
newest added

Discourse crawler

This PR contains Big query queries and Github graphql API scraping code for (pre-2015 issues + comments). This is not easily reusable code with easily pluggable configurations. This works and...

## Programming & Computing Sub-Reddits Dataset URL - [awesome list of programming subreddits](https://github.com/iCHAIT/awesome-subreddits) [Code Pile Spreadsheet](https://docs.google.com/spreadsheets/d/1OrOnv-Cv1wRq0jNk4AegHiMtLk88YQDz5b1TP-o5SE8/edit#gid=1020850625) [Another list of programming subreddits](https://github.com/learn-anything/reddit#linux) Thanks to @ncoop57! Does the dataset exist in a...

dataset-request

Follow work in data documentation space such as https://arxiv.org/abs/1803.09010 and https://arxiv.org/abs/2201.07311 We will be basing our documentation off the template from huggingface: https://github.com/huggingface/datasets/blob/main/templates/README.md

## Title Dataset URL - [here]([https://gitlab.com) Does the dataset exists in a scraped format ? No URL if Yes - [here]() ## Description Gitlab, like github, but not in bigquery...

dataset-request

A scraper for GitHub diffs, given a JSONL containing for each commit, the hash, commit message, and repository name as a string. This uses PyArrow via `dask` to save to...

Adds crawler for tutorialspoint.com

We should follow a similar process to the BigScience workshop's dataset processing. They include many of the tools ready for us to use such as data deduplication, both exact match...

## Title Dataset URL - [LinusTechTip](https://linustechtips.com/forum/20-programming/) Does the dataset exist in a scraped format? No ## Description This well-known programming forum, just scanned there have more than 10.000 topics from...

dataset-request

## Mailing Lists Dataset URL - * [git](https://git-scm.com/community) * [python](https://www.python.org/community/lists/) * [mailing list archives](https://www.mail-archive.com/) Does the dataset exists in a scraped format ? No ## Description In general. (Almost) every...

dataset-request