Programming & Computing Sub-Reddits
Dataset URL - awesome list of programming subreddits Code Pile Spreadsheet Another list of programming subreddits Thanks to @ncoop57!
Does the dataset exist in a scraped format ?
No, we need to format them into a dialogue format.
Description
Obtain data from Pushift Reddit using wgets/http requests from 2009-2022 and filter for programming-related subreddits.
Procedure
- [x] Obtain data from Pushift Reddit from the years 2006-2022. We probably need to write a script that issues wgets for data dumps.
- [x] Store data dump on a GCP Bucket.
- [ ] Create 3 tables authors, submissions, and comments in BigQuery from the GCP Buckets.
- [ ] Merge posts with reply chains and author metadata (specifically bio)
- [ ] (Optionally) Filter for long dialogue chains following OPT
- [x] Process Reddit threads (posts and replies) into a conversational form using this script
- [x] Filter for programming subreddits in the list of subreddits. Then we process non-programming subreddits and programming subreddits separately.
- [x] Process into output format
{"text": string, "meta": obj} - [ ] Run dedup Min-Hash
- [x] Run
lm_formatscript
Final Data Format inside text
[Context]:
"Learning to learn", using deep learning to design the architecture of another deep network: https://arxiv.org/abs/1606.04474
[Response]:
using deep learning with SGD to design the learning algorithms of another deep network *
Extra Contexts:
[context/2]:
Could someone there post a summary of the insightful moments.
[context/1]:
Basically L2L is the new deep learning.
[context/0]:
What's "L2L" mean?
Other features:
[context_author]:
goodside
[response_author]:
NetOrBrain
[subreddit]:
MachineLearning
[thread_id]:
5h6yvl
Cheesy#0202 (Me) Working with jesse#7865 and Eleuther to obtain Pushshift Reddit data.
To find relevant reddit communities, we can look at awesome lists: https://github.com/learn-anything/reddit#linux
Currently obtaining dayta pogchamp!
@taisazero please add this information to the issue description:
Tests
Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.
Give an example of the columns and data:
| col1 | col2 | .... |
|---|---|---|
| row1 | row1 | .... |
Filtering Resources
Toxic Content
Toxic Subreddits from (Gehman et al., 2020)
Looked also into DialoGPT's excluded subreddits but it was empty: DialoGPT Subreddit blocklist
Low-Quality Content
- The author is a known bot.
- It comes from a known non-English subreddit.
- The comment is marked as removed/deleted.
- It is longer than 2048 characters and does not contain spaces.
- It is longer than 128 BPE tokens.
- It is shorter than 5 characters.
- It contains a URL.
- It starts with a non-ASCII character.
- It is further than depth 7 in the thread. From (Roller et. al, 2021)