Code-Pile Reddit

Programming & Computing Sub-Reddits

Dataset URL - awesome list of programming subreddits Code Pile Spreadsheet Another list of programming subreddits Thanks to @ncoop57!

Does the dataset exist in a scraped format ?

No, we need to format them into a dialogue format.

Description

Obtain data from Pushift Reddit using wgets/http requests from 2009-2022 and filter for programming-related subreddits.

Procedure

[x] Obtain data from Pushift Reddit from the years 2006-2022. We probably need to write a script that issues wgets for data dumps.
[x] Store data dump on a GCP Bucket.
[ ] Create 3 tables authors, submissions, and comments in BigQuery from the GCP Buckets.
[ ] Merge posts with reply chains and author metadata (specifically bio)
[ ] (Optionally) Filter for long dialogue chains following OPT
[x] Process Reddit threads (posts and replies) into a conversational form using this script
[x] Filter for programming subreddits in the list of subreddits. Then we process non-programming subreddits and programming subreddits separately.
[x] Process into output format {"text": string, "meta": obj}
[ ] Run dedup Min-Hash
[x] Run lm_format script

Final Data Format inside `text`

[Context]:
	"Learning to learn", using deep learning to design the architecture of another deep network: https://arxiv.org/abs/1606.04474
[Response]:
	using deep learning with SGD to design the learning algorithms of another deep network   *

Extra Contexts:
	[context/2]:
		Could someone there post a summary of the insightful moments.
	[context/1]:
		Basically L2L is the new deep learning.
	[context/0]:
		What's "L2L" mean?

Other features:
	[context_author]:
		goodside
	[response_author]:
		NetOrBrain
	[subreddit]:
		MachineLearning
	[thread_id]:
		5h6yvl

Sep 15 '22 03:09 taisazero

Cheesy#0202 (Me) Working with jesse#7865 and Eleuther to obtain Pushshift Reddit data.

Sep 16 '22 00:09 taisazero

To find relevant reddit communities, we can look at awesome lists: https://github.com/learn-anything/reddit#linux

Sep 18 '22 15:09 ncoop57

Currently obtaining dayta pogchamp!

Sep 19 '22 22:09 taisazero

@taisazero please add this information to the issue description:

Tests

Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.

Give an example of the columns and data:

col1	col2	....
row1	row1	....

Sep 26 '22 00:09 ncoop57

Filtering Resources

Toxic Content

Toxic Subreddits from (Gehman et al., 2020)

Looked also into DialoGPT's excluded subreddits but it was empty: DialoGPT Subreddit blocklist

Low-Quality Content

The author is a known bot.
It comes from a known non-English subreddit.
The comment is marked as removed/deleted.
It is longer than 2048 characters and does not contain spaces.
It is longer than 128 BPE tokens.
It is shorter than 5 characters.
It contains a URL.
It starts with a non-ASCII character.
It is further than depth 7 in the thread. From (Roller et. al, 2021)

Oct 30 '22 21:10 taisazero

Reddit

Programming & Computing Sub-Reddits

Description

Procedure

Final Data Format inside text

Tests

Filtering Resources

Toxic Content

Low-Quality Content

Final Data Format inside `text`