Code-Pile icon indicating copy to clipboard operation
Code-Pile copied to clipboard

Reddit

Open taisazero opened this issue 3 years ago • 5 comments

Programming & Computing Sub-Reddits

Dataset URL - awesome list of programming subreddits Code Pile Spreadsheet Another list of programming subreddits Thanks to @ncoop57!

Does the dataset exist in a scraped format ?

No, we need to format them into a dialogue format.

Description

Obtain data from Pushift Reddit using wgets/http requests from 2009-2022 and filter for programming-related subreddits.

Procedure

  • [x] Obtain data from Pushift Reddit from the years 2006-2022. We probably need to write a script that issues wgets for data dumps.
  • [x] Store data dump on a GCP Bucket.
  • [ ] Create 3 tables authors, submissions, and comments in BigQuery from the GCP Buckets.
  • [ ] Merge posts with reply chains and author metadata (specifically bio)
  • [ ] (Optionally) Filter for long dialogue chains following OPT
  • [x] Process Reddit threads (posts and replies) into a conversational form using this script
  • [x] Filter for programming subreddits in the list of subreddits. Then we process non-programming subreddits and programming subreddits separately.
  • [x] Process into output format {"text": string, "meta": obj}
  • [ ] Run dedup Min-Hash
  • [x] Run lm_format script

Final Data Format inside text

[Context]:
	"Learning to learn", using deep learning to design the architecture of another deep network: https://arxiv.org/abs/1606.04474
[Response]:
	using deep learning with SGD to design the learning algorithms of another deep network   *

Extra Contexts:
	[context/2]:
		Could someone there post a summary of the insightful moments.
	[context/1]:
		Basically L2L is the new deep learning.
	[context/0]:
		What's "L2L" mean?

Other features:
	[context_author]:
		goodside
	[response_author]:
		NetOrBrain
	[subreddit]:
		MachineLearning
	[thread_id]:
		5h6yvl

taisazero avatar Sep 15 '22 03:09 taisazero

Cheesy#0202 (Me) Working with jesse#7865 and Eleuther to obtain Pushshift Reddit data.

taisazero avatar Sep 16 '22 00:09 taisazero

To find relevant reddit communities, we can look at awesome lists: https://github.com/learn-anything/reddit#linux

ncoop57 avatar Sep 18 '22 15:09 ncoop57

Currently obtaining dayta pogchamp!

taisazero avatar Sep 19 '22 22:09 taisazero

@taisazero please add this information to the issue description:

Tests

Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.

Give an example of the columns and data:

col1 col2 ....
row1 row1 ....

ncoop57 avatar Sep 26 '22 00:09 ncoop57

Filtering Resources

Toxic Content

Toxic Subreddits from (Gehman et al., 2020)

Looked also into DialoGPT's excluded subreddits but it was empty: DialoGPT Subreddit blocklist

Low-Quality Content

  1. The author is a known bot.
  2. It comes from a known non-English subreddit.
  3. The comment is marked as removed/deleted.
  4. It is longer than 2048 characters and does not contain spaces.
  5. It is longer than 128 BPE tokens.
  6. It is shorter than 5 characters.
  7. It contains a URL.
  8. It starts with a non-ASCII character.
  9. It is further than depth 7 in the thread. From (Roller et. al, 2021)

taisazero avatar Oct 30 '22 21:10 taisazero