openwebtext
openwebtext copied to clipboard
Open clone of OpenAI's unreleased WebText dataset scraper. This version uses pushshift.io files instead of the API for speed.
Bumps [pillow](https://github.com/python-pillow/Pillow) from 5.4.1 to 9.0.1. Release notes Sourced from pillow's releases. 9.0.1 https://pillow.readthedocs.io/en/stable/releasenotes/9.0.1.html Changes In show_file, use os.remove to remove temporary images. CVE-2022-24303 #6010 [@radarhere, @hugovk] Restrict builtins within...
As soon in the pic, the pre-filtered URLs can no longer be accessed. Can someone take a look at it?  Thanks!
Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.25.6 to 1.26.5. Release notes Sourced from urllib3's releases. 1.26.5 :warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap Fixed...
Bumps [lxml](https://github.com/lxml/lxml) from 4.3.1 to 4.6.3. Changelog Sourced from lxml's changelog. 4.6.3 (2021-03-21) Bugs fixed A vulnerability (CVE-2021-28957) was discovered in the HTML Cleaner by Kevin Chung, which allowed JavaScript...
Bumps [pygments](https://github.com/pygments/pygments) from 2.4.2 to 2.7.4. Release notes Sourced from pygments's releases. 2.7.4 Updated lexers: Apache configurations: Improve handling of malformed tags (#1656) CSS: Add support for variables (#1633, #1666)...
Bumps [pyyaml](https://github.com/yaml/pyyaml) from 5.1.2 to 5.4. Changelog Sourced from pyyaml's changelog. 5.4 (2021-01-19) yaml/pyyaml#407 -- Build modernization, remove distutils, fix metadata, build wheels, CI to GHA yaml/pyyaml#472 -- Fix for...
Hello, Per the readme downloads are processed a month at a time. Is there an estimate of the average size of data scraped in these chunks? As well as an...
Hi, I downloaded the pre-filtered URL list from [here](https://mega.nz/#F!EZZD0YwJ!9_PlEQzdMVLaNdKv_ICNVQ), and then tried to extract the text with `download.py` as per the readme ```bash python download.py url_dumps_deduped/RS_2018-07.xz.deduped.txt \ --n_procs 40 \...
``` (base) user@desktop:/data/openwebtext$ python fetch_urls.py Downloaded RS_2012-12.bz2 Downloaded RS_v2_2008-06.xz Downloaded RS_2012-05.bz2 Downloaded RS_v2_2009-09.xz Downloaded RS_v2_2007-11.xz Downloaded RS_v2_2010-05.xz Downloaded RS_2012-01.bz2 Downloaded RS_2020-04.zst Downloaded RS_v2_2006-06.xz Downloaded RS_v2_2010-02.xz Downloaded RS_v2_2006-02.xz Downloaded RS_2012-09.bz2 Downloaded...
this is more of a question than an issue - I noticed that in my scrape there is a large number of spurious results like: `Sorry, we just need to...