openwebtext icon indicating copy to clipboard operation
openwebtext copied to clipboard

Open clone of OpenAI's unreleased WebText dataset scraper. This version uses pushshift.io files instead of the API for speed.

Results 20 openwebtext issues
Sort by recently updated
recently updated
newest added

Bumps [pillow](https://github.com/python-pillow/Pillow) from 5.4.1 to 9.0.1. Release notes Sourced from pillow's releases. 9.0.1 https://pillow.readthedocs.io/en/stable/releasenotes/9.0.1.html Changes In show_file, use os.remove to remove temporary images. CVE-2022-24303 #6010 [@​radarhere, @​hugovk] Restrict builtins within...

dependencies

As soon in the pic, the pre-filtered URLs can no longer be accessed. Can someone take a look at it? ![图片20220217093739](https://user-images.githubusercontent.com/10181676/154387764-3520aa1e-4136-45c7-b697-9dd21d3a82e9.png) Thanks!

Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.25.6 to 1.26.5. Release notes Sourced from urllib3's releases. 1.26.5 :warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap Fixed...

dependencies

Bumps [lxml](https://github.com/lxml/lxml) from 4.3.1 to 4.6.3. Changelog Sourced from lxml's changelog. 4.6.3 (2021-03-21) Bugs fixed A vulnerability (CVE-2021-28957) was discovered in the HTML Cleaner by Kevin Chung, which allowed JavaScript...

dependencies

Bumps [pygments](https://github.com/pygments/pygments) from 2.4.2 to 2.7.4. Release notes Sourced from pygments's releases. 2.7.4 Updated lexers: Apache configurations: Improve handling of malformed tags (#1656) CSS: Add support for variables (#1633, #1666)...

dependencies

Bumps [pyyaml](https://github.com/yaml/pyyaml) from 5.1.2 to 5.4. Changelog Sourced from pyyaml's changelog. 5.4 (2021-01-19) yaml/pyyaml#407 -- Build modernization, remove distutils, fix metadata, build wheels, CI to GHA yaml/pyyaml#472 -- Fix for...

dependencies

Hello, Per the readme downloads are processed a month at a time. Is there an estimate of the average size of data scraped in these chunks? As well as an...

Hi, I downloaded the pre-filtered URL list from [here](https://mega.nz/#F!EZZD0YwJ!9_PlEQzdMVLaNdKv_ICNVQ), and then tried to extract the text with `download.py` as per the readme ```bash python download.py url_dumps_deduped/RS_2018-07.xz.deduped.txt \ --n_procs 40 \...

``` (base) user@desktop:/data/openwebtext$ python fetch_urls.py Downloaded RS_2012-12.bz2 Downloaded RS_v2_2008-06.xz Downloaded RS_2012-05.bz2 Downloaded RS_v2_2009-09.xz Downloaded RS_v2_2007-11.xz Downloaded RS_v2_2010-05.xz Downloaded RS_2012-01.bz2 Downloaded RS_2020-04.zst Downloaded RS_v2_2006-06.xz Downloaded RS_v2_2010-02.xz Downloaded RS_v2_2006-02.xz Downloaded RS_2012-09.bz2 Downloaded...

this is more of a question than an issue - I noticed that in my scrape there is a large number of spurious results like: `Sorry, we just need to...