openwebtext issues

Bump pillow from 5.4.1 to 9.0.1

Bumps [pillow](https://github.com/python-pillow/Pillow) from 5.4.1 to 9.0.1. Release notes Sourced from pillow's releases. 9.0.1 https://pillow.readthedocs.io/en/stable/releasenotes/9.0.1.html Changes In show_file, use os.remove to remove temporary images. CVE-2022-24303 #6010 [@radarhere, @hugovk] Restrict builtins within...

dependabot[bot]

dependencies

pre-filtered URLs can no longer be accessed

As soon in the pic, the pre-filtered URLs can no longer be accessed. Can someone take a look at it? ![图片20220217093739](https://user-images.githubusercontent.com/10181676/154387764-3520aa1e-4136-45c7-b697-9dd21d3a82e9.png) Thanks!

sunhmy

Bump urllib3 from 1.25.6 to 1.26.5

Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.25.6 to 1.26.5. Release notes Sourced from urllib3's releases. 1.26.5 :warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap Fixed...

dependabot[bot]

dependencies

Bump lxml from 4.3.1 to 4.6.3

Bumps [lxml](https://github.com/lxml/lxml) from 4.3.1 to 4.6.3. Changelog Sourced from lxml's changelog. 4.6.3 (2021-03-21) Bugs fixed A vulnerability (CVE-2021-28957) was discovered in the HTML Cleaner by Kevin Chung, which allowed JavaScript...

dependabot[bot]

dependencies

Bump pygments from 2.4.2 to 2.7.4

Bumps [pygments](https://github.com/pygments/pygments) from 2.4.2 to 2.7.4. Release notes Sourced from pygments's releases. 2.7.4 Updated lexers: Apache configurations: Improve handling of malformed tags (#1656) CSS: Add support for variables (#1633, #1666)...

dependabot[bot]

dependencies

Bump pyyaml from 5.1.2 to 5.4

Bumps [pyyaml](https://github.com/yaml/pyyaml) from 5.1.2 to 5.4. Changelog Sourced from pyyaml's changelog. 5.4 (2021-01-19) yaml/pyyaml#407 -- Build modernization, remove distutils, fix metadata, build wheels, CI to GHA yaml/pyyaml#472 -- Fix for...

dependabot[bot]

dependencies

Estimated disk space usage of scraped data?

1

Hello, Per the readme downloads are processed a month at a time. Is there an estimate of the average size of data scraped in these chunks? As well as an...

dnola

Error with get_state in download.py

1

Hi, I downloaded the pre-filtered URL list from [here](https://mega.nz/#F!EZZD0YwJ!9_PlEQzdMVLaNdKv_ICNVQ), and then tried to extract the text with `download.py` as per the readme ```bash python download.py url_dumps_deduped/RS_2018-07.xz.deduped.txt \ --n_procs 40 \...

JohnGiorgi

pycurl error: transfer closed with X bytes remaining to read

``` (base) user@desktop:/data/openwebtext$ python fetch_urls.py Downloaded RS_2012-12.bz2 Downloaded RS_v2_2008-06.xz Downloaded RS_2012-05.bz2 Downloaded RS_v2_2009-09.xz Downloaded RS_v2_2007-11.xz Downloaded RS_v2_2010-05.xz Downloaded RS_2012-01.bz2 Downloaded RS_2020-04.zst Downloaded RS_v2_2006-06.xz Downloaded RS_v2_2010-02.xz Downloaded RS_v2_2006-02.xz Downloaded RS_2012-09.bz2 Downloaded...

drfinkus

Filtering extracted results

2

this is more of a question than an issue - I noticed that in my scrape there is a large number of spurious results like: `Sorry, we just need to...

Jack000

openwebtext
openwebtext copied to clipboard

Metadata

Bump pillow from 5.4.1 to 9.0.1

pre-filtered URLs can no longer be accessed

Bump urllib3 from 1.25.6 to 1.26.5

Bump lxml from 4.3.1 to 4.6.3

Bump pygments from 2.4.2 to 2.7.4

Bump pyyaml from 5.1.2 to 5.4

Estimated disk space usage of scraped data?

Error with get_state in download.py

pycurl error: transfer closed with X bytes remaining to read

Filtering extracted results

← Metadata

Owner

Metadata

openwebtext openwebtext copied to clipboard

Metadata

← Metadata

Owner

Metadata

openwebtext
openwebtext copied to clipboard