corpuscrawler issues

Add Wikipedia crawler ? (300+ languages)

5

A [quick search](https://github.com/google/corpuscrawler/search?q=wikipedia) shows you that CorpusCrawler does not crawl or use Wikipedia. I don't know Python but it seems feasible, either from scratch on Wikipedia API (1) or using...

hugolpz

Undefined names

% __flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics__ ``` ./corpuscrawler/Lib/corpuscrawler/crawl_mi.py:62:39: F821 undefined name 'sitemap' if pubdate is None: pubdate = sitemap[url] ^ ./corpuscrawler/Lib/corpuscrawler/crawl_kab.py:53:48: F821 undefined name 'url' assert doc.status == 200,...

cclauss

No module named 'corpuscrawler' error

2

The script doesn't run with Python 3. Shows error : ![1234](https://user-images.githubusercontent.com/65889104/111897581-ba20fd00-8a46-11eb-8f9f-953e46ea9ac3.PNG) For solving this I have tried changing this: ![12345](https://user-images.githubusercontent.com/65889104/111897597-cd33cd00-8a46-11eb-8f78-c5be60f0676b.PNG) to: `checker.parse(robots_txt)` as it is already decoded in Python 3...

Aayush-hub

Improve readme documentation on how to provide a new crawler

5

[This /CONTRIBUTING.md](https://github.com/google/corpuscrawler/blob/master/CONTRIBUTING.md) is a License Agreement / Code of Conduite to sign. As far as I can see, this very cherishable project has no actual tutorial. In don't have Python...

hugolpz

Use corpora from Universal Dependencies

The [Universal Dependencies project](https://github.com/UniversalDependencies) has corpora in a set of languages; consider incorporating them.

brawer

Use available corpora for opensubtitles (63 languages)

3

### Research * J. Tiedemann, 2016, Finding Alternative Translations in a Large Corpus of Movie Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)...

hugolpz

Shorten project structure

3

Related to #80. Suggestion. Mainly, move the core codes up so it is more visible. The crawlers are kept into their own folder. - [ ] Reoganize project structure from...

hugolpz

Define crawlers' output format

Related to #80. This is the core documentation which can help open source contributions.

hugolpz

harfbuzz-testing-wikipedia

1

Hi Sascha, Nice work! Here's the output of what roozbeh did for HarfBuzz testing by extracting Wikipedia: https://github.com/behdad/harfbuzz-testing-wikipedia Don't know if it's of much use. That one included all talk...

behdad

Does not run in python3.7 or python 2.7

1

``` $ python2 --version Python 2.7.16+ $ python3 --version Python 3.7.2+ ``` ``` $ python3 ./corpuscrawler --language tzh --output output-tzh/ Cache-Hit: http://listen.bible.is/robots.txt Traceback (most recent call last): File "./corpuscrawler", line...

ftyers

corpuscrawler
corpuscrawler copied to clipboard

Metadata

Add Wikipedia crawler ? (300+ languages)

Undefined names

No module named 'corpuscrawler' error

Improve readme documentation on how to provide a new crawler

Use corpora from Universal Dependencies

Use available corpora for opensubtitles (63 languages)

Shorten project structure

Define crawlers' output format

harfbuzz-testing-wikipedia

Does not run in python3.7 or python 2.7

← Metadata

Owner

Metadata

corpuscrawler corpuscrawler copied to clipboard

Metadata

← Metadata

Owner

Metadata

corpuscrawler
corpuscrawler copied to clipboard