corpuscrawler icon indicating copy to clipboard operation
corpuscrawler copied to clipboard

Crawler for linguistic corpora

Results 18 corpuscrawler issues
Sort by recently updated
recently updated
newest added

A [quick search](https://github.com/google/corpuscrawler/search?q=wikipedia) shows you that CorpusCrawler does not crawl or use Wikipedia. I don't know Python but it seems feasible, either from scratch on Wikipedia API (1) or using...

% __flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics__ ``` ./corpuscrawler/Lib/corpuscrawler/crawl_mi.py:62:39: F821 undefined name 'sitemap' if pubdate is None: pubdate = sitemap[url] ^ ./corpuscrawler/Lib/corpuscrawler/crawl_kab.py:53:48: F821 undefined name 'url' assert doc.status == 200,...

The script doesn't run with Python 3. Shows error : ![1234](https://user-images.githubusercontent.com/65889104/111897581-ba20fd00-8a46-11eb-8f9f-953e46ea9ac3.PNG) For solving this I have tried changing this: ![12345](https://user-images.githubusercontent.com/65889104/111897597-cd33cd00-8a46-11eb-8f78-c5be60f0676b.PNG) to: `checker.parse(robots_txt)` as it is already decoded in Python 3...

[This /CONTRIBUTING.md](https://github.com/google/corpuscrawler/blob/master/CONTRIBUTING.md) is a License Agreement / Code of Conduite to sign. As far as I can see, this very cherishable project has no actual tutorial. In don't have Python...

The [Universal Dependencies project](https://github.com/UniversalDependencies) has corpora in a set of languages; consider incorporating them.

### Research * J. Tiedemann, 2016, Finding Alternative Translations in a Large Corpus of Movie Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)...

Related to #80. Suggestion. Mainly, move the core codes up so it is more visible. The crawlers are kept into their own folder. - [ ] Reoganize project structure from...

Related to #80. This is the core documentation which can help open source contributions.

Hi Sascha, Nice work! Here's the output of what roozbeh did for HarfBuzz testing by extracting Wikipedia: https://github.com/behdad/harfbuzz-testing-wikipedia Don't know if it's of much use. That one included all talk...

``` $ python2 --version Python 2.7.16+ $ python3 --version Python 3.7.2+ ``` ``` $ python3 ./corpuscrawler --language tzh --output output-tzh/ Cache-Hit: http://listen.bible.is/robots.txt Traceback (most recent call last): File "./corpuscrawler", line...