corpuscrawler
corpuscrawler copied to clipboard
Crawler for linguistic corpora
A [quick search](https://github.com/google/corpuscrawler/search?q=wikipedia) shows you that CorpusCrawler does not crawl or use Wikipedia. I don't know Python but it seems feasible, either from scratch on Wikipedia API (1) or using...
% __flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics__ ``` ./corpuscrawler/Lib/corpuscrawler/crawl_mi.py:62:39: F821 undefined name 'sitemap' if pubdate is None: pubdate = sitemap[url] ^ ./corpuscrawler/Lib/corpuscrawler/crawl_kab.py:53:48: F821 undefined name 'url' assert doc.status == 200,...
The script doesn't run with Python 3. Shows error :  For solving this I have tried changing this:  to: `checker.parse(robots_txt)` as it is already decoded in Python 3...
[This /CONTRIBUTING.md](https://github.com/google/corpuscrawler/blob/master/CONTRIBUTING.md) is a License Agreement / Code of Conduite to sign. As far as I can see, this very cherishable project has no actual tutorial. In don't have Python...
The [Universal Dependencies project](https://github.com/UniversalDependencies) has corpora in a set of languages; consider incorporating them.
### Research * J. Tiedemann, 2016, Finding Alternative Translations in a Large Corpus of Movie Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)...
Related to #80. Suggestion. Mainly, move the core codes up so it is more visible. The crawlers are kept into their own folder. - [ ] Reoganize project structure from...
Related to #80. This is the core documentation which can help open source contributions.
Hi Sascha, Nice work! Here's the output of what roozbeh did for HarfBuzz testing by extracting Wikipedia: https://github.com/behdad/harfbuzz-testing-wikipedia Don't know if it's of much use. That one included all talk...
``` $ python2 --version Python 2.7.16+ $ python3 --version Python 3.7.2+ ``` ``` $ python3 ./corpuscrawler --language tzh --output output-tzh/ Cache-Hit: http://listen.bible.is/robots.txt Traceback (most recent call last): File "./corpuscrawler", line...