wikiextractor icon indicating copy to clipboard operation
wikiextractor copied to clipboard

A tool for extracting plain text from Wikipedia dumps

Results 150 wikiextractor issues
Sort by recently updated
recently updated
newest added

Hi, when I used the command "python -m wikiextractor.WikiExtractor ", after processing the pages, the message "if self.quitting: raise BdbQuit" came out. How to solve the problem? Thanks a lot!

While using this program i got an error.Due to under problem it was not working `from .extract import Extractor, ignoreTag, define_template, acceptedNamespaces` So i changed into `from extract import Extractor,...

I have noticed that content of bullets (and probably notes, but I didn't check thoroughly) is missing from the output. For example, the `Aerosol effects` section in https://en.wikipedia.org/wiki/Albedo results in...

I tried using WikiExtractor on the abstract dump, but it didn't extract any files. It did run for sometime, but it only created one folder (AA) and concluded by extracting...

我在抽取文章摘要的过程中遇到了一些问题,信息如下: ![image](https://user-images.githubusercontent.com/47466844/140681349-e79333dd-7799-4b54-b211-0a96b8af5fe3.png) 结果并没有抽取到摘要文章,且查看abstract_template内容为空,请问可以提供一些技术支持吗

Note in the below example, in paragraph 2, sentence 1, how "its founding president was Luigi Vittorio Bertarelli." was not correctly captured. Instead It was truncated to "its founding president...

INFO: Loaded 734526 templates in 4151.8s INFO: Starting page extraction from Traceback (most recent call last): File "Anaconda\Scripts\wikiextractor-script.py", line 33, in sys.exit(load_entry_point('wikiextractor==3.0.6', 'console_scripts', 'wikiextractor')()) File "Anaconda\lib\site-packages\wikiextractor-3.0.6-py3.9.egg\wikiextractor\WikiExtractor.py", line 638, in main...

There is an error in extractPage which is fixed by this PR

@attardi @gojomo @xiaoling Thanks for your work! I want to extract the corpus in the format: title, title's description, all mentions/anchors, and their positions and associated entities. So, could I...