Masahiro Suzuki

Results 4 comments of Masahiro Suzuki

Hello, I also faced this problem, and I found a (temporary?) solution. When handling lists, the code does not add its text. https://github.com/attardi/wikiextractor/blob/master/wikiextractor/extract.py#L238-L264 So changing ```continue``` in L264 to ```page.append(line.lstrip('*#;'))```...

Thank you for your comment and for sharing the issue. I have not noticed this ipadic issue. Not only tokenization but also vocab.txt (making vocabulary process) would have the problem,...

As you mentioned, it seems that subword tokenization from long units like 長単位 would be better than using `ipadic` or `unidic(_lite)`. I think it would be better to create a...

Thank you for sharing! I will check it in detail.