Masahiro Suzuki comments

Results 4 comments of


                                            Masahiro Suzuki

Missing bullets content

Hello, I also faced this problem, and I found a (temporary?) solution. When handling lists, the code does not add its text. https://github.com/attardi/wikiextractor/blob/master/wikiextractor/extract.py#L238-L264 So changing ```continue``` in L264 to ```page.append(line.lstrip('*#;'))```...

ipadic problem for 四半期連結会計期間末日満期手形

Thank you for your comment and for sharing the issue. I have not noticed this ipadic issue. Not only tokenization but also vocab.txt (making vocabulary process) would have the problem,...

ipadic problem for 四半期連結会計期間末日満期手形

As you mentioned, it seems that subword tokenization from long units like 長単位 would be better than using `ipadic` or `unidic(_lite)`. I think it would be better to create a...

ipadic problem for 四半期連結会計期間末日満期手形

Thank you for sharing! I will check it in detail.