Guoao Wei
Guoao Wei
> The only part that might be tricky is that for some functions we need to pass more than one parameter (see for instance pca in test_indexes). That can be...
See #130 for test handling `nan`, and #157 for new HeroTypes.
I would prefer to start with adding Chinese support for the preprocessing module. The most common Chinese NLP tools right now should be [jieba](https://github.com/fxsjy/jieba), [HanLP](https://github.com/hankcs/HanLP), and [pkuseg](https://github.com/lancopku/PKUSeg-python). Also, spaCy has...
@jbesomi I'm also confused about the difference. [Here](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html) says that word segmentation is a prior process of tokenization, but in practice we think they're almost equivalent. `preprocessing_zh.py` sounds good, would...
Hi @jbesomi. You are correct. I read about [this issue](https://github.com/explosion/spaCy/issues/4695), seems `zh_core_web_sm` do includes a word segmenter, which is trained by OntoNotes dataset with gold segmentation. Also, as I mentioned...
Hi, @jbesomi. You've made a good point. The Chinese model of spaCy was originally released from [howl-anderson/Chinese_models_for_SpaCy](https://github.com/howl-anderson/Chinese_models_for_SpaCy), however there hasn't been any info about its performance compared to other tools....
I just found in [https://spacy.io/usage/models#chinese](https://spacy.io/usage/models#chinese) that spaCy's Chinese model is a custom `pkuseg` trained on OntoNotes 5.0. Sounds good but I'll still go with `jieba` first and see if we...
For Asian languages (Chinese, Japanese...), word segmentation is an essential step in preprocessing. We usually remove non-textual characters in corpus, making them looks like naturally written texts, and segment the...
I found a problem of using global setting language. Some of functions cannot be applied to Asian languages, e.g. `remove_diacritics`, `stem`. Also, `remove_punctuation` is integrated into `remove_stopwords` after `tokenization`. When...
> Hey @AlfredWGA ! > > Apologize, what do you mean by "integrated"? (Also, remove_punctuation is _integrated_ into remove_stopwords after tokenization) > > I agree. > > To probably understand...