Anurag Singh
Anurag Singh
@avinsit123 How about using word level inltk embedding and then xgboost to classify the tokens?
I am falling short of memory while creating TextLMDataBunch with only 100K articles and 32K vocabulary. How much memory is required to create the data for language model?
Thank you for the information. The issue was that single file was having over 350K character which was unable to tokenized and numericalized at once and loaded into main memory...
I have completed for Urdu and here is the [link](https://github.com/anuragshas/nlp-for-urdu) Resources for Kashmiri language is very scarce and some of them are paid, there are epaper websites having images. I...
You are welcome. I am really happy that I will be able to raise my first PR on github. After going through the the code, I guess i will have...
@goru001 Here is the link of [MaithiliWikiArticles](https://drive.google.com/open?id=15-Yy5Zfr7GIKEN0-d7kWMRGRhYamd6vH). I have been busy searching for job, I will create PR for urdu lm as soon as I get free.
@goru001 I have uploaded the urdu model using the instructions mentioned [here](https://github.com/goru001/inltk/issues/2#issuecomment-478350926). Please let me know what changes shall I make to create PR
pycache and .idea folder is already present to the repo shouldn't that be removed?
@ankur220693 I am actually short of data for working on maithili. The model that I had created was overfitting therefore I had to put it on hold. If you can...
For Kashmiri there is not enough data available publicly to work on, check on Oscar or Wikipedia dump if there is data available. Last time I had scraped it was...