Alex Nguyen
Alex Nguyen
1/ Với TV có lẽ cần bước tiền xử lý các âm tiết trước khi cho vào bộ wordpiece (của BERT). vì nhiều khi mình có nhiều cách bỏ dấu...
> ``` > >>> from underthesea import text_normalize > >>> text_normalize('Ðảm baỏ chất lựơng phòng thí nghịêm hoá học') > 'Đảm bảo chất lượng phòng thí nghiệm hóa học' >...
> Kết quả thế nào anh nhỉ? more efficient and more robust. more efficient: giảm số lượng tokens trong 1 câu, giúp tăng tốc huấn luyện ... more robust: chịu...
I have same interest and would like to ask how you filters "low quality content" with an n-gram language model? How can you define "good" vs "bad" data? and in...
> Hi @tiendung -- Can you try running with `-release` (`./build/codon run -release gpu.py`)? It works! Thank you.
Can you explaim future? still don't get what they mean and which value should be use. xx or aa / bb?
+1 for RWKV
> Hi @tiendung, are you trying to run the pipeline on non-english languages ("my" and "gu")? > > If you're goal is to create the english cc slice of the...
I got same error when running below command. I think @danielpclark sum it up here https://github.com/togethercomputer/RedPajama-Data/issues/23#issuecomment-1520547829 ```sh make lang=en dl_lm python -m cc_net -l en ```
```We then run the same cc-net pipeline on warc_wikipedia.warc, which produces warc_wikipedia.warc.wet.``` I have sample problem finding source code to convert warc to wet. Please assits