Alex Nguyen

Results 15 comments of Alex Nguyen

1/ Với TV có lẽ cần bước tiền xử lý các âm tiết trước khi cho vào bộ wordpiece (của BERT). vì nhiều khi mình có nhiều cách bỏ dấu...

> ``` > >>> from underthesea import text_normalize > >>> text_normalize('Ðảm baỏ chất lựơng phòng thí nghịêm hoá học') > 'Đảm bảo chất lượng phòng thí nghiệm hóa học' >...

> Kết quả thế nào anh nhỉ? more efficient and more robust. more efficient: giảm số lượng tokens trong 1 câu, giúp tăng tốc huấn luyện ... more robust: chịu...

I have same interest and would like to ask how you filters "low quality content" with an n-gram language model? How can you define "good" vs "bad" data? and in...

> Hi @tiendung -- Can you try running with `-release` (`./build/codon run -release gpu.py`)? It works! Thank you.

Can you explaim future? still don't get what they mean and which value should be use. xx or aa / bb?

> Hi @tiendung, are you trying to run the pipeline on non-english languages ("my" and "gu")? > > If you're goal is to create the english cc slice of the...

I got same error when running below command. I think @danielpclark sum it up here https://github.com/togethercomputer/RedPajama-Data/issues/23#issuecomment-1520547829 ```sh make lang=en dl_lm python -m cc_net -l en ```

```We then run the same cc-net pipeline on warc_wikipedia.warc, which produces warc_wikipedia.warc.wet.``` I have sample problem finding source code to convert warc to wet. Please assits