Alex Nguyen comments

Results 15 comments of


                                            Alex Nguyen

Improve tokenizer

1/ Với TV có lẽ cần bước tiền xử lý các âm tiết trước khi cho vào bộ wordpiece (của BERT). vì nhiều khi mình có nhiều cách bỏ dấu...

Improve tokenizer

> ``` > >>> from underthesea import text_normalize > >>> text_normalize('Ðảm baỏ chất lựơng phòng thí nghịêm hoá học') > 'Đảm bảo chất lượng phòng thí nghiệm hóa học' >...

Improve tokenizer

> Kết quả thế nào anh nhỉ? more efficient and more robust. more efficient: giảm số lượng tokens trong 1 câu, giúp tăng tốc huấn luyện ... more robust: chịu...

Release of data pre-processing code?

I have same interest and would like to ask how you filters "low quality content" with an n-gram language model? How can you define "good" vs "bad" data? and in...

[help] GPU error

> Hi @tiendung -- Can you try running with `-release` (`./build/codon run -release gpu.py`)? It works! Thank you.

Access/train to use the embeddings

Can you explaim future? still don't get what they mean and which value should be use. xx or aa / bb?

Suggestion: RWKV Language Model

+1 for RWKV

Got error while runing `python -m cc_net -l my -l gu`

> Hi @tiendung, are you trying to run the pipeline on non-english languages ("my" and "gu")? > > If you're goal is to create the english cc slice of the...

Got error while runing `python -m cc_net -l my -l gu`

I got same error when running below command. I think @danielpclark sum it up here https://github.com/togethercomputer/RedPajama-Data/issues/23#issuecomment-1520547829 ```sh make lang=en dl_lm python -m cc_net -l en ```

Questions about the quality classifier in common crawl

```We then run the same cc-net pipeline on warc_wikipedia.warc, which produces warc_wikipedia.warc.wet.``` I have sample problem finding source code to convert warc to wet. Please assits