Hao Wang

Results 4 comments of Hao Wang

any update to this problem? I've had the same issue

> Have you tried the solution proposed by @lukas-blecher to use a pre-tokenizer? > > I believe this issue is related to this one: #645 Yes, I've used a pre-tokenizer....

Sorry, we only release the pre-trained model currently. You can find the the dataset engine description in Section 3.1 of our paper.

The compiling pipeline is complicated and it's not ready for open source, I could provide some demo data and scripts to request llm for explanation as I got some free...