LuoKaiGSW

Results 16 comments of LuoKaiGSW

> Hey @am-bean, thanks for chasing this down! We should be loading with use_fast=False eventually -- do you know if this is happening automatically, or do you have to manually...

> Extremely hacky, but I managed to work around this by passing my own byte_decoder as part of the tokenizer. > > e.g. for Llama 3, you can brute force...

> sentencepiece hey, @yonitjio , I don't understand why installing sentencepiece would solve this problem. According to the code, it seems like it would still go to the branch that...

> I mean the original issue with mistral can already be solved with installing `sentencepiece`. > If you don't install `sentencepiece`, the tokenizer will fallback to fast tokenizer which doesn't...

I have set up the environment on CUDA 118, and it's working; perhaps you can refer to [this](https://github.com/sgl-project/sglang/issues/284). However, I have encountered an issue with [garbled output](https://github.com/sgl-project/sglang/issues/316) during use, and...

> @LuoKaiGSW Could you please provide more details about this error, such as the GPU, NVIDIA driver version, and which packages cause this error? Thank you for your reply. I...

> 他们论文中是4:1,3.5和4的数据一起调的,ShareGPT_Vicuna_unfiltered里面并没有提示哪个是3.5和4。ShareGPT_Vicuna_unfiltered中有许多没法直接用的数据,比如开头一定要是human的才行,对话多长等等问题。 这个微调的数据量是不是有点少了呢,5W条?? 看论文,作者应该是用的这个[数据集](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json),但是我看了这个数据也有9万条,不过是split以后的结果,如果不split的话应该是5万条左右,所以这种条数是指的没有split的数量是吗?

> 你们试了作者开源的agentlm了吗,效果怎么样?? 我用论文中提到的构造数据的方式训了一版模型,测试了一下,效果不太稳定

> Hey! I suppose you are using `python` and can't see what's inside your tokenizer! #1542 should help you with this 🤗 Thank you for your reply, but I didn't...

> You cannot see any attributes because both `__repr__` and `__str__` are not implemented So, is it impossible to read this mapping relationship from the fast tokenizer?