REST OSError: failed to fill whole buffer

when I initialize draftretriever.Reader, I meet this error.

python3 gen_model_answer_rest.py loading the datastore ... Traceback (most recent call last): File "/mnt/gefei/REST/llm_judge/gen_model_answer_rest.py", line 493, in run_eval( File "/mnt/gefei/REST/llm_judge/gen_model_answer_rest.py", line 135, in run_eval datastore = draftretriever.Reader( File "/root/anaconda3/envs/rest/lib/python3.9/site-packages/draftretriever/init.py", line 43, in init self.reader = draftretriever.Reader( OSError: failed to fill whole buffer

Sep 25 '24 03:09 Siegfried-qgf

Hi, I've not encountered this error before. I wonder if you've fully built the datastore without any interruptions.

Sep 25 '24 03:09 zhenyuhe00

I check the datastore and find a segmentation fault Namespace(model_path='/mnt/tianlian/deployment/llm_task_flows/model_original/hugging_face_finetune/Qwen2.5-14B-Instruct', large_datastore=False) number of samples: 68623 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 68623/68623 [04:13<00:00, 271.09it/s] [1] 32657 segmentation fault (core dumped) python3 get_datastore_chat.py

Sep 25 '24 04:09 Siegfried-qgf

when I limit the num of dataset=100 ,it's ok. but when the num of dataset=2500, it's error. python3 get_datastore_chat.py Namespace(model_path='/mnt/tianlian/deployment/llm_task_flows/model_original/hugging_face_finetune/Qwen2.5-14B-Instruct', large_datastore=False) number of samples: 100 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 342.49it/s]

╭─   /mnt/gefei/REST/datastore   main !2 ?3 ··················································································································································································································································································  5s  rest root@4514c07d970b  12:47:36 ╰─❯ python3 get_datastore_chat.py Namespace(model_path='/mnt/tianlian/deployment/llm_task_flows/model_original/hugging_face_finetune/Qwen2.5-14B-Instruct', large_datastore=False) number of samples: 2500 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2500/2500 [00:08<00:00, 307.74it/s] [1] 56767 segmentation fault (core dumped) python3 get_datastore_chat.py

Could this be related to how my image was created？

Sep 25 '24 04:09 Siegfried-qgf

Hi, I suppose it's because the vocab size of Qwen2.5 is 151936, which exceeds the range of u16 as I manually set in the DraftRetriever. To fix the issue, you may change this line in writer from self.index_file.write_u16::<LittleEndian>(item as u16)?; to self.index_file.write_u32::<LittleEndian>(item as u32)?; Besides, change this line these two lines in Reader from for i in (0..data_u8.len()).step_by(2) {let int = LittleEndian::read_u16(&data_u8[i..i+2]) as i32; to for i in (0..data_u8.len()).step_by(4) {let int = LittleEndian::read_u32(&data_u8[i..i+4]) as i32;

Hope these changes may fix the bug. If you have any further questions, please feel free to contact me.

Sep 25 '24 05:09 zhenyuhe00

Thanks！I have fixed the bug !

Sep 25 '24 06:09 Siegfried-qgf

Hi, I suppose it's because the vocab size of Qwen2.5 is 151936, which exceeds the range of u16 as I manually set in the DraftRetriever. To fix the issue, you may change this line in writer from self.index_file.write_u16::<LittleEndian>(item as u16)?; to self.index_file.write_u32::<LittleEndian>(item as u32)?; Besides, change this line this line in Reader from let int = LittleEndian::read_u16(&data_u8[i..i+2]) as i32; to let int = LittleEndian::read_u32(&data_u8[i..i+4]) as i32;

Hope these changes may fix the bug. If you have any further questions, please feel free to contact me.

Hi, when I try to use deepseek-coder-6.7b-base to construct the datastore for code-related tasks, I also encounter the segmentation fault. However, the vocab size of deepseek-coder-6.7b-base is 32000, which is smaller than 65535. How can I resolve this problem? Thank you!

Nov 05 '24 15:11 whisperzqh

Hi, I suppose it's because the vocab size of Qwen2.5 is 151936, which exceeds the range of u16 as I manually set in the DraftRetriever. To fix the issue, you may change this line in writer from self.index_file.write_u16::<LittleEndian>(item as u16)?; to self.index_file.write_u32::<LittleEndian>(item as u32)?; Besides, change this line this line in Reader from let int = LittleEndian::read_u16(&data_u8[i..i+2]) as i32; to let int = LittleEndian::read_u32(&data_u8[i..i+4]) as i32; Hope these changes may fix the bug. If you have any further questions, please feel free to contact me.

Hi, when I try to use deepseek-coder-6.7b-base to construct the datastore for code-related tasks, I also encounter the segmentation fault. However, the vocab size of deepseek-coder-6.7b-base is 32000, which is smaller than 65535. How can I resolve this problem? Thank you!

Hi, I assume the issue lies in the following code: writer = draftretriever.Writer( index_file_path=datastore_path, max_chunk_len=512 * 1024 * 1024, vocab_size=tokenizer.vocab_size, ) Here, tokenizer.vocab_size is set to 32,000 for deepseek-coder-6.7b-base. However, the actual vocab_size for deepseek-coder-6.7b-base is 32,000 plus the number of added_tokens, which totals 32,021.

I've change the code vocab_size=tokenizer.vocab_size to vocab_size=tokenizer.vocab_size+len(tokenizer.get_added_vocab()). Sorry for the bug.

Nov 28 '24 13:11 zhenyuhe00