OSError: failed to fill whole buffer
when I initialize draftretriever.Reader, I meet this error.
python3 gen_model_answer_rest.py
loading the datastore ...
Traceback (most recent call last):
File "/mnt/gefei/REST/llm_judge/gen_model_answer_rest.py", line 493, in
Hi, I've not encountered this error before. I wonder if you've fully built the datastore without any interruptions.
I check the datastore and find a segmentation fault Namespace(model_path='/mnt/tianlian/deployment/llm_task_flows/model_original/hugging_face_finetune/Qwen2.5-14B-Instruct', large_datastore=False) number of samples: 68623 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 68623/68623 [04:13<00:00, 271.09it/s] [1] 32657 segmentation fault (core dumped) python3 get_datastore_chat.py
when I limit the num of dataset=100 ,it's ok. but when the num of dataset=2500, it's error. python3 get_datastore_chat.py Namespace(model_path='/mnt/tianlian/deployment/llm_task_flows/model_original/hugging_face_finetune/Qwen2.5-14B-Instruct', large_datastore=False) number of samples: 100 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 342.49it/s]
╭─ /mnt/gefei/REST/datastore main !2 ?3 ·················································································································································································································································································· 5s rest root@4514c07d970b 12:47:36 ╰─❯ python3 get_datastore_chat.py Namespace(model_path='/mnt/tianlian/deployment/llm_task_flows/model_original/hugging_face_finetune/Qwen2.5-14B-Instruct', large_datastore=False) number of samples: 2500 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2500/2500 [00:08<00:00, 307.74it/s] [1] 56767 segmentation fault (core dumped) python3 get_datastore_chat.py
Could this be related to how my image was created?
Hi,
I suppose it's because the vocab size of Qwen2.5 is 151936, which exceeds the range of u16 as I manually set in the DraftRetriever.
To fix the issue, you may change this line in writer from self.index_file.write_u16::<LittleEndian>(item as u16)?; to self.index_file.write_u32::<LittleEndian>(item as u32)?;
Besides, change this line these two lines in Reader from for i in (0..data_u8.len()).step_by(2) {let int = LittleEndian::read_u16(&data_u8[i..i+2]) as i32; to for i in (0..data_u8.len()).step_by(4) {let int = LittleEndian::read_u32(&data_u8[i..i+4]) as i32;
Hope these changes may fix the bug. If you have any further questions, please feel free to contact me.
Thanks!I have fixed the bug !
Hi, I suppose it's because the vocab size of Qwen2.5 is 151936, which exceeds the range of u16 as I manually set in the DraftRetriever. To fix the issue, you may change this line in writer from
self.index_file.write_u16::<LittleEndian>(item as u16)?;toself.index_file.write_u32::<LittleEndian>(item as u32)?;Besides, change this line this line in Reader fromlet int = LittleEndian::read_u16(&data_u8[i..i+2]) as i32;tolet int = LittleEndian::read_u32(&data_u8[i..i+4]) as i32;Hope these changes may fix the bug. If you have any further questions, please feel free to contact me.
Hi, when I try to use deepseek-coder-6.7b-base to construct the datastore for code-related tasks, I also encounter the segmentation fault. However, the vocab size of deepseek-coder-6.7b-base is 32000, which is smaller than 65535. How can I resolve this problem? Thank you!
Hi, I suppose it's because the vocab size of Qwen2.5 is 151936, which exceeds the range of u16 as I manually set in the DraftRetriever. To fix the issue, you may change this line in writer from
self.index_file.write_u16::<LittleEndian>(item as u16)?;toself.index_file.write_u32::<LittleEndian>(item as u32)?;Besides, change this line this line in Reader fromlet int = LittleEndian::read_u16(&data_u8[i..i+2]) as i32;tolet int = LittleEndian::read_u32(&data_u8[i..i+4]) as i32;Hope these changes may fix the bug. If you have any further questions, please feel free to contact me.Hi, when I try to use deepseek-coder-6.7b-base to construct the datastore for code-related tasks, I also encounter the segmentation fault. However, the vocab size of deepseek-coder-6.7b-base is 32000, which is smaller than 65535. How can I resolve this problem? Thank you!
Hi,
I assume the issue lies in the following code:
writer = draftretriever.Writer( index_file_path=datastore_path, max_chunk_len=512 * 1024 * 1024, vocab_size=tokenizer.vocab_size, )
Here, tokenizer.vocab_size is set to 32,000 for deepseek-coder-6.7b-base. However, the actual vocab_size for deepseek-coder-6.7b-base is 32,000 plus the number of added_tokens, which totals 32,021.
I've change the code vocab_size=tokenizer.vocab_size to vocab_size=tokenizer.vocab_size+len(tokenizer.get_added_vocab()). Sorry for the bug.