REST LLama3 8B is not supported

When I run:

RAYON_NUM_THREADS=6 CUDA_VISIBLE_DEVICES=0 python3 -m rest.inference.cli --datastore-path datastore/datastore_chat_small.idx --base-model meta-llama/Meta-Llama-3-8B-Instruct

I get:

RAYON_NUM_THREADS=6 CUDA_VISIBLE_DEVICES=0 python3 -m rest.inference.cli --datastore-path datastore/datastore_chat_small.idx --base-model meta-llama/Meta-Llama-3-8B-Instruct Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00, 1.47s/it] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. USER: hey ASSISTANT: Traceback (most recent call last): ... File "/home/liranringel/REST/rest/model/modeling_llama_kv.py", line 594, in forward key_states = past_key_value[0].cat(key_states, dim=2) File "/home/liranringel/REST/rest/model/kv_cache.py", line 66, in cat dst.copy_(tensor) RuntimeError: The size of tensor a (32) must match the size of tensor b (8) at non-singleton dimension 1

Jul 11 '24 07:07 liranringel

Have you encountered the problem of segmentation fault (core dumped) when using Llama-3-8B and running python3 get_datastore_chat.py --model-path Meta-Llama-3-8B-Instruct?

Jul 23 '24 11:07 YudiZh

When I run:

RAYON_NUM_THREADS=6 CUDA_VISIBLE_DEVICES=0 python3 -m rest.inference.cli --datastore-path datastore/datastore_chat_small.idx --base-model meta-llama/Meta-Llama-3-8B-Instruct

I get:

RAYON_NUM_THREADS=6 CUDA_VISIBLE_DEVICES=0 python3 -m rest.inference.cli --datastore-path datastore/datastore_chat_small.idx --base-model meta-llama/Meta-Llama-3-8B-Instruct Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:05<00:00, 1.47s/it] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. USER: hey ASSISTANT: Traceback (most recent call last): ... File "/home/liranringel/REST/rest/model/modeling_llama_kv.py", line 594, in forward key_states = past_key_value[0].cat(key_states, dim=2) File "/home/liranringel/REST/rest/model/kv_cache.py", line 66, in cat dst.copy_(tensor) RuntimeError: The size of tensor a (32) must match the size of tensor b (8) at non-singleton dimension 1

Hi, modeling_llama_kv.py is adapted from the older version of the Transformers library for Llama-2 and the changes are marked by [MODIFIED] (a few lines of code). For Llama3, you may adapt the modeling_llama.py from the latest Transformers library to ensure correct configs (e.g., group query attention).

Sep 20 '24 08:09 zhenyuhe00

Hi,

RuntimeError: The size of tensor a (32) must match the size of tensor b (8) at non-singleton dimension 1

As for this issue, it's caused by the group-query attention.

segmentation fault (core dumped) when using Llama-3-8B

As for this issue, it's caused by the large vocabulary size of Llama-3 which exceeds the range of u16.

They are all fixed in the llama3 branch thanks to Chinmaya Andukuri.

Nov 22 '24 03:11 zhenyuhe00

@zhenyuhe00 thanks!

Dec 07 '24 09:12 liranringel