Embeddings Provider has typo in model configuration params
Before submitting your bug report
- [X] I believe this is a bug. I'll try to join the Continue Discord for questions
- [X] I'm not able to find an open issue that reports the same bug
- [X] I've seen the troubleshooting guide on the Continue Docs
Relevant environment info
- OS:Windows 10
- Continue:0.8.14
- IDE:VS Code 1.87.0
Description
I have the same ollama model configured as embeddingsProvider and as model. I can see in the ollama logs that the model switches and redeploys the same model but with different configuration. One thing that I could easily identify is (prob due to a typo) the BOS token:
llm_load_print_meta: BOS token = 32013 '<´¢£beginÔûüofÔûüsentence´¢£>'
vs.
llm_load_print_meta: BOS token = 32013 '<´¢£beginÔûüofÔûüsentence´¢£>''
As you can see there is a single quote too much at the end.
Model: deepseek-coder:6.7b
Config.json
{
"models": [
{
"title": "GPT-4 Vision (Free Trial)",
"provider": "free-trial",
"model": "gpt-4-vision-preview"
},
{
"title": "GPT-3.5-Turbo (Free Trial)",
"provider": "free-trial",
"model": "gpt-3.5-turbo"
},
{
"title": "Gemini Pro (Free Trial)",
"provider": "free-trial",
"model": "gemini-pro"
},
{
"title": "Codellama 70b (Free Trial)",
"provider": "free-trial",
"model": "codellama-70b"
},
{
"model": "deepseek-coder:6.7b-instruct-q4_K_M",
"title": "deepseek-coder:6.7b-instruct-q4_K_M",
"completionOptions": {},
"apiBase": "http://localhost:11434",
"provider": "ollama"
}
],
"slashCommands": [
{
"name": "edit",
"description": "Edit selected code"
},
{
"name": "comment",
"description": "Write comments for the selected code"
},
{
"name": "share",
"description": "Export this session as markdown"
}
],
"customCommands": [
{
"name": "test",
"prompt": "Write a comprehensive set of unit tests for the selected code. It should setup, run tests that check for correctness including important edge cases, and teardown. Ensure that the tests are complete and sophisticated. Give the tests just as chat output, don't edit any file.",
"description": "Write unit tests for highlighted code"
}
],
"contextProviders": [
{
"name": "open",
"params": {}
},
{
"name": "codebase",
"params": {
"nRetrieve": 25,
"nFinal": 5,
"useReranking": true
}
}
],
"embeddingsProvider": {
"provider": "ollama",
"model": "deepseek-coder:6.7b-instruct-q4_K_M",
"apiBase": "http://localhost:11434"
}
}
To reproduce
use the same ollama model as embeddings provider, perform a query that uses the embeddings like @Codebase
Log output
`this is the log of a programm that loads the same model twice, do you know what is the difference between the loads?
log:
[GIN] 2024/03/06 - 12:59:27 | 200 | 2m5s | 127.0.0.1 | POST "/api/chat"
time=2024-03-06T12:59:48.815+01:00 level=INFO source=routes.go:78 msg="changing loaded model"
time=2024-03-06T12:59:49.757+01:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-06T12:59:49.760+01:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 7.5"
time=2024-03-06T12:59:49.760+01:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-06T12:59:49.762+01:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 7.5"
time=2024-03-06T12:59:49.762+01:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
loading library C:\Users\pawerner\AppData\Local\Temp\ollama664810684\cuda_v11.3\ext_server.dll
time=2024-03-06T12:59:49.768+01:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\\Users\\pawerner\\AppData\\Local\\Temp\\ollama664810684\\cuda_v11.3\\ext_server.dll"
time=2024-03-06T12:59:49.768+01:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from C:\Users\pawerner\.ollama\models\blobs\sha256-8de39949f334605a7b8d7167723c9ccc926e506f38405f6d00bdf3df12e8dcf9 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = deepseek-ai
llama_model_loader: - kv 2: llama.context_length u32 = 16384
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 100000.000000
llama_model_loader: - kv 11: llama.rope.scaling.type str = linear
llama_model_loader: - kv 12: llama.rope.scaling.factor f32 = 4.000000
llama_model_loader: - kv 13: general.file_type u32 = 15
llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32256] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,31757] = ["─á ─á", "─á t", "─á a", "i n", "h e...
llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 32013
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 32021
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 32014
llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: mismatch in special tokens definition ( 243/32256 vs 256/32256 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 32256
llm_load_print_meta: n_merges = 31757
llm_load_print_meta: n_ctx_train = 16384
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 4096
llm_load_print_meta: n_embd_v_gqa = 4096
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 100000.0
llm_load_print_meta: freq_scale_train = 0.25
llm_load_print_meta: n_yarn_orig_ctx = 16384
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 6.74 B
llm_load_print_meta: model size = 3.80 GiB (4.84 BPW)
llm_load_print_meta: general.name = deepseek-ai
llm_load_print_meta: BOS token = 32013 '<´¢£beginÔûüofÔûüsentence´¢£>'
llm_load_print_meta: EOS token = 32021 '<|EOT|>'
llm_load_print_meta: PAD token = 32014 '<´¢£endÔûüofÔûüsentence´¢£>'
llm_load_print_meta: LF token = 30 '?'
llm_load_tensors: ggml ctx size = 0.22 MiB
llm_load_tensors: offloading 6 repeating layers to GPU
llm_load_tensors: offloaded 6/33 layers to GPU
llm_load_tensors: CPU buffer size = 3892.62 MiB
llm_load_tensors: CUDA0 buffer size = 727.62 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 100000.0
llama_new_context_with_model: freq_scale = 0.25
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA T1200 Laptop GPU, compute capability 7.5, VMM: yes
llama_kv_cache_init: CUDA_Host KV buffer size = 832.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 192.00 MiB
llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
llama_new_context_with_model: CUDA_Host input buffer size = 13.02 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 164.01 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 168.00 MiB
llama_new_context_with_model: graph splits (measure): 5
time=2024-03-06T12:59:51.675+01:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop"
[GIN] 2024/03/06 - 12:59:53 | 200 | 4.2588429s | 127.0.0.1 | POST "/api/embeddings"
time=2024-03-06T12:59:55.566+01:00 level=INFO source=routes.go:78 msg="changing loaded model"
time=2024-03-06T12:59:56.304+01:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-06T12:59:56.308+01:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 7.5"
time=2024-03-06T12:59:56.322+01:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-06T12:59:56.325+01:00 level=INFO source=gpu.go:146 msg="CUDA Compute Capability detected: 7.5"
time=2024-03-06T12:59:56.325+01:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
loading library C:\Users\pawerner\AppData\Local\Temp\ollama664810684\cuda_v11.3\ext_server.dll
time=2024-03-06T12:59:56.333+01:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: C:\\Users\\pawerner\\AppData\\Local\\Temp\\ollama664810684\\cuda_v11.3\\ext_server.dll"
time=2024-03-06T12:59:56.346+01:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
llama_model_loader: loaded meta data with 26 key-value pairs and 291 tensors from C:\Users\pawerner\.ollama\models\blobs\sha256-8de39949f334605a7b8d7167723c9ccc926e506f38405f6d00bdf3df12e8dcf9 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = deepseek-ai
llama_model_loader: - kv 2: llama.context_length u32 = 16384
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 100000.000000
llama_model_loader: - kv 11: llama.rope.scaling.type str = linear
llama_model_loader: - kv 12: llama.rope.scaling.factor f32 = 4.000000
llama_model_loader: - kv 13: general.file_type u32 = 15
llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32256] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,31757] = ["─á ─á", "─á t", "─á a", "i n", "h e...
llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 32013
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 32021
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 32014
llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
llm_load_vocab: mismatch in special tokens definition ( 243/32256 vs 256/32256 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 32256
llm_load_print_meta: n_merges = 31757
llm_load_print_meta: n_ctx_train = 16384
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 4096
llm_load_print_meta: n_embd_v_gqa = 4096
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 100000.0
llm_load_print_meta: freq_scale_train = 0.25
llm_load_print_meta: n_yarn_orig_ctx = 16384
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 6.74 B
llm_load_print_meta: model size = 3.80 GiB (4.84 BPW)
llm_load_print_meta: general.name = deepseek-ai
llm_load_print_meta: BOS token = 32013 '<´¢£beginÔûüofÔûüsentence´¢£>''
llm_load_print_meta: EOS token = 32021 '<|EOT|>'
llm_load_print_meta: PAD token = 32014 '<´¢£endÔûüofÔûüsentence´¢£>'
llm_load_print_meta: LF token = 30 '?'
llm_load_tensors: ggml ctx size = 0.22 MiB
llm_load_tensors: offloading 4 repeating layers to GPU
llm_load_tensors: offloaded 4/33 layers to GPU
llm_load_tensors: CPU buffer size = 3892.62 MiB
llm_load_tensors: CUDA0 buffer size = 495.22 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: freq_base = 100000.0
llama_new_context_with_model: freq_scale = 0.25
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA T1200 Laptop GPU, compute capability 7.5, VMM: yes
llama_kv_cache_init: CUDA_Host KV buffer size = 1792.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: CUDA_Host input buffer size = 17.04 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 296.02 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 296.00 MiB
llama_new_context_with_model: graph splits (measure): 5
time=2024-03-06T12:59:58.240+01:00 level=INFO source=dyn_ext_server.go:161 msg="Starting llama main loop"
[GIN] 2024/03/06 - 13:02:12 | 200 | 2m17s | 127.0.0.1 | POST "/api/chat"
other differences found, but maybe they are intentional (would be cool if I can change them):
llama_new_context_with_model: n_ctx = 2048
vs
llama_new_context_with_model: n_ctx = 4096
and some cache and buffer sizes.
The general purpose of the issue is 1. the typo bug and 2. to ask for a fix/feature to make it possible to use the same model for embedding and chat via ollama