llama.cpp Add embedding mode with arg flag. Currently working

Hi everyone, I took a stab at adding embedding mode, where we print the sentence embedding for the input instead of generating more tokens. If I only add the compute and print in llama_eval itself, that works. But for some reason after adding the boolean flag (even if I create a second function that has the same preamble but in the end only prints the embeddings) it stops working. Could anyone take a look and tell me where I failed or what I am missing, so that we can add this capability?

Thank you!

Mar 19 '23 06:03 StrikingLoo

It's true that I am, but we never reach that line. I forgot to uncomment it for the PR, but before that, the program fails on the ggml_graph_compute (ctx0, &gf); line.

We never reach that other part.

On Sun, Mar 19, 2023 at 8:09 PM taher @.***> wrote:

@.**** commented on this pull request.

In main.cpp https://github.com/ggerganov/llama.cpp/pull/282#discussion_r1141579379:

@@ -936,12 +943,27 @@ int main(int argc, char ** argv) { printf(ANSI_COLOR_YELLOW); }
if (params.embedding){
   printf("got right before second call.\n");
   const int64_t t_start_us = ggml_time_us(); //HERE
   if (!llama_eval(model, params.n_threads, n_past, embd, logits, mem_per_token, true)) {
           fprintf(stderr, "Failed to predict\n");
           return 1;
   }
   //ggml_free(model.ctx);
   if (params.use_color) {
       printf(ANSI_COLOR_RESET);
   }
   return 0;
Looks like you're returning without freeing.

ggml_free(model.ctx);

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/pull/282#pullrequestreview-1347726107, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD6JCX7LPLXDZHBKKHKKCHTW47C5XANCNFSM6AAAAAAV75U6WA . You are receiving this because you authored the thread.Message ID: @.***>

Mar 20 '23 05:03 StrikingLoo

input text

On Tue, Mar 21, 2023 at 9:30 AM taher @.***> wrote:

@.**** commented on this pull request.

In main.cpp https://github.com/ggerganov/llama.cpp/pull/282#discussion_r1143682426:

@@ -936,12 +943,27 @@ int main(int argc, char ** argv) { printf(ANSI_COLOR_YELLOW); }
if (params.embedding){
   printf("got right before second call.\n");
   const int64_t t_start_us = ggml_time_us(); //HERE
   if (!llama_eval(model, params.n_threads, n_past, embd, logits, mem_per_token, true)) {
Are you trying to capture input text's embeddings or predicted text embeddings?

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/llama.cpp/pull/282#discussion_r1143682426, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD6JCX7NKIIHNJ6VEJC7KSDW5HJSDANCNFSM6AAAAAAV75U6WA . You are receiving this because you authored the thread.Message ID: @.***>

Mar 21 '23 16:03 StrikingLoo

Do the input tokens need to go through many layers to obtain their embeddings? My guess is that the input embeddings should be obtainable earlier, so that it is not necessary to loop through multiple layers and obtain the logits also.

Perhaps we should return as soon as the input tokens are projected into the embedding space? as this step would give us the vector representation we need.

Mar 22 '23 00:03 nullhook

The embeddings should be the output of the last attention layer, corresponding to the last token in the input. Say I input "I am a dog", the transformer should be mapping each token to an embedding in the end (attending each to each). I want the embedding that the last attention layer assigns to the last token. Not the logits themselves, but certainly not just the word embeddings from the beginning -before any feedforward through the attention layers-.

Mar 22 '23 01:03 StrikingLoo

The input embeddings obtained at the starting layer are static and not contextual? In order for the embeddings to include context for each input token in the sequence, is it necessary for the input to go through all the layers?

Mar 22 '23 02:03 nullhook

Sorry, I don't get if those are assertions or questions. I understand we want for the whole input to go through all the attention layers, then we take the representation for the last token. That is what this PR is supposed to be doing. Right now I think it is working, though I would say we can test it more before merging (right now I see it doesn't break, and it gives an output of the correct shape, but I would test more)

Mar 22 '23 02:03 StrikingLoo

Okay, I think this is working well. I tested it with multiple inputs and the embeddings make a lot of sense. Just to clarify, what this is doing is:

Take the input (a sentence, etc.)
Feed it through all the attention layers. In the end we have N embeddings of size n_embd.
Print the embedding corresponding to the last token in the input.
Stop execution Right now it works without errors both in embedding mode and not embedding mode. The embeddings look coherent. More exhaustive testing may be performed with someone with access to better compute, but I tried 8 inputs, some semantically related, some not, and the cosine similarities between embeddings made sense. I made some sentences have the exact same last word so that, if we were merely outputting word embeddings with no feedforwarding, similarity should be maximum, which it was not, so this looks good.

Do let me know if you think other changes are needed before merging, and thank you for your help in getting this to work!

Mar 22 '23 02:03 StrikingLoo

Okay, merged with master again, moved everything to llama.cpp. I addressed both changes: now everywhere that makes sense, the boolean matches the console arguments (the test in the beginning for memory size needs to be hardcoded to false) and I removed the spurious time check.

Mar 22 '23 06:03 StrikingLoo

How are you testing the correctness of the embeddings?

Mar 22 '23 13:03 nullhook

I tried multiple inputs and judged from the cosine similarities between semantically similar vs not-similar sentences. I am more than open to other tests if anyone can think of them.

Example pair-wise correlations

Napoleonic_France -- cats_are_cute
-0.027480708515286424
Napoleonic_France -- I_love_dogs
-0.31697662922849257
Napoleonic_France -- I_love_cats
-0.301660192174504
Napoleonic_France -- Victorian_England
0.8798725429541867
cats_are_cute -- Napoleonic_France
-0.027480708515286424
cats_are_cute -- I_love_dogs
0.576915180966497
cats_are_cute -- I_love_cats
0.6115158942552829
cats_are_cute -- Victorian_England
-0.01785591030506376
I_love_dogs -- Napoleonic_France
-0.31697662922849257
I_love_dogs -- cats_are_cute
0.576915180966497
I_love_dogs -- I_love_cats
0.9332809579375094
I_love_dogs -- Victorian_England
-0.3006025355059021
I_love_cats -- Napoleonic_France
-0.301660192174504
I_love_cats -- cats_are_cute
0.6115158942552829
I_love_cats -- I_love_dogs
0.9332809579375094
I_love_cats -- Victorian_England
-0.2838594074158372
Victorian_England -- Napoleonic_France
0.8798725429541867
Victorian_England -- cats_are_cute
-0.01785591030506376
Victorian_England -- I_love_dogs
-0.3006025355059021
Victorian_England -- I_love_cats
-0.2838594074158372

This also made me think, it may be desirable to normalize the embeddings (to norm 1, not the normalization layer itself). I was doing it in 'post-processing'. Do you think it would be good to add this? Not sure how to do it in ggml, but we would just need to divide by the embeddings norm.

Mar 22 '23 16:03 StrikingLoo

I'm curious if normalizing reduces the dimensionality of the embedding space? If the goal of the consumer is to only compute cosine similarity from the embeddings, then I think it makes sense to normalize. However, I suggest leaving the embeddings as raw vectors and letting the consumer decide whether or not to normalize them. It's possible that others may have a different opinion on this.

Mar 22 '23 18:03 nullhook

I agree, normalizing would be lossy. I would assume e.g. GPT-3 API gives you normalized embeddings, but it's easy enough to normalize them on consumer end so I wouldn't sweat it.

I will look into the changes ggerganov suggested + there is a new merge conflict to solve.

Mar 22 '23 18:03 StrikingLoo

It would be helpful to add an example of generating input embeddings, normalizing them, and computing cosine similarity in the example folder in this PR.

Mar 22 '23 21:03 nullhook

Hi ggerganov, Thank you for the instructions. I did steps 1, 2 and 4, but I am not sure what we want the final program to look like in step 3.

"Keep llama_eval_internal() as original and only change to copy the embeddings into the new context buffer if it is not empty (i.e. embeddings parameter was set to true during init). You will need to store the ggml tensor at the end in a separate variable so you can access it after the ggml_graph_compute() call, similar to the logits. No need for an alternative ggml_graph_compute() call as you have proposed"

Do I still keep the behavior where we check and, if embeddings==true, we only run until we have the embeddings then return? If I do I can't leave llama_eval_internal identical. Plus there is that test run we do at the beginning (Something about memory cost) that requires me to send a false for the embeddings, even if we will keep them later.

The model load / config is already as specified.

Mar 23 '23 00:03 StrikingLoo

@StrikingLoo Just pushed a change. Give it a try and let me know if it works

Mar 23 '23 20:03 ggerganov

This looks perfect to me. Both generation and embeddings work without errors. Just to make sure: would we be merging like this, or should I add the display_embeddings code under "// TODO: print / use the embeddings"?

Edit: Adding this here instead of posting multiple comments. Eventually I would like a version where either if we send the embedding argument, nothing else is printed to stdout except the embeddings (so that the command can be easily piped) or, alternatively, if we send --embedding --output-path 'path.emb' or some such the embeddings are stored in a file. But I'm okay with the program not having that by default and the users adding this. I just think many people will prefer to use the program as a blackbox without checking the code. Maybe the API mode you mentioned before is already planning to add something like this.

Mar 24 '23 00:03 StrikingLoo

This will become a separate example program called embedding. For now will merge like this because I want to start refactoring and later we will move into a separate program

Mar 24 '23 15:03 ggerganov

@ggerganov is this the correct command

./embedding -m models/7B/ggml-model-q4_0.bin -p "ciao" -n 512

It seems it's not using the prompt in p. Infact I do not see in the logs the log I see in embedding.cpp that should print it:

main: seed = 1679853514
llama_model_load: loading model from 'models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml ctx size = 4273.34 MB
llama_model_load: mem required  = 6065.34 MB (+ 1026.00 MB per state)
llama_model_load: loading model part 1/1 from 'models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
2.177604 -1.095253
...
04 0.534440 0.732717 0.781988 -1.836264 -0.860989 -0.564879 0.084990 0.838598 1.210304 -0.441369 -1.963783 2.096257 

llama_print_timings:        load time =  1124.05 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time =   464.57 ms /     3 tokens (  154.86 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time =  2215.86 ms

Mar 26 '23 17:03 loretoparisi