ipex-llm ipex Llama.cpp server fails with Phi3 models

Hi,

I've trying to serve different Phi3 models using the Llama.cpp server that is created by the init-llama-cpp ipex.

When I server with this version I have two problems:

The server doesn't stop on token, it just keeps generating stuff (I need to set the -n parameter to make it stop somewhere).
I also get inconsistent output, mostly gibberish.

Doing the same inference on mainstream Llama.cpp server with SYSCL backend works.

The same happens for Phi3-medium (Q6, Q4), Phi3-mini and Phi3.5 mini.

I have the latests versions of everything (driver, ipex), and I use an ARC 770 GPU.

Thanks!

Aug 23 '24 22:08 hvico

Hi @hvico, we are trying to reproduce your issue. Could you please show more details that displayed by ollama served side?

Aug 26 '24 01:08 sgwhat

Hi @hvico , could you please also provide us with your detail cmd so that we can try to reproduce it ?

Aug 26 '24 02:08 rnwang04