Llama server prediction
Hi, When I am using cnv mode and it is correctly giving the exact answer.
But when I am using llama server mode and triggering endpoint with question, It returns unwanted things as well with the correct answer. (N_predict is set to 20).it actually returns words until 20 tokens.
1.Why it is not behaving like cnv mode. 2.it means can't we use server mode. 3.any solution for this?
Thankyou
https://github.com/microsoft/BitNet/blob/main/run_inference_server.py#L41 at this moment it seems like cnv is unsupported by the server, the pull request notes this limitation: https://github.com/microsoft/BitNet/pull/204#issue-3008871631
I too would like this feature added, if possible?