Llama server prediction

Open sakithajayasinghe opened this issue 8 months ago • 1 comments

Hi, When I am using cnv mode and it is correctly giving the exact answer.

But when I am using llama server mode and triggering endpoint with question, It returns unwanted things as well with the correct answer. (N_predict is set to 20).it actually returns words until 20 tokens.

1.Why it is not behaving like cnv mode. 2.it means can't we use server mode. 3.any solution for this?

Thankyou

Jun 12 '25 01:06 sakithajayasinghe

https://github.com/microsoft/BitNet/blob/main/run_inference_server.py#L41 at this moment it seems like cnv is unsupported by the server, the pull request notes this limitation: https://github.com/microsoft/BitNet/pull/204#issue-3008871631

I too would like this feature added, if possible?

Jun 13 '25 11:06 grctest