BitNet icon indicating copy to clipboard operation
BitNet copied to clipboard

Add run_inference_server.py for Running llama.cpp Built-in Server

Open Benjamin-Wegener opened this issue 9 months ago • 4 comments

This pull request introduces a new script, run_inference_server.py, which leverages llama.cpp's built-in server for more convenient and efficient inference. The script is designed to start the server with various configurable parameters, making it easier to run and manage inference tasks. Key Features:

Server Integration: Utilizes llama.cpp's built-in server for running inference.
Configurable Parameters: Allows users to specify model path, prompt, number of tokens to predict, threads, context size, temperature, host, and port.
Cross-Platform Compatibility: Supports both Windows and Unix-based systems.
Signal Handling: Gracefully shuts down the server on receiving a SIGINT signal (Ctrl+C).

Changes Made:

New Script: Added run_inference_server.py to the repository.
CMakeLists.txt Update: Enabled the compilation of the server in CMakeLists.txt.
Command-Line Arguments: Implemented argument parsing for various configuration options.
Signal Handling: Added a signal handler to shut down the server gracefully.

Usage:

python run_inference_server.py --model path/to/model --prompt "Your prompt here" --n-predict 4096 --threads 2 --ctx-size 2048 --temperature 0.8 --host 127.0.0.1 --port 8080

This enhancement aims to simplify the process of running inference tasks and improve the overall user experience. Note:

The -cnv flag has been removed as it is not supported by the server.
Ensure that the model path and other parameters are correctly specified before running the script.

Benjamin-Wegener avatar Apr 21 '25 17:04 Benjamin-Wegener

I am thankful for this. It works very nice. Thank you Benjamin. Danke vielmals

gnusupport avatar Apr 26 '25 10:04 gnusupport

@microsoft-github-policy-service agree

Benjamin-Wegener avatar May 01 '25 04:05 Benjamin-Wegener

@microsoft-github-policy-service agree

Benjamin-Wegener avatar May 02 '25 13:05 Benjamin-Wegener

Hi, When I am using cnv mode and it is correctly giving the exact answer.

But when I am using llama server mode and triggering endpoint with question, It returns unwanted things as well with the correct answer. (N_predict is set to 20).it actually returns words until 20 tokens.

1.Why it is not behaving like cnv mode. 2.it means can't we use server mode. 3.any solution for this?

Thankyou

sakithajayasinghe avatar Jun 12 '25 01:06 sakithajayasinghe