inference
inference copied to clipboard
QUESTION: multi-threaded generation
like? xinference --threads 100 --threads N, -t N: Set the number of threads to use during generation.
Hi!
You can set it using the parameter n_threads. Please note that multithreading can only be applied to models running with GGML backend.
By default, the number of thread will be set the half of your CPU count: max(multiprocessing.cpu_count() // 2, 1).
Here's an example:
from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
model_uid = client.launch_model(
model_name="baichuan",
model_format="ggmlv3",
size_in_billions=7,
n_threads=4,
)
model = client.get_model(model_uid)
print(model.generate("What is the largest animal in the world?", generate_config={"max_tokens": 128}))
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 5 days since being marked as stale.