inference QUESTION: multi-threaded generation

like? xinference --threads 100 --threads N, -t N: Set the number of threads to use during generation.

Oct 15 '23 13:10 qxpf666

Hi!

You can set it using the parameter n_threads. Please note that multithreading can only be applied to models running with GGML backend.

By default, the number of thread will be set the half of your CPU count: max(multiprocessing.cpu_count() // 2, 1).

Here's an example:

from xinference.client import RESTfulClient

client = RESTfulClient("http://127.0.0.1:9997")

model_uid = client.launch_model(
    model_name="baichuan",
    model_format="ggmlv3",
    size_in_billions=7,
    n_threads=4,
)
model = client.get_model(model_uid)
print(model.generate("What is the largest animal in the world?", generate_config={"max_tokens": 128}))

Oct 15 '23 14:10 UranusSeven

This issue is stale because it has been open for 7 days with no activity.

Aug 09 '24 19:08 github-actions[bot]

This issue was closed because it has been inactive for 5 days since being marked as stale.

Aug 15 '24 19:08 github-actions[bot]