Kokoro-FastAPI icon indicating copy to clipboard operation
Kokoro-FastAPI copied to clipboard

Support more than 1 stream at the same time.

Open sipvoip opened this issue 1 year ago • 17 comments

I noticed that when Kokoro is running, it does not use all of the GPU. However, the latency gets very bad if I send two requests simultaneously. Is there a way to optimize it to support more than one stream simultaneously?

P.S. This is a fantastic project! How do we give back? I don't see a donation link on the readme.

sipvoip avatar Feb 03 '25 13:02 sipvoip

Yup, I refactored in an earlier build towards tha, but a few other changes took precedence before it could be optimized to take advantage of it. Should be be able to configure it back on the next version or so

😅 And thanks! Will add one in. I did have a BuyMeACoffee link somewhere in there but will set up the proper github sponsor file

remsky avatar Feb 04 '25 07:02 remsky

Yup, I refactored in an earlier build towards tha, but a few other changes took precedence before it could be optimized to take advantage of it. Should be be able to configure it back on the next version or so

😅 And thanks! Will add one in. I did have a BuyMeACoffee link somewhere in there but will set up the proper github sponsor file

This is awesome! I was just trying to solve the concurrency issue. ChatGPT was recommending batching requests and a few other solutions... noticed CPU and GPUs were not that used up and was really trying to figure this out! Thx for your work on this, @remsky !!!

Here is the test script I was using for concurrent requests:

import asyncio import aiohttp import time

async def send_request(session, index, delay, results): """Send a request with a staggered delay and measure response time.""" await asyncio.sleep(delay * index) payload = { "model": "kokoro", "input": "Testing my friend is a great thing to do!", "voice": "af_nova", "response_format": "mp3", "speed": 1, "stream": True, "return_download_link": False } start_time = time.time() async with session.post("http://localhost:8880/v1/audio/speech", json=payload) as response: try: response_text = await response.text() except UnicodeDecodeError: response_text = await response.read() # Read raw bytes if decoding fails end_time = time.time() elapsed_time = end_time - start_time results[index] = elapsed_time print(f"Request {index+1}: {elapsed_time:.2f} sec") async def main(num_requests, delay): """Run multiple concurrent requests and measure time statistics.""" async with aiohttp.ClientSession() as session: results = [None] * num_requests # Store times per request tasks = [ send_request(session, i, delay, results) for i in range(num_requests) ] await asyncio.gather(*tasks) # Compute statistics completed_times = [t for t in results if t is not None] if completed_times: avg_time = sum(completed_times) / len(completed_times) min_time = min(completed_times) max_time = max(completed_times) print(f"\nTotal time for {num_requests} requests: {sum(completed_times):.2f} sec") print(f"Avg time per request: {avg_time:.2f} sec") print(f"Min time: {min_time:.2f} sec | Max time: {max_time:.2f} sec") else: print("No requests completed.") if name == "main": num_requests = 150 # Number of requests delay = 0.1 # Stagger delay in seconds start_time = time.time() asyncio.run(main(num_requests, delay)) total_duration = time.time() - start_time print(f"\nTotal script execution time: {total_duration:.2f} sec")

bluestarforever avatar Feb 08 '25 06:02 bluestarforever

Adding multiple model instances and better concurrency handling here, though still in progress and adding some locust/load testing before it'll find it's way onto main

Feel free to take a look or test, but will be aiming to get it onto a release shortly

https://github.com/remsky/Kokoro-FastAPI/tree/nightly

remsky avatar Feb 10 '25 06:02 remsky

Awesome! Very nice work & look forward to it 👍

On Mon, Feb 10, 2025, 1:05 AM remsky @.***> wrote:

Adding multiple model instances and better concurrency handling here, though still in progress and adding some locust/load testing before it'll find it's way onto main

Feel free to take a look or test, but will be aiming to get it onto a release shortly

https://github.com/remsky/Kokoro-FastAPI/tree/nightly

— Reply to this email directly, view it on GitHub https://github.com/remsky/Kokoro-FastAPI/issues/115#issuecomment-2646996544, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFNFWQOGZDDYQTUETQHKF32PA6RBAVCNFSM6AAAAABWMCQJM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNBWHE4TMNJUGQ . You are receiving this because you commented.Message ID: @.***>

bluestarforever avatar Feb 10 '25 06:02 bluestarforever

Adding multiple model instances and better concurrency handling here, though still in progress and adding some locust/load testing before it'll find it's way onto main

Feel free to take a look or test, but will be aiming to get it onto a release shortly

https://github.com/remsky/Kokoro-FastAPI/tree/nightly

Finally getting a chance to check out the nightly! Thank you! Do you know the proper way we should run simultaneous instances on this one by chance? Have a good day!!! This way I can test it for you.

bluestarforever avatar Feb 18 '25 23:02 bluestarforever

Uh, how does one test nightly's? Any attempt to clone a nightly fails: (see below) even though I can get to it easily enough with any browser.

Image

Image

RBEmerson970 avatar Feb 19 '25 14:02 RBEmerson970

Uh, how does one test nightly's? Any attempt to clone a nightly fails: (see below) even though I can get to it easily enough with any browser.

Just clone the repo normally and switch to the "nightly" branch.

PushLimits avatar Feb 20 '25 05:02 PushLimits

OK, got the clone part. I'll put that in an area away from the "production" clone. EDIT: DISREGARD: How do I change to change to the nightly branch? Assume either "docker run" (preferred) or "docker build".

ADDED: See this grab of the GitHub Desktop tool - the branch selector's on the top line.

Image

RBEmerson970 avatar Feb 20 '25 14:02 RBEmerson970

OK, I went through the "build" workflow (vs. "run") and wound up with a container which works although it shows no ports. That is, If I browse to http://localhost:8880/web/ it gets me a working FastKoko page that speaks. Watching the log with Powershell, I see the expected results and I hear speech. The "production" container is definitely stopped (i.e., not running). But...

What have I really built? That is, how do I confirm I"ve really built the current nightly as of 11:00 EST 20 Feb2025?

FWIW VERSION contains v0.2.1

RBEmerson970 avatar Feb 20 '25 16:02 RBEmerson970

My apologies for the topic drift. Pleased follow the "nightly" thread here: #191

RBEmerson970 avatar Feb 20 '25 20:02 RBEmerson970

Adding multiple model instances and better concurrency handling here, though still in progress and adding some locust/load testing before it'll find it's way onto main

Feel free to take a look or test, but will be aiming to get it onto a release shortly

https://github.com/remsky/Kokoro-FastAPI/tree/nightly

Let us know if you need any help on this as I know you are busy!

bluestarforever avatar Mar 02 '25 18:03 bluestarforever

I tried testing the nightly, but couldn't figure out how to properly set perhaps: "max_concurrent_models" ... attempted to insert it in the payload:

payload = {
    "model": "kokoro",
    "input": "Testing my friend is a great thing to do!",
    "voice": "af_nova",
    "response_format": "mp3",
    "speed": 1,
    "stream": True,
    "return_download_link": False,
    "max_concurrent_models": 20
}

But, didn't seem to have any affect. Didn't seem to run multiple models (or instances) and had a lot of GPU leftover actually it seems. Also, would be interesting to allow it to run as many concurrent models up until a certain limit of GPU utilization if possible (not sure if it is easy to do).

EDIT: I figured out a little bit more of what's going on here. I modified the python file itself, the config file, and then restarted the API and I find that the models load into VRam but for some reason the generation time is still basically the same as when it's running the default amount of models for some reason. I didn't try changing the other config options in the python config file yet though.

bluestarforever avatar Mar 03 '25 20:03 bluestarforever

True. This feature is urgently required.

CodePothunter avatar Mar 07 '25 03:03 CodePothunter

I've solved this problem by allowing managing multiple GPU instances (as well as running multiple models on a single GPU). I also create a queue for mangaing concurrent requests.

Total Requests: 100 Successful Requests: 100 Failed Requests: 0 Success Rate: 100.00% Test Duration: 4.94 seconds QPS: 20.23 Average Latency: 897.08 ms P95 Latency: 986.18 ms P99 Latency: 989.64 ms Audio Throughput: 0.00 MB/s Max Concurrent Requests: 20

CodePothunter avatar Mar 07 '25 07:03 CodePothunter

I've solved this problem by allowing managing multiple GPU instances (as well as running multiple models on a single GPU). I also create a queue for mangaing concurrent requests.

Total Requests: 100 Successful Requests: 100 Failed Requests: 0 Success Rate: 100.00% Test Duration: 4.94 seconds QPS: 20.23 Average Latency: 897.08 ms P95 Latency: 986.18 ms P99 Latency: 989.64 ms Audio Throughput: 0.00 MB/s Max Concurrent Requests: 20

Excellent work! Do you happen to have a brief step-by-step progression to repeating your work using the nightly? Thank you!

bluestarforever avatar Mar 07 '25 16:03 bluestarforever

For now, can't you just set the uvicorn workers to >1? It would spawn, for example, 3 worker processes running for the FastAPI with the model duplicated 3 times, would allow up to 3 concurrent connections.

You can set it in the docker/scripts/entrypoint.sh

Image https://www.uvicorn.org/deployment/

richardr1126 avatar Mar 10 '25 03:03 richardr1126

Let us know if we can help get this feature "live" my friend!!!

bluestarforever avatar Apr 12 '25 19:04 bluestarforever