Memory leak and Performance issue
Describe the bug
The memory usage is 1.8GB after generating the first sentence.
But as more sentences are generated, the memory usage keeps growing steadily.
Branch / Deployment used Master branch as of 2025-03-27.
Operating System
MacOS, server was started via start-gpu_mac.sh.
I can replicate this on Linux CPU on a docker container
It also appers that it may be a problem with how kokoro-FastAPI is using the Kokoro library or the Kokoro library itself might be the problem. I am doing some more investigation to confirm where the leak is
I've hammered Fast Koko with lots of text under Win 11 and never seen a hint of memory leak.
Unfortunately the machine's down with epic overheating problems, so further tests until next week at the earliest (ARGH!!!).
Ok so I can replicate the issue with kokoro itself. I have submitted an issue: https://github.com/hexgrad/kokoro/issues/152
@fireblade2534 Nice to know it's somebody's else's problem for a change. ;D
I generated a super long audio using Kokoro FastAPI (1 hour 9 seconds), in my Mac Mini M4, MPS GPU accerlerated.
As the memory usage goes up steadily, the generation speed goes down steadily.
If we can fix the memory leak issue, I believe the performance of Kokoro will get improved a lot.
FWIW, I think this is a Mac-specific issue. I've repeatedly done long MP3's which didn't take much time (on the order of a couple of minutes) with texts of ~56K characters and which ran without a problem. This is on an i9, RTX4090 under Docker under Win 11.
The one problem with long (>~12 minutes) texts is the readback on Fast Koko goes from the full text to reading back the opening line of subsequent chunks. Note this is as of about a week ago. Unfortunately the system's out for service, so I can't repeat the tests at the moment.
I can replicate the memory leak issue on Ubuntu latest.
FWIW, I think this is a Mac-specific issue. I've repeatedly done long MP3's which didn't take much time (on the order of a couple of minutes) with texts of ~56K characters and which ran without a problem. This is on an i9, RTX4090 under Docker under Win 11.
The one problem with long (>~12 minutes) texts is the readback on Fast Koko goes from the full text to reading back the opening line of subsequent chunks. Note this is as of about a week ago. Unfortunately the system's out for service, so I can't repeat the tests at the moment.
What OS do you run it on because I have done all my testing on Ubuntu linux
FWIW, I think this is a Mac-specific issue. I've repeatedly done long MP3's which didn't take much time (on the order of a couple of minutes) with texts of ~56K characters and which ran without a problem. This is on an i9, RTX4090 under Docker under Win 11. The one problem with long (>~12 minutes) texts is the readback on Fast Koko goes from the full text to reading back the opening line of subsequent chunks. Note this is as of about a week ago. Unfortunately the system's out for service, so I can't repeat the tests at the moment.
What OS do you run it on because I have done all my testing on Ubuntu linux
Er, please refer to my post. :)
Update:
In my Mac Mini M4 device, the issue only occurs when I use torch MPS backend. If I use CPU backend, then the issue does not occur again.
I seem to have found a functioning workaround for this until the memory leak gets solved/patched through an update.
The trick is to run Kokoro-FastAPI in Docker, following the guide on Open WebUI's "docs/getting started" page in the "Text-to-Speech" section:
https://docs.openwebui.com/tutorials/text-to-speech/Kokoro-FastAPI-integration/
Once the container is installed, stop it and then run this CLI command:
docker update --memory="4g" --memory-swap="5g" <container_id/name>
This will hard limit the Kokoro container to 4 GB of RAM and 1 GB of swapfile (5g = 4 GB + 1 GB). This is very useful for any container that might have a runaway memory leak issue.
Now Kokoro and the lovely af_sarah voice is running stable on my setup and stays confined within those limits. Without those limits Kokoro could easily swallow 20-30 GB of RAM over time.
I'm using the CPU version of Kokoro-FastAPI, in case that matters: docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu
To monitor and double check the Kokoro container's resource usage in realtime, use this CLI: docker stats <container_id/name>
I seem to have found a functioning workaround for this until the memory leak gets solved/patched through an update.
The trick is to run Kokoro-FastAPI in Docker, following the guide on Open WebUI's "docs/getting started" page in the "Text-to-Speech" section:
https://docs.openwebui.com/tutorials/text-to-speech/Kokoro-FastAPI-integration/
Once the container is installed, stop it and then run this CLI command:
docker update --memory="4g" --memory-swap="5g" <container_id/name>
This will hard limit the Kokoro container to 4 GB of RAM and 1 GB of swapfile (5g = 4 GB + 1 GB). This is very useful for any container that might have a runaway memory leak issue.
Now Kokoro and the lovely af_sarah voice is running stable on my setup and stays confined within those limits. Without those limits Kokoro could easily swallow 20-30 GB of RAM over time.
I'm using the CPU version of Kokoro-FastAPI, in case that matters: docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu
To monitor and double check the Kokoro container's resource usage in realtime, use this CLI: docker stats <container_id/name>
isn't that just going to cause it to choke at 4gb?
isn't that just going to cause it to choke at 4gb?
It has worked without issues for weeks without crashing under those memory constraints. Kokoro uses about 1,5-3 of 4 GB. The memory limit doesn't seem to affect anything else than restrict the memory leak from spinning out of control 😊
I'm guessing that without the memory limit, Kokoro just keeps piling those TTS audio files into RAM indefinitely, but when limited by 4 GB, the Kokoro container is being forced to purge previous audio files from its RAM when rendering new ones.
If you encounter any issues with very long replies/audio files, I guess you can adjust the RAM limit to 6 or 8 GB.. it doesn't have to be 4. But 4 seems to run without issues, leaving more space for the bigger models and multitasking.
Just to be safe I run all containers with this setting applied from CLI:
docker update --restart always $(docker ps -q)
Specs: M4 Max 128 GB, macOS Sequoia, Ollama, Open WebUI & Docker Desktop
Ah, ok. I'm actually using Kokoro directly as a backbone for a different software context (ASR + permutative voice model generation). I just found the thread re: memory leak + Kokoro in google, but yeah it's a problem in Kokoro's Kpipeline. Kokoro can actually run continuously under 1GB VRAM footprint if you manage the memory on each call. Personally, I wouldn't rely on Kokoro's memory management logic (or lackthereof).