Creating a child process during Tool Call causes Worker Shutdown
Bug Description
My environment: Windows 11, WSL2 (Ubuntu flavor)
I have a longer running operation that I have tested extensively and it works. Its main loop is a subclass of multiprocessing.Process, and it runs well/indefinitely without error when tested on its own.
I want to run this when a tool call is triggered - supplying the input parameter. I am able to trigger the tool call, the parameter is correct, the child process starts... but then after a short (few seconds) period of time, I get "worker.py:578 - shutting down worker" in the log. I don't see any errors at all, and I have debugging enabled for this subprocess, as well as LiveKit. Nothing that looks suspicious.
To start the subprocess, I am creating an mp.Queue(), passing this to the child process init (subclass of mp.Process), and running p.start(). I then enter an async polling loop to check the queue for updates - using non-blocking queue.get() and asyncio.sleep(). The child process is using the forkserver start method - I can confirm.
I have encapsulated all the above in an async method which I run using asyncio.create_task(). This is the ONLY command in my tool call handler. This task creates the subprocess (tested and working on it's own), starts it, and then indefinitely monitors the queue.
My question is this: Why is the worker getting shutdown? Aside from not being a best practise, what is actually causing the worker shutdown?
Expected Behavior
The tool call returns right away, but the process I created continues running in the background - updating the UI and a state variable until it's done. What happens instead is the worker shuts down after a few seconds (at arbitrary points in the child process execution) and takes the child process down with it.
Reproduction Steps
I cannot even pin down the exact cause, so I don't know.
Operating System
windows 11, wsl2
Models Used
Groq -> Cerebras -> Groq
Package Versions
livekit=1.0.22
livekit-agents=1.2.8
livekit-api=1.0.5
Session/Room/Call IDs
No response
Proposed Solution
I just need to understand why the worker is shutting down, and how to avoid it. I am completely at a loss as to why this is occurring.
Additional Context
Hoping something will jump out at you as the cause. If required otherwise, I can provide some code to help diagnose.
Screenshots and Recordings
No response
Hi, thank you for the detailed report!
The child process is using the forkserver start method - I can confirm.
For this, I am wondering- how are you using forkserver in Windows? It may help to see the code for the long-running operation for further investigation.
I also wonder if upgrading to the latest version of livekit-agents and using AgentServer in lieu of Worker may remedy this
can you share a minimal example for reproducing?
I figured it out. The cause of this is actually hilarious, but it may be valuable for others to learn from my pain...
I was using the dev server - with live reloading... I'm sure you know where this is going now.
Apparently, if you remove/create/edit files from a child process (at least in the way I was doing it), the watcher believes that you - the user - has modified a system file, and triggers a reload of the system. I'm not 100% sure if it was the open()/append or os.remove() that was triggering this, but I will narrow it down soon.
This explains perfectly why:
- Why I had no errors at all.
- Why copilot, Claude, etc. could not find the cause - there wasn't one. There was no issue in the code.
- Why I was able to mitigate the issue when I commented out all the code that impacted the file system.
No wonder I was losing my mind over this and going in circles - I was chasing ghosts that weren't there... welp.
Edit: To be clear, I was writing content to an .md file - for my own purposes. Not modifying any code or actual resources. But I guess the watcher does not know this...