Microphone or wave-recorder button not work, no voice chat as described
Describe the bug I have been using Gradio to build chat apps. I recently found Chainlit, and was attracted immediately. The chat UI seems to be very professional and it’s simple to pick up. However, I noticed a major bug: its microphone or voice chat does not work! Mic is disabled by default. When I turned it on and clicked on the mic or wave-recorder button it did not respond at all. It did not bring up the "allow connecting to microphone?" message in the browser, nor did it connect to the mic. In fact, the process seems to be hanging there forever. (see the attached pics). I think it would be a major bug that a framework specialized in chat cannot talk or connect to microphone. This feature has been available for quite a while in frameworks like Gradio, which are less specialized in chat. I noticed that this problem was reported a while ago like in (https://github.com/Chainlit/chainlit/issues/626), but has not been addressed. I tried different solutions suggested by users, e.g. deploy over https instead of http etc. nothing worked so far. So please take it as a major bug as it is and address it. thank you!
Expected behavior -simple ways to turn mic on and off. -make mic or wave-recorder button work -support voice chat in general (both users and AI can talk)
Screenshots
Hi @bigmw,
Can you share what code do you currently have in your @cl.on_audio_chunk and @cl.on_audio_end function decorators? Voice chat is supported in Chainlit, but it gives you full control over how you want to implement the code to handle the audio chunks. This allows you to pass audio to OpenAI's Realtime API, pass it to a whisper model for transcription, etc.
They have an example for setting up Chainlit with realtime audio in their Cookbook, and it worked well for me with some small modifications to fit my use case. Are you able to try that code in your app?
Aidan, Here is the code in my @cl.on_audio_chunk and @cl.on_audio_end function decorators. As you mentioned, I also checked the example for realtime audio in their Cookbook. In addition, I also checked the Quivr Chatbot Example. Let me know if you find any problem here. thank you!
import os
import speech_recognition as sr
from io import BytesIO
import chainlit as cl
from chainlit.element import Element
@cl.on_audio_chunk
async def on_audio_chunk(chunk: cl.InputAudioChunk):
if chunk.isStart:
buffer = BytesIO()
# This is required for whisper to recognize the file type
buffer.name = f"input_audio.{chunk.mimeType.split('/')[1]}"
# Initialize the session for a new audio stream
cl.user_session.set("audio_buffer", buffer)
cl.user_session.set("audio_mime_type", chunk.mimeType)
# Write the chunks to a buffer and transcribe the whole audio at the end
cl.user_session.get("audio_buffer").write(chunk.data)
@cl.on_audio_end
async def on_audio_end(elements: list[Element]):
# Get the audio buffer from the session
audio_buffer: BytesIO = cl.user_session.get("audio_buffer")
audio_buffer.seek(0) # Move the file pointer to the beginning
audio_file = audio_buffer.read()
audio_mime_type: str = cl.user_session.get("audio_mime_type")
input_audio_el = cl.Audio(
mime=audio_mime_type, content=audio_file, name=audio_buffer.name
)
await cl.Message(
author="You",
type="user_message",
content="",
elements=[input_audio_el, *elements],
).send()
recognizer = sr.Recognizer()
with sr.AudioFile(audio_buffer.name) as source:
audio_data = recognizer.record(source)
try:
text = recognizer.recognize_google(audio_data)
except sr.UnknownValueError:
await cl.Message(content="Sorry, I couldn't understand that audio.").send()
except sr.RequestError:
await cl.Message(content="Could not request results, please try again.").send()
msg = cl.Message(author="You", content=text, elements=elements)
await on_message(message=msg)
Hi @bigmw,
Your code looks a lot like the Quivr Chatbot Example you provided, but unfortunately this method of handling audio was changed in Chainlit v2.0.0 to add support for realtime conversations like OpenAI's realtime audio.
You can still get this code to work, but you would need to follow their migration guide from the prerelease. The reason you are seeing the permanent spinning icon is that you do not have a function decorated with @cl.on_audio_start, which is required to begin an audio conversation. You could then run your own voice activity detection (VAD) in the on_audio_chunk() function, and when the user stops talking you could run the code you currently have in on_audio_end() in the on_audio_chunk() function. Additionally, you would need to remove the elements input argument from on_audio_end() as that is no longer present post-v2.0.0.
Alternatively, you could use a Chainlit version pre-v2.0.0 and your code would potentially work.
Hi Aidan @AidanShipperley, Thanks for the input and detailed suggestion. It make a lot of sense. However, I still saw the same problem after I updated my script accordingly.
I further followed the second Multi-Modality example in the Chainlit documentation, which includes both Text To Speech and Speech to Text, very similar to my app. note the example is released/updated recently after chainlit 2.0 release, and is consistent with what you suggested. And I still got the same problem.
For demo purpose, I simplified my app script and remove the LLM calling and SST/TTS parts as below. I did see the app prompt me for mic access now. But otherwise it was still the same, mic does not work and connecting try (spinning icon) lasts forever. In the demo script below, I inserted a few "await cl.Message().send()" lines for debugging purpose. This showed that on_audio_start() did worked, but on_audio_chunk() never. If fact it never stated as the first cl.Message().send() line in on_audio_chunk() never worked. Hope this give you guys some better idea on the bug. Note both on_audio_start() and on_audio_chunk() are copied from the official openai-whisper example, except process_audio() was not called for simple demo. The problem/bug can be replicated if you run the demo app. Let me know if you have further thoughts/suggestions. Thank you!
import io
import os
import wave
import numpy as np
import audioop
import chainlit as cl
# Define a threshold for detecting silence and a timeout for ending a turn
SILENCE_THRESHOLD = 3500 # Adjust based on your audio level (e.g., lower for quieter audio)
SILENCE_TIMEOUT = 1300.0 # Seconds of silence to consider the turn finished
@cl.on_chat_start
async def start_chat():
msg0="Hello! How can I help you?"
await cl.Message(content=msg0).send()
@cl.on_audio_start
async def on_audio_start():
cl.user_session.set("silent_duration_ms", 0)
cl.user_session.set("is_speaking", False)
cl.user_session.set("audio_chunks", [])
# await cl.Message(content="audio starts.").send()
return True
@cl.on_audio_chunk
async def on_audio_chunk(chunk: cl.InputAudioChunk):
# await cl.Message(content="On audio chunk now").send()
audio_chunks = cl.user_session.get("audio_chunks")
if audio_chunks is not None:
await cl.Message(content="adding audio chunk..").send()
audio_chunk = np.frombuffer(chunk.data, dtype=np.int16)
audio_chunks.append(audio_chunk)
cl.user_session.set("audio_chunks", audio_chunks)
# If this is the first chunk, initialize timers and state
if chunk.isStart:
await cl.Message(content="first audio chunk..").send()
cl.user_session.set("last_elapsed_time", chunk.elapsedTime)
cl.user_session.set("is_speaking", True)
return
audio_chunks = cl.user_session.get("audio_chunks")
last_elapsed_time = cl.user_session.get("last_elapsed_time")
silent_duration_ms = cl.user_session.get("silent_duration_ms")
is_speaking = cl.user_session.get("is_speaking")
# Calculate the time difference between this chunk and the previous one
time_diff_ms = chunk.elapsedTime - last_elapsed_time
cl.user_session.set("last_elapsed_time", chunk.elapsedTime)
# Compute the RMS (root mean square) energy of the audio chunk
audio_energy = audioop.rms(chunk.data, 2) # Assumes 16-bit audio (2 bytes per sample)
if audio_energy < SILENCE_THRESHOLD:
# Audio is considered silent
silent_duration_ms += time_diff_ms
cl.user_session.set("silent_duration_ms", silent_duration_ms)
if silent_duration_ms >= SILENCE_TIMEOUT and is_speaking:
cl.user_session.set("is_speaking", False)
# await process_audio()
await cl.Message(content="This is an audio response.").send()
else:
# Audio is not silent, reset silence timer and mark as speaking
cl.user_session.set("silent_duration_ms", 0)
if not is_speaking:
cl.user_session.set("is_speaking", True)
# @cl.on_audio_end
# async def on_audio_end():
# pass
@cl.on_message
async def on_message(message: cl.Message):
await cl.Message(content="This is a reponse.").send()
I have not directly tested your code yet, first could you give me a few of these things so I can help narrow down where this is happening?
- Could you share your
.chainlit/config.tomlfile? Just to ensure that you've set your sample rate to24000and everything else is in order. - Can you share your OS, what browser you are using, and what the browser's version is?
- I happened to, just by pure chance, be testing my own audio code and I noticed that Chainlit's current implementation of the realtime assistant doesn't seem to work in FireFox as a custom sample rate is being set for one AudioContext (or another node supplying the microphone stream) while the microphone data comes at the device’s default rate. Edge and Chrome often handle this discrepancy automatically by resampling or allowing inter-context connections, but Firefox enforces stricter rules.
- Could you try your code with another browser just in case?
- After you click the audio button, do any errors print out in either your terminal or in the web browser's developer console (right click -> inspect -> click on
Consoletab at the top)? - Instead of sending messages to the chat for debugging, which are quite slow to send (compared to how fast
on_audio_chunk()will be called), can you try print statements instead and see which functions get called? I usually add print statements at the top of each function. - Was there a reason you commented out
on_audio_end()? You may need all three functions defined for it to work, but this is just a guess.
I think with these we can narrow down where the issue is arising from.
Hi @bigmw can you please update us? Did you solve your problem? How did you do it? Sharing your solution might help us all 😄
All the best.
I have the same problem but something is different a bit. I can run my chainlit app correctly in notebook pc, but I cannot run in my mobile phone. The AI app is https://github.com/monuminu/AOAI_Samples/tree/main/realtime-assistant-support. Wait for the solution. Thx.
Same problem here, I have followed the up-to-date documentation, and it still hangs. Even if I just put a pass in the methods.
https://docs.chainlit.io/api-reference/lifecycle-hooks/on-audio-chunk
Also, the cookbok for audio has been removed...
bump!
bump,too
It does seem like this issue is related to the sample rate difference between the device and what chainlit expects, as sort of mentioned above by @AidanShipperley. At first glace, it seems like this is something chainlit needs to handle on the client side.
This only seems to be an issue in Firefox. I tested in Chrome (desktop and mobile) & Safari (desktop macos) and it works fine. On Firefox it fails on both desktop and mobile.
Here is the console output:
Connecting AudioNodes from AudioContexts with different sample-rate is currently not supported. [index.mjs:217:4079](https://127.0.0.1:52087/libs/react-client/dist/index.mjs)
Uncaught (in promise) DOMException: AudioContext.createMediaStreamSource: Connecting AudioNodes from AudioContexts with different sample-rate is currently not supported.
begin index.mjs:217
pe index.mjs:308
emit index.mjs:136
emitEvent socket.js:498
onevent socket.js:485
onpacket socket.js:455
emit index.mjs:136
ondecoded manager.js:204
promise callback*Dae< websocket-constructor.browser.js:5
ondecoded manager.js:203
emit index.mjs:136
add index.js:146
ondata manager.js:190
emit index.mjs:136
onPacket socket.js:341
emit index.mjs:136
onPacket transport.js:98
onData transport.js:90
onmessage websocket.js:68
[index.mjs:10:3795](https://127.0.0.1:52087/libs/react-client/dist/index.mjs)
@cl.on_audio_start async def on_audio_start(): cl.user_session.set("silent_duration_ms", 0) cl.user_session.set("is_speaking", False) cl.user_session.set("audio_chunks", []) # await cl.Message(content="audio starts.").send() return True
人家都说很清楚了,要添加on_audio_start方法,搞定
Please add some more documentation on this functionality... 😭
- I don't see any listing at all for
on_audio_start()today in the API reference - The Advanced Features > Multi-Modality page just mentions
on_audio_chunkbeing required, and doesn't really give any implementation guidance - The "migration guide" referenced above is buried in the
2.0rc0release note (not even the actual 2.0 release)
It took me ages to figure out why my audio button was spinning forever, as I was starting from quite a heavily modified internal version of the samples that I guess must've been created for v1.
I have the same sample rate issue in Firefox (but works fine in Edge, Chrome, and Safari). A crude kludge that works for me is to inject the following JavaScript (specified in .chainlit/config.toml):
const RealContext = window.AudioContext || window.webkitAudioContext;
window.AudioContext = function (opts = {}) {
delete opts.sampleRate;
return new RealContext(opts);
};
Then, the microphone works fine in Firefox as well, but at a cost of loss of control over the sampling rate. Does anyone know how to send the sampling rate from the frontend back to the python Chainlit backend (or generally how to send any information from the browser frontend to the backend programmatically)?
This issue is stale because it has been open for 14 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.
My case: Server and hosting on a local network Solved for me: I generated SSL Certificate Generator so a cert.pem file and key.pem file Inserted in .env file CHAINLIT_SSL_CERT=C:\Users.....\cert.pem CHAINLIT_SSL_KEY = C:\Users.....\key.pem And went to the address https:\server ip:serverport