feat: Speech-to-Text Voice Input (via Whisper)
Description
This PR adds voice input functionality via Whisper Speech-to-Text models.
A microphone button is added to the InputToolbar component next to the image/file input button (seen below in screenshots). After clicking the button, voice input will be started and the user's microphone data will begin to be processed by the backend.
This PR includes with it the whisper-tiny.en quantized model and was tested on a mid-2015 base model MacBook Pro via cpu only inference.
More Details
The final implementation I settled on after a bunch of testing was to process fixed windows of audio (~1.5 seconds right now) but intelligently maintain a larger window if the speaker is still speaking at the end of the current window (we use silero-vad to detect this); this produced the best results as it accumulates speech, giving whisper much longer context into what the user's saying. I'm sure the parameters here could be fine tuned based on more real world usage.
The frontend will show initial results while the user is still speaking and will update them as whisper is able to process the rest of the user's speech. After the user is done speaking, it will 'commit' that text to the frontend and the audio buffer will be cleared for the next speech segment.
Still Todo
- allow users to specify different whisper models
- add gpu support
Checklist
- [x] The base branch of this PR is
dev, rather thanmain - [ ] The relevant docs, if any, have been updated or created
Screenshots
Voice input active shown by new yellow-orange gradient
New voice input button in InputToolbar
Testing
Pull and debug this branch, there is no specific configuration option to enable this feature. For now, only the tiny.en model is used and packaged with continue. Once firing up the dev environment, simply initiate a new chat and click the small microphone button in the toolbar to start voice input. You may be prompted by your system to allow microphone access – once provided, you should be able to speak and see your speech transcribed live into the chat box.
In the future, support may be added for the various whisper models or other STT providers.
Deploy Preview for continuedev failed.
| Name | Link |
|---|---|
| Latest commit | d9a4c86985d2516e72e800fddad066ed1ea66e41 |
| Latest deploy log | https://app.netlify.com/sites/continuedev/deploys/66c41a174baf9000081dd592 |
Sorry, accidentally had the main branch selected when I hit publish
I'm having some trouble getting any results from whisper when running in debug mode—logs constantly look like this. Anything else I should be trying here?
#2051 fixed a debounce issue with InputToolbar that might be the culprit (audio input is set to be disabled when the user focuses away / onBlur)
From that PR:
const [shouldHideToolbar, setShouldHideToolbar] = useState(false);
const debouncedShouldHideToolbar = debounce((value) => {
setShouldHideToolbar(value);
}, 200);
useEffect(() => {
if (editor) {
const handleFocus = () => {
debouncedShouldHideToolbar(false);
}
const handleBlur = () => {
debouncedShouldHideToolbar(true);
};
editor.on('focus', handleFocus);
editor.on('blur', handleBlur);
return () => {
editor.off('focus', handleFocus);
editor.off('blur', handleBlur);
};
}
}, [editor]);
I'll pull that fix to local and do some testing
@sestinj okay, I think you might be encountering that bug I alluded to (where it kicks on then immediately off) because that fix (#2051) was not in place.
It's intermittent but I can get it to reproduce when I don't have the PR in place. I just merged in the latest dev branch and testing on my machine it seems to be working more consistently now when you click the microphone. Let me know if you're still having issues
Other updates:
-
Added better debug output during the startup process
-
With Windows you must specify your device's name, we now have a
setupInputDevicefunction that pulls that info out of ffmpeg and sets the device name appropriately (tested with output from my win 10 machine) -
User can specify custom local whisper models (we are using @xenova/transformers.js
AutomaticSpeechRecognitionPipeline); definitely needs testing -
added config options under
voiceInput.inputDeviceandvoiceInput.inputFormatso users can specify ffmpeg parameters manually, depending on your system you can run, OSX specific:ffmpeg -list_devices true -f avfoundation -i dummy(for windows use-f dshow, linux-f alsaetc...)
I think this was the problem
[0] Microsoft Teams Audio
[1] MacBook Pro Microphone
[2] Nate’s iPhone (2) Microphone
extensionHostProcess.js:155
Found FFmpeg Audio Input device: "[0] Microsoft Teams Audio" for Input format: avfoundation
Had to delete not just Microsoft Teams, but the driver that it installed lol
Ok! This thing is awesome. Is it intentional to require the user to press enter themselves after the audio is stopped? I'm not super familiar with what's standard for STT UI, but I can see reasons for both directions.
Other feedback is I would like to make this experimental for now, or even just an option in the "ui" section of config.json
Had to delete not just Microsoft Teams, but the driver that it installed lol
😂 sounds about right
Is it intentional to require the user to press enter themselves after the audio is stopped?
Yeah, I think with the much lower latency models they auto-submit and auto-interrupt but I think for right now going with press Enter when you're finished might be the most familiar for Continue's users. It might be cool to also try a push-to-talk type UX in some form
Down the road I'd love to experiment with real-time speech interactivity with code bases but for a v1 it's probably good for now
would like to make this experimental for now
Sweet, yeah that definitely makes sense till we've got some more real world usability testing. I can wire that up when I get some time tomorrow/Friday. Yeah I think having a UI toggle would be good to raise awareness & get people trying it out
Came across this and figured I'd give a bit of feedback based on my experience working with whisper and transcription stuff over the last year-ish. First, thanks for doing this -- voice is so powerful and I can't wait til this lands in continue.
Local transcription can be tricky -- I think I saw you added gpu support, which will help performance, but generally, I can see a world in which users just want to use their openai api key to do remote transcription with the whisper api. In a lot of cases, uploading the audio to openai letting them do the transcription in the cloud ends up being faster than doing it locally, even when on a gpu on a small whisper model.
Things get much trickier with streaming transcription. What I've observed you have the same performance problems with local inference generally, but streaming degrades transcript accuracy. There is some wiggle room here -- if you chunk the audio 5 seconds at a time, you'll probably get better results than 1-2 seconds, but waiting for the initial 5 seconds to load up feels weird, and it's more expensive to do the local inference on large audio chunks.
Personally, in a code editor context, I'd prefer to wait for a little loader icon for the transcript to come back from the openai api, and have very high accuracy than have it stream the output locally. On top of that, in most cases, especially if you have fast internet, the whole thing will probably be faster than doing local inference to transcribe.
Again, thanks for all the amazing work here.
Came across this and figured I'd give a bit of feedback based on my experience working with whisper and transcription stuff over the last year-ish. First, thanks for doing this -- voice is so powerful and I can't wait til this lands in continue.
Local transcription can be tricky -- I think I saw you added gpu support, which will help performance, but generally, I can see a world in which users just want to use their openai api key to do remote transcription with the whisper api. In a lot of cases, uploading the audio to openai letting them do the transcription in the cloud ends up being faster than doing it locally, even when on a gpu on a small whisper model.
Things get much trickier with streaming transcription. What I've observed you have the same performance problems with local inference generally, but streaming degrades transcript accuracy. There is some wiggle room here -- if you chunk the audio 5 seconds at a time, you'll probably get better results than 1-2 seconds, but waiting for the initial 5 seconds to load up feels weird, and it's more expensive to do the local inference on large audio chunks.
Personally, in a code editor context, I'd prefer to wait for a little loader icon for the transcript to come back from the openai api, and have very high accuracy than have it stream the output locally. On top of that, in most cases, especially if you have fast internet, the whole thing will probably be faster than doing local inference to transcribe.
Again, thanks for all the amazing work here.
Whisper is not a streaming model, using it for streaming is basically a hack, but there are other ASR implementations that work much better for streaming tasks. See for example https://github.com/k2-fsa/sherpa
Performs at least an order of magnitude better than Whisper, could even work in real time in the browser with WASM https://huggingface.co/spaces/k2-fsa/web-assembly-asr-sherpa-ncnn-en
Other option: https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/intro.html
Personally, in a code editor context, I'd prefer to wait for a little loader icon for the transcript to come back from the openai api, and have very high accuracy than have it stream the output locally. On top of that, in most cases, especially if you have fast internet, the whole thing will probably be faster than doing local inference to transcribe.
For sure, when I implemented it I definitely had the API in mind as well as an option for less cost concious users – I think with Continue in particular though, being the open source option (vs copilot, cursor, etc), it tends to lend itself more towards users looking for local options or a hybrid approach. Keeping this in mind the UI, config, etc work was all done so that plugging up the OAI API or any other STT model down the road should be pretty seamless.
Whisper is not a streaming model, using it for streaming is basically a hack, but there are other ASR implementations that work much better for streaming tasks. See for example https://github.com/k2-fsa/sherpa
Interesting, I hadn't heard of sherpa, looks like there's a few other models as well – @daniel-dona did you give this PR's implementation with whisper a try in Continue?
@sestinj - thanks for getting that config wired up; had some last minute stuff come up last week.
@mkummer225 I've tested this a number of times, I think it's working fairly nicely, and yet I've found myself hesitating to merge. I think the intuition for that, now that I've fully explicitly reasoned about it, is the following:
- This is a feature that will probably be used by 1% or less of users on a day-to-day basis
- It adds almost 30Mb to the package
- This adds another bit of maintenance to the build process
- The fully-local implementation is going to be prone to native module-related bugs and worse performance
My conclusion is that we'll want to wait for support for a non-local implementation before merging. I'd consider this because it wouldn't require adding more size to the bundle, and will be less of a maintenance burden, but would still be great for users that want the feature!
@sestinj no problem, to address a few of the issues you could just not include the weights and use the manual path functionality; ie, the user would have to enable it via experimental then also provide a path to their weights.
Regardless, I understand, thanks for closing the loop - thought it might be a cool feature within the IDE
I personally don't see using Whisper via the API for pseudo real-time STT as viable (cost wise) but others might
I appreciate the understanding. It does currently seem like it would be unwise to merge this much without the plan to take it out of experimental. We want to be quite careful about the product surface area that we add, and even if experimental, we have to fully accept that maintenance.
I think that for now I'd like to close this, as I'm not going to have the chance to add to it myself. If somebody comes along and is willing to do a bit of extra work to get the API or another reliable method working, then we might reconsider. I'll leave open for the next few days just to make sure
@mkummer225 nice work with this PR. Beyond just being a nice to have feature, this also feels like quite a helpful bit of accessibility
@sestinj I'd potentially be willing to put in some or all of the work to get this over the line, as this is a feature I'd really want to use, especially after trying voice with copilot. But it'd be worth really clarifying what you as maintainers want from this.
What would a plan to take this out of experimental need? When you say "do a bit of extra work to get the API or another reliable method working," do you have any idea specifically what you might want? Would it be just using the API first, then adding another option to point to a local instance of Whisper or what not later? Or would you expect both bits of functionality to be in this from the get go?
Would I be right in thinking the main bits of work here are at least:
- Providing a way to let users specify whether they use a local Whisper or the Whisper API in the config
- Providing a way to manually specify the path to the weights
- Adding the ability to use the Whisper API into the existing code
What else would need to be done? If any additional tasks can be provided and maybe some slight guidance through reviews, I can hopefully help with this
@callum-gander thanks for offering to step up here! I would want to see the following in order to feel comfortable merging:
- There is an interface for swappable Speech-to-Text providers
- This interface is implemented by the Whisper API provider (it is my assumption that Whisper API is, if we had to choose one, the best choice, but am open to suggestions)
- It is configurable like follows:
"experimental": {
"speechToText": {
"apiKey": "<API_KEY>"
}
}
- Notice above that the swappable providers are not configurable, intentionally, as there is only one currently supported (I do not wish to take on more for now)
- No native binaries are involved (failure to load is not isolated, may cause full-on crash)
- Minimal code footprint other than a self-contained folder with interface and implementations
- No build modifications
- Because good work has been done on the local whisper provider, I think that at a minimum we should link to its existence in the history of this PR, but the required assets should go
The general point I'm getting at with all of these is I hope for this feature to require the absolute minimal maintenance (it is a very cool contribution, and will make some folks quite happy, however is quite unlikely to be prioritized by us for any amount of time in the near-to-mid-term).
Let me know what you think
I can help here. I have prototype code from my projects that can stream from the web speech api and also the hosted openai whisper.
In my experience, auto-sending text after speaking is often annoying, as the STT makes mistakes with homonyms all the time and I like to correct them manually before sending. LLMs can often understand the wrong text anyway, but this will decrease accuracy.
I have a configurable delay for sending text after speaking. I usually like it at 1-2 seconds. However, I find I just turn it off most of the time and send manually.
Nate I know you are skeptical of the web speech api's quality cross platform, but I'm pretty sure it's going to be fine. Since continue is hosted in essentially a uniform chrome, it might even be the same code everywhere. (Or is the JetBrains browser significantly different?)
Using hosted Whisper works surprisingly well even though whisper isn't designed for streaming. However, the implementation I have is much more complicated, requiring WebRTC and a Python intermediary which I would have to port to TypeScript.
It's probably best to skip the old whisper API and go straight to the new openai realtime api, as models that handle audio tokens rather than going through a speech to text model first are much better at not getting fooled by homonyms and etcetera. I plan on using realtime api for my apps that do speech to speech with tool use, but haven't had a chance to start using the realtime api yet.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Speak to Me</title>
</head>
<body>
<button id="record" aria-label="record">🎙️</button>
<input id="inputText" style="width: 30em" />
<script>
const inputText = document.getElementById("inputText");
const recordButton = document.getElementById('record');
// this should be whatever the final standard is called, not the webkit prefixed one
const recognition = new window.webkitSpeechRecognition();
recognition.continuous = true;
recognition.interimResults = true;
recognition.onstart = function () {
console.log("Voice recognition started.");
};
recognition.onerror = function (event) {
console.error("Error occurred: " + event.error);
};
recognition.onend = function () {
console.log("Voice recognition ended.");
};
recognition.onresult = function (event) {
const result = event.results[0][0].transcript;
inputText.value = result;
};
let recognizing = false;
function record_or_stop_audio() {
if (recognizing) {
recognition.stop();
recordButton.textContent = "🎙️";
recordButton.setAttribute("aria-label", "record");
recognizing = false;
return;
}
recognition.start();
recordButton.textContent = "🛑";
recordButton.setAttribute("aria-label", "stop");
recognizing = true;
}
recordButton.addEventListener('click', () => {
record_or_stop_audio();
});
</script>
</body>
</html>
No build modifications
@sestinj w.r.t build modifications, this implementation uses ffmpeg (a binary) that requires build-time modifications in several places in order to ensure we can gain microphone access.
This was how I was able to get voice input in VS Code. There is no access to any of the web speech api's. Other implementations seem to set up a local server and use websockets to pipe microphone input from a separate web browser tab.
The only other extension that allows for truly native + non-hacky voice input is Microsoft's own for use with their CoPilot extension. They do make available a package but it I was unable to get the package to install on an OSX machine (it looks like it may require installation on a Windows machine(?)). They use .NET in C++ to gain microphone access
As far as I'm aware this is the only other native voice input implementation to MS's
@fzzzy yeah, @mkummer225 is right as shown by the issue he links to, the Web Speech API isn't available to vscode extensions.
Can also confirm that after investigating a bit, that @mkummer225 is right, there doesn't seem to be any other way around adding some build-time modifications, as already done in the PR. Again repeating himself, short of setting up a local server, which feels overkill just to send the audio, or fiddling around with node-speech to see if I can get it working on my OSX machine, it seems like at minimum this would require some build-time modifications to get working. Which of these three options is preferable to investigate further in light of this?
Another question which @fzzzy points out is whether we want to just use the standard Whisper API here and potentially deal with transcription that makes mistakes with bits like homonyms, use the standard Whisper API method + potentially a LLM query to clean up the text which would add latency + cost, or whether we want to use the newer realtime API here, which would involve additional overhead and headaches around all the WebSockets bits. I personally lean towards the first option, not sure what you all think.
Apologies for the delay and asking more questions, I have limited time and don't want to commit to a specific way of doing something if it's not going to be the one that you ultimately want
@callum-gander Ok, that's good to know about the web speech api not being available. Too bad.
I have already working code for the whisper api backend. Give me a few days to make it into a standalone example. I think some will have to be converted to typescript, but it should't be too hard.
I do think long term the realtime api is going to work best. I will begin prototyping that next week. It should be easy going.
I'd vote for RealTime API and ffmpeg if we really have to add to the build—how certain are we that there's no possible VS Code API or NodeJS library that can accomplish this without modifying build?
The realtime API will not require ffmpeg as the webrtc audio samples are passed directly to openai for processing. The old whisper API will still require ffmpeg for converting from raw samples out of webrtc into mp3 files to pass to the old whisper api. My python code also uses ffmpeg for this purpose.
At this point, using the realtime API will be both the least amount of work and the most advanced, so I'm just going to start on a prototype of that on monday. At some point multiple speech to text methods can be supported.
Hi, just came around here :-)
BTW: None of Whisper implementations I have seen relies on mp3, all of them use 16Khz PCM mono (wav).
-
I'm not fan of OpenAI Realtime API: It cost money and I would prefer some configuration of Continue with maximum privacy. So I even set up my own OpenAI endpoints and infrastructure.
-
You mention sherpa. You already use Transformers.js, right? Did you know there is also sherpa-onnx and faster-whisper-onnx? Did you know that Transformers.js is built on top of ONNX?
If you would update Transformers.js from v2 in your codebase to v3, you would get WebGPU support out of the box.
Here is realtime-whisper-webgpu example using Transformers.js@v3 https://huggingface.co/spaces/Xenova/realtime-whisper-webgpu ...it even works on Android mobile with Chrome. On iPhone in Safari you need some experimental configuration.
Next you have also this one coded in Rust and compiled as WASM with WebGPU support, so it works in the browser: https://github.com/FL33TW00D/whisper-turbo (online demo: https://whisper-turbo.com/ )
I'm doing quite exhausting research about speech 2 text and sherpa is not good because it supports only English and Chinese. Base model for Whisper is good only for English, but the main advantage of Whisper models if that there are plenty of fine-tuned models for each language, and that is how you get that extra accuracy.
So I would recommend: Okay, go with commercial Realtime API for users who want to spend money and don't care about privacy, but also think about others. Whisper has very wide variety of language models. The problem is there are so many forks that try to solve the basis of Whisper implementation that has hardcoded 30s processing loop and it gets confusing. I recommend using Whisper models, but do more research on some good real-time streaming Whisper implementation.
Also, I don't like when developers write something in a programming language and then call external OS processes like FFMPEG. It always reminds me of these "script kiddies" :-D ...wouldn't there be something better? I check the code just briefly, don't remember why FFMPEG was even there. I have javascript examples with whisper for iOS as well as Windows and they take Microphone input from the browser and perform audio conversion right there without FFMPEG.
@sestinj this seems to be going a bit off topic without much outside contribution – maybe we should close this for now? If I get some time down the road I can revisit and incorporate the API functionality
If the community would like to keep discussing, we could open a discussion thread to garner preferences, use cases, etc
I see several people calling out that ffmpeg isn't required but I see no counter examples; I've done hours of research at this point for this PR and haven't found a method for getting microphone input within VS Code's extension environment that didn't require an external binary of some form to hook into OS level microphone access.
Microsoft themselves compile and provide an external binary for microphone access in CoPilot (as mentioned in my previous post)
I don't like when developers write something in a programming language and then call external OS processes like FFMPEG ... I have javascript examples with whisper for iOS as well as Windows and they take Microphone input from the browser and perform audio conversion right there without FFMPEG
Electron, or more specifically VS Code extensions have very specific sandboxed constraints – as reiterated many times Browser level microphone access is revoked in extensions, this was the first thing I tried while developing this PR. I'd much rather have clean API calls, trust me it would have made developing this PR to where it is much simpler
Ah, okay, I did not know about it. I also did research for last few weeks, quite intensive and digging into various Whisper forks.
One idea: OpenAI API have support for their whisper and Realtime API, etc.
What if Continue would just take advantage of this API?
Because I plan to work on some Whisper realtime implementation, taking the best working parts of the works and put it together and then put it on Github. My plan is to have streaming and also websocket support with OpenAI compatible API.
In that case someone could install that solution or there could be Serverless endpoint with some subscription focused on supporting various languages and Continue could just plug into it via OpenAI API. Or you could always install it on localhost.
So there would be some decoupling of components.
LobeChat here is using Whisper through OpenAI for STT and offers also Edge and Browser support, might be worth checking out their implementation: https://github.com/lobehub/lobe-chat
I just tried a different approach: "manage an UI-independent handle to the continue.dev textfield (via AT-SPI) and add the text via an external application". This way the STT application can be entirely separate (all with API and model support) while at the same time insert into ANY application and widget (talk with continue.dev, talk with editor, talk with terminal etc.).
here is the python code:
import pyatspi
import random
def process_object(obj):
try:
if not obj: return False
component = obj.queryComponent()
if not component:
return False
if not obj.role in [pyatspi.ROLE_TEXT, pyatspi.ROLE_ENTRY, pyatspi.ROLE_DOCUMENT_TEXT]:
print(f"\t{obj.name} {obj.role}: wrong role")
return False
#if not component.contains(x, y, pyatspi.DESKTOP_COORDS): return False
try:
editable = obj.queryEditableText()
print(f"\t{obj.name} {obj.role}: editable")
return True
except Exception:
print(f"\t{obj.name} {obj.role}: not-editable")
return False
except Exception:
print("error")
return False
if __name__ == "__main__":
registry = pyatspi.Registry
desktop = registry.getDesktop(0)
for app in desktop:
if not app: continue
#if app.name != "gedit": continue
if app.name != "code": continue
print("\n" + app.name)
obj = pyatspi.findDescendant(app, lambda o: process_object(o))
if obj:
old = obj.queryText().getText(0, -1)
print(f"old content: {old[0:80]}")
editable = obj.queryEditableText()
new = "hello world" + str(random.randint(1000, 9999))
editable.setTextContents(new)
print(f"text '{new}' inserted into external application '{app.name}'")
sudo apt install python3-pyatspi
python3 -m venv --system-site-packages venv
requires accessibility support enabled in order for vscode to provide AT-SPI information:
"editor.accessibilitySupport": "on"
"accessibility.signalOptions.volume": 0
unfortunately the continue.dev text input doesn't have the text role (ROLE_INVALID). all the standard search fields and editor panels have ROLE_ENTRY but cannot be edited.
ideas
- get a textfield into continue.dev which can be externally edited.
- make a standalone vscode extension which can at least retrieve textinput-handles within vscode (i.e. not using AT-SPI).
- just store the transcribed text into clipboard and have to user manually paste it (LobeChat background service?).
Closing this for now – when I get some time I'll make the API connections