[Feature Request] Slow Transcription - Whisper API support?
First off; I love the work you put into this app and tryign to make it as user friendly as possible. Thank you!
Unfortunately transcribing is suuuper slow for me (practically unuseable) as I need to wait +1 minute for transcribing even 1 second of audio recording with just even 'Whisper small'.
I'm on a ~2015 laptop running windows 11 (8GB RAM Intel(R) Core(TM) m3-5Y30 CPU at 0.9GHz).
Would you consider implementing support for Openai Whisper API powered transcription via asking api key? This should solve that issue.
Hey thank you for the kind words and I'm really sorry it's not running well on your computer. I did recently remove some BLAS support which may make the CPU inference run slower than before. Perhaps I can add it back, but I do need to test it. Let me see if I can find a machine with a lower power CPU and no GPU to see how it does. In addition I'll go through and see if there's any other optimizations on CPU that can be made where possible.
Right now I am not planning to add support for remote Whisper API as the app is meant for local use only. And due to this there are effectively system requirements.
I've had another request for API support in the past and I will think about how best to solve the issue of hardware. For now though the application will remain local only. I especially would like to not rely on specific API providers nor ones that make you pay.
Perhaps there is a way that people who are running Handy can spare some of their idle compute to do transcriptions for others that don't have the hardware to do so. This is obviously a much more complex thing to do and would have to be carefully crafted and opt-in. I'm not confident this is the way forward but something I'm pondering.
Another thought is maybe a computer donation program or something. Surely there are many people with older computers that are still quite powerful that are sitting around for one reason or another. Perhaps it's possible to get them moving and be an upgrade for someone else.
Do you mind explaining why so strict on the local only use?
I get not being eager to support non-open ai-companies; but what if you make it non-obvious or very optional looking to enable it? That way people with good enough requirements are unlikely to even look to enable it.
It will be interesting to see the BLAS support back; I'm not familiar with it but I fear it isn't going to be nearly enough considering the delay.
I think it adds massively to the value of this app as you could use tiny or old - cheap machines and use them as mostly voice controlled in combo with privacy secure open source ai chat ui solution like JAN ai interface. More power to move a step away from being caged into company privacy controlled AI UIs.
It's just not something I'm in particular interested in supporting for this application right now. Handy doesn't have to do everything for everyone. At some point I have to draw a line in the sand for what the boundaries of the app are. Every issue and PR that comes up is me figuring out what they are. For now it is designed to be local-only. Not just local-first. Why? I guess that's just what I want from the app. I don't have much more explanation than that. I like the constraint and wish it to continue. I'm not on a holy war against API's or whatever, I use many of them and even host a whisper API myself (https://geppetto.app), but I like that the project doesn't use them. I think the market is saturated with projects that do, and I just want something different.
I partially like the constraint because it means I can't hide from the fact that not every machine can run it well. And I want to feel and know that. I want to push and understand the models and the acceleration to make it so that it can. There is room for improvement. I would love to train a model even that runs well on the CPU. I really believe that this technology should be ubiquitous, and I want to understand the edges of what that means. Contemplate them.
Anyone is free to fork it and make it work and I will gladly support that fork and direct people to it and if there is enough support maybe merge it if the UX is good. There are plenty of alternatives that already support API's. One example is: https://github.com/epicenter-so/epicenter/tree/main/apps/whispering
I also suspect BLAS is not enough either. And I do agree with you as well, but it's out of the scope for me at the moment of what I want to support with Handy.
And also with all of this said, I am still open to the idea especially if the community as a whole would like it. I do need more support and feedback and arguments to be satisfied in supporting it. I'm happy for this issue to be a discussion for adding API support, and the best way to do so for user experience while keeping the app FOSS.
I do think people with lower end hardware should have access to transcription as well and it will be something I'll be thinking about more regardless.
I may be happy to add a sponsored endpoint (and maybe geppetto will be this) or something, if it means that the user experience can just be flipping a switch. I don't really want someone to put an API key in. I don't think this technology should be limited to just people with the requisite hardware, nor do I think people should have to pay for it.
I only write this much because I do care, and care a lot. I don't know the best way forward. I'm playing with ideas and possibilities and stumbling through them. And at the same time I want to make a wonderful user experience. I want to understand more about the space, and the limitations.
Thanks for your very thoughtful reply. It really shines through how passionate you are with designing a solid product!
I'm glad to hear that you are potentially open to exploring it later or merging in branches supporting it and thanks so much for the epicenter-whispering alternative. I will try that out soon.
When you talk about a saturated market though, I searched both via chatgpt assisted searches (long thread) and via directly searching github projects. All but your project seemed to actually work in a user friendly way and/or bad-low maintained-abandoned from what I could find.
Perhaps before you started you did a lot of research for other solutions; any further solutions I could try of the top of your head other then epicenter?
I haven't used a ton of them but here are a few
There may be others too, but for Windows the options are more limited than I realized. For MacOS there are many
Thank you!
I reopened #18 as there might be a fast implementation that runs on the CPU and may be a good replacement and fix slow inference for you. I will need to play around and test it on a variety of hardware
Okay and one more note, could maybe try Moonshine as well
https://github.com/moonshine-ai/moonshine
The benefit to supporting the API is it lets us separate out the front end from the back end. (for example, using nvidia's parakeet rather than whisper, as it's much faster)
And yeah, not that many windows front ends. I'm playing with whisper-writer / parakeet-writer right now.
@burntcouscous would you mind trying 0.5.0 with parakeet and see if it's good enough for you?
I know it may not, just curious. API support may come in the future but I can't make any promises at the moment
Any recommended software for Linux systems?
Whispering is one option which supports cloud models: https://github.com/epicenter-os/epicenter/tree/main/apps/whispering
Epicenter is the company who makes this and they also sponsor Handy's development and some of the libraries I've put out for local transcription (big s/o @braden-w). Likely transcription-rs will get API support at some point and there is a chance when it does, Handy also will. But I have yet to decide on UX.
I may just host Parakeet/Whisper for free for anyone using Handy. No API key required.
@cjpais Thank you so much! We love Handy and are proud sponsors of its development 🫶
I just returned from two weeks of back-to-back calls, so Whispering is currently undergoing an overhaul. I have several fixes to implement regarding models and Linux support within the next week 😅, but yes, Whispering supports various APIs!
@cjpais , let me know if you would like to discuss hosting Parakeet/Whisper! Happy to sponsor server costs, we have some spare API credits on other providers like Replicate that we can probably figure out :)
Thank you for the thoughtful discussion. I redirected my request for Soniox to Whispering.
I still use Handy and enjoy it a lot with parakeet-tdt-0_6b-v3. Still, from my testing I believe that Soniox is a stronger model. Anyway, thanks for clearly stating your goals for Handy!