Support for Azure Cognitive Services Speech SDK
Expected Behavior
Spring AI should support Speech-to-Text functionality in it.
Current Behavior
Spring AI supports OpenAI Text-to-Speech (TTS).
Context
We are building an application that converts speech to text where we can have audio input from website./ mobile device. Referring to cognitive-services-speech-sdk
Hi @radhakrishna67. Spring AI provides APIs for Audio use cases, including OpenAI implementations for text-to-speech (https://docs.spring.io/spring-ai/reference/api/audio/speech/openai-speech.html) and speech-to-text (https://docs.spring.io/spring-ai/reference/api/audio/transcriptions/openai-transcriptions.html).
Does that solve your problem?
@ThomasVitale I have forgot to mention about Azure cognitive-services-speech and Google Speech-to-Text support.
@ThomasVitale does Spring AI support the Azure and Google services that @radhakrishna67 mentioned?
Spring has support for LLM-based text-to-speech (OpenAI) and speech-to-text (OpenAI and Azure OpenAI). To complete the picture, I think it would make sense to have a feature request to implement support for text-to-speech for Azure OpenAI as well.
Azure provides speech services other than Azure OpenAI. Those are not supported by Spring AI. @radhakrishna67 what is the name of the specific service/API you'd like to integrate your app with? A link to the service documentation would also help, thanks.
About Google, same question: what's the specific service/API you're interested in?
Spring AI provides integrations with Google Vertex (which is about to be removed from the project since Google deprecated the service) and Google Gemini. As far as I know, Gemini itself doesn't provide speech-related capabilities.
@ThomasVitale : The PaLM2 models are being deprecated in Google, and recommendation is to switch to Gemini models https://ai.google.dev/palm_docs/palm
PaLM2 support is to be removed from SpringAI.
Please see 2 examples of transcribing audio and video data with multimodality, as supported directly by Vertex with Gemini models: https://github.com/ddobrin/gemini-workshop-for-spring-ai-java-developers/blob/main/src/main/java/gemini/workshop/MultimodalAudioExample.java
https://github.com/ddobrin/gemini-workshop-for-spring-ai-java-developers/blob/main/src/main/java/gemini/workshop/MultimodalVideoExample.java
I've renamed the issue to be a request to support the Cognitive Services Speech SDK - which has a different feature set than that offered by OpenAI Transcription models such as whisper.
The original request mentions " audio input from website./ mobile device.." there is a big difference, as it seems like the azure speech sdk has an android and ios support - which is not the domain of Spring AI.
While we currently have base AI model support for transcription, we do not have a portable service abstraction, like a TranscriptionModel and a possibly related TranscriptionClient. This issue is related in this regard. https://github.com/spring-projects/spring-ai/issues/1478
I asked chatgpt compare whisper and the speech sdk to understand the differences better. Here is the output.
Here is a summary of the differences.
⚖️ Comparison Summary
| Feature | Azure Speech SDK | OpenAI Whisper Model via Azure AI Services |
|---|---|---|
| Real-time transcription | ✅ Supported | ❌ Not supported |
| Batch transcription | ✅ Supported | ✅ Supported |
| Language support | 🌐 100+ languages and dialects | 🗣️ Multiple languages (output in English) |
| Customization | 🎯 Custom models supported | ❌ Not customizable |
| Speaker diarization | ✅ Supported | ✅ Supported (via Azure AI Speech) |
| Translation | ✅ Real-time speech translation | ✅ Speech-to-English only |
| Integration | 🛠️ SDKs for many platforms/languages | 🌐 REST API via Azure AI/OpenAI Services |
📝 Choosing the Right Tool
Use Azure Cognitive Services Speech SDK if:
- You need real-time transcription or translation.
- You want broad language support across 100+ languages and dialects.
- Your app requires custom speech models tailored to your domain.
- You're building interactive voice-enabled applications or services.
Use OpenAI Whisper Model via Azure if:
- You're processing large batches of pre-recorded audio.
- You need robust transcription that handles accents, noise, and informal speech.
- Your main goal is to transcribe or translate audio into English.
- You prefer using Azure AI for asynchronous speech-to-text workloads.