spring-ai Support for Azure Cognitive Services Speech SDK

Expected Behavior

Spring AI should support Speech-to-Text functionality in it.

Current Behavior

Spring AI supports OpenAI Text-to-Speech (TTS).

Context

We are building an application that converts speech to text where we can have audio input from website./ mobile device. Referring to cognitive-services-speech-sdk

Aug 05 '24 05:08 radhakrishna67

Hi @radhakrishna67. Spring AI provides APIs for Audio use cases, including OpenAI implementations for text-to-speech (https://docs.spring.io/spring-ai/reference/api/audio/speech/openai-speech.html) and speech-to-text (https://docs.spring.io/spring-ai/reference/api/audio/transcriptions/openai-transcriptions.html).

Does that solve your problem?

Aug 05 '24 06:08 ThomasVitale

@ThomasVitale I have forgot to mention about Azure cognitive-services-speech and Google Speech-to-Text support.

Aug 05 '24 18:08 radhakrishna67

@ThomasVitale does Spring AI support the Azure and Google services that @radhakrishna67 mentioned?

Sep 05 '24 14:09 csterwa

Spring has support for LLM-based text-to-speech (OpenAI) and speech-to-text (OpenAI and Azure OpenAI). To complete the picture, I think it would make sense to have a feature request to implement support for text-to-speech for Azure OpenAI as well.

Azure provides speech services other than Azure OpenAI. Those are not supported by Spring AI. @radhakrishna67 what is the name of the specific service/API you'd like to integrate your app with? A link to the service documentation would also help, thanks.

About Google, same question: what's the specific service/API you're interested in?

Spring AI provides integrations with Google Vertex (which is about to be removed from the project since Google deprecated the service) and Google Gemini. As far as I know, Gemini itself doesn't provide speech-related capabilities.

Sep 05 '24 15:09 ThomasVitale

@ThomasVitale : The PaLM2 models are being deprecated in Google, and recommendation is to switch to Gemini models https://ai.google.dev/palm_docs/palm

PaLM2 support is to be removed from SpringAI.

Oct 18 '24 01:10 ddobrin

Please see 2 examples of transcribing audio and video data with multimodality, as supported directly by Vertex with Gemini models: https://github.com/ddobrin/gemini-workshop-for-spring-ai-java-developers/blob/main/src/main/java/gemini/workshop/MultimodalAudioExample.java

https://github.com/ddobrin/gemini-workshop-for-spring-ai-java-developers/blob/main/src/main/java/gemini/workshop/MultimodalVideoExample.java

Oct 18 '24 01:10 ddobrin

I've renamed the issue to be a request to support the Cognitive Services Speech SDK - which has a different feature set than that offered by OpenAI Transcription models such as whisper.

The original request mentions " audio input from website./ mobile device.." there is a big difference, as it seems like the azure speech sdk has an android and ios support - which is not the domain of Spring AI.

While we currently have base AI model support for transcription, we do not have a portable service abstraction, like a TranscriptionModel and a possibly related TranscriptionClient. This issue is related in this regard. https://github.com/spring-projects/spring-ai/issues/1478

I asked chatgpt compare whisper and the speech sdk to understand the differences better. Here is the output.

Here is a summary of the differences.

⚖️ Comparison Summary

Feature	Azure Speech SDK	OpenAI Whisper Model via Azure AI Services
Real-time transcription	✅ Supported	❌ Not supported
Batch transcription	✅ Supported	✅ Supported
Language support	🌐 100+ languages and dialects	🗣️ Multiple languages (output in English)
Customization	🎯 Custom models supported	❌ Not customizable
Speaker diarization	✅ Supported	✅ Supported (via Azure AI Speech)
Translation	✅ Real-time speech translation	✅ Speech-to-English only
Integration	🛠️ SDKs for many platforms/languages	🌐 REST API via Azure AI/OpenAI Services

📝 Choosing the Right Tool

Use Azure Cognitive Services Speech SDK if:

You need real-time transcription or translation.
You want broad language support across 100+ languages and dialects.
Your app requires custom speech models tailored to your domain.
You're building interactive voice-enabled applications or services.

Use OpenAI Whisper Model via Azure if:

You're processing large batches of pre-recorded audio.
You need robust transcription that handles accents, noise, and informal speech.
Your main goal is to transcribe or translate audio into English.
You prefer using Azure AI for asynchronous speech-to-text workloads.

Apr 21 '25 16:04 markpollack