WebSocket text-to-speech with Input streaming - always return in MP3 format
Hi there! There is a problem that no matter what output_format I give as input, mp3 is always returned only for websockets
def text_stream():
yield "Hello my friend "
yield "How are you today? "
audio_stream = client.generate(
text=text_stream(),
stream=True,
model="eleven_multilingual_v2",
output_format="pcm_24000",
request_options={
"additional_headers": {
"output_format": "pcm_24000"
}
},
optimize_streaming_latency=3,
voice=Voice(
voice_id='yDVKFZyiAaNeYZvcliQG',
settings=VoiceSettings(stability=0.32, similarity_boost=1.0, style=1.0, use_speaker_boost=True)
)
)
for chunk in audio_stream:
print(chunk)
`Example of response (looks like mp3):
b'\xff\xfb\x90\xc4\x00\x00\x10\xf9\x9b#\xac0\xc6\x89\xec\xb1(t\xf4\x8d\xb6\x00\x00\x00\x10\x84\x92mF\xa0j\x0e\x8fe\xa3@\x81\xa0\x88(\xa9\x00`\x1f\x11\xc9\xe6fk\xc9\x93&@\x82\x1e\xed\x81"\xc9\xa6}\xdd\xde\xc4DD\x18B\x13\xbb\xbb\xbd\xf6A\x08\x88\x88\x88\xbb\xbb<\x9awq\x11\x11\x11\x11d\xc9\xdd\xdd\xefh\x88 \x84D]\xdd\xdd\xdd\xd9\x08\x88\x88\x88\x8b\xbb\xbb\xbb\xbb\x88\x88\x88\x8c\x8b\xbb\xbb\xdf\xf4\xc4""\x08G\xdfv\xf7\xb6....`
Yeah, I was having this issue and I think I solved it. You're gonna want to go to the site package and find the realtime_tts.py file. This file defines the RealtimeTextToSpeechClient class and its convert_realtime() method. That is the method that gets called when you use client.generate() with stream set to True and where the text is a generator function as in your code. But for some reason, generate() does not pass the output_format param to convert_realtime(). So what you need to do is find line 89 in realtime_tts.py and add the output format you want to the end of the URL like so: "wss://api.elevenlabs.io/", f"v1/text-to-speech/{jsonable_encoder(voice_id)}/stream-input?model_id={model_id}&output_format=pcm_24000"
That should do it!
@skinnynpale @ryandonahue1 appreciate the patience here, the SDK has been fixed in v1.5.0. Please let us know if you run into anymore issues!