Missing x-timestamps-path Response Header on dev/captioned_speech Endpoint
Describe the bug In the v0.2.4 release, the response from the dev/captioned_speech endpoint does not include the x-timestamps-path header, even when the request body explicitly sets "return_timestamps": true. This behavior was previously expected to return the header when timestamps are requested.
Screenshots or console output The following screenshots compare the response headers from v0.2.2 and v0.2.4 using the same request. In v0.2.2, the x-timestamps-path header is present, whereas in v0.2.4, it is missing.
Branch / Deployment used It's the docker container with the image directly pulling from ghcr.io/remsky/kokoro-fastapi-cpu:v0.2.4
Operating System Docker desktop v4.42.1 on Mac mini 4 (MacOS 15.5 (24F74))
Additional context n/a
Hi. I'm experiencing the same too.
Branch / Deployment used Pulled from ghcr.io/remsky/kokoro-fastapi-cpu:v0.2.4
Operating System Docker desktop 28.2.2, build e6534b4, Windows 11 Home, Version 10.0.26100 Build 26100
@opiuman @tongshen-yong Right now it is returned as JSON in either chunks or all at once:
Taken from readme.md:
Streaming:
import requests
import base64
import json
response = requests.post(
"http://localhost:8880/dev/captioned_speech",
json={
"model": "kokoro",
"input": "Hello world!",
"voice": "af_bella",
"speed": 1.0,
"response_format": "mp3",
"stream": True,
},
stream=True
)
f=open("output.mp3","wb")
for chunk in response.iter_lines(decode_unicode=True):
if chunk:
chunk_json=json.loads(chunk)
# Decode base 64 stream to bytes
chunk_audio=base64.b64decode(chunk_json["audio"].encode("utf-8"))
# Process streaming chunks
f.write(chunk_audio)
# Print word level timestamps
print(chunk_json["timestamps"])
Non Streaming:
import requests
import base64
import json
response = requests.post(
"http://localhost:8880/dev/captioned_speech",
json={
"model": "kokoro",
"input": "Hello world!",
"voice": "af_bella",
"speed": 1.0,
"response_format": "mp3",
"stream": False,
},
stream=False
)
with open("output.mp3","wb") as f:
audio_json=json.loads(response.content)
# Decode base 64 stream to bytes
chunk_audio=base64.b64decode(audio_json["audio"].encode("utf-8"))
# Process streaming chunks
f.write(chunk_audio)
# Print word level timestamps
print(audio_json["timestamps"])
same here
@Makooooooooo see my response above
I am having the same issue.
same here
It is still the same