Kokoro-FastAPI icon indicating copy to clipboard operation
Kokoro-FastAPI copied to clipboard

Missing x-timestamps-path Response Header on dev/captioned_speech Endpoint

Open opiuman opened this issue 6 months ago • 7 comments

Describe the bug In the v0.2.4 release, the response from the dev/captioned_speech endpoint does not include the x-timestamps-path header, even when the request body explicitly sets "return_timestamps": true. This behavior was previously expected to return the header when timestamps are requested.

Screenshots or console output The following screenshots compare the response headers from v0.2.2 and v0.2.4 using the same request. In v0.2.2, the x-timestamps-path header is present, whereas in v0.2.4, it is missing.

Image

Branch / Deployment used It's the docker container with the image directly pulling from ghcr.io/remsky/kokoro-fastapi-cpu:v0.2.4

Operating System Docker desktop v4.42.1 on Mac mini 4 (MacOS 15.5 (24F74))

Additional context n/a

opiuman avatar Jul 01 '25 05:07 opiuman

Hi. I'm experiencing the same too.

Branch / Deployment used Pulled from ghcr.io/remsky/kokoro-fastapi-cpu:v0.2.4

Operating System Docker desktop 28.2.2, build e6534b4, Windows 11 Home, Version 10.0.26100 Build 26100

tongshen-yong avatar Jul 03 '25 17:07 tongshen-yong

@opiuman @tongshen-yong Right now it is returned as JSON in either chunks or all at once:

Taken from readme.md:

Streaming:

import requests
import base64
import json

response = requests.post(
    "http://localhost:8880/dev/captioned_speech",
    json={
        "model": "kokoro",
        "input": "Hello world!",
        "voice": "af_bella",
        "speed": 1.0,
        "response_format": "mp3",
        "stream": True,
    },
    stream=True
)

f=open("output.mp3","wb")
for chunk in response.iter_lines(decode_unicode=True):
    if chunk:
        chunk_json=json.loads(chunk)
        
        # Decode base 64 stream to bytes
        chunk_audio=base64.b64decode(chunk_json["audio"].encode("utf-8"))
        
        # Process streaming chunks
        f.write(chunk_audio)
        
        # Print word level timestamps
        print(chunk_json["timestamps"])

Non Streaming:

import requests
import base64
import json

response = requests.post(
    "http://localhost:8880/dev/captioned_speech",
    json={
        "model": "kokoro",
        "input": "Hello world!",
        "voice": "af_bella",
        "speed": 1.0,
        "response_format": "mp3",
        "stream": False,
    },
    stream=False
)

with open("output.mp3","wb") as f:

    audio_json=json.loads(response.content)
    
    # Decode base 64 stream to bytes
    chunk_audio=base64.b64decode(audio_json["audio"].encode("utf-8"))
    
    # Process streaming chunks
    f.write(chunk_audio)
    
    # Print word level timestamps
    print(audio_json["timestamps"])

fireblade2534 avatar Jul 04 '25 22:07 fireblade2534

same here

Makooooooooo avatar Jul 07 '25 00:07 Makooooooooo

@Makooooooooo see my response above

fireblade2534 avatar Jul 10 '25 18:07 fireblade2534

I am having the same issue.

kattapug avatar Jul 16 '25 14:07 kattapug

same here

antikilahdjs avatar Oct 01 '25 13:10 antikilahdjs

It is still the same

hayttle avatar Oct 03 '25 19:10 hayttle