buzz icon indicating copy to clipboard operation
buzz copied to clipboard

Documenting audio length limitations for OpenAI Whisper API.

Open xmoiduts opened this issue 1 year ago • 9 comments

When using the OpenAI Whisper API option for transcribing a long audio (1 hour), buzz will report

"Failed (Maximum content size limit (26214400) exceeded (2629xxxx) bytes read) {..."

According to the OpenAI community, whisper API has a file limit of 25MB, while buzz converts input files to PCM (16000 sample rate), which means that audios exceeding around 800 seconds will result in error (I tried 850 seconds and the error happens).

I recommend:

  1. Documenting the length limit; or,
  2. To contain longer audios, try resampling the input media to a higher-compression format for OpenAI Whisper API jobs when the input audio is already at a low bitrate.

xmoiduts avatar Feb 18 '24 16:02 xmoiduts

gpt polite, short version: Hello, I encountered an issue with Buzz while processing a 13MB .m4a file from a 27-minute video. It failed, but my custom Python script succeeded. This suggests a potential issue with Buzz. Could it be looked into? Thank you.

~~long, human, uneducated version hi, for a 27min video, i extracted the .m4a which is 13MB. i feed into buzz, it failed as above. but when i use MY OWN python script, it worked. there is something wrong /w buzz then. need improvement. thanks~~

ccchan234 avatar Feb 20 '24 20:02 ccchan234

i am not sure what's wrong /w buzz's part of code to transcribe,

i just paste mine (relevant part) here:

i am no programmer, i got these from chatgpt plus.

` class AudioTranscriberApp: def init(self, master): self.master = master master.title("Audio Transcriber")

    self.label = tk.Label(master, text="Choose an audio file to transcribe")
    self.label.pack()

    # Language selection
    self.language_label = tk.Label(master, text="Select Language:")
    self.language_label.pack()
    
    self.language = tk.StringVar()
    self.language_combobox = ttk.Combobox(master, textvariable=self.language)
    # Include the most common languages and then the UN + G7 + Korean languages
    self.language_combobox['values'] = (
        'English (en)', 'Chinese (zh)', 'Japanese (ja)', 'Korean (ko)',  # Most common
        'French (fr)', 'Russian (ru)', 'Spanish (es)', 'Arabic (ar)',  # UN languages
        'German (de)', 'Italian (it)'  # G7 languages
    )
    self.language_combobox['state'] = 'readonly'  # Prevent user from typing a value
    self.language_combobox.set('English (en)')  # Set default value
    self.language_combobox.pack()

    self.transcribe_button = tk.Button(master, text="Choose File", command=self.transcribe_audio)
    self.transcribe_button.pack()

    self.close_button = tk.Button(master, text="Close", command=master.quit)
    self.close_button.pack()

def transcribe_audio(self):
    audio_file_path = filedialog.askopenfilename(
        filetypes=[("Audio Files", "*.mp3 *.wav *.m4a"), ("All Files", "*.*")]
    )

    if not audio_file_path:
        # If no file is selected, do nothing
        return

    # Extract the language code from the selection
    language_code = self.language.get().split(' ')[-1].strip('()')

    base_filename = os.path.splitext(audio_file_path)[0]
    srt_file_path = self.generate_new_filename(f"{base_filename}.srt")
    txt_file_path = self.generate_new_filename(f"{base_filename}.txt")

    try:
        # Open the audio file in binary mode and request transcription in the selected language
        with open(audio_file_path, "rb") as audio_file:
            transcript_response = openai.Audio.transcribe(
                file=audio_file,
                model="whisper-1",
                response_format="srt",
                language=language_code
            )
            
        transcription_text = transcript_response['choices'][0]['text'] if 'choices' in transcript_response else transcript_response

        # Save the SRT transcription
        with open(srt_file_path, 'w') as srt_file:
            srt_file.write(transcription_text)
        print("Transcription (SRT format) saved to:", srt_file_path)

        # Convert SRT to plain text and save
        with open(srt_file_path, 'r') as srt_file, open(txt_file_path, 'w') as txt_file:
            for line in srt_file:
                if line.strip().isdigit() or line.strip() == '' or '-->' in line:
                    continue
                txt_file.write(line)
        print("Plain text saved to:", txt_file_path)

        messagebox.showinfo("Success", "Transcription completed successfully!\nFiles have been saved.")

    except Exception as e:
        messagebox.showerror("Error", f"An error occurred: {e}")

`

ccchan234 avatar Feb 21 '24 07:02 ccchan234

When using the OpenAI Whisper API option for transcribing a long audio (1 hour), buzz will report

"Failed (Maximum content size limit (26214400) exceeded (2629xxxx) bytes read) {..."

According to the OpenAI community, whisper API has a file limit of 25MB, while buzz converts input files to PCM (16000 sample rate), which means that audios exceeding around 800 seconds will result in error (I tried 850 seconds and the error happens).

I recommend:

  1. Documenting the length limit; or,
  2. To contain longer audios, try resampling the input media to a higher-compression format for OpenAI Whisper API jobs when the input audio is already at a low bitrate.

for an education video, a 3:30 video need 1.7MB of .m4a (smaller than .mp3).

so a video of about 49min could be processed by my script in theory, not too bad.

but i did saw someone have script to cut it into parts then submit individually.

i'll later borrow code from the script if have time.

ccchan234 avatar Feb 21 '24 07:02 ccchan234

Thanks for the report here. I fixed this issue months ago (https://github.com/chidiwilliams/buzz/pull/652), but I haven't had time to release a new version recently. I'll try to do so this coming week and update this thread. Thanks!

chidiwilliams avatar Feb 24 '24 00:02 chidiwilliams

Thanks for the report here. I fixed this issue months ago (#652), but I haven't had time to release a new version recently. I'll try to do so this coming week and update this thread. Thanks!

hey, on website someone showed how to split the file, grab the srt and txt, THEN join them together.

i am a naive programmer using chatgpt, i could copycat that.

indeep i am still writing it yesterday.

do you think you can incorporate it, or show me the function, i use chatgpt to implement it for you?

i dont want to re-invent the wheel.

that author's youtube video on this: https://www.youtube.com/watch?v=-FtsoKryhPY&t=638s

the lab on colab: https://colab.research.google.com/github/ywchiu/largitdata/blob/master/code/Course_221.ipynb#scrollTo=XfeeQGUQQOwx

page on github: https://github.com/ywchiu/largitdata/blob/master/code/Course_221.ipynb

essentially: split using pydub: (there are some chinese as comments, i asked gpt to translate the comments and DONT touch the code. pls verify with the code above as GPT sometimes malfunction) #@title Split YouTube Video from pydub import AudioSegment

#@markdown ### Length of the segment to split (in milliseconds): segment_length = 1000000 #@param {type:"integer"}

Load the MP3 audio file

sound = AudioSegment.from_file(f'{filename}.mp3', format='mp3')

sound_track = []

Split the audio file into multiple files

for i, chunk in enumerate(sound[::segment_length]): # Set the filename for the split file chunk.export(f'output_{i}.mp3', format='mp3') audio_file = open(f'output_{i}.mp3', "rb") sound_track.append(audio_file)

he use jupyter/colab so the codes are in blocks. his idea is to split the file, fixed at 1000000ms, (very important, otherwise it's hard to calculate SRT times) then submit to whisper api in a loop cycle. then, for each SRT file, maksure the maximum timestamp is <= the file time.

code: max_time = pysrt.SubRipTime(seconds = 1000) for sub in subtitles1: sub.start = sub.start if sub.start < max_time else max_time sub.end = sub.end if sub.end < max_time else max_time

then join the SRT (and txt file, easy for txt file). the timestamp for each file need to be recalculate with respect to the location of the file in the loop.

code: shift_time = pysrt.SubRipTime(seconds = 1000) for sub in subtitles2: sub.start = sub.start + shift_time sub.end = sub.end + shift_time

this will make BUZZ much more useful!

ccchan234 avatar Feb 24 '24 03:02 ccchan234

i need such function deadly now, so i think i'll try to implement in simple python scripts in this weekend.

if you are interested, pls just let me know, i guess you use python too.

then you tell me the part that deal /w the processing, i could implement into it.

ps i am no programmer, i could code mostly due to some courses in college and chatgpt so pls dont think i could do it perfectly alone.

thanks

ccchan234 avatar Feb 24 '24 03:02 ccchan234

ps: i tried to do the spliting /w .m4a /w pydub and ffmpeg, but i failed. .mp3 is much more easy in last nite. (the above colab also use mp3, it's larger in size but seems easier than .m4a)

so recently i'll try again with .mp3.

ccchan234 avatar Feb 24 '24 03:02 ccchan234

Thanks for the report here. I fixed this issue months ago (#652), but I haven't had time to release a new version recently. I'll try to do so this coming week and update this thread. Thanks!

hi, i see you are implementing it also? https://app.codecov.io/gh/chidiwilliams/buzz/pull/652/blob/buzz/transcriber.py

ok, i'll wait for your reply at the weekend. i wish both SRT and TXT are woking well. thank you.


<html>
<body>
<!--StartFragment-->
# If the file is larger than 25MB, split into chunks
--
326 | # and transcribe each chunk separately
327 | num_chunks = math.ceil(total_size / max_chunk_size)
328 | chunk_duration = duration_secs / num_chunks
329 |  
330 | segments = []
331 |  
332 | for i in range(num_chunks):
333 | chunk_start = i * chunk_duration
334 | chunk_end = min((i + 1) * chunk_duration, dura

<!--EndFragment-->
</body>
</html>

ccchan234 avatar Feb 24 '24 03:02 ccchan234

Thanks for the report here. I fixed this issue months ago (#652), but I haven't had time to release a new version recently. I'll try to do so this coming week and update this thread. Thanks!

hey, on website someone showed how to split the file, grab the srt and txt, THEN join them together.

i am a naive programmer using chatgpt, i could copycat that.

indeep i am still writing it yesterday.

do you think you can incorporate it, or show me the function, i use chatgpt to implement it for you?

i dont want to re-invent the wheel.

that author's youtube video on this: https://www.youtube.com/watch?v=-FtsoKryhPY&t=638s

the lab on colab: https://colab.research.google.com/github/ywchiu/largitdata/blob/master/code/Course_221.ipynb#scrollTo=XfeeQGUQQOwx

page on github: https://github.com/ywchiu/largitdata/blob/master/code/Course_221.ipynb

essentially: split using pydub: (there are some chinese as comments, i asked gpt to translate the comments and DONT touch the code. pls verify with the code above as GPT sometimes malfunction) #@title Split YouTube Video from pydub import AudioSegment

#@markdown ### Length of the segment to split (in milliseconds): segment_length = 1000000 #@param {type:"integer"}

Load the MP3 audio file

sound = AudioSegment.from_file(f'{filename}.mp3', format='mp3')

sound_track = []

Split the audio file into multiple files

for i, chunk in enumerate(sound[::segment_length]): # Set the filename for the split file chunk.export(f'output_{i}.mp3', format='mp3') audio_file = open(f'output_{i}.mp3', "rb") sound_track.append(audio_file)

he use jupyter/colab so the codes are in blocks. his idea is to split the file, fixed at 1000000ms, (very important, otherwise it's hard to calculate SRT times) then submit to whisper api in a loop cycle. then, for each SRT file, maksure the maximum timestamp is <= the file time.

code: max_time = pysrt.SubRipTime(seconds = 1000) for sub in subtitles1: sub.start = sub.start if sub.start < max_time else max_time sub.end = sub.end if sub.end < max_time else max_time

then join the SRT (and txt file, easy for txt file). the timestamp for each file need to be recalculate with respect to the location of the file in the loop.

code: shift_time = pysrt.SubRipTime(seconds = 1000) for sub in subtitles2: sub.start = sub.start + shift_time sub.end = sub.end + shift_time

this will make BUZZ much more useful!

hi, for those who want a temp solution, the above colab script is a good start.

it helps to split and merge the video, one by one.

would be useful while we wait for the update.

Buzz is good that it could handle many files at one (but not recursively, and no need to as likely people wont do that, as recusively process media files will be very CPU demanding).

thanks

ccchan234 avatar Feb 26 '24 06:02 ccchan234

Custom API endpoint support was added in 1.0.0 See list of currently known custom API endpoints to use https://github.com/chidiwilliams/buzz/discussions/827

Alternate solution is to run a custom whisper transcription back-end on some cloud platform. GPU support is also available on local models.

raivisdejus avatar Jul 11 '24 05:07 raivisdejus