Documenting audio length limitations for OpenAI Whisper API.
When using the OpenAI Whisper API option for transcribing a long audio (1 hour), buzz will report
"Failed (Maximum content size limit (26214400) exceeded (2629xxxx) bytes read) {..."
According to the OpenAI community, whisper API has a file limit of 25MB, while buzz converts input files to PCM (16000 sample rate), which means that audios exceeding around 800 seconds will result in error (I tried 850 seconds and the error happens).
I recommend:
- Documenting the length limit; or,
- To contain longer audios, try resampling the input media to a higher-compression format for OpenAI Whisper API jobs when the input audio is already at a low bitrate.
gpt polite, short version: Hello, I encountered an issue with Buzz while processing a 13MB .m4a file from a 27-minute video. It failed, but my custom Python script succeeded. This suggests a potential issue with Buzz. Could it be looked into? Thank you.
~~long, human, uneducated version hi, for a 27min video, i extracted the .m4a which is 13MB. i feed into buzz, it failed as above. but when i use MY OWN python script, it worked. there is something wrong /w buzz then. need improvement. thanks~~
i am not sure what's wrong /w buzz's part of code to transcribe,
i just paste mine (relevant part) here:
i am no programmer, i got these from chatgpt plus.
` class AudioTranscriberApp: def init(self, master): self.master = master master.title("Audio Transcriber")
self.label = tk.Label(master, text="Choose an audio file to transcribe")
self.label.pack()
# Language selection
self.language_label = tk.Label(master, text="Select Language:")
self.language_label.pack()
self.language = tk.StringVar()
self.language_combobox = ttk.Combobox(master, textvariable=self.language)
# Include the most common languages and then the UN + G7 + Korean languages
self.language_combobox['values'] = (
'English (en)', 'Chinese (zh)', 'Japanese (ja)', 'Korean (ko)', # Most common
'French (fr)', 'Russian (ru)', 'Spanish (es)', 'Arabic (ar)', # UN languages
'German (de)', 'Italian (it)' # G7 languages
)
self.language_combobox['state'] = 'readonly' # Prevent user from typing a value
self.language_combobox.set('English (en)') # Set default value
self.language_combobox.pack()
self.transcribe_button = tk.Button(master, text="Choose File", command=self.transcribe_audio)
self.transcribe_button.pack()
self.close_button = tk.Button(master, text="Close", command=master.quit)
self.close_button.pack()
def transcribe_audio(self):
audio_file_path = filedialog.askopenfilename(
filetypes=[("Audio Files", "*.mp3 *.wav *.m4a"), ("All Files", "*.*")]
)
if not audio_file_path:
# If no file is selected, do nothing
return
# Extract the language code from the selection
language_code = self.language.get().split(' ')[-1].strip('()')
base_filename = os.path.splitext(audio_file_path)[0]
srt_file_path = self.generate_new_filename(f"{base_filename}.srt")
txt_file_path = self.generate_new_filename(f"{base_filename}.txt")
try:
# Open the audio file in binary mode and request transcription in the selected language
with open(audio_file_path, "rb") as audio_file:
transcript_response = openai.Audio.transcribe(
file=audio_file,
model="whisper-1",
response_format="srt",
language=language_code
)
transcription_text = transcript_response['choices'][0]['text'] if 'choices' in transcript_response else transcript_response
# Save the SRT transcription
with open(srt_file_path, 'w') as srt_file:
srt_file.write(transcription_text)
print("Transcription (SRT format) saved to:", srt_file_path)
# Convert SRT to plain text and save
with open(srt_file_path, 'r') as srt_file, open(txt_file_path, 'w') as txt_file:
for line in srt_file:
if line.strip().isdigit() or line.strip() == '' or '-->' in line:
continue
txt_file.write(line)
print("Plain text saved to:", txt_file_path)
messagebox.showinfo("Success", "Transcription completed successfully!\nFiles have been saved.")
except Exception as e:
messagebox.showerror("Error", f"An error occurred: {e}")
`
When using the OpenAI Whisper API option for transcribing a long audio (1 hour), buzz will report
"Failed (Maximum content size limit (26214400) exceeded (2629xxxx) bytes read) {..."
According to the OpenAI community, whisper API has a file limit of 25MB, while buzz converts input files to PCM (16000 sample rate), which means that audios exceeding around 800 seconds will result in error (I tried 850 seconds and the error happens).
I recommend:
- Documenting the length limit; or,
- To contain longer audios, try resampling the input media to a higher-compression format for OpenAI Whisper API jobs when the input audio is already at a low bitrate.
for an education video, a 3:30 video need 1.7MB of .m4a (smaller than .mp3).
so a video of about 49min could be processed by my script in theory, not too bad.
but i did saw someone have script to cut it into parts then submit individually.
i'll later borrow code from the script if have time.
Thanks for the report here. I fixed this issue months ago (https://github.com/chidiwilliams/buzz/pull/652), but I haven't had time to release a new version recently. I'll try to do so this coming week and update this thread. Thanks!
Thanks for the report here. I fixed this issue months ago (#652), but I haven't had time to release a new version recently. I'll try to do so this coming week and update this thread. Thanks!
hey, on website someone showed how to split the file, grab the srt and txt, THEN join them together.
i am a naive programmer using chatgpt, i could copycat that.
indeep i am still writing it yesterday.
do you think you can incorporate it, or show me the function, i use chatgpt to implement it for you?
i dont want to re-invent the wheel.
that author's youtube video on this: https://www.youtube.com/watch?v=-FtsoKryhPY&t=638s
the lab on colab: https://colab.research.google.com/github/ywchiu/largitdata/blob/master/code/Course_221.ipynb#scrollTo=XfeeQGUQQOwx
page on github: https://github.com/ywchiu/largitdata/blob/master/code/Course_221.ipynb
essentially: split using pydub: (there are some chinese as comments, i asked gpt to translate the comments and DONT touch the code. pls verify with the code above as GPT sometimes malfunction) #@title Split YouTube Video from pydub import AudioSegment
#@markdown ### Length of the segment to split (in milliseconds): segment_length = 1000000 #@param {type:"integer"}
Load the MP3 audio file
sound = AudioSegment.from_file(f'{filename}.mp3', format='mp3')
sound_track = []
Split the audio file into multiple files
for i, chunk in enumerate(sound[::segment_length]): # Set the filename for the split file chunk.export(f'output_{i}.mp3', format='mp3') audio_file = open(f'output_{i}.mp3', "rb") sound_track.append(audio_file)
he use jupyter/colab so the codes are in blocks. his idea is to split the file, fixed at 1000000ms, (very important, otherwise it's hard to calculate SRT times) then submit to whisper api in a loop cycle. then, for each SRT file, maksure the maximum timestamp is <= the file time.
code: max_time = pysrt.SubRipTime(seconds = 1000) for sub in subtitles1: sub.start = sub.start if sub.start < max_time else max_time sub.end = sub.end if sub.end < max_time else max_time
then join the SRT (and txt file, easy for txt file). the timestamp for each file need to be recalculate with respect to the location of the file in the loop.
code: shift_time = pysrt.SubRipTime(seconds = 1000) for sub in subtitles2: sub.start = sub.start + shift_time sub.end = sub.end + shift_time
this will make BUZZ much more useful!
i need such function deadly now, so i think i'll try to implement in simple python scripts in this weekend.
if you are interested, pls just let me know, i guess you use python too.
then you tell me the part that deal /w the processing, i could implement into it.
ps i am no programmer, i could code mostly due to some courses in college and chatgpt so pls dont think i could do it perfectly alone.
thanks
ps: i tried to do the spliting /w .m4a /w pydub and ffmpeg, but i failed. .mp3 is much more easy in last nite. (the above colab also use mp3, it's larger in size but seems easier than .m4a)
so recently i'll try again with .mp3.
Thanks for the report here. I fixed this issue months ago (#652), but I haven't had time to release a new version recently. I'll try to do so this coming week and update this thread. Thanks!
hi, i see you are implementing it also? https://app.codecov.io/gh/chidiwilliams/buzz/pull/652/blob/buzz/transcriber.py
ok, i'll wait for your reply at the weekend. i wish both SRT and TXT are woking well. thank you.
<html>
<body>
<!--StartFragment-->
# If the file is larger than 25MB, split into chunks
--
326 | # and transcribe each chunk separately
327 | num_chunks = math.ceil(total_size / max_chunk_size)
328 | chunk_duration = duration_secs / num_chunks
329 |
330 | segments = []
331 |
332 | for i in range(num_chunks):
333 | chunk_start = i * chunk_duration
334 | chunk_end = min((i + 1) * chunk_duration, dura
<!--EndFragment-->
</body>
</html>
Thanks for the report here. I fixed this issue months ago (#652), but I haven't had time to release a new version recently. I'll try to do so this coming week and update this thread. Thanks!
hey, on website someone showed how to split the file, grab the srt and txt, THEN join them together.
i am a naive programmer using chatgpt, i could copycat that.
indeep i am still writing it yesterday.
do you think you can incorporate it, or show me the function, i use chatgpt to implement it for you?
i dont want to re-invent the wheel.
that author's youtube video on this: https://www.youtube.com/watch?v=-FtsoKryhPY&t=638s
the lab on colab: https://colab.research.google.com/github/ywchiu/largitdata/blob/master/code/Course_221.ipynb#scrollTo=XfeeQGUQQOwx
page on github: https://github.com/ywchiu/largitdata/blob/master/code/Course_221.ipynb
essentially: split using pydub: (there are some chinese as comments, i asked gpt to translate the comments and DONT touch the code. pls verify with the code above as GPT sometimes malfunction) #@title Split YouTube Video from pydub import AudioSegment
#@markdown ### Length of the segment to split (in milliseconds): segment_length = 1000000 #@param {type:"integer"}
Load the MP3 audio file
sound = AudioSegment.from_file(f'{filename}.mp3', format='mp3')
sound_track = []
Split the audio file into multiple files
for i, chunk in enumerate(sound[::segment_length]): # Set the filename for the split file chunk.export(f'output_{i}.mp3', format='mp3') audio_file = open(f'output_{i}.mp3', "rb") sound_track.append(audio_file)
he use jupyter/colab so the codes are in blocks. his idea is to split the file, fixed at 1000000ms, (very important, otherwise it's hard to calculate SRT times) then submit to whisper api in a loop cycle. then, for each SRT file, maksure the maximum timestamp is <= the file time.
code: max_time = pysrt.SubRipTime(seconds = 1000) for sub in subtitles1: sub.start = sub.start if sub.start < max_time else max_time sub.end = sub.end if sub.end < max_time else max_time
then join the SRT (and txt file, easy for txt file). the timestamp for each file need to be recalculate with respect to the location of the file in the loop.
code: shift_time = pysrt.SubRipTime(seconds = 1000) for sub in subtitles2: sub.start = sub.start + shift_time sub.end = sub.end + shift_time
this will make BUZZ much more useful!
hi, for those who want a temp solution, the above colab script is a good start.
it helps to split and merge the video, one by one.
would be useful while we wait for the update.
Buzz is good that it could handle many files at one (but not recursively, and no need to as likely people wont do that, as recusively process media files will be very CPU demanding).
thanks
Custom API endpoint support was added in 1.0.0
See list of currently known custom API endpoints to use https://github.com/chidiwilliams/buzz/discussions/827
Alternate solution is to run a custom whisper transcription back-end on some cloud platform. GPU support is also available on local models.