Bug: DataFrame creation fails due to mismatched array lengths in step5_splitforsub.py
你好,以下是我用 AI 生成的 issue 报告,谢谢开发者。
标题
Bug: DataFrame creation fails due to mismatched array lengths in step5_splitforsub.py
问题描述
在处理字幕分割时,程序在 step5_splitforsub.py 中创建 DataFrame 时报错,显示源文本和翻译文本的数组长度不匹配。
错误信息
2025-01-21 16:23:13.174 Uncaught app exception
Traceback (most recent call last):
File "/opt/anaconda3/envs/videolingo/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/exec_code.py", line 88, in exec_func_with_error_handling
result = func()
File "/opt/anaconda3/envs/videolingo/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 590, in code_to_exec
exec(code, module. dict )
File "/Users/XXX/NAS-Home/Document/Code/Github-开源项目/VideoLingo/st.py", line 123, in
main()
File "/Users/XXX/NAS-Home/Document/Code/Github-开源项目/VideoLingo/st.py", line 119, in main
text_processing_section()
File "/Users/XXX/NAS-Home/Document/Code/Github-开源项目/VideoLingo/st.py", line 33, in text_processing_section
process_text()
File "/Users/XXX/NAS-Home/Document/Code/Github-开源项目/VideoLingo/st.py", line 57, in process_text
step5_splitforsub.split_for_sub_main()
File "/Users/XXX/NAS-Home/Document/Code/Github-开源项目/VideoLingo/core/step5_splitforsub.py", line 131, in split_for_sub_main
pd.DataFrame({'Source': src, 'Translation': remerged}).to_excel(OUTPUT_REMERGED_FILE, index=False)
File "/opt/anaconda3/envs/videolingo/lib/python3.10/site-packages/pandas/core/frame.py", line 778, in init mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "/opt/anaconda3/envs/videolingo/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 503, in dict_to_mgr
return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
File "/opt/anaconda3/envs/videolingo/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 114, in arrays_to_mgr
index = _extract_index(arrays)
File "/opt/anaconda3/envs/videolingo/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 677, in _extract_index
raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length
复现步骤
- 启动 VideoLingo
- 上传视频并完成翻译
- 在字幕处理阶段出现错误
环境信息
- Python 版本:3.10
- 操作系统:macOS
- VideoLingo 版本:[您的版本号]
建议修复
建议在创建 DataFrame 之前添加数组长度检查,确保 src 和 remerged 数组长度一致。可以考虑在 split_align_subs 函数中添加数据一致性验证。
相关文件
- core/step5_splitforsub.py
我不是这个project的开发者哈,不过我也遇到了这个问题,就稍稍研究了一下。因为这个切割是交给大语言模型去完成的,有些大语言模型不是很稳定,在切割时会出问题。比如这句话:can be seen after the rapid consumption of around 1,200 milligrams of caffeine.,它长度超过了75字节,所以需要切割,然后大语言模型给的切割建议是["can be seen after the rapid consumption of around 1,200 milligrams of caffeine", "."],只切割了一个句号,所以split_src和split_trans都增加了,最后导致的就是remerged比它们数量要少。然后下一次attempt的时候,大语言模型又把caffeine的e切掉了。所以导致的结果就是,总共只有3次尝试,但是每次都切得有问题,比如只切一个字母,导致这句话下次还得切。所以remerged的数量就永远追不上split_src,因为它一直在增加。我的解决办法就是在def process(i)这个方法里,加一句判断,把这种切割时抽风的情况排除掉
def process(i):
split_src = split_sentence(src_lines[i], num_parts=2).strip()
src_parts, tr_parts, tr_remerged = align_subs(src_lines[i], tr_lines[i], split_src)
if all(len(split_src_part) > 1 for split_src_part in src_parts):
src_lines[i] = src_parts
tr_lines[i] = tr_parts
remerged_tr_lines[i] = tr_remerged
这样,只有当切割的每一段都是长度大于1的时候,才进行切割,不然就放着这句不动,直到耗尽3次attempt,这句略比75长,就这么着了。我没在github上fork这个项目,就不提交pr了哈。毕竟作者没准有更好的修复方式。我这个就作为一个workaround仅供参考吧
或者,把这个地方的3增大也能解决
for attempt in range(3): # 使用固定的3次重试