VideoLingo Bug: DataFrame creation fails due to mismatched array lengths in step5

你好，以下是我用 AI 生成的 issue 报告，谢谢开发者。

标题

Bug: DataFrame creation fails due to mismatched array lengths in step5_splitforsub.py

问题描述

在处理字幕分割时，程序在 step5_splitforsub.py 中创建 DataFrame 时报错，显示源文本和翻译文本的数组长度不匹配。

错误信息

2025-01-21 16:23:13.174 Uncaught app exception
Traceback (most recent call last):
File "/opt/anaconda3/envs/videolingo/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/exec_code.py", line 88, in exec_func_with_error_handling
result = func()
File "/opt/anaconda3/envs/videolingo/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 590, in code_to_exec
exec(code, module. dict )
File "/Users/XXX/NAS-Home/Document/Code/Github-开源项目/VideoLingo/st.py", line 123, in
main()
File "/Users/XXX/NAS-Home/Document/Code/Github-开源项目/VideoLingo/st.py", line 119, in main
text_processing_section()
File "/Users/XXX/NAS-Home/Document/Code/Github-开源项目/VideoLingo/st.py", line 33, in text_processing_section
process_text()
File "/Users/XXX/NAS-Home/Document/Code/Github-开源项目/VideoLingo/st.py", line 57, in process_text
step5_splitforsub.split_for_sub_main()
File "/Users/XXX/NAS-Home/Document/Code/Github-开源项目/VideoLingo/core/step5_splitforsub.py", line 131, in split_for_sub_main
pd.DataFrame({'Source': src, 'Translation': remerged}).to_excel(OUTPUT_REMERGED_FILE, index=False)
File "/opt/anaconda3/envs/videolingo/lib/python3.10/site-packages/pandas/core/frame.py", line 778, in init mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "/opt/anaconda3/envs/videolingo/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 503, in dict_to_mgr
return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
File "/opt/anaconda3/envs/videolingo/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 114, in arrays_to_mgr
index = _extract_index(arrays)
File "/opt/anaconda3/envs/videolingo/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 677, in _extract_index
raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length

复现步骤

启动 VideoLingo
上传视频并完成翻译
在字幕处理阶段出现错误

环境信息

Python 版本：3.10
操作系统：macOS
VideoLingo 版本：[您的版本号]

建议修复

建议在创建 DataFrame 之前添加数组长度检查，确保 src 和 remerged 数组长度一致。可以考虑在 split_align_subs 函数中添加数据一致性验证。

相关文件

core/step5_splitforsub.py

Jan 21 '25 08:01 leonlin21

我不是这个project的开发者哈，不过我也遇到了这个问题，就稍稍研究了一下。因为这个切割是交给大语言模型去完成的，有些大语言模型不是很稳定，在切割时会出问题。比如这句话：can be seen after the rapid consumption of around 1,200 milligrams of caffeine.，它长度超过了75字节，所以需要切割，然后大语言模型给的切割建议是["can be seen after the rapid consumption of around 1,200 milligrams of caffeine", "."]，只切割了一个句号，所以split_src和split_trans都增加了，最后导致的就是remerged比它们数量要少。然后下一次attempt的时候，大语言模型又把caffeine的e切掉了。所以导致的结果就是，总共只有3次尝试，但是每次都切得有问题，比如只切一个字母，导致这句话下次还得切。所以remerged的数量就永远追不上split_src，因为它一直在增加。我的解决办法就是在def process(i)这个方法里，加一句判断，把这种切割时抽风的情况排除掉

    def process(i):
        split_src = split_sentence(src_lines[i], num_parts=2).strip()
        src_parts, tr_parts, tr_remerged = align_subs(src_lines[i], tr_lines[i], split_src)
        if all(len(split_src_part) > 1 for split_src_part in src_parts):
            src_lines[i] = src_parts
            tr_lines[i] = tr_parts
        remerged_tr_lines[i] = tr_remerged

这样，只有当切割的每一段都是长度大于1的时候，才进行切割，不然就放着这句不动，直到耗尽3次attempt，这句略比75长，就这么着了。我没在github上fork这个项目，就不提交pr了哈。毕竟作者没准有更好的修复方式。我这个就作为一个workaround仅供参考吧

Feb 09 '25 00:02 GavinTao1219

或者，把这个地方的3增大也能解决

for attempt in range(3):  # 使用固定的3次重试

Feb 09 '25 00:02 GavinTao1219