VideoLingo icon indicating copy to clipboard operation
VideoLingo copied to clipboard

Bug: DataFrame creation fails due to mismatched array lengths in step5_splitforsub.py

Open leonlin21 opened this issue 1 year ago • 2 comments

你好,以下是我用 AI 生成的 issue 报告,谢谢开发者。

标题

Bug: DataFrame creation fails due to mismatched array lengths in step5_splitforsub.py

问题描述

在处理字幕分割时,程序在 step5_splitforsub.py 中创建 DataFrame 时报错,显示源文本和翻译文本的数组长度不匹配。

错误信息

2025-01-21 16:23:13.174 Uncaught app exception
Traceback (most recent call last):
File "/opt/anaconda3/envs/videolingo/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/exec_code.py", line 88, in exec_func_with_error_handling
result = func()
File "/opt/anaconda3/envs/videolingo/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 590, in code_to_exec
exec(code, module. dict )
File "/Users/XXX/NAS-Home/Document/Code/Github-开源项目/VideoLingo/st.py", line 123, in
main()
File "/Users/XXX/NAS-Home/Document/Code/Github-开源项目/VideoLingo/st.py", line 119, in main
text_processing_section()
File "/Users/XXX/NAS-Home/Document/Code/Github-开源项目/VideoLingo/st.py", line 33, in text_processing_section
process_text()
File "/Users/XXX/NAS-Home/Document/Code/Github-开源项目/VideoLingo/st.py", line 57, in process_text
step5_splitforsub.split_for_sub_main()
File "/Users/XXX/NAS-Home/Document/Code/Github-开源项目/VideoLingo/core/step5_splitforsub.py", line 131, in split_for_sub_main
pd.DataFrame({'Source': src, 'Translation': remerged}).to_excel(OUTPUT_REMERGED_FILE, index=False)
File "/opt/anaconda3/envs/videolingo/lib/python3.10/site-packages/pandas/core/frame.py", line 778, in init mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "/opt/anaconda3/envs/videolingo/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 503, in dict_to_mgr
return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
File "/opt/anaconda3/envs/videolingo/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 114, in arrays_to_mgr
index = _extract_index(arrays)
File "/opt/anaconda3/envs/videolingo/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 677, in _extract_index
raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length

复现步骤

  1. 启动 VideoLingo
  2. 上传视频并完成翻译
  3. 在字幕处理阶段出现错误

环境信息

  • Python 版本:3.10
  • 操作系统:macOS
  • VideoLingo 版本:[您的版本号]

建议修复

建议在创建 DataFrame 之前添加数组长度检查,确保 srcremerged 数组长度一致。可以考虑在 split_align_subs 函数中添加数据一致性验证。

相关文件

  • core/step5_splitforsub.py

leonlin21 avatar Jan 21 '25 08:01 leonlin21

我不是这个project的开发者哈,不过我也遇到了这个问题,就稍稍研究了一下。因为这个切割是交给大语言模型去完成的,有些大语言模型不是很稳定,在切割时会出问题。比如这句话:can be seen after the rapid consumption of around 1,200 milligrams of caffeine.,它长度超过了75字节,所以需要切割,然后大语言模型给的切割建议是["can be seen after the rapid consumption of around 1,200 milligrams of caffeine", "."],只切割了一个句号,所以split_src和split_trans都增加了,最后导致的就是remerged比它们数量要少。然后下一次attempt的时候,大语言模型又把caffeine的e切掉了。所以导致的结果就是,总共只有3次尝试,但是每次都切得有问题,比如只切一个字母,导致这句话下次还得切。所以remerged的数量就永远追不上split_src,因为它一直在增加。我的解决办法就是在def process(i)这个方法里,加一句判断,把这种切割时抽风的情况排除掉

    def process(i):
        split_src = split_sentence(src_lines[i], num_parts=2).strip()
        src_parts, tr_parts, tr_remerged = align_subs(src_lines[i], tr_lines[i], split_src)
        if all(len(split_src_part) > 1 for split_src_part in src_parts):
            src_lines[i] = src_parts
            tr_lines[i] = tr_parts
        remerged_tr_lines[i] = tr_remerged

这样,只有当切割的每一段都是长度大于1的时候,才进行切割,不然就放着这句不动,直到耗尽3次attempt,这句略比75长,就这么着了。我没在github上fork这个项目,就不提交pr了哈。毕竟作者没准有更好的修复方式。我这个就作为一个workaround仅供参考吧

GavinTao1219 avatar Feb 09 '25 00:02 GavinTao1219

或者,把这个地方的3增大也能解决

for attempt in range(3):  # 使用固定的3次重试

GavinTao1219 avatar Feb 09 '25 00:02 GavinTao1219