[🐛BUG] 我在使用BART模型和wmt16-en-de的时候出现问题。
描述这个 bug 我在使用BART模型和wmt16-en-de的时候出现src和tgt长度不一致的情况,但是我检查了数据集之后发现文件长度相等。
如何复现 (cmd: run_textbox.py --model=BART --model_path=facebook/bart-base --dataset=wmt16-en-de --src_lang=en_XX --tgt_lang=de_DE)
日志
General Hyper Parameters:
gpu_id: 0 use_gpu: True device: cuda seed: 2020 reproducibility: True cmd: run_textbox.py --model=BART --model_path=facebook/bart-base --dataset=wmt16-en-de --src_lang=en_XX --tgt_lang=de_DE filename: BART-wmt16-en-de-2023-May-10_01-27-13 saved_dir: saved/ state: INFO wandb: offline
Training Hyper Parameters:
do_train: True do_valid: True optimizer: adamw adafactor_kwargs: {'lr': 0.001, 'scale_parameter': False, 'relative_step': False, 'warmup_init': False} optimizer_kwargs: {} valid_steps: 1 valid_strategy: epoch stopping_steps: 2 epochs: 50 learning_rate: 3e-05 train_batch_size: 4 grad_clip: 0.1 accumulation_steps: 48 disable_tqdm: False resume_training: True
Evaluation Hyper Parameters:
do_test: True lower_evaluation: True multiref_strategy: max bleu_max_ngrams: 4 bleu_type: sacrebleu smoothing_function: 0 corpus_bleu: False rouge_max_ngrams: 2 rouge_type: files2rouge meteor_type: pycocoevalcap chrf_type: m-popovic distinct_max_ngrams: 4 inter_distinct: True unique_max_ngrams: 4 self_bleu_max_ngrams: 4 tgt_lang: de_DE metrics: ['bleu'] eval_batch_size: 8 corpus_meteor: True
Model Hyper Parameters:
model: BART model_name: bart model_path: facebook/bart-base config_kwargs: {} tokenizer_kwargs: {'src_lang': 'en_XX', 'tgt_lang': 'de_DE'} generation_kwargs: {'num_beams': 5, 'no_repeat_ngram_size': 3, 'early_stopping': True} efficient_kwargs: {} efficient_methods: [] efficient_unfreeze_model: False label_smoothing: 0.1
Dataset Hyper Parameters:
dataset: wmt16-en-de data_path: dataset/wmt16-en-de src_lang: en_XX tgt_lang: de_DE src_len: 1024 tgt_len: 1024 truncate: tail prefix_prompt: translate English to Germany: metrics_for_best_model: ['bleu']
Unrecognized Hyper Parameters:
tokenizer_add_tokens: [] load_type: from_pretrained find_unused_parameters: False
================================================================================
10 May 01:27 INFO Pretrain type: pretrain disabled
Traceback (most recent call last):
File "run_textbox.py", line 12, in
请参考这个方法 https://github.com/RUCAIBox/TextBox/issues/346#issuecomment-1520385763