How can I avoid duplicating some tokens during translating?
I try to translate some texts and sometimes I get really unexpected things.
For example, I try to translate that text
text = """ самописное по\nдобрый день, просьба добавить в исключение файл (прикреплен). возможности изменить самописное по нет.""" """
And it gives me
самописное по
добрый день, просьба добавить в исключение файл (прикреплен). ---> I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, but I'm sorry, and I'm sorry, and I'm sorry, but I'm sorry, but I'm sorry, and I'm sorry to be so sorry
возможности изменить самописное по нет. ---> I'm not sure I can change the self-publishing.
Code for reproducing it:
from nltk.tokenize.punkt import PunktSentenceTokenizer
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline
sent_tokenizer = PunktSentenceTokenizer()
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
translation_pipeline = pipeline(
"translation",
model=model,
tokenizer=tokenizer,
src_lang="rus_Cyrl",
tgt_lang="eng_Latn",
max_length=5000,
)
for sent in sent_tokenizer.tokenize(text):
print(sent, ' ---> ', translation_pipeline(sent)[0]['translation_text'])
Python version: 3.8.13 transformers: 4.21.1
@ldevyataykina read https://huggingface.co/blog/how-to-generate and try change num_beams, no_repeat_ngram_size and other parameters from article.
translation_pipeline = pipeline( "translation", model=model, tokenizer=tokenizer, src_lang="rus_Cyrl", tgt_lang="eng_Latn", max_length=512, num_beams=5, )
for sent in sent_tokenizer.tokenize(text): print(sent, ' ---> ', translation_pipeline(sent)[0]['translation_text'])
Output:
самописное по добрый день, просьба добавить в исключение файл (прикреплен). ---> self-published on good day, please add the file to the exclusion (attached). возможности изменить самописное по нет. ---> I don't have the ability to change the self-publishing.
P.S. Don't set max_length more than 512 tokens.