How to save parts of the texts in target lang inside input text?

Open lena-kru opened this issue 3 years ago • 0 comments

I have some texts that I need to translate and some of them may consist from different languages. I have noticed that sometimes I lose parts of the text in the target lang inside output.

It looks the following way:

text = """
false positive sage update
hallo nochmal zusammen, 

erneut eine exe datei die als virus erkannt wird und in quarantäne kommt.

operating system:
microsoft windows server 2019_x005F
"""

sent_tokenizer = PunktSentenceTokenizer()
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")

translation_pipeline = pipeline(
    "translation",
    model=model,
    tokenizer=tokenizer,
    src_lang="deu_Latn",
    tgt_lang="eng_Latn",
    max_length=5000,
) 
for sent in sent_tokenizer.tokenize(text):
    print(sent, ' ---> ', translation_pipeline(sent)[0]['translation_text'])

And here I get

false positive sage update
hallo nochmal zusammen, 

erneut eine exe datei die als virus erkannt wird und in quarantäne kommt.  --->  Hello again, again an exe file that's been recognized as a virus and is being quarantined.
operating system:
microsoft windows server 2019_x005F  --->

But it is ok when I have russian text with english parts. So it possible to keep all the text even if the languages are different inside one text?

Aug 09 '22 15:08 lena-kru