fairseq
fairseq copied to clipboard
How to save parts of the texts in target lang inside input text?
I have some texts that I need to translate and some of them may consist from different languages. I have noticed that sometimes I lose parts of the text in the target lang inside output.
It looks the following way:
text = """
false positive sage update
hallo nochmal zusammen,
erneut eine exe datei die als virus erkannt wird und in quarantäne kommt.
operating system:
microsoft windows server 2019_x005F
"""
sent_tokenizer = PunktSentenceTokenizer()
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
translation_pipeline = pipeline(
"translation",
model=model,
tokenizer=tokenizer,
src_lang="deu_Latn",
tgt_lang="eng_Latn",
max_length=5000,
)
for sent in sent_tokenizer.tokenize(text):
print(sent, ' ---> ', translation_pipeline(sent)[0]['translation_text'])
And here I get
false positive sage update
hallo nochmal zusammen,
erneut eine exe datei die als virus erkannt wird und in quarantäne kommt. ---> Hello again, again an exe file that's been recognized as a virus and is being quarantined.
operating system:
microsoft windows server 2019_x005F --->
But it is ok when I have russian text with english parts. So it possible to keep all the text even if the languages are different inside one text?