bertalign icon indicating copy to clipboard operation
bertalign copied to clipboard

trilingual alignment

Open oushiei120 opened this issue 11 months ago • 3 comments

If I want to attempt a trilingual alignment of a literary work, is it more efficient to align the third language with one of the already segmented texts from an existing bilingual aligned corpus, or to align all three languages from scratch? 如果想尝试一个文学作品的三语种对齐,是在已经对齐的一对一语料库中采用分割好的其中一个文本来继续和另一个语种对齐,还是重新对三个语种对齐的效率更高?

oushiei120 avatar Mar 11 '25 04:03 oushiei120

I think it's much easier to align every two languages first, then merge the alignments using some graph searching algorithm such as connected component

For example, with one source text and two target texts, one of the alignment might be like:

1, 2 -> 1, 2, 3

while the other is:

1 -> 1 2 -> 2, 3

With connnected component, you can find that the minimum alignment unit should be 1, 2 -> 1, 2, 3

bfsujason avatar Mar 11 '25 06:03 bfsujason

I think this is a good topic I recently tried to modify your repository with windsurf, but the effect was not good. Because there are currently many one-to-two bilingual corpora that have been extensively proofread, how to continue adding corpora from other translations on this basis is a problem worth solving.

oushiei120 avatar Mar 11 '25 06:03 oushiei120

Is windsurf an AI-powered IDE? That is a good idea. You can ask AI how to align more than 2 languages using graph algorithm. I think it will solve your problem.

bfsujason avatar Mar 11 '25 07:03 bfsujason