Question about fine-tuning NLLB for Mi'kmaq language (Indigenous Canadian language)
Hello NLLB/fairseq team,
I'm interested in exploring how to fine-tune the NLLB model to support the Mi'kmaq language (also spelled Mi'gmaq), an Eastern Algonquian language spoken by approximately 11,000 people in Eastern Canada and parts of the Northeastern US.
My husband is Mi'kmaq and a native speaker of the language. We're curious about:
- What would be required to fine-tune NLLB for Mi'kmaq?
- Is there an existing tutorial or guide for adding new languages?
- What kind and amount of parallel data would be needed?
- Are there resources like the NLLB-Seed dataset mentioned in other issues that we could translate to Mi'kmaq?
I noticed that other Indigenous languages have been added through community efforts, and we'd like to contribute similarly to help preserve and promote the Mi'kmaq language in digital spaces.
Thank you for any guidance you can provide!
Best regards, -BSTdev
Hi!
- What would be required to fine-tune NLLB for Mi'kmaq?
The main requirements are parallel training data and compute.
As training data, you need at least thousands or tens of thousands of translated sentences and phrases; the more the better. Single word translations (from dictionaries) would also help, but they alone are not sufficient.
For training the model (fine-tuning NLLB-200-600M), you need at least one GPU with about 16GB of memory. You could get one with Google Colab (monthly subscription of about $10).
- Is there an existing tutorial or guide for adding new languages?
There is this unofficial tutorial. It is a bit obsolete already, but still informative.
- What kind and amount of parallel data would be needed?
All parallel data that you obtain would be useful, the more data and more diversity, the better.
- Are there resources like the NLLB-Seed dataset mentioned in other issues that we could translate to Mi'kmaq?
FLORES and NLLB-Seed would be nice to translate (their extension to new languages is now managed by https://oldi.org). Another massively multilingual training dataset is SMOL by Google.
Hello everyone,
I have a similar use case. I already know the unofficial tutorial (it's great), but it is already almost 2 years old and I wonder if there is no current tutorial (maybe from the NLLB community)?
I wonder if there is no current tutorial (maybe from the NLLB community)?
@MaxWenzel I could probably write a new one, with updated dependencies and better resistance to catastrophic forgetting of other languages.
What else would you like to get covered in the new tutorial?
Hi @avidale , that sounds great. My scenario is that I want to add a new language (like in your old scenario). I think this is also one of the main use cases. Current dependency versions would of course be great.
I am curious and would be very happy about an updated tutorial!
@avidale, hello! Have you written an updated tutorial :)?