Notebook translation
Hi @mishig25 @lewtun
I am currently translating/adapting the course notebooks (e.g. using a CamemBERT instead of a BERT). I think I'll be done by the end of the month. I plan to put them here: https://github.com/huggingface/notebooks/tree/main/course I was thinking of creating a "fr" directory to put all the translations in (like in https://github.com/huggingface/notebooks/tree/main/transformers_doc). I have two questions:
- should I leave the English files as they are now or put them in an "en" folder? The second option would be cleaner but the problem is that we have to modify all the links to the notebooks in the course to change the path and add an "en" in the path. Does this seem feasible to you?
- for the French edition, would it be feasible to create a button that would link to the French notebooks? This would be added next to the one that links to the English notebooks (to give the reader the choice between the two languages but also because some elements of the course are only available in English and the equivalent does not exist in French)
Hi @lbourdois! Amazing idea as always
- I think we can create new folders
en&fr. And just change the urls in the notebook components here. - For both fr & en notebooks, we can put options under collab buttons like this in transforerms docs here (instead of
Mixed, Pytorch, Tensorflow, it will beEnglish, Francais)
@lbourdois please let me know what do you think about the suggestions, especially number 2 👍
Sounds great @mishig25 :)
Let me know if you need some notebooks already translated for testing
@lbourdois happy to go with whatever option you prefer 😊
if you have some notebooks for testing, we can totally start with that!
Hey @lbourdois using French pretrained models is an awesome proposal 🔥 !
The best way to do this would be the following:
- Adapt the French MDX files directly in
course/fr/and open PRs on this repo with the changes - Generate the French notebooks by running the steps outlined here
- Copy all the generated notebooks into the
notebooks/course/frfolder that @mishig25 suggested.
Step (2) currently has some hard-coded logic for the English notebooks, but I'll refactor it to be suitable for any language
Hey @lbourdois I've now implemented the following to help you get started:
- Generated French notebooks from the MDX files and placed them in the
notebooks/course/frfolder - Updated all the URLs as @mishig25 suggested to point to these new notebooks
From here, my suggestion would be to open PRs on the course repo with the code changes. Once they're merged, we can then generate all the notebooks by running:
python utils/generate_notebooks.py --output_dir nbs
and copy-paste the output in the notebooks repo with a PR. Let me know if you need any more help getting started!
Hi @lewtun
You are even more ambitious than I am in wanting to translate the notebooks and the codes in the course.
I was thinking of leaving the codes in English in the course and only translating the notebooks. This way the reader could choose between the English notebook (to run the notebook while following the course) and the French notebook (which is an adaptation proposal using a particular French model among x existing ones). The choice would be made using the button that @mishig25 showed in his post.
Why was I going to leave the codes in English in the course?
-
Because it takes less time. Modifying the code means rewriting the explanatory text as well (usually the text that comments on the output you get saying "you get this, it's great" or "you get this, it's not great, let's try to improve things by doing x").
-
Changing the text means that the English and French texts would no longer be perfectly aligned. And that's annoying for the last idea I was going to propose once the notebooks were translated (I promise there will be more afterwards) but which I'm spoiling now: create a multilingual dataset of parallel sentences from the (finished) course translations. This dataset would of course be uploaded to the 🤗 Datasets library.
-
And simply because sometimes there is no perfect equivalent content in French for a subject in the course. For example chapter 6/6 (https://huggingface.co/course/chapter6/6?fw=pt) deals with WordPiece tokenization which is the tokenization of BERT. However, no French model uses this tokenization. CamemBERT, BARThez and FrALBERT use a Sentence tokenisation. FlauBERT uses FastBPE. PAGnol, Cédille, and all French GPT-2s use BPE. This is the most restrictive point from my point of view and it made me think of the idea of leaving the course code in English, and proposing a notebook in French where a very similar idea would be used but which is not quite the same (the Sentence Piece of CamemBERT).
But now I am hesitating between translating everything (notebooks and courses) into French and translating only the notebooks. Maybe start with the notebooks, but in any case we have to translate them and then see if we have to translate the codes in the course as well. It's less optimised to do it that way, but it leaves more time for reflection.
Hey @lbourdois thanks for the additional context and details about the challenges with keeping the English and French versions aligned!
I agree with your proposal to first focus on having the notebooks in French - I'll revert my change to the notebook generation script so that we don't override your new changes accidentally.
Can I close this issue now that the French notebooks are available?
Maybe it serves as a reminder to @mishig25 regarding point 2 of your message https://github.com/huggingface/course/issues/309#issuecomment-1243523395.