Fine-tune cross-lingual translator for text2text generation
Fine-tune cross-lingual translator for text2text generation tasks, e.g. question generation, question answering, summarization, etc. to demonstrate cross-lingual alignment, zero-shot generation, etc.
For example, can we demonstrate question generation or question answering using the existing API? If not, what needs to get fixed?
https://github.com/artitw/text2text#training--finetuning
I would be working on this.
Awesome. For question generation, one approach to get started is to use the SQuAD dataset, and pre-process it into context + answer -> question. Likewise, for question answering pre-process it into context + question -> answer. This could then be used to for the fine-tuning.
Here is the link to the colab for the analysis of the question answering and question generation API --> https://colab.research.google.com/drive/1WzO_TP9Nn98AeKmicCYaNTRWezAp9OLt?usp=sharing
Thanks very much for sharing the notebook. I can recommend two things to try:
- Use the existing pre-trained question-answering model to evaluate the performance of the question generation model.
- Use the text2text fine-tuning API to see if we can get question generation to work on a pre-trained translator. Although the documentation provides an example for translating, there's nothing stopping us from using it for question generation. Depending on the results, we can dig deeper to understand how to develop the model further.
Thanks Art.
Feedback: After evaluating the performance of the question generation model with the question-answering model, my conclusion is that is pretty accurate to a high level, I have documented it here https://colab.research.google.com/drive/1WzO_TP9Nn98AeKmicCYaNTRWezAp9OLt#scrollTo=BtnFEHUlProe&line=1&uniqifier=1
Blocker: When using the text2text fine-tuning API to get a pre-trained translator I run out of space. I have experienced this on both AWS and colab (I get 'no space left on device' error message). I would appreciate every help I can get. I have attached screenshots here.

- Would you be able to report the test set accuracy so that we can establish a benchmark? This would be useful for researchers as a better way to measure question generation performance.
- It looks like you are using the default model, whish takes up a lot of space and memory. Try using a smaller model with the setting
t2t.Transformer.PRETRAINED_TRANSLATOR = "facebook/m2m100_418M"This was tested on Google's colab environment.
@artitw oh Okay. Got it
Thanks Art. The question generation API actually works on a pre-trained translator. I was able to demonstrate it here. https://colab.research.google.com/drive/1WzO_TP9Nn98AeKmicCYaNTRWezAp9OLt?usp=sharing
The next steps is for me to report the test set accuracy so that we can establish a benchmark.
Reviewed the notebook. It looks like the fine-tuning was not performed on question generation data; rather, it was done using the example for translation. Could you try the following format? I updated the API in the repo to avoid confusion with the [SEP] token.
result = t2t.Handler(["I will go to school today to take my math exam. [SEP] school [TGT] Where will you go to take your math exam?"],
src_lang="en",
tgt_lang="en",
num_epochs=10,
save_directory="model_dir"
).fit()
Oh I see
Hi @artitw I have gone back to do the work again and the question generation API actually works on a pre-trained translator. Here https://colab.research.google.com/drive/1WzO_TP9Nn98AeKmicCYaNTRWezAp9OLt?usp=sharing
What strategy do you recommend for benchmarking the test set accuracy?
For benchmarking, we can start with lower casing the text and then calculating the exact match accuracy.
For finetuning a pretrained translator, we would have to use the translate (not question generation) API to generate the finetuned results.
In addition to exact match accuracy, it would be good to calculate average BLEU scores over the answers as well. For reference, see https://en.wikipedia.org/wiki/BLEU
Hi @artitw , After training with about 33 data points, the pre-trained translator is still just translating the payload. Do you suggest I train with even more data? here is my result https://colab.research.google.com/drive/1vJ5U_UNFxeu92VVyhAhxKSur_BZSJSIJ?usp=sharing
Thanks for sharing the notebook. It looks like the right direction, but I would expect it to need much more training (>10k examples). I would also recommend saving the intermediate results in Google Drive so that you can pick up where you left off without starting over.
@artitw oh okay, got it
Hi @artitw, I would like to continue from where John stopped.
great, I've assigned you to this issue. Please review what John has done and let us know of any questions here.
Noted. I have reviewed John's work and played with the notebooks he reported. It seems that my assignments are the following, in order:
- Get sufficient (> 10k) training data.
- Report exact match accuracy.
- Report average BLEU scores for the answers.
Am I right? Do you have any suggestions on getting training data?
Thank you.
What you describe sounds like the right track. I would recommend starting with the English SQuAD [1] dataset and then use XQuAD [2] after that is somewhat working.
[1] https://rajpurkar.github.io/SQuAD-explorer/ [2] https://github.com/deepmind/xquad
Hi @artitw,
After trying different options that did not work out, I opted for Amazon Sagemaker.
- I loaded the datasets (JSON) to AWS s3
- I dockerized the fine-tuning script and pushed the image to AWS ECR
- I then created a job on Sagemaker using the docker image as a custom algorithm
The job has been running for some hours taking SQuAD[1] dataset as input. I will keep you updated. I could not get access to a HPC Cluster so I followed this approach. Please, let me know what you think.
Hi @lere01 What you suggest seems interesting. I would recommend using a small dataset to test your setup before running any heavy jobs.
Hi @artitw,
I used a small dataset to test my setup as you suggested and it worked fine. But the larger dataset took too long to run. I set the job to run for 5 days and even that time frame was not enough.
However, you can see some sort of proof of concept at https://colab.research.google.com/drive/1Vvem1DqNJZQej4t2qAIkZN0DyCdUY_sM#scrollTo=RXf2UrMvSc25.
- I used 50 rows from the training set to fine tune
- Then performed translation task (answering) on 50 rows of the dev set.
- I calculated the bleu_score using NLTK implementation and reported BLEU-1 to BLEU-4
- 84% of answers generated by the model were perfect match for references.
This was just to show that the whole process works. I would like your suggestion on how to proceed.
Hi @lere01,
Thanks for sharing your work and the summary. It looks like a good start. The main issue I can see is that the notebook you shared uses the Answerer model, not the finetuned translator you performed fitting on. We would have to perform predictions using the translator model because we are using it for an unintended purpose.
Hi @artitw
Hope you have had a good day. Two things.
1. Before going far, I want to let you know that I am fine tuning using
t2t.Handler([f"{CONTEXT} [TGT] {QUESTION}"],
src_lang="en",
tgt_lang="en",
num_epochs=10,
save_directory="model_dir"
).fit()
AND NOT
t2t.Handler([f"{CONTEXT} [SEP] {ANSWER} [TGT] {QUESTION}"],
src_lang="en",
tgt_lang="en",
num_epochs=10,
save_directory="model_dir"
).fit()
Am I on the right track?
2. I dug into the codebase and figured out a way to use the GPU.
By editing the Translator and doing this:
import text2text as t2t
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
class Translator(t2t.Transformer):
def __init__(self, **kwargs):
pretrained_translator = self.__class__.PRETRAINED_TRANSLATOR
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
self.__class__.model = AutoModelForSeq2SeqLM.from_pretrained(pretrained_translator).to(torch_device)
self.__class__.tokenizer = AutoTokenizer.from_pretrained(pretrained_translator)
What do you think?
The second approach should work, as we want to generate questions that correspond to a context and an answer.
Hi @artitw
Hope you have had a good day. Two things.
1. Before going far, I want to let you know that I am fine tuning using
t2t.Handler([f"{CONTEXT} [TGT] {QUESTION}"], src_lang="en", tgt_lang="en", num_epochs=10, save_directory="model_dir" ).fit()AND NOT
t2t.Handler([f"{CONTEXT} [SEP] {ANSWER} [TGT] {QUESTION}"], src_lang="en", tgt_lang="en", num_epochs=10, save_directory="model_dir" ).fit()Am I on the right track?
Nice find. I am referencing your pull request here: https://github.com/artitw/text2text/pull/31
2. I dug into the codebase and figured out a way to use the GPU.
By editing the Translator and doing this:
import text2text as t2t import torch from transformers import AutoTokenizer, AutoModelForSeq2SeqLM class Translator(t2t.Transformer): def __init__(self, **kwargs): pretrained_translator = self.__class__.PRETRAINED_TRANSLATOR torch_device = "cuda" if torch.cuda.is_available() else "cpu" self.__class__.model = AutoModelForSeq2SeqLM.from_pretrained(pretrained_translator).to(torch_device) self.__class__.tokenizer = AutoTokenizer.from_pretrained(pretrained_translator)What do you think?
Hi @artitw,
The dataset we are using for fine tuning has multiple questions attached to each context. Do you think that this might be affecting the algorithm's learning? As against one question per context.
Yes, I would suggest that the context be concatenated with the answer for each target question. This would ensure that each unique question is mapped to a unique input to the model.