text2text Fine-tune cross-lingual translator for text2text generation

Fine-tune cross-lingual translator for text2text generation tasks, e.g. question generation, question answering, summarization, etc. to demonstrate cross-lingual alignment, zero-shot generation, etc.

For example, can we demonstrate question generation or question answering using the existing API? If not, what needs to get fixed?

https://github.com/artitw/text2text#training--finetuning

Jul 31 '21 01:07 artitw

I would be working on this.

Aug 01 '21 15:08 johnanisere

Awesome. For question generation, one approach to get started is to use the SQuAD dataset, and pre-process it into context + answer -> question. Likewise, for question answering pre-process it into context + question -> answer. This could then be used to for the fine-tuning.

Aug 01 '21 17:08 artitw

Here is the link to the colab for the analysis of the question answering and question generation API --> https://colab.research.google.com/drive/1WzO_TP9Nn98AeKmicCYaNTRWezAp9OLt?usp=sharing

Aug 14 '21 17:08 johnanisere

Thanks very much for sharing the notebook. I can recommend two things to try:

Use the existing pre-trained question-answering model to evaluate the performance of the question generation model.
Use the text2text fine-tuning API to see if we can get question generation to work on a pre-trained translator. Although the documentation provides an example for translating, there's nothing stopping us from using it for question generation. Depending on the results, we can dig deeper to understand how to develop the model further.

Aug 14 '21 18:08 artitw

Thanks Art.

Feedback: After evaluating the performance of the question generation model with the question-answering model, my conclusion is that is pretty accurate to a high level, I have documented it here https://colab.research.google.com/drive/1WzO_TP9Nn98AeKmicCYaNTRWezAp9OLt#scrollTo=BtnFEHUlProe&line=1&uniqifier=1

Blocker: When using the text2text fine-tuning API to get a pre-trained translator I run out of space. I have experienced this on both AWS and colab (I get 'no space left on device' error message). I would appreciate every help I can get. I have attached screenshots here. Screenshot 2021-08-21 at 18 47 16 Screenshot 2021-08-21 at 18 50 28

Aug 22 '21 11:08 johnanisere

Would you be able to report the test set accuracy so that we can establish a benchmark? This would be useful for researchers as a better way to measure question generation performance.
It looks like you are using the default model, whish takes up a lot of space and memory. Try using a smaller model with the setting t2t.Transformer.PRETRAINED_TRANSLATOR = "facebook/m2m100_418M" This was tested on Google's colab environment.

Aug 22 '21 21:08 artitw

@artitw oh Okay. Got it

Aug 23 '21 12:08 johnanisere

Thanks Art. The question generation API actually works on a pre-trained translator. I was able to demonstrate it here. https://colab.research.google.com/drive/1WzO_TP9Nn98AeKmicCYaNTRWezAp9OLt?usp=sharing

The next steps is for me to report the test set accuracy so that we can establish a benchmark.

Sep 01 '21 23:09 johnanisere

Reviewed the notebook. It looks like the fine-tuning was not performed on question generation data; rather, it was done using the example for translation. Could you try the following format? I updated the API in the repo to avoid confusion with the [SEP] token.

result = t2t.Handler(["I will go to school today to take my math exam. [SEP] school [TGT] Where will you go to take your math exam?"], 
            src_lang="en",
            tgt_lang="en",
            num_epochs=10, 
            save_directory="model_dir"
            ).fit()

Sep 03 '21 19:09 artitw

Oh I see

Sep 03 '21 19:09 johnanisere

Hi @artitw I have gone back to do the work again and the question generation API actually works on a pre-trained translator. Here https://colab.research.google.com/drive/1WzO_TP9Nn98AeKmicCYaNTRWezAp9OLt?usp=sharing

What strategy do you recommend for benchmarking the test set accuracy?

Sep 16 '21 22:09 johnanisere

For benchmarking, we can start with lower casing the text and then calculating the exact match accuracy.

For finetuning a pretrained translator, we would have to use the translate (not question generation) API to generate the finetuned results.

Sep 19 '21 18:09 artitw

In addition to exact match accuracy, it would be good to calculate average BLEU scores over the answers as well. For reference, see https://en.wikipedia.org/wiki/BLEU

Sep 26 '21 21:09 artitw

Hi @artitw , After training with about 33 data points, the pre-trained translator is still just translating the payload. Do you suggest I train with even more data? here is my result https://colab.research.google.com/drive/1vJ5U_UNFxeu92VVyhAhxKSur_BZSJSIJ?usp=sharing

Sep 26 '21 21:09 johnanisere

Thanks for sharing the notebook. It looks like the right direction, but I would expect it to need much more training (>10k examples). I would also recommend saving the intermediate results in Google Drive so that you can pick up where you left off without starting over.

Sep 26 '21 21:09 artitw

@artitw oh okay, got it

Sep 27 '21 13:09 johnanisere

Hi @artitw, I would like to continue from where John stopped.

Jan 03 '22 15:01 lere01

great, I've assigned you to this issue. Please review what John has done and let us know of any questions here.

Jan 03 '22 22:01 artitw

Noted. I have reviewed John's work and played with the notebooks he reported. It seems that my assignments are the following, in order:

Get sufficient (> 10k) training data.
Report exact match accuracy.
Report average BLEU scores for the answers.

Am I right? Do you have any suggestions on getting training data?

Thank you.

Jan 03 '22 22:01 lere01

What you describe sounds like the right track. I would recommend starting with the English SQuAD [1] dataset and then use XQuAD [2] after that is somewhat working.

[1] https://rajpurkar.github.io/SQuAD-explorer/ [2] https://github.com/deepmind/xquad

Jan 03 '22 22:01 artitw

Hi @artitw,

After trying different options that did not work out, I opted for Amazon Sagemaker.

I loaded the datasets (JSON) to AWS s3
I dockerized the fine-tuning script and pushed the image to AWS ECR
I then created a job on Sagemaker using the docker image as a custom algorithm

The job has been running for some hours taking SQuAD[1] dataset as input. I will keep you updated. I could not get access to a HPC Cluster so I followed this approach. Please, let me know what you think.

Jan 18 '22 13:01 lere01

Hi @lere01 What you suggest seems interesting. I would recommend using a small dataset to test your setup before running any heavy jobs.

Jan 20 '22 01:01 artitw

Hi @artitw,

I used a small dataset to test my setup as you suggested and it worked fine. But the larger dataset took too long to run. I set the job to run for 5 days and even that time frame was not enough.

However, you can see some sort of proof of concept at https://colab.research.google.com/drive/1Vvem1DqNJZQej4t2qAIkZN0DyCdUY_sM#scrollTo=RXf2UrMvSc25.

I used 50 rows from the training set to fine tune
Then performed translation task (answering) on 50 rows of the dev set.
I calculated the bleu_score using NLTK implementation and reported BLEU-1 to BLEU-4
84% of answers generated by the model were perfect match for references.

This was just to show that the whole process works. I would like your suggestion on how to proceed.

Jan 30 '22 19:01 lere01

Hi @lere01,

Thanks for sharing your work and the summary. It looks like a good start. The main issue I can see is that the notebook you shared uses the Answerer model, not the finetuned translator you performed fitting on. We would have to perform predictions using the translator model because we are using it for an unintended purpose.

Feb 01 '22 02:02 artitw

Hi @artitw

Hope you have had a good day. Two things.

1. Before going far, I want to let you know that I am fine tuning using

t2t.Handler([f"{CONTEXT} [TGT] {QUESTION}"], 
            src_lang="en",
            tgt_lang="en",
            num_epochs=10, 
            save_directory="model_dir"
            ).fit()

AND NOT


t2t.Handler([f"{CONTEXT} [SEP] {ANSWER} [TGT] {QUESTION}"], 
            src_lang="en",
            tgt_lang="en",
            num_epochs=10, 
            save_directory="model_dir"
            ).fit()

Am I on the right track?

Feb 16 '22 02:02 lere01

2. I dug into the codebase and figured out a way to use the GPU.

By editing the Translator and doing this:

import text2text as t2t
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

class Translator(t2t.Transformer):
     def __init__(self, **kwargs):
     pretrained_translator = self.__class__.PRETRAINED_TRANSLATOR
     torch_device = "cuda" if torch.cuda.is_available() else "cpu"
     self.__class__.model = AutoModelForSeq2SeqLM.from_pretrained(pretrained_translator).to(torch_device)
     self.__class__.tokenizer = AutoTokenizer.from_pretrained(pretrained_translator)

What do you think?

Feb 16 '22 02:02 lere01

The second approach should work, as we want to generate questions that correspond to a context and an answer.

Hi @artitw

Hope you have had a good day. Two things.

1. Before going far, I want to let you know that I am fine tuning using

t2t.Handler([f"{CONTEXT} [TGT] {QUESTION}"], 
            src_lang="en",
            tgt_lang="en",
            num_epochs=10, 
            save_directory="model_dir"
            ).fit()

AND NOT


t2t.Handler([f"{CONTEXT} [SEP] {ANSWER} [TGT] {QUESTION}"], 
            src_lang="en",
            tgt_lang="en",
            num_epochs=10, 
            save_directory="model_dir"
            ).fit()

Am I on the right track?

Feb 16 '22 02:02 artitw

Nice find. I am referencing your pull request here: https://github.com/artitw/text2text/pull/31

2. I dug into the codebase and figured out a way to use the GPU.

By editing the Translator and doing this:

import text2text as t2t
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

class Translator(t2t.Transformer):
     def __init__(self, **kwargs):
     pretrained_translator = self.__class__.PRETRAINED_TRANSLATOR
     torch_device = "cuda" if torch.cuda.is_available() else "cpu"
     self.__class__.model = AutoModelForSeq2SeqLM.from_pretrained(pretrained_translator).to(torch_device)
     self.__class__.tokenizer = AutoTokenizer.from_pretrained(pretrained_translator)

What do you think?

Feb 16 '22 02:02 artitw

Hi @artitw,

The dataset we are using for fine tuning has multiple questions attached to each context. Do you think that this might be affecting the algorithm's learning? As against one question per context.

Feb 17 '22 23:02 lere01

Yes, I would suggest that the context be concatenated with the answer for each target question. This would ensure that each unique question is mapped to a unique input to the model.

Feb 18 '22 02:02 artitw