Can't reproduce the results for GLUE CoLA
My steps:
git clone https://github.com/microsoft/LoRA.git
cd LoRA
pip install -e .
cd examples/NLU
pip install -e .
Change export num_gpus=8 to export num_gpus=1 in roberta_large_cola.sh
Then CUDA_VISIBLE_DEVICES=0 bash roberta_large_cola.sh
Running on a single A100
Using:
- datasets 2.6.1
- python 3.9.13
- PyTorch 1.13.0+cu117
During training, the eval_matthews_correlation is stuck to 0 at all epochs. I actually had the same issue on the current transformers version, and decreasing the learning rate + no warmup helped to regain OKeyish numbers during training, but not as shiny as 0.68.
Do you have an idea of what I could be doing wrong?
Update: using
export num_gpus=1
export CUBLAS_WORKSPACE_CONFIG=":16:8" # https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
export PYTHONHASHSEED=0
export output_dir="./roberta_cola_custom_sh"
python -m torch.distributed.launch --nproc_per_node=$num_gpus \
examples/text-classification/run_glue.py \
--model_name_or_path roberta-large \
--task_name cola \
--do_train \
--do_eval \
--max_seq_length 128 \
--per_device_train_batch_size 8 \ # original: 4
--learning_rate 2e-5 \ # original: 3e-4
--num_train_epochs 20 \
--output_dir $output_dir/model \
--logging_steps 10 \
--logging_dir $output_dir/log \
--evaluation_strategy epoch \
--save_strategy epoch \
--warmup_ratio 0.0 \ # original: 0.06
--apply_lora \
--lora_r 8 \
--lora_alpha 16 \
--seed 0 \
--weight_decay 0.0 # original: 0.1
trains just fine, I have no eval_matthews_correlation = 0 during training.
@fxmarty asking as I can´t really get glue as good ad in the paper, if you have run also other glue tasks, did you have to apply similar changes also in remaining tasks?
And also, do I understand correctly that you had to reduce bath size from 4 * 8 = 32 to 8, considering gpus?
I am facing the similar problem: when I set num_gpu=2 and add gradient_accumulation_steps=4 (which makes the batch size still 32), the average of 5 random seeds on CoLA of roberta-large using LoRA is 67.0. This results are "the result for each run is taken from the best epoch".
Did anyone know the solution? I am assuming the setting per_device_train_batch_size = 4 on a single GPU is equivalent to total batch size = 4 which is the paper setting, but I am still getting matthews_correlation = 0 during evaluation.