transformers AMP autocast not invoked with CUDA 11.8 build of Pytorch

System Info

pytorch 2.1 + CUDA 11.8 transformers 4.36.2 accelerate 0.26.0

Who can help?

@pacman100 , @muellerz

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

in Trainer.autocast_smart_context_manager(), only CPU AMP is supported, and CUDA autocast wrapper is managed by accelerate when training starts. This design works with pytorch 2.1 built with CUDA 12.1, but not with the CUDA 11.8 version.

Expected behavior

CUDA AMP works with torch 2.1+CUDA 11.8

My simple fix is as follows:

add force_cuda_amp to TrainingArguments to flag the code to enable CUDA AMP autocast
derive Trainer.autocast_smart_context_manager() to return CUDA AMP context if force_cuda_amp is flagged.

A more systematic solution (more like a hack) is to detect CUDA version when Trainer is initialized, if CUDA is < 12 then enable this flag automatically.

~~Edit: my fix resulted in NaN so no fix yet~~ Edit 2: my fix actually worked. The NaN problem was from the hidden _fast_init flag of from_pretrained that caused some new modules not properly initialized.

Jan 15 '24 16:01 haixpham

cc @pacman100 @muellerzr

Jan 15 '24 17:01 amyeroberts

Gentle ping @muellerzr @pacman100

Mar 12 '24 10:03 amyeroberts

Another ping @pacman100 @muellerzr

Apr 08 '24 11:04 amyeroberts

Hello, Zach will be looking into this.

Apr 10 '24 12:04 pacman100

Hi @haixpham, terribly sorry for the delay. On further investigating we weren't using the Accelerator's context manager. Can you try again using pip install git+https://github.com/huggingface/transfromers@muellerzr-fix-autocast? This should help with that :)

Apr 10 '24 14:04 muellerzr

@muellerzr This should fix it. However, the unfixed Trainer.autocast_smart_context_manager() still worked fine in bf16 with Pytorch 2.1 / CUDA 12.1. I looked into accelerate wrapper and it seems the "forward" call is wrapped in the AMP context in accelerator.prepare() but this wrapper didn't work in Pytorch 2.1 / CUDA 11.8

Apr 10 '24 15:04 haixpham

@haixpham can you describe how you could tell it "wasn't working"? I'm now investigating the current code in CUDA 11.8 with torch 2.1, and I'm finding the right wrapper being called.

How my args are setup:

training_args = TrainingArguments(
    output_dir="results/sequence_classification",  # Where weights are stored
    learning_rate=2e-5,  # The learning rate during training
    adam_beta1=.9,
    adam_beta2=.95,
    weight_decay=0.0,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    per_device_train_batch_size=16,  # Number of samples per batch during training
    max_steps=200,  # How many iterations through the dataloaders should be done
    fp16=True,
)

Apr 10 '24 15:04 muellerzr

More specifically, the wrapper around the models' forward() is called in the correct precision.

Apr 10 '24 15:04 muellerzr

can you describe how you could tell it "wasn't working"?

The code crashed during acclerator.backward(loss) in Trainer.training_step(). I followed the code path (of transformers 4.36.2 and accelerate 0.26.0) and saw the call wrapped, but autocast wasn't invoked correctly. Explicitly forcing autocast in Trainer.autocast_smart_context_manager() solved the problem, and I didn't look further into it.

Apr 10 '24 16:04 haixpham

Thanks! Let me see if I can reproduce

Apr 10 '24 16:04 muellerzr

I can't reproduce this on torch==2.1.0, accelerate==0.26.0, transformers@main @haixpham. What are your training args setup like?

Apr 10 '24 16:04 muellerzr

Here's the script I ran without issue:

# End-to-end script running the Hugging Face Trainer
# for multiple choice. Based on the Tasks documentation
# originally from: https://hf.co/docs/transformers/tasks/multiple_choice
from dataclasses import dataclass
from typing import Optional, Union

import evaluate
import numpy as np
import torch
from accelerate import PartialState
from datasets import load_dataset
from transformers import AutoModelForMultipleChoice, AutoTokenizer, Trainer, TrainingArguments
from transformers.tokenization_utils_base import PaddingStrategy, PreTrainedTokenizerBase

# Constants
model_name = "bert-base-uncased"
dataset_name = "swag"
metric = "accuracy"

# Load dataset
print(f"Downloading dataset ({dataset_name})")
dataset = load_dataset(dataset_name, "regular", split="train[:8%]")
dataset = dataset.train_test_split(test_size=0.2)

# Tokenize the dataset
tokenizer = AutoTokenizer.from_pretrained(model_name)
ending_names = ["ending0", "ending1", "ending2", "ending3"]


def tokenize_function(examples):
    first_sentences = [[context] * 4 for context in examples["sent1"]]
    question_headers = examples["sent2"]
    second_sentences = [
        [f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)
    ]

    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])

    tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
    return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}


print(f"Tokenizing dataset for {model_name}...")
tokenized_dataset = dataset.map(tokenize_function, batched=True)


# Create our own data collator class and use it
@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        label_name = "label" if "label" in features[0].keys() else "labels"
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]["input_ids"])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])

        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )

        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch["labels"] = torch.tensor(labels, dtype=torch.int64)
        return batch


data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)

# Handle computation of our metrics
print(f"Loading metric ({metric})...")
accuracy = evaluate.load(metric)


def compute_metrics(evaluation_preds):
    predictions, labels = evaluation_preds
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)


print(f"Instantiating model ({model_name})...")
model = AutoModelForMultipleChoice.from_pretrained(model_name)

# Define the hyperparameters in the TrainingArguments
print("Creating training arguments (weights are stored at `results/multiple_choice`)...")
training_args = TrainingArguments(
    output_dir="results/multiple_choice",  # Where weights are stored
    learning_rate=5e-5,  # The learning rate during training
    per_device_train_batch_size=16,  # Number of samples per batch during training
    per_device_eval_batch_size=16,  # Number of samples per batch during evaluation
    num_train_epochs=2,  # How many iterations through the dataloaders should be done
    weight_decay=0.01,  # Regularization penalization
    evaluation_strategy="epoch",  # How often metrics on the evaluation dataset should be computed
    fp16=True,  # Whether to use 16-bit precision (mixed precision) instead of 32-bit. Generally faster on T4's
)

# Create the `Trainer`, passing in the model and arguments
# the datasets to train on, how the data should be collated,
# and the method for computing our metrics
print("Creating `Trainer`...")
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Initiate training
print("Training...")
trainer.train()

Apr 10 '24 17:04 muellerzr

I trained a standard T5ForConditionalGeneration model for QA, nothing out of ordinary. The problem only appeared with Pytorch 2.1 built with CUDA 11.8 (it worked fine with Pytorch 2.1 built with CUDA 12.1)

Apr 10 '24 17:04 haixpham

A full reproducer is very appreciated here, as I just did more or less what you just said, and I am using PyTorch 2.1 built on CUDA 11.8

Apr 10 '24 17:04 muellerzr

I will provide sample code tomorrow maybe when I get to office

Apr 10 '24 17:04 haixpham

Thank you so much @haixpham :)

Apr 10 '24 17:04 muellerzr

Please note I reported problem with transformers 4.36.2 while you tested with main. Something may have changed in Trainer that made the problem go away

Apr 10 '24 17:04 haixpham

Just reran with that transformers version, no issues there either. (But good call!)

Apr 10 '24 17:04 muellerzr

@muellerzr I set up a conda environment from scratch with torch 2.1+CUDA 11.8 / transformers 4.36.2. To my surprise, the problem (autocasting exception in autograd.backward) did not surface anymore. Was it a problem particular to the machine I ran on before? I'm not sure. Anyway, the problem is gone now (without the fix you committed).

Sorry for the trouble. I'm closing this issue now.

P.S. Another issue I raised in the same day #28510 has not been fixed yet - about using deepspeed context in PretrainedModel when Deepspeed config is provided to TrainingArgument.

Apr 11 '24 15:04 haixpham

I'll give it a glance today, thanks for your patience and cooperation @haixpham 🤗

Apr 11 '24 15:04 muellerzr