AMP autocast not invoked with CUDA 11.8 build of Pytorch
System Info
pytorch 2.1 + CUDA 11.8 transformers 4.36.2 accelerate 0.26.0
Who can help?
@pacman100 , @muellerz
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
in Trainer.autocast_smart_context_manager(), only CPU AMP is supported, and CUDA autocast wrapper is managed by accelerate when training starts. This design works with pytorch 2.1 built with CUDA 12.1, but not with the CUDA 11.8 version.
Expected behavior
CUDA AMP works with torch 2.1+CUDA 11.8
My simple fix is as follows:
- add
force_cuda_ampto TrainingArguments to flag the code to enable CUDA AMP autocast - derive
Trainer.autocast_smart_context_manager()to return CUDA AMP context ifforce_cuda_ampis flagged.
A more systematic solution (more like a hack) is to detect CUDA version when Trainer is initialized, if CUDA is < 12 then enable this flag automatically.
~~Edit: my fix resulted in NaN so no fix yet~~
Edit 2: my fix actually worked. The NaN problem was from the hidden _fast_init flag of from_pretrained that caused some new modules not properly initialized.
cc @pacman100 @muellerzr
Gentle ping @muellerzr @pacman100
Another ping @pacman100 @muellerzr
Hello, Zach will be looking into this.
Hi @haixpham, terribly sorry for the delay. On further investigating we weren't using the Accelerator's context manager. Can you try again using pip install git+https://github.com/huggingface/transfromers@muellerzr-fix-autocast? This should help with that :)
@muellerzr This should fix it. However, the unfixed Trainer.autocast_smart_context_manager() still worked fine in bf16 with Pytorch 2.1 / CUDA 12.1. I looked into accelerate wrapper and it seems the "forward" call is wrapped in the AMP context in accelerator.prepare() but this wrapper didn't work in Pytorch 2.1 / CUDA 11.8
@haixpham can you describe how you could tell it "wasn't working"? I'm now investigating the current code in CUDA 11.8 with torch 2.1, and I'm finding the right wrapper being called.
How my args are setup:
training_args = TrainingArguments(
output_dir="results/sequence_classification", # Where weights are stored
learning_rate=2e-5, # The learning rate during training
adam_beta1=.9,
adam_beta2=.95,
weight_decay=0.0,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
per_device_train_batch_size=16, # Number of samples per batch during training
max_steps=200, # How many iterations through the dataloaders should be done
fp16=True,
)
More specifically, the wrapper around the models' forward() is called in the correct precision.
can you describe how you could tell it "wasn't working"?
The code crashed during acclerator.backward(loss) in Trainer.training_step(). I followed the code path (of transformers 4.36.2 and accelerate 0.26.0) and saw the call wrapped, but autocast wasn't invoked correctly. Explicitly forcing autocast in Trainer.autocast_smart_context_manager() solved the problem, and I didn't look further into it.
Thanks! Let me see if I can reproduce
I can't reproduce this on torch==2.1.0, accelerate==0.26.0, transformers@main @haixpham. What are your training args setup like?
Here's the script I ran without issue:
# End-to-end script running the Hugging Face Trainer
# for multiple choice. Based on the Tasks documentation
# originally from: https://hf.co/docs/transformers/tasks/multiple_choice
from dataclasses import dataclass
from typing import Optional, Union
import evaluate
import numpy as np
import torch
from accelerate import PartialState
from datasets import load_dataset
from transformers import AutoModelForMultipleChoice, AutoTokenizer, Trainer, TrainingArguments
from transformers.tokenization_utils_base import PaddingStrategy, PreTrainedTokenizerBase
# Constants
model_name = "bert-base-uncased"
dataset_name = "swag"
metric = "accuracy"
# Load dataset
print(f"Downloading dataset ({dataset_name})")
dataset = load_dataset(dataset_name, "regular", split="train[:8%]")
dataset = dataset.train_test_split(test_size=0.2)
# Tokenize the dataset
tokenizer = AutoTokenizer.from_pretrained(model_name)
ending_names = ["ending0", "ending1", "ending2", "ending3"]
def tokenize_function(examples):
first_sentences = [[context] * 4 for context in examples["sent1"]]
question_headers = examples["sent2"]
second_sentences = [
[f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)
]
first_sentences = sum(first_sentences, [])
second_sentences = sum(second_sentences, [])
tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}
print(f"Tokenizing dataset for {model_name}...")
tokenized_dataset = dataset.map(tokenize_function, batched=True)
# Create our own data collator class and use it
@dataclass
class DataCollatorForMultipleChoice:
"""
Data collator that will dynamically pad the inputs for multiple choice received.
"""
tokenizer: PreTrainedTokenizerBase
padding: Union[bool, str, PaddingStrategy] = True
max_length: Optional[int] = None
pad_to_multiple_of: Optional[int] = None
def __call__(self, features):
label_name = "label" if "label" in features[0].keys() else "labels"
labels = [feature.pop(label_name) for feature in features]
batch_size = len(features)
num_choices = len(features[0]["input_ids"])
flattened_features = [
[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
]
flattened_features = sum(flattened_features, [])
batch = self.tokenizer.pad(
flattened_features,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors="pt",
)
batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
batch["labels"] = torch.tensor(labels, dtype=torch.int64)
return batch
data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
# Handle computation of our metrics
print(f"Loading metric ({metric})...")
accuracy = evaluate.load(metric)
def compute_metrics(evaluation_preds):
predictions, labels = evaluation_preds
predictions = np.argmax(predictions, axis=1)
return accuracy.compute(predictions=predictions, references=labels)
print(f"Instantiating model ({model_name})...")
model = AutoModelForMultipleChoice.from_pretrained(model_name)
# Define the hyperparameters in the TrainingArguments
print("Creating training arguments (weights are stored at `results/multiple_choice`)...")
training_args = TrainingArguments(
output_dir="results/multiple_choice", # Where weights are stored
learning_rate=5e-5, # The learning rate during training
per_device_train_batch_size=16, # Number of samples per batch during training
per_device_eval_batch_size=16, # Number of samples per batch during evaluation
num_train_epochs=2, # How many iterations through the dataloaders should be done
weight_decay=0.01, # Regularization penalization
evaluation_strategy="epoch", # How often metrics on the evaluation dataset should be computed
fp16=True, # Whether to use 16-bit precision (mixed precision) instead of 32-bit. Generally faster on T4's
)
# Create the `Trainer`, passing in the model and arguments
# the datasets to train on, how the data should be collated,
# and the method for computing our metrics
print("Creating `Trainer`...")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["test"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
# Initiate training
print("Training...")
trainer.train()
I trained a standard T5ForConditionalGeneration model for QA, nothing out of ordinary. The problem only appeared with Pytorch 2.1 built with CUDA 11.8 (it worked fine with Pytorch 2.1 built with CUDA 12.1)
A full reproducer is very appreciated here, as I just did more or less what you just said, and I am using PyTorch 2.1 built on CUDA 11.8
I will provide sample code tomorrow maybe when I get to office
Thank you so much @haixpham :)
Please note I reported problem with transformers 4.36.2 while you tested with main. Something may have changed in Trainer that made the problem go away
Just reran with that transformers version, no issues there either. (But good call!)
@muellerzr I set up a conda environment from scratch with torch 2.1+CUDA 11.8 / transformers 4.36.2. To my surprise, the problem (autocasting exception in autograd.backward) did not surface anymore. Was it a problem particular to the machine I ran on before? I'm not sure. Anyway, the problem is gone now (without the fix you committed).
Sorry for the trouble. I'm closing this issue now.
P.S. Another issue I raised in the same day #28510 has not been fixed yet - about using deepspeed context in PretrainedModel when Deepspeed config is provided to TrainingArgument.
I'll give it a glance today, thanks for your patience and cooperation @haixpham 🤗