transformers T5 Mutli-GPU FSDP evaluation loop raises RuntimeError when predict_with

System Info

Transformers version 4.27.0-dev Python version 3.8.12

Who can help?

@gante

Information

[X] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

after #21604 I tried optimizing my code a bit more and I read about deepspeed and FSDP and decided to try FSDP since it seemed simpler.

here's a link to the new code: https://pastebin.com/n9Su4AiL

torchrun train_model.py --dataset_path ./data/HF_HE_AR_Dataset.json --tokenizer_path ./T5Tokenizer/ --max_length=128 --batch_size=4 --logging_steps 10 --save_steps 1000 --model google/t5-v1_1-large --validation_path ./data/dev.json --test_path ./data/test.json --weight_decay 0.0

when the code reaches the number of logging steps I defined (here is 10), it crashes when the final error is:

RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy al
location, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.

full traceback can be found here: https://pastebin.com/ucZ021EQ

it happens with and without fsdp_transformer_layer_cls_to_wrap and with any fsdp option with and without auto_wrap and both shard_grap_op and full_shard and with and without fp16=True when predict_with_generate=True

if predict_with_generate=False it works fine

Expected behavior

running fsdp with predict_with_generate successfully

Feb 16 '23 18:02 eyalmazuz

Hey @eyalmazuz 👋

Looking at the exception, it does not look like a generate error, but rather a pytorch/trainer-related issue (it fails in the embedding layer). I'm not very knowledgeable there, so I'm tagging @sgugger for a comment.

BTW, without a short reproducible script, our ability to help is limited :)

Feb 16 '23 19:02 gante

Hey @eyalmazuz wave

Looking at the exception, it does not look like a generate error, but rather a pytorch/trainer-related issue (it fails in the embedding layer). I'm not very knowledgeable there, so I'm tagging @sgugger for a comment.

BTW, without a short reproducible script, our ability to help is limited :)

Hi @gante

I created a repository with all the code here: https://github.com/eyalmazuz/T5-Translation

I think I uploaded everything needed

It is possible to use the validation file for training as well, the problem still persists

and as I mentioned at the end of the issue, it only happens when predict_with_generate=True, so I assumed it's an issue with it, in the way it is handled when generating outputs as part of the evaluation vs predicting

Feb 16 '23 19:02 eyalmazuz

cc @pacman100

Feb 16 '23 19:02 sgugger

Hello, with FSDP it isn't supported as mentioned here: https://huggingface.co/docs/accelerate/usage_guides/fsdp#a-few-caveats-to-be-aware-of

This feature is incompatible with --predict_with_generate in the run_translation.py script of 🤗 Transformers library.

Feb 17 '23 02:02 pacman100

@eyalmazuz, the reason is the generate of transformers bypasses the FSDP module's forward and directly calls internal model's encoder which isn't wrapped in FSDP unit, because of this the parameters required for the forward pass aren't gathered leading to the error you notice above.

Related PRs to make generate work with FSDP, some hack is required: https://github.com/pytorch/pytorch/issues/82461

Feb 17 '23 03:02 pacman100

Even if one manually wraps encoder and decoder in separate FSDP units, it will still produce errors because shared parameters should be part of same FSDP unit which would now be broken because shared embedding layers of encoder and decoder will be in separate FSDP units: https://github.com/pytorch/pytorch/issues/79605

Feb 17 '23 03:02 pacman100

Related PRs to make generate work with FSDP, some hack is required: https://github.com/pytorch/pytorch/issues/82461

A hacky way proposed in above issue with PyTorch team is currently the only way to get generate to work with FSDP.

Feb 17 '23 03:02 pacman100

@pacman100 thank you for your reply If I understood https://github.com/pytorch/pytorch/issues/82461, then the issue occurs because FSDP wraps the entire T5 but not sub modules so when calling forward on T5 it works but calling directly on T5.encoder will not work since it's specifically not wrapped in FSDP.

But isn't adding auto_wrap to the FSDP params suppose to recursively wrap all layers in FSDP and thus solve the issue? as the documentation says

To automatically recursively wrap layers with FSDP using default_auto_wrap_policy,
add --fsdp "full_shard auto_wrap" or --fsdp "shard_grad_op auto_wrap" to the command line arguments.

Or is it only wrapping T5Block in this case?

I changed the seq2seq_trainer file and added a small dummy forward pass before model.generate as mentioned in https://github.com/huggingface/accelerate/issues/570

model_inputs = self.tokenizer(
    "في يوم", text_target="ביום שני, מדענים מבית הספר", max_length=10, return_tensors='pt', truncation=True
)

outputs = self.model(**model_inputs)
gen_kwargs["synced_gpus"] = True

generated_tokens = self.model.generate(
    generation_inputs,
    **gen_kwargs,
)

is synced_gpus=True needed? it works without it, but it'll keep it anyways

Feb 17 '23 06:02 eyalmazuz

@eyalmazuz, transformer auto wrap only wraps T5Block modules in nested FSDP units

the encoder, decoder, lm_head and shared are part of the global FSDP unit and this is important too because embedding layers which are shared need to be part of the same FSDP unit, in this case the global one.

If one puts encoder and decoder modules in different nested FSDP units, shared embedding weights are no longer in same FSDP units leading to another error as mentioned in above comments

Feb 17 '23 07:02 pacman100

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Mar 24 '23 15:03 github-actions[bot]

T5 Mutli-GPU FSDP evaluation loop raises RuntimeError when predict_with_generate is True

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior