T5 Mutli-GPU FSDP evaluation loop raises RuntimeError when predict_with_generate is True
System Info
Transformers version 4.27.0-dev Python version 3.8.12
Who can help?
@gante
Information
- [X] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
after #21604 I tried optimizing my code a bit more and I read about deepspeed and FSDP and decided to try FSDP since it seemed simpler.
here's a link to the new code: https://pastebin.com/n9Su4AiL
torchrun train_model.py --dataset_path ./data/HF_HE_AR_Dataset.json --tokenizer_path ./T5Tokenizer/ --max_length=128 --batch_size=4 --logging_steps 10 --save_steps 1000 --model google/t5-v1_1-large --validation_path ./data/dev.json --test_path ./data/test.json --weight_decay 0.0
when the code reaches the number of logging steps I defined (here is 10), it crashes when the final error is:
RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy al
location, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.
full traceback can be found here: https://pastebin.com/ucZ021EQ
it happens with and without fsdp_transformer_layer_cls_to_wrap and with any fsdp option with and without auto_wrap and both shard_grap_op and full_shard and with and without fp16=True
when predict_with_generate=True
if predict_with_generate=False it works fine
Expected behavior
running fsdp with predict_with_generate successfully
Hey @eyalmazuz 👋
Looking at the exception, it does not look like a generate error, but rather a pytorch/trainer-related issue (it fails in the embedding layer). I'm not very knowledgeable there, so I'm tagging @sgugger for a comment.
BTW, without a short reproducible script, our ability to help is limited :)
Hey @eyalmazuz wave
Looking at the exception, it does not look like a generate error, but rather a pytorch/trainer-related issue (it fails in the embedding layer). I'm not very knowledgeable there, so I'm tagging @sgugger for a comment.
BTW, without a short reproducible script, our ability to help is limited :)
Hi @gante
I created a repository with all the code here: https://github.com/eyalmazuz/T5-Translation
I think I uploaded everything needed
It is possible to use the validation file for training as well, the problem still persists
and as I mentioned at the end of the issue, it only happens when predict_with_generate=True, so I assumed it's an issue with it, in the way it is handled when generating outputs as part of the evaluation vs predicting
cc @pacman100
Hello, with FSDP it isn't supported as mentioned here: https://huggingface.co/docs/accelerate/usage_guides/fsdp#a-few-caveats-to-be-aware-of
This feature is incompatible with --predict_with_generate in the run_translation.py script of 🤗 Transformers library.
@eyalmazuz, the reason is the generate of transformers bypasses the FSDP module's forward and directly calls internal model's encoder which isn't wrapped in FSDP unit, because of this the parameters required for the forward pass aren't gathered leading to the error you notice above.
Related PRs to make generate work with FSDP, some hack is required:
https://github.com/pytorch/pytorch/issues/82461
Even if one manually wraps encoder and decoder in separate FSDP units, it will still produce errors because shared parameters should be part of same FSDP unit which would now be broken because shared embedding layers of encoder and decoder will be in separate FSDP units: https://github.com/pytorch/pytorch/issues/79605
Related PRs to make generate work with FSDP, some hack is required: https://github.com/pytorch/pytorch/issues/82461
A hacky way proposed in above issue with PyTorch team is currently the only way to get generate to work with FSDP.
@pacman100 thank you for your reply If I understood https://github.com/pytorch/pytorch/issues/82461, then the issue occurs because FSDP wraps the entire T5 but not sub modules so when calling forward on T5 it works but calling directly on T5.encoder will not work since it's specifically not wrapped in FSDP.
But isn't adding auto_wrap to the FSDP params suppose to recursively wrap all layers in FSDP and thus solve the issue?
as the documentation says
To automatically recursively wrap layers with FSDP using default_auto_wrap_policy,
add --fsdp "full_shard auto_wrap" or --fsdp "shard_grad_op auto_wrap" to the command line arguments.
Or is it only wrapping T5Block in this case?
I changed the seq2seq_trainer file and added a small dummy forward pass before model.generate as mentioned in https://github.com/huggingface/accelerate/issues/570
model_inputs = self.tokenizer(
"في يوم", text_target="ביום שני, מדענים מבית הספר", max_length=10, return_tensors='pt', truncation=True
)
outputs = self.model(**model_inputs)
gen_kwargs["synced_gpus"] = True
generated_tokens = self.model.generate(
generation_inputs,
**gen_kwargs,
)
is synced_gpus=True needed?
it works without it, but it'll keep it anyways
@eyalmazuz, transformer auto wrap only wraps T5Block modules in nested FSDP units
the encoder, decoder, lm_head and shared are part of the global FSDP unit and this is important too because embedding layers which are shared need to be part of the same FSDP unit, in this case the global one.
If one puts encoder and decoder modules in different nested FSDP units, shared embedding weights are no longer in same FSDP units leading to another error as mentioned in above comments
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.