[QUESTION] Splitting big models over multiple GPUs
When specifying the number of GPUs during inference, is it only for parallelism or is the model loaded piece-wise over multiple GPUs, if it's bigger than individual GPUs? For example I'd like to use XCOMET-XXL and our cluster has many 12GB GPUs.
At first I thought that the model parts will be loaded onto all GPUs, e.g.:
comet-score -s data/xcomet_ennl.src -t data/xcomet_ennl_T1.tgt --gpus 5 --model "Unbabel/XCOMET-XL"
However I'm getting GPU OOM on the first GPU:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB. GPU 0 has a total capacity of 10.75 GiB of which 11.62 MiB is free. ...
- Is it correct that in the above setting the model is being loaded in full 5 times on all 5 GPUs?
- Is there a way to split the model over multiple GPUs?
Thank you!
-
unbabel-comet 2.2.1 -
pytorch-lightning 2.2.0.post0 -
torch 2.2.1
same question here
Last time I check this was not very easy to do with pytorch-lightning.
We actually used a custom made implementation with FSDP to train these larger models (without using pytorch-lightning). I have to double check if the new versions support FSDP better than the currently used pytorch lightning version (2.2.0.post0).
But short answer: model parallelism is not something we are supporting in the current codebase.
idea here. Ctranslate2 just integrated tensor parallelism. It also support XMLRoberta, so just wondering if we could adapt a bit the converter so that we could run the model within CT2 which is very fast. How different is it from XML Roberta at inference ?
Does it support XLM-R XL? the architecture also differs from XLM-R
It seems like they improved documentation a lot actually: https://lightning.ai/docs/pytorch/stable/advanced/model_parallel/fsdp.html
Does it support XLM-R XL? the architecture also differs from XLM-R
we can adapt if we have the detailed description somewhere. cc @minhthuc2502