tensor_parallel
tensor_parallel copied to clipboard
Automatically split your PyTorch models on multiple GPUs for training & inference
Hi, If my model is multimodal and the geneates actually defines different like this: ``` generation_output = model_tp.generate( pixel_values=pixel_values, input_ids=input_ids, attention_mask=attention_mask, **generation_config, ) ``` it is don't work. How to...
[0] NCCL INFO cudaDriverVersion 11040 [0] NCCL INFO Bootstrap : Using eth0:10.84.253.70 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation File "/usr/local/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context...
The inference speed of naive model parallel is much better than tensor parallel: Setup: Llama-30b on 2080Ti 22G x4 Naive: 31.64s 4-way TP, main branch: 177.78s 4-way TP, llama branch:...
It is roughly a duplicate of llama config, adds support for mixtral models.
It works fine on v1.3.2, however ``` RuntimeError: Trying to shard a model containing 'meta' parameters. Please set `sharded=False` during model creation and call `.apply_sharding()` only after dispatch ``` occurres...
I'm trying to use the huggingface trainer after using tensor_parallel with the Llama2 7b model, by calling ```python model = tp.tensor_parallel(model) ``` but I'm getting the following error. ``` ValueError:...
What should I do if I want to use tensor_parallel for a GPTQ quantized model([Llama-2-7b-Chat-GPTQ](https://huggingface.co/4bit/Llama-2-7b-Chat-GPTQ) for examlpe) to inference on 2 or more GPUs? Currently, I am using AutoGPTQ to...
What an amazing work, however, when I tried to inference kosmos model from hugging face, there was an error: NotImplementedError: A model class needs to define a `prepare_inputs_for_generation` method in...
I used tensor_parallel to finetune qwen model with lora in tensor parallel way. However, it cannot save the model in the end. Any help can you provide? Thanks.