DeepSpeed [BUG] I have been trying to run deepspeed on 32 GB Tesla V 100 GPU

Describe the bug I have been trying to run deepspeed on 32 GB Tesla V 100 GPU but it still does not work. I tried parellelizing it over 4 GPUs as well and it shows me a SIGKILL

To Reproduce Here is the code i ran

`import os import deepspeed import torch from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0')) world_size = int(os.getenv('WORLD_SIZE', '1')) generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B')

generator.model = deepspeed.init_inference(generator.model, mp_size=world_size, dtype=torch.float, replace_with_kernel_inject=True)

string = generator("DeepSpeed is", do_sample=True, min_length=50) if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0: print(string) `

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

May 05 '23 21:05 AbhayGoyal

@AbhayGoyal you need to specify the device in pipeline. If you don't do this, the tokenizer will be on CPU and the model with be on GPU, resulting in the following error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Here is an updated version of your script that should work:

import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
device = torch.device(f"cuda:{local_rank}")
generator = pipeline("text-generation", model="EleutherAI/gpt-neo-2.7B", device=device)

generator.model = deepspeed.init_inference(
    generator.model,
    mp_size=world_size,
    dtype=torch.float,
    replace_with_kernel_inject=True,
)

string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)

May 05 '23 23:05 mrwyattii

Thanks for the reply. I understand what you are saying and will make the changes. But will this also fix the memory problem I am facing?

On Fri, May 5, 2023, 6:02 PM Michael Wyatt @.***> wrote:

@AbhayGoyal https://github.com/AbhayGoyal you need to specify the device in pipeline. If you don't do this, the tokenizer will be on CPU and the model with be on GPU, resulting in the following error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Here is an updated version of your script that should work:

import osimport deepspeedimport torchfrom transformers import pipeline local_rank = int(os.getenv("LOCAL_RANK", "0"))world_size = int(os.getenv("WORLD_SIZE", "1"))device = torch.device(f"cuda:{local_rank}")generator = pipeline("text-generation", model="EleutherAI/gpt-neo-2.7B", device=device) generator.model = deepspeed.init_inference( generator.model, mp_size=world_size, dtype=torch.float, replace_with_kernel_inject=True, ) string = generator("DeepSpeed is", do_sample=True, min_length=50)if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0: print(string)

— Reply to this email directly, view it on GitHub https://github.com/microsoft/DeepSpeed/issues/3463#issuecomment-1536878891, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEMF2JT45KR2B5LTM5M2GRLXEWBJBANCNFSM6AAAAAAXXS3NPQ . You are receiving this because you were mentioned.Message ID: @.***>

May 05 '23 23:05 AbhayGoyal

I tried the solution you gave. It still gives me the exact same error

May 06 '23 03:05 AbhayGoyal

@AbhayGoyal can you please share the error message you are seeing? Is it an Out Of Memory error?

May 08 '23 17:05 mrwyattii

Actually it turns out that if I run it on just 1 GPU, it works well. Let me send the code here

On Mon, May 8, 2023, 12:42 PM Michael Wyatt @.***> wrote:

@AbhayGoyal https://github.com/AbhayGoyal can you please share the error message you are seeing? Is it an Out Of Memory error?

— Reply to this email directly, view it on GitHub https://github.com/microsoft/DeepSpeed/issues/3463#issuecomment-1538777526, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEMF2JT757WHAQVXIMZMFCTXFEV7XANCNFSM6AAAAAAXXS3NPQ . You are receiving this because you were mentioned.Message ID: @.***>

May 08 '23 23:05 AbhayGoyal

https://github.com/microsoft/DeepSpeedExamples/blob/8e4ec02c1545f7bd87d3bfe5daaafa5a5f1fe6a6/inference/huggingface/text-generation/inference-test.py

On Mon, May 8, 2023 at 6:04 PM Abhay Goyal @.***> wrote:

Actually it turns out that if I run it on just 1 GPU, it works well. Let me send the code here

On Mon, May 8, 2023, 12:42 PM Michael Wyatt @.***> wrote:

@AbhayGoyal https://github.com/AbhayGoyal can you please share the error message you are seeing? Is it an Out Of Memory error?

— Reply to this email directly, view it on GitHub https://github.com/microsoft/DeepSpeed/issues/3463#issuecomment-1538777526, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEMF2JT757WHAQVXIMZMFCTXFEV7XANCNFSM6AAAAAAXXS3NPQ . You are receiving this because you were mentioned.Message ID: @.***>

May 09 '23 00:05 AbhayGoyal

https://github.com/microsoft/DeepSpeedExamples/blob/8e4ec02c1545f7bd87d3bfe5daaafa5a5f1fe6a6/inference/huggingface/text-generation/inference-test.py … On Mon, May 8, 2023 at 6:04 PM Abhay Goyal @.> wrote: Actually it turns out that if I run it on just 1 GPU, it works well. Let me send the code here On Mon, May 8, 2023, 12:42 PM Michael Wyatt @.> wrote: > @AbhayGoyal https://github.com/AbhayGoyal can you please share the > error message you are seeing? Is it an Out Of Memory error? > > — > Reply to this email directly, view it on GitHub > <#3463 (comment)>, > or unsubscribe > https://github.com/notifications/unsubscribe-auth/AEMF2JT757WHAQVXIMZMFCTXFEV7XANCNFSM6AAAAAAXXS3NPQ > . > You are receiving this because you were mentioned.Message ID: > @.***> >

What are the exact command line arguments you are using to launch the script? If you can run on a single GPU, it should run on multiple GPU as well. Please ensure you are using --ds_inference and --use_kernel when you run this script!

May 09 '23 22:05 mrwyattii

I don't think that is the case. I also did not explicitly mention the number of GPUs to be used. Here is the command I used

deepspeed inference_test.py --name EleutherAI/gpt-neo-2.7B --batch_size 10

On Tue, May 9, 2023 at 5:52 PM Michael Wyatt @.***> wrote:

https://github.com/microsoft/DeepSpeedExamples/blob/8e4ec02c1545f7bd87d3bfe5daaafa5a5f1fe6a6/inference/huggingface/text-generation/inference-test.py … <#m_5188765260997883064_> On Mon, May 8, 2023 at 6:04 PM Abhay Goyal @.> wrote: Actually it turns out that if I run it on just 1 GPU, it works well. Let me send the code here On Mon, May 8, 2023, 12:42 PM Michael Wyatt @.> wrote: > @AbhayGoyal https://github.com/AbhayGoyal https://github.com/AbhayGoyal can you please share the > error message you are seeing? Is it an Out Of Memory error? > > — > Reply to this email directly, view it on GitHub > <#3463 (comment) https://github.com/microsoft/DeepSpeed/issues/3463#issuecomment-1538777526>,

or unsubscribe > https://github.com/notifications/unsubscribe-auth/AEMF2JT757WHAQVXIMZMFCTXFEV7XANCNFSM6AAAAAAXXS3NPQ . > You are receiving this because you were mentioned.Message ID: > @.***>

What are the exact command line arguments you are using to launch the script? If you can run on a single GPU, it should run on multiple GPU as well. Please ensure you are using --ds_inference and --use_kernel when you run this script!

— Reply to this email directly, view it on GitHub https://github.com/microsoft/DeepSpeed/issues/3463#issuecomment-1540987711, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEMF2JVM3CO3IHC4R44GSATXFLDDLANCNFSM6AAAAAAXXS3NPQ . You are receiving this because you were mentioned.Message ID: @.***>

May 09 '23 23:05 AbhayGoyal

@AbhayGoyal I was facing the same issue on V100. In my case my process crashed with SIGKILL when I ran out of System RAM. The reason is that the model is first loaded on the CPU, and then moved to GPU by Deepspeed. So if you run the script with more than one GPUs, DS loads multiple instances of the model and may cause system memory to be exceeded. Can you check the amount of RAM (System RAM not GPU RAM) available? You should run the Inference script and then monitor the RAM using "free -s2 -g".

May 19 '23 07:05 karandua2016

Thanks. You are correct. I did that. So instead of using multiple GPU, i just used 1. Just to make things simpler.

On Fri, May 19, 2023, 2:52 AM karandua2016 @.***> wrote:

@AbhayGoyal https://github.com/AbhayGoyal I was facing the same issue on V100. In my case my process crashed with SIGKILL when I ran out of System RAM. The reason is that the model is first loaded on the CPU, and then moved to GPU by Deepspeed. So if you run the script with more than one GPUs, DS loads multiple instances of the model and may cause system memory to be exceeded. Can you check the amount of RAM (System RAM not GPU RAM) available? You should run the Inference script and then monitor the RAM using "free -s2 -g".

— Reply to this email directly, view it on GitHub https://github.com/microsoft/DeepSpeed/issues/3463#issuecomment-1554185460, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEMF2JQV3GRIRBOM7LQ2HPLXG4RCZANCNFSM6AAAAAAXXS3NPQ . You are receiving this because you were mentioned.Message ID: @.***>

May 19 '23 12:05 AbhayGoyal

Hi all, I'm facing the same issue here. Was wondering wether anyone has any ideas what might be causing this.

I'm trying to run inference on a model that needs 2 A100 GPUs minimum for inference using

/opt/conda/bin/deepspeed /root/DeepSpeedExamples/inference/huggingface/text-generation/inf
erence-test.py --num_gpus 2 --name huggyllama/llama-65b

and getting the sigkill error:

[2023-06-15 15:32:36,151] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 18390
[2023-06-15 15:32:43,064] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 18391

even though in theory the model should fit on 2 A100 GPUs and generate results using deepspeed.

Jun 15 '23 15:06 KMFODA

Same issue on 8 * A100, mark.

Aug 15 '23 07:08 abmybgx

Hi, I have encounter the same error on 8*H800 GPU, so any solution about this?

Aug 25 '23 06:08 zzkcaesar

Same error with 4*RTXA5000 GPU.

Aug 26 '23 11:08 egesko

Hi All, we have recently made some updates that affect this issue. Please install the latest DeepSpeed and use the latest scripts from https://github.com/microsoft/DeepSpeedExamples/blob/master/inference/huggingface/text-generation/inference-test.py

You can now load models using meta tensors to avoid using all the system memory and causing these errors. This works for most models when using Auto Tensor Parallelism (i.e., when not using --use_kernel) and it works for GPT-NEO, BLOOM, OPT, and GPT-J models when using kernel injection (i.e., when using --use_kernel):

deepspeed --num_gpus 2 inference-test.py --model huggyllama/llama-65b --use_meta_tensor

Sep 20 '23 16:09 mrwyattii