DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] I have been trying to run deepspeed on 32 GB Tesla V 100 GPU

Open AbhayGoyal opened this issue 2 years ago • 11 comments

Describe the bug I have been trying to run deepspeed on 32 GB Tesla V 100 GPU but it still does not work. I tried parellelizing it over 4 GPUs as well and it shows me a SIGKILL

To Reproduce Here is the code i ran

`import os import deepspeed import torch from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0')) world_size = int(os.getenv('WORLD_SIZE', '1')) generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B')

generator.model = deepspeed.init_inference(generator.model, mp_size=world_size, dtype=torch.float, replace_with_kernel_inject=True)

string = generator("DeepSpeed is", do_sample=True, min_length=50) if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0: print(string) `

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

AbhayGoyal avatar May 05 '23 21:05 AbhayGoyal

@AbhayGoyal you need to specify the device in pipeline. If you don't do this, the tokenizer will be on CPU and the model with be on GPU, resulting in the following error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Here is an updated version of your script that should work:

import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
device = torch.device(f"cuda:{local_rank}")
generator = pipeline("text-generation", model="EleutherAI/gpt-neo-2.7B", device=device)

generator.model = deepspeed.init_inference(
    generator.model,
    mp_size=world_size,
    dtype=torch.float,
    replace_with_kernel_inject=True,
)

string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)

mrwyattii avatar May 05 '23 23:05 mrwyattii

Thanks for the reply. I understand what you are saying and will make the changes. But will this also fix the memory problem I am facing?

On Fri, May 5, 2023, 6:02 PM Michael Wyatt @.***> wrote:

@AbhayGoyal https://github.com/AbhayGoyal you need to specify the device in pipeline. If you don't do this, the tokenizer will be on CPU and the model with be on GPU, resulting in the following error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Here is an updated version of your script that should work:

import osimport deepspeedimport torchfrom transformers import pipeline local_rank = int(os.getenv("LOCAL_RANK", "0"))world_size = int(os.getenv("WORLD_SIZE", "1"))device = torch.device(f"cuda:{local_rank}")generator = pipeline("text-generation", model="EleutherAI/gpt-neo-2.7B", device=device) generator.model = deepspeed.init_inference( generator.model, mp_size=world_size, dtype=torch.float, replace_with_kernel_inject=True, ) string = generator("DeepSpeed is", do_sample=True, min_length=50)if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0: print(string)

— Reply to this email directly, view it on GitHub https://github.com/microsoft/DeepSpeed/issues/3463#issuecomment-1536878891, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEMF2JT45KR2B5LTM5M2GRLXEWBJBANCNFSM6AAAAAAXXS3NPQ . You are receiving this because you were mentioned.Message ID: @.***>

AbhayGoyal avatar May 05 '23 23:05 AbhayGoyal

I tried the solution you gave. It still gives me the exact same error

AbhayGoyal avatar May 06 '23 03:05 AbhayGoyal

@AbhayGoyal can you please share the error message you are seeing? Is it an Out Of Memory error?

mrwyattii avatar May 08 '23 17:05 mrwyattii

Actually it turns out that if I run it on just 1 GPU, it works well. Let me send the code here

On Mon, May 8, 2023, 12:42 PM Michael Wyatt @.***> wrote:

@AbhayGoyal https://github.com/AbhayGoyal can you please share the error message you are seeing? Is it an Out Of Memory error?

— Reply to this email directly, view it on GitHub https://github.com/microsoft/DeepSpeed/issues/3463#issuecomment-1538777526, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEMF2JT757WHAQVXIMZMFCTXFEV7XANCNFSM6AAAAAAXXS3NPQ . You are receiving this because you were mentioned.Message ID: @.***>

AbhayGoyal avatar May 08 '23 23:05 AbhayGoyal

https://github.com/microsoft/DeepSpeedExamples/blob/8e4ec02c1545f7bd87d3bfe5daaafa5a5f1fe6a6/inference/huggingface/text-generation/inference-test.py

On Mon, May 8, 2023 at 6:04 PM Abhay Goyal @.***> wrote:

Actually it turns out that if I run it on just 1 GPU, it works well. Let me send the code here

On Mon, May 8, 2023, 12:42 PM Michael Wyatt @.***> wrote:

@AbhayGoyal https://github.com/AbhayGoyal can you please share the error message you are seeing? Is it an Out Of Memory error?

— Reply to this email directly, view it on GitHub https://github.com/microsoft/DeepSpeed/issues/3463#issuecomment-1538777526, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEMF2JT757WHAQVXIMZMFCTXFEV7XANCNFSM6AAAAAAXXS3NPQ . You are receiving this because you were mentioned.Message ID: @.***>

AbhayGoyal avatar May 09 '23 00:05 AbhayGoyal

https://github.com/microsoft/DeepSpeedExamples/blob/8e4ec02c1545f7bd87d3bfe5daaafa5a5f1fe6a6/inference/huggingface/text-generation/inference-test.py On Mon, May 8, 2023 at 6:04 PM Abhay Goyal @.> wrote: Actually it turns out that if I run it on just 1 GPU, it works well. Let me send the code here On Mon, May 8, 2023, 12:42 PM Michael Wyatt @.> wrote: > @AbhayGoyal https://github.com/AbhayGoyal can you please share the > error message you are seeing? Is it an Out Of Memory error? > > — > Reply to this email directly, view it on GitHub > <#3463 (comment)>, > or unsubscribe > https://github.com/notifications/unsubscribe-auth/AEMF2JT757WHAQVXIMZMFCTXFEV7XANCNFSM6AAAAAAXXS3NPQ > . > You are receiving this because you were mentioned.Message ID: > @.***> >

What are the exact command line arguments you are using to launch the script? If you can run on a single GPU, it should run on multiple GPU as well. Please ensure you are using --ds_inference and --use_kernel when you run this script!

mrwyattii avatar May 09 '23 22:05 mrwyattii

I don't think that is the case. I also did not explicitly mention the number of GPUs to be used. Here is the command I used

deepspeed inference_test.py --name EleutherAI/gpt-neo-2.7B --batch_size 10

On Tue, May 9, 2023 at 5:52 PM Michael Wyatt @.***> wrote:

https://github.com/microsoft/DeepSpeedExamples/blob/8e4ec02c1545f7bd87d3bfe5daaafa5a5f1fe6a6/inference/huggingface/text-generation/inference-test.py … <#m_5188765260997883064_> On Mon, May 8, 2023 at 6:04 PM Abhay Goyal @.> wrote: Actually it turns out that if I run it on just 1 GPU, it works well. Let me send the code here On Mon, May 8, 2023, 12:42 PM Michael Wyatt @.> wrote: > @AbhayGoyal https://github.com/AbhayGoyal https://github.com/AbhayGoyal can you please share the > error message you are seeing? Is it an Out Of Memory error? > > — > Reply to this email directly, view it on GitHub > <#3463 (comment) https://github.com/microsoft/DeepSpeed/issues/3463#issuecomment-1538777526>,

or unsubscribe > https://github.com/notifications/unsubscribe-auth/AEMF2JT757WHAQVXIMZMFCTXFEV7XANCNFSM6AAAAAAXXS3NPQ . > You are receiving this because you were mentioned.Message ID: > @.***>

What are the exact command line arguments you are using to launch the script? If you can run on a single GPU, it should run on multiple GPU as well. Please ensure you are using --ds_inference and --use_kernel when you run this script!

— Reply to this email directly, view it on GitHub https://github.com/microsoft/DeepSpeed/issues/3463#issuecomment-1540987711, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEMF2JVM3CO3IHC4R44GSATXFLDDLANCNFSM6AAAAAAXXS3NPQ . You are receiving this because you were mentioned.Message ID: @.***>

AbhayGoyal avatar May 09 '23 23:05 AbhayGoyal

@AbhayGoyal I was facing the same issue on V100. In my case my process crashed with SIGKILL when I ran out of System RAM. The reason is that the model is first loaded on the CPU, and then moved to GPU by Deepspeed. So if you run the script with more than one GPUs, DS loads multiple instances of the model and may cause system memory to be exceeded. Can you check the amount of RAM (System RAM not GPU RAM) available? You should run the Inference script and then monitor the RAM using "free -s2 -g".

karandua2016 avatar May 19 '23 07:05 karandua2016

Thanks. You are correct. I did that. So instead of using multiple GPU, i just used 1. Just to make things simpler.

On Fri, May 19, 2023, 2:52 AM karandua2016 @.***> wrote:

@AbhayGoyal https://github.com/AbhayGoyal I was facing the same issue on V100. In my case my process crashed with SIGKILL when I ran out of System RAM. The reason is that the model is first loaded on the CPU, and then moved to GPU by Deepspeed. So if you run the script with more than one GPUs, DS loads multiple instances of the model and may cause system memory to be exceeded. Can you check the amount of RAM (System RAM not GPU RAM) available? You should run the Inference script and then monitor the RAM using "free -s2 -g".

— Reply to this email directly, view it on GitHub https://github.com/microsoft/DeepSpeed/issues/3463#issuecomment-1554185460, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEMF2JQV3GRIRBOM7LQ2HPLXG4RCZANCNFSM6AAAAAAXXS3NPQ . You are receiving this because you were mentioned.Message ID: @.***>

AbhayGoyal avatar May 19 '23 12:05 AbhayGoyal

Hi all, I'm facing the same issue here. Was wondering wether anyone has any ideas what might be causing this.

I'm trying to run inference on a model that needs 2 A100 GPUs minimum for inference using

/opt/conda/bin/deepspeed /root/DeepSpeedExamples/inference/huggingface/text-generation/inf
erence-test.py --num_gpus 2 --name huggyllama/llama-65b

and getting the sigkill error:

[2023-06-15 15:32:36,151] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 18390
[2023-06-15 15:32:43,064] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 18391

even though in theory the model should fit on 2 A100 GPUs and generate results using deepspeed.

KMFODA avatar Jun 15 '23 15:06 KMFODA

Same issue on 8 * A100, mark.

abmybgx avatar Aug 15 '23 07:08 abmybgx

Hi, I have encounter the same error on 8*H800 GPU, so any solution about this?

zzkcaesar avatar Aug 25 '23 06:08 zzkcaesar

Same error with 4*RTXA5000 GPU.

egesko avatar Aug 26 '23 11:08 egesko

Hi All, we have recently made some updates that affect this issue. Please install the latest DeepSpeed and use the latest scripts from https://github.com/microsoft/DeepSpeedExamples/blob/master/inference/huggingface/text-generation/inference-test.py

You can now load models using meta tensors to avoid using all the system memory and causing these errors. This works for most models when using Auto Tensor Parallelism (i.e., when not using --use_kernel) and it works for GPT-NEO, BLOOM, OPT, and GPT-J models when using kernel injection (i.e., when using --use_kernel):

deepspeed --num_gpus 2 inference-test.py --model huggyllama/llama-65b --use_meta_tensor

mrwyattii avatar Sep 20 '23 16:09 mrwyattii