DeepSpeed Fix llama meta tensor loading, model tensor parallelism inference

The Llama uses the non-conventional LayerNorm and it will not be loaded when using meta tensor. So it will result in NotImplementedError: Cannot copy out of meta tensor; no data!

May 25 '23 14:05 zeyugao

@microsoft-github-policy-service agree

May 30 '23 16:05 zeyugao

The second commit should fix https://github.com/microsoft/DeepSpeed/issues/3452

May 30 '23 16:05 zeyugao

I can still reproduce the error in https://github.com/microsoft/DeepSpeed/issues/3452#issuecomment-1536912461 with this PR.

May 31 '23 20:05 Yard1

I can still reproduce the error in #3452 (comment) with this PR.

I made some modifications based on your code to run on my machine. I am also using v100. It works fine when mp_size=2 or 4. Can you try this code?

if True:
    import sys
    import os
    # New deepspeed path
    sys.path.insert(0, '/Code/DeepSpeed')
    import torch
    import deepspeed
    from transformers import LlamaForCausalLM, LlamaTokenizer
    import argparse

# here
deepspeed.init_distributed()

local_rank = int(os.environ.get('LOCAL_RANK', 0))
world_size = int(os.environ.get('WORLD_SIZE', 1))

print(f'local_rank: {local_rank}, world_size: {world_size}')

tokenizer = LlamaTokenizer.from_pretrained('./llama_7B_hf/')
model = LlamaForCausalLM.from_pretrained(./llama_7B_hf/')
model = deepspeed.init_inference(
    model,
    mp_size=world_size,
    dtype=torch.half,
    replace_with_kernel_inject=True
)

batch = tokenizer(
    "The primary use of LLaMA is research on large language models, including",
    return_tensors="pt",
    add_special_tokens=False
)
# here
batch = {k: v.cuda(local_rank) for k, v in batch.items()}
generated = model.generate(batch["input_ids"], max_length=100)
print(tokenizer.decode(generated[0]))

It yields

The primary use of LLaMA is research on large language models, including the BERT model.

\subsection{Learning Language Models}

LLaMA is a tool for training large language models. It is designed to be used with the BERT model, but it can also be used with other large language models.

(Edit: Some issues still exist when mp_size>2 or when padding='right')

Jun 01 '23 02:06 zeyugao

@zeyugao Thank you, it works! I must have messed something up with the installation.

Jun 01 '23 18:06 Yard1

Hi, I am trying to use your PR to run LLaMA-65B. How should I do this? Directly using LlamaForCausalLM.from_pretrained and launching with deepspeed --num_gpus 8 seems to consume a lot of RAM yet meta tensors are not supported for LLaMA.

Jun 08 '23 11:06 lyy1994

@lyy1994 You can refer to https://github.com/microsoft/DeepSpeedExamples/blob/master/inference/huggingface/text-generation/inference-test.py#L24 for how to use meta tensor by checking how this example uses variable use_meta_tensor. I doubt that whether it is compatible with the kernel injection

Jun 08 '23 14:06 zeyugao

@lyy1994 You can refer to https://github.com/microsoft/DeepSpeedExamples/blob/master/inference/huggingface/text-generation/inference-test.py#L24 for how to use meta tensor by checking how this example uses variable use_meta_tensor. I doubt that whether it is compatible with the kernel injection

Thanks for your suggestions! I have tried the code you point to and it raises the following error:

AssertionError: Meta tensors are not supported for this model currently.

Jun 08 '23 14:06 lyy1994

@lyy1994 You can refer to https://github.com/microsoft/DeepSpeedExamples/blob/master/inference/huggingface/text-generation/inference-test.py#L24 for how to use meta tensor by checking how this example uses variable use_meta_tensor. I doubt that whether it is compatible with the kernel injection

Thanks for your suggestions! I have tried the code you point to and it raises the following error:
AssertionError: Meta tensors are not supported for this model currently.

Sorry, I made a mistake in running this script before. The command I used is:

deepspeed --num_gpus 2 inference-test.py --checkpoint_path ./LLaMA-7B-Official --batch_size 2 --name ./LLaMA-7B-Official --use_meta_tensor

and it still gives the following error even using your PR:

NotImplementedError: Cannot copy out of meta tensor; no data!

Jun 08 '23 15:06 lyy1994

(Edit: Some issues still exist when mp_size>2 or when padding='right')

I tried running this with mp_size 4, and it worked for me. As far as I can tell, this fixes https://github.com/microsoft/DeepSpeed/issues/3628 as intended. Great work!

@RezaYazdaniAminabadi Could you look at merging this? The gated MLP intermediate weight sharding is broken in 0.9.3 / master, and this fixes it (it needs strided copy like QKV gemm weight since the two weight matrices are glued into one).

Jun 08 '23 18:06 davidthomas426

I think that this pr is not complete for fixing the LLaMA inference and it does not involve tensor parallelism. During testing, it was found that when the input context exceeds a certain length (e.g., larger than 768, 1024, or 1536), the kernel-injected LLaMA produces incorrect results. In particular, while the model correctly predicts the first output tokens (with do_sample=False) and aligns with the model without kernel injection, subsequent tokens are completely inaccurate. This problem persists even when tensor parallelism is disabled (tp_size=1 or without specifying relevant parameters).

The exact cause remains uncertain, with possibilities including KV cache errors or computation precision loss. However, since shorter inputs generate accurate results, this complicates the issue, and I lack extensive debugging experience with CUDA kernels so I have no much idea about how to debug it.

Jun 10 '23 03:06 zeyugao

@lyy1994 I don't observe this issue in my pr (without kernel injection enabled). Did you correctly install the fixed deepspeed? You can use PYTHONPATH to force the python to use the fixed code. For example, I use the following script:

PYTHONPATH=/Code/DeepSpeed deepspeed --include=localhost:1,2 inference-test.py --name /mnt/data/llama_7B_hf --checkpoint_path /mnt/data/llama_7B_hf --ds_inference --use_meta_tensor

Jun 12 '23 08:06 zeyugao

I think that it is beacuse of the max_out_tokens parameters in init_inference. Enlarge it when needed. So I think that this pr is complete.

Also, I added some commits that fix the meta tensor loading when kernel injection is enabled.

Jun 12 '23 10:06 zeyugao

I tried it out using vicuna-13b and 8xT4. With one query (bs=1) it looks good, but when I try to send a batch it crashes (sometimes).

Jun 14 '23 10:06 thies1006

Any updates on this PR? I saw #3788 had fixed the gated MLP but still did not work for meta tensors.

Jun 26 '23 21:06 chhzh123

@zeyugao Hi, i am trying to use your PR to run inference for LLAMA-7B and 65B, it worked well for LLAMA-7B. However, when i used llama-65b, i got this error "KeyError: 'model.layers.53.mlp.gate_proj.weight'".

I have checked the checkpoint file, the key 'model.layers.53.mlp.gate_proj.weight' does exist.

Have you encountered similar issues?

here is my code snippet:

model_name = 'huggyllama/llama-65b'
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
config = AutoConfig.from_pretrained(model_name)

with deepspeed.OnDevice(dtype=dtype, device="meta"):
    model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.float16)

model = model.eval()
model = deepspeed.init_inference(
    model,
    mp_size=world_size,
    base_dir=repo_root,
    dtype=torch.float16,
    checkpoint=checkpoints_json,
    replace_with_kernel_inject=True,
)

Jul 22 '23 15:07 ganyk

@ganyk I have encountered this problem before. In brief, the root of the problem surfaces during loading tensor https://github.com/microsoft/DeepSpeed/blob/23a11a39510e2aefb48236e3d2672a7dcbfc42a3/deepspeed/module_inject/containers/llama.py#L101-L102

In this function, it tries to put both mlp.up_proj.weight and mlp.gate_proj.weight into inter_w for every layer. In practice, this would mean trying to load mlp.gate_proj.weight from the safetensor (or pickle) file where mlp.up_proj.weight resides. However, for llama 30B and 65B models saved by the huggingface script, these two tensors are not saved in the same file in some layers, which leads to a KeyError. (pytorch_model-00010-of-00014.bin and pytorch_model-00009-of-00014.bin)

pytorch_model.bin.index.json:

    "model.layers.53.input_layernorm.weight": "pytorch_model-00010-of-00014.bin",
    "model.layers.53.mlp.down_proj.weight": "pytorch_model-00009-of-00014.bin",
    "model.layers.53.mlp.gate_proj.weight": "pytorch_model-00009-of-00014.bin",
    "model.layers.53.mlp.up_proj.weight": "pytorch_model-00010-of-00014.bin",
    "model.layers.53.post_attention_layernorm.weight": "pytorch_model-00010-of-00014.bin",
    "model.layers.53.self_attn.k_proj.weight": "pytorch_model-00009-of-00014.bin",
    "model.layers.53.self_attn.o_proj.weight": "pytorch_model-00009-of-00014.bin",
    "model.layers.53.self_attn.q_proj.weight": "pytorch_model-00009-of-00014.bin",
    "model.layers.53.self_attn.rotary_emb.inv_freq": "pytorch_model-00009-of-00014.bin",
    "model.layers.53.self_attn.v_proj.weight": "pytorch_model-00009-of-00014.bin",

I don't have a deep enough understanding of these lower-level load operations, so I don't have a good way to only initialize part of the parameters, that is copying only mlp.up_proj.weight to inter_w first, and then copying another part of inter_w upon discovering the corresponding mlp.gate_proj.weight.

My previous solution was to repackage the model files, so every layer's parameters would definitely appear in the same file. However, it seems I can no longer find the script. I would appreciate it if you could write it and send it up. If I have time in the following days, I might try to write it out again.

Jul 22 '23 15:07 zeyugao

@ganyk This is a sample repack script, it seems that it is correct in 7B (just checking the key in file content, not running it)

The first argument is the original model, and the second argument is the output directory, you can move the files in output directory and replace the corresponding model in the original model directory after you do some backup. Note that this will also load the full model into memory, so maybe you need enough memory for that. And this is for the pickle, not safetensor.

import re
import argparse
import json
import pathlib
import torch
from collections import defaultdict

parser = argparse.ArgumentParser()
parser.add_argument('hf_model')
parser.add_argument('output_dir')

args = parser.parse_args()
hf_model = pathlib.Path(args.hf_model)
output_dir = pathlib.Path(args.output_dir)

layer_pattern = re.compile(r'model\.layers\.(\d+)\.')

with open(hf_model / 'pytorch_model.bin.index.json', 'r') as f:
    pytorch_model_index = json.load(f)

layer_weight = defaultdict(list)
extra_weight = []
files = set()

for weight_name, file in pytorch_model_index['weight_map'].items():
    print(weight_name, file)

    layer_match = layer_pattern.search(weight_name)
    if layer_match:
        layer = int(layer_match.group(1))
        layer_weight[layer].append([weight_name, file])
    else:
        extra_weight.append([weight_name, file])
    files.add(file)

files = list(files)
layer_weight = dict(layer_weight)

file_content = {}


def get_or_load(file):
    if file in file_content:
        return file_content[file]
    else:
        print('loading', file)
        file_content[file] = torch.load(hf_model / file)
        return file_content[file]


new_weight_map = {}
new_file_content = defaultdict(dict)


def add_to_file(file, weight_name, weight):
    new_file_content[file][weight_name] = weight
    new_weight_map[weight_name] = file


for idx, (weight_name, file) in enumerate(extra_weight):
    add_to_file(files[0], weight_name, get_or_load(file)[weight_name])

for layer, weights in layer_weight.items():
    for weight_name, file in weights:
        add_to_file(files[layer % len(files)], weight_name, get_or_load(file)[weight_name])

pytorch_model_index['weight_map'] = new_weight_map
with open(output_dir / 'pytorch_model.bin.index.json', 'w') as f:
    json.dump(pytorch_model_index, f)

for file, content in new_file_content.items():
    torch.save(content, output_dir / file)

Jul 23 '23 02:07 zeyugao

@zeyugao Thanks for the repack script, I have successfully run inference for llama-65B after repacking the checkpoint files.

Jul 23 '23 09:07 ganyk

@RezaYazdaniAminabadi @jeffra @awan-10 Could you please take a look at this pr? As https://github.com/microsoft/DeepSpeed/pull/3914 and https://github.com/microsoft/DeepSpeed/pull/3788 are doing some similar. And this pr also add the meta tensor for both autotp and kernel injection with properly repack of the model in https://github.com/microsoft/DeepSpeed/pull/3608#issuecomment-1646727688

Jul 23 '23 09:07 zeyugao

@zeyugao hey, thank you for this pr. However, the result is different to the pipeline parallel using huggingface when I use this pr. The code is below: #####deepspeed meta tensor##### import torch import deepspeed from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig

from transformers import LlamaForCausalLM, LlamaTokenizer import transformers import os import torch.distributed as dist import io import json from pathlib import Path

local_rank = int(os.getenv("LOCAL_RANK", "0")) world_size = int(os.getenv("WORLD_SIZE", "1"))

deepspeed.init_distributed() rank = dist.get_rank() print(f"========Current randk:{rank} || lcoal rank:{local_rank} || world size:{world_size}==========")

model_path = 'xx/xx/llama65b'

config = AutoConfig.from_pretrained(model_path) checkpoints_json = model_path+"checkpoints.json" if rank==0: with io.open(checkpoints_json, "w", encoding="utf-8") as f: file_list = [str(entry) for entry in Path(model_path).rglob("*.[bp][it][n]") if entry.is_file()] data = {"type": "BLOOM", "checkpoints": file_list, "version": 1.0} json.dump(data, f) dist.barrier()

tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True) with deepspeed.OnDevice(dtype=torch.float16, device="meta"): model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.float16)

model = model.eval() ds_model = deepspeed.init_inference( model, mp_size=world_size, dtype=torch.float16, base_dir=model_path, replace_with_kernel_inject=True, checkpoint=checkpoints_json, )

prompt = "Where is Hawaii?"

encoding = tokenizer(prompt, return_tensors="pt") generation_config = transformers.GenerationConfig( temperature=0.0, top_k=20, repetition_penalty=1.2, )

input_ids = encoding["input_ids"].to(model.device) result = ds_model.generate( input_ids=input_ids, generation_config=generation_config, return_dict_in_generate=False, output_scores=False, max_new_tokens=512, )

output = tokenizer.decode(result[0][len(input_ids[0]):]) if rank == 0: print(output)

#####huggingface inference##### import torch from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.models.llama import LlamaTokenizer import transformers import os import torch.distributed as dist

model_path = 'xx/xx/llama65b'

tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False) model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map='auto')

prompt = "Where is Hawaii?" encoding = tokenizer(prompt, return_tensors="pt") generation_config = transformers.GenerationConfig( temperature=0.0, top_k=20, repetition_penalty=1.2, )

input_ids = encoding["input_ids"].to(model.device) with torch.inference_mode(): result = model.generate( input_ids=input_ids, generation_config=generation_config, return_dict_in_generate=False, output_scores=False, max_new_tokens=512, )

output = tokenizer.decode(result[0][len(input_ids[0]):])

print(output) ######################

the two result is different, but when I test it at llama6b without meta tensor, the result is same. Any clues?

Jul 31 '23 06:07 xs1997zju

@zeyugao The result is different when use "AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map='auto')" to load model and infernce when use this pr

Jul 31 '23 06:07 xs1997zju

When I test the llama7b and don't use the meta tensor load method, the huggingface inference and deepspeed tensor parallel result are same. So I think there might be something to be checked in this pr.

Jul 31 '23 07:07 xs1997zju

@xs1997zju same issue here, @zeyugao any ideas about what might be causing these differences?

Jul 31 '23 08:07 ganyk

@ganyk @xs1997zju I don't have much idea for now. In my testing, I get five results using different methods.

65B models with mp_size=4

If I set up the script correctly, both the meta tensor and kernel injection operate as expected individually. However, when I combine them, the outcome varies. Despite this, the model doesn't produce useless data. On the contrary, the output is rather meaningful. This suggests that, overall, the model's weight seems to be being loaded accurately. Some residual weight might be influencing the outcome.

Full

huggingface

Hawaiian Islands Map
What are the popular places to visit in Hawaii?
Facts about Hawaii
Major Cities of Hawaii
Flag of Hawaii State
Airports in Hawaii
Latitude and Longitude of Hawaii
Tourist Attractions in Hawaii
Infographic on Hawaii
The state of Hawaii, located at 21.5° N latitude and 160° W longitude, comprises a group of islands situated in the central Pacific Ocean. The archipelago consists of eight major islands namely Oahu (the most populous), Maui, Kaua'I, Moloka'I, Lana'I, Ni'ihau, Kaula and Lehua. These islands were formed by volcanic activity that occurred millions of years ago. They lie along an underwater mountain range known as the Hawaiian-Emperor seamount chain. This chain extends from the island of Hawaii up to the Kamchatka Peninsula in Russia. It was created due to movement of tectonic plates over a hot spot beneath the Earth’s surface.
Official Name: State of Hawaii
Capital City: Honolulu
Largest city: Hilo
Area Ranking: 43rd largest US state
Population Ranking: 8th least populated US state
Demonym: Hawaiians or Hawai'i residents
Highest point: Mauna Kea - 13796 feet above sea level
Admission to Union: August 21st, 1959 (50th)
Number of counties: Five Counties
Governor: David Ige
Senators: Mazie Hirono & Brian Schatz
Congressional Delegation: Tulsi Gabbard, Colleen Hanabusa, Mark Takai, Donna Mercado Kim, Sam Graves, John Moolenaar, Tim Walz, Rick Larsen, Suzan DelBene, Denny Heck, Derek Kilmer, Adam Smith, Jaime Herrera Beutler, Dan Newhouse, Dave Reichert, Peter DeFazio, Kurt Schrader, Greg Walden, Earl Blumenauer, Suzanne Bonamici, Mike Simpson, Raul Labrador, Russ Fulcher, Jim McDermott, Pramila Jay

no meta tensor with kernel injected

Hawaiian Islands Map
What are the popular places to visit in Hawaii?
Facts about Hawaii
Major Cities of Hawaii
Flag of Hawaii State
Hawaii Lat Long Map
Where is Honolulu, HI
Airports in Hawaii
University Of Hawai'i At Manoa Campus Map
Description: The map showing location of Hawaii state on USA map. Disclaimer
The 50th and most recent US state admitted into the Union, Hawaii was annexed by the United States as a territory in 1897 after being ruled for decades by various European powers. It became a state officially in August 21st, 1959.
It consists of eight major islands (Niihau, Kauai, Oahu, Maui, Molokai, Lanai, Kahoolawe, and the Big Island) that stretch over 36 million acres of land area. Its capital city is Honolulu which also serves as its largest city with more than half of the population living there. Other important cities include Pearl City, Waipahu, Mililani Town, East Honolulu, Ewa Gentry, Kapolei, Kihei, Makakilo, Schofield Barracks, Wailuku, Aiea, Halawa Heights, Nanakuli, and many others.
Geographically speaking, it lies between latitudes 18°4′ N and 28°27′ N and longitudes 154°40′W and 178°22′W. It has a total coastline length of 750 miles or 1,210 kilometers.
Aside from the main island chain, it also includes several smaller islands such as Kaula, Lehua, Nihoa, Necker, Gardner Pinnacles, Maro Reef, French Frigate Shoals, Lisianski Island, Laysan Island, Southeast Island, Pearl & Hermes Reef, Midway Atoll, Kure Atoll, Ocean Island, and many other small atolls and reefs.
In terms of climate, it experiences tropical weather conditions all year round due to its proximity to the equator. Temperatures range from lows of around 68

meta tensor with kernel injected

Hawaii is a state of the United States. It has 5 counties and it's capital is Honolulu.

no meta tensor without kernel injected

Hawaiian Islands Map
What are the popular places to visit in Hawaii?
Facts about Hawaii
Major Cities of Hawaii
Flag of Hawaii State
Hawaii Lat Long Map
Where is Honolulu, HI
Airports in Hawaii
University Of Hawai'i At Manoa Campus Map
Description: The map showing location of Hawaii state on USA map. Disclaimer
The 50th and most recent US state admitted into the Union, Hawaii was annexed by the United States as a territory in 1897 after being ruled for decades by various European powers. It became a state officially in August 21st, 1959.
It consists of eight major islands (Niihau, Kauai, Oahu, Maui, Molokai, Lanai, Kahoolawe, and the Big Island) that stretch over 36 million acres of land area. Its capital city is Honolulu which also serves as its largest city with more than half of the population living there. Other important cities include Pearl City, Waipahu, Mililani Town, East Honolulu, Ewa Gentry, Kapolei, Kihei, Makakilo, Schofield Barracks, Wailuku, Aiea, Halawa Heights, Nanakuli, and many others.
Geographically speaking, it lies between latitudes 18°4′ N and 28°27′ N and longitudes 154°40′W and 178°22′W. It has a total coastline length of 750 miles or 1,210 kilometers.
Aside from the main island chain, it also includes several smaller islands such as Kaula, Lehua, Nihoa, Necker, Gardner Pinnacles, Maro Reef, French Frigate Shoals, Lisianski Island, Laysan Island, Southeast Island, Pearl & Hermes Reef, Midway Atoll, Kure Atoll, Ocean Island, and many other small atolls and reefs.
In terms of climate, it experiences tropical weather conditions all year round due to its proximity to the equator. Temperatures range from lows of around 68

meta tensor without kernel injected

Hawaiian Islands Map
What are the popular places to visit in Hawaii?
Facts about Hawaii
Major Cities of Hawaii
Flag of Hawaii State
Hawaii Lat Long Map
Where is Honolulu, HI
Airports in Hawaii
University Of Hawai'i At Manoa Campus Map
Description: The map showing location of Hawaii state on USA map. Disclaimer
The 50th and most recent US state admitted into the Union, Hawaii was annexed by the United States as a territory in 1897 after being ruled for decades by various European powers. It became a state officially in August 21st, 1959.
It consists of eight major islands (Niihau, Kauai, Oahu, Maui, Molokai, Lanai, Kahoolawe, and the Big Island) that stretch over 36 million acres of land area. Its capital city is Honolulu which also serves as its largest city with more than half of the population living there. Other important cities include Pearl City, Waipahu, Mililani Town, East Honolulu, Ewa Gentry, Kapolei, Kihei, Makakilo, Schofield Barracks, Wailuku, Aiea, Halawa Heights, Nanakuli, and many others.
Geographically speaking, it lies between latitudes 18°4′ N and 28°27′ N and longitudes 154°40′W and 178°22′W. It has a total coastline length of 750 miles or 1,210 kilometers.
Aside from the main island chain, it also includes several smaller islands such as Kaula, Lehua, Nihoa, Necker, Gardner Pinnacles, Maro Reef, French Frigate Shoals, Lisianski Island, Laysan Island, Southeast Island, Pearl & Hermes Reef, Midway Atoll, Kure Atoll, Ocean Island, and many other small atolls and reefs.
In terms of climate, it experiences tropical weather conditions all year round due to its proximity to the equator. Temperatures range from lows of around 68

Aug 01 '23 01:08 zeyugao

Thanks for the reply and experiments. But it seems weird the result of "no meta tensor without kernel injected", it should be the same of huggingface. Did you have tried on llama7b? For llama7b, this two result is same.

Aug 01 '23 02:08 xs1997zju

@xs1997zju In "no meta tensor without kernel injected", the auto tensor parallel is enabled. Meanwhile the huggingface method is a bare pipeline.

And in first few lines, the result from deepspeed aligns with huggingface, maybe it is due to the precision issue in different implementations.

Aug 01 '23 02:08 zeyugao

Anyway, now, I just using mp=4 and don't use meta tensor to load model 65b, but have to load model on cpu RAM in every rank, and then use deepspeed.init_inference, the final result is same as huggingface. The ram of my machine is not enough for loading 8 model in each rank QAQ.

Aug 01 '23 05:08 xs1997zju

So I think the problem should be in meta tensor load method, the llama model adapt has some issue.

Aug 01 '23 05:08 xs1997zju

When the transformers is version 4.28.1, use meta-tensor and no kernel-inject, the result is the same as huggingface.

Aug 02 '23 09:08 xs1997zju