DeepSpeed [BUG] deepspeed_stage_3 was used in pytorch_lightning。when initialize, it cost huge cpu memory which increase with the grow of gpu num。

Describe the bug deepspeed_stage_3 was used in pytorch_lightning。when initialize, it cost huge cpu memory 。

the amount of cpu memory used = gpu_number * 2 * model_size

To Reproduce Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior A clear and concise description of what you expected to happen.

ds_report output Please run ds_report to give us details about your setup.

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: [e.g. Ubuntu 18.04]
GPU count and types [e.g. two machines with x8 A100s each]
Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
Python version
Any other relevant info about your setup

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else?

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

Mar 12 '23 13:03 linyubupa

the pytorch-lightning code was used : `trainer = Trainer( max_epochs=1, devices=args.num_devices, precision=16, strategy="deepspeed_stage_3", accelerator='gpu', num_nodes=args.num_nodes, limit_val_batches=0, # 添加plugins plugins=plugins, # 添加log和profile logger=lighting_logger, profiler=profiler, # 添加callback callbacks=callbacks, # 关闭官方进度条 enable_progress_bar=False )

logger.info("start train #########################")
trainer.fit(model, data_module)`

Mar 12 '23 13:03 linyubupa

the amount of cpu memory used = gpu_number * 2 * model_size

Mar 12 '23 13:03 linyubupa

Hi @linyubupa, could you describe more details about reproducing this issue? Especially how you measured cpu memory used and model_size

Mar 16 '23 23:03 ShijieZZZZ

Hi @linyubupa, could you describe more details about reproducing this issue? Especially how you measured cpu memory used and model_size

""" #this is the code that i used: """ ` def main_worker(gpu, args): # prepare for ant environment huggingface.set_env(gpu) display_envs()

# batch related
train_batch_size = 2
gradient_accumulation_steps = 1
gradient_checkpointing = False

# train related
learning_rate = 1e-5
eval_batch_size = 2
eval_steps = 500
save_steps = 20000
num_train_epochs = 1

# log related
logging_steps = 10
disable_tqdm = False
log_level = 'passive' #'debug'
random.seed(42)

output_dir = "/mntnlp/yumu/bloom176/v1/"
# model_name_or_path = '/mntnlp/yumu/bloomz176/models--bigscience--bloomz-mt/snapshots/4f161c36a766d2328a63a26fe4970024a532e945'

model_name_or_path='/mntnlp/yumu/bloomz/models--bigscience--bloomz-7b1-mt/snapshots/13e9b1a39fe86c8024fe15667d063aa8a3e32460'
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, torch_dtype=torch.float16)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, torch_dtype=torch.float16)
tokenizer.pad_token = tokenizer.eos_token
model.resize_token_embeddings(len(tokenizer))
tokenizer.pad_token_id = tokenizer.eos_token_id
model.config.end_token_id = tokenizer.eos_token_id
model.config.pad_token_id = model.config.eos_token_id
model.config.use_cache = False


# Set up the datasets

user_config = UserConfig()
# table_name="apmktalgo_dev.targeting_graph_data_text_ly_1_glm_predict_100"
train_table = "XXXXXXX"
test_table = "XXXXXX"
model_name_or_path = '/mntnlp/yumu/bloomz/models--bigscience--bloomz-7b1-mt/snapshots/13e9b1a39fe86c8024fe15667d063aa8a3e32460'
max_seq_length = 100
aistudio_reader_num_processes = 2
fields = ['idx', 'text']
train_dataset = ODPSataset(train_table, fields, user_config, max_seq_length, model_name_or_path, aistudio_reader_num_processes)
dev_dataset = ODPSataset(test_table, fields, user_config, max_seq_length, model_name_or_path, aistudio_reader_num_processes)



# Set up the metric
rouge = evaluate.load("rouge")

def compute_metrics(eval_preds):
    labels_ids = eval_preds.label_ids
    pred_ids = eval_preds.predictions
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
    result = rouge.compute(predictions=pred_str, references=label_str)
    return result

# Create a preprocessing function to extract out the proper logits from the model output
def preprocess_logits_for_metrics(logits, labels):
    if isinstance(logits, tuple):
        logits = logits[0]
    return logits.argmax(dim=-1)

# Prepare the trainer and start training
training_args = TrainingArguments(
    output_dir=output_dir,
    evaluation_strategy="steps",
    eval_accumulation_steps=1,
    learning_rate=learning_rate,
    per_device_train_batch_size=train_batch_size,
    per_device_eval_batch_size=eval_batch_size,
    gradient_checkpointing=gradient_checkpointing,
    half_precision_backend='auto',
    fp16=True,
    adam_beta1=0.9,
    adam_beta2=0.95,
    gradient_accumulation_steps=gradient_accumulation_steps,
    num_train_epochs=num_train_epochs,
    warmup_steps=2,
    eval_steps=eval_steps,
    save_steps=save_steps,
    load_best_model_at_end=True,
    logging_steps=logging_steps,
    log_level=log_level,
    disable_tqdm=disable_tqdm,
    save_total_limit=2,
    prediction_loss_only=True,
    deepspeed="./new_ds_config.json",
    remove_unused_columns=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=dev_dataset,
    compute_metrics=compute_metrics,
    data_collator=default_data_collator,
    preprocess_logits_for_metrics=preprocess_logits_for_metrics,
)
trainer.add_callback(ClearCacheCallback(1000))


trainer.train()
#detail_evaluate(trainer)
#trainer.evaluate()
trainer.save_model(output_dir)

def main():

# import deepspeed
gpu_count = torch.cuda.device_count()
# main_worker(gpu_count,sys.argv)
mp.spawn(main_worker, args=(sys.argv,), nprocs=gpu_count, join=True)

`

Mar 18 '23 02:03 linyubupa

i think the most cause of this result is mp.spawn(main_worker, args=(sys.argv,), nprocs=gpu_count, join=True)

Mar 18 '23 03:03 linyubupa

and this is the config of ds { "train_batch_size": "auto", "fp16": { "enabled": true, "min_loss_scale": 1, "opt_level": "auto" }, "zero_optimization": { "stage": 3, "allgather_partitions": true, "allgather_bucket_size": 5e8, "contiguous_gradients": true }, "optimizer": { "type": "AdamW", "params": { "lr": 1e-05, "betas": [ 0.9, 0.95 ], "eps": 1e-08 } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": 0, "warmup_max_lr": 1e-05, "warmup_num_steps": "auto" } } }

Mar 18 '23 03:03 linyubupa

Hi @linyubupa, could you describe more details about reproducing this issue? Especially how you measured cpu memory used and model_size

I measured cpu memory by using aistudio tools which is proved correct

Mar 18 '23 05:03 linyubupa

Hi @linyubupa, I am not able to run your code. Do you have a smaller test (but complete setup) for reproducing? Your suspicion on mp.spawn makes sense. See this https://github.com/pytorch/pytorch/issues/38645. Could you also try measureing cpu usage with deepspeed.runtime.utils.see_memory_usage?

Mar 24 '23 00:03 ShijieZZZZ

Hi @linyubupa. Closed for now. Feel free to reopen if there is any update.

Apr 03 '23 16:04 ShijieZZZZ