linyubupa
linyubupa
the pytorch-lightning code was used : `trainer = Trainer( max_epochs=1, devices=args.num_devices, precision=16, strategy="deepspeed_stage_3", accelerator='gpu', num_nodes=args.num_nodes, limit_val_batches=0, # 添加plugins plugins=plugins, # 添加log和profile logger=lighting_logger, profiler=profiler, # 添加callback callbacks=callbacks, # 关闭官方进度条 enable_progress_bar=False )...
the amount of cpu memory used = gpu_number * 2 * model_size
> Hi @linyubupa, could you describe more details about reproducing this issue? Especially how you measured _cpu memory used_ and _model_size_ """ #this is the code that i used: """...
i think the most cause of this result is `mp.spawn(main_worker, args=(sys.argv,), nprocs=gpu_count, join=True)`
and this is the config of ds ` { "train_batch_size": "auto", "fp16": { "enabled": true, "min_loss_scale": 1, "opt_level": "auto" }, "zero_optimization": { "stage": 3, "allgather_partitions": true, "allgather_bucket_size": 5e8, "contiguous_gradients": true...
> Hi @linyubupa, could you describe more details about reproducing this issue? Especially how you measured _cpu memory used_ and _model_size_ I measured cpu memory by using aistudio tools which...
if you got multi gpu, the cost cpu memory = 2 * model_size* gpu_numbers
> ```python > def configure_sharded_model(self): > ``` sorry for late reply,I build up model in configure_sharded_model , but the cpu memory still cost amountly
I solve this by using deepspeed init with transformers trainer : https://huggingface.co/docs/transformers/main_classes/deepspeed 、、、 deepspeed --num_gpus 8 --num_nodes 2 --hostfile hostfile --master_addr hostname1 --master_port=9901 \ your_program.py --deepspeed ds_config.json 、、、
ERROR: failed to solve: process "/bin/sh -c mim install mmcv==2.0.0" did not complete successfully: exit code: 1