[BUG] The example script in auto tensor parallelism doc only works for model size fit in 1 GPU

Open brevity2021 opened this issue 2 years ago • 0 comments

Describe the bug Running the example script in auto tensor parallelism doc only works when the model size can fit in 1 GPU.

For example, I was using a g5.12xlarge instance, and opt-6.7b can fit in one GPU. Running the example command to enable tensor parallelism: deepspeed --num_gpus <num_gpus> DeepSpeedExamples/inference/huggingface/text-generation/inference-test.py --name <model> --batch_size <batch_size> --test_performance --ds_inference will succeed.

But when trying with Opt-13b, it will fail without any meaningful message with the same command. Command: deepspeed --num_gpus 4 DeepSpeedExamples/inference/huggingface/text-generation/inference-test.py --name facebook/opt-13b --batch_size 1 --test_performance --ds_inference --dtype=float16

Error message:

[2023-04-18 20:05:01,196] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-18 20:05:01,239] [INFO] [runner.py:540:main] cmd = /home/xx/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None DeepSpeedExamples/inference/huggingface/text-generation/inference-test.py --name facebook/opt-13b --batch_size 1 --test_performance --ds_inference --dtype=float16
[2023-04-18 20:05:03,235] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-04-18 20:05:03,235] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-04-18 20:05:03,235] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-04-18 20:05:03,235] [INFO] [launch.py:247:main] dist_world_size=4
[2023-04-18 20:05:03,235] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2023-04-18 20:05:05,406] [INFO] [utils.py:785:see_memory_usage] before init
[2023-04-18 20:05:05,406] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB
[2023-04-18 20:05:05,407] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 5.44 GB, percent = 2.9%
[2023-04-18 20:06:17,381] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 492250
[2023-04-18 20:06:19,676] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 492251
[2023-04-18 20:06:21,929] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 492252
[2023-04-18 20:06:21,929] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 492253
[2023-04-18 20:06:24,261] [ERROR] [launch.py:434:sigkill_handler] ['/home/xx/venv/bin/python', '-u', 'DeepSpeedExamples/inference/huggingface/text-generation/inference-test.py', '--local_rank=3', '--name', 'facebook/opt-13b', '--batch_size', '1', '--test_performance', '--ds_inference', '--dtype=float16'] exits with return code = -9

I also tried using checkpoint loading for models that cannot fit into one GPU, but the checkpoint loading has issue (reported here)

So it looks like there is no way to get a model with parameters that cannot fit in 1 GPU to use automatic tensor parallelism if the model doesn't have kernel implemented. This seems to contradict the description in this document "DeepSpeed now supports automatic tensor parallelism for HuggingFace models by default as long as kernel injection is not enabled and an injection policy is not provided"?

Apr 18 '23 20:04 brevity2021