yukiyee comments

Results 6 comments of


                                            yukiyee

[BUG]: How to run llama2 70B pretrain on 32gpus? I got OOM error on almost every plugin and config.

> Hi，what's your batch size on each gpu? The microbatch size is the unit of passing when using pipeline parallel. > > If your batch size is more than 1,...

when I uese hybrid_parallel, and set the enable_fused_normalization = True. I can't run the code, here are some error: RuntimeError: Failed to replace input_layernorm of type LlamaRMSNorm with FusedRMSNorm with the exception: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMS normalization kernel. Please check your model configuration or sharding policy, you can set up an issue for us to help you as well. However, I have install the apex, it will still occur. How can i solve it?

Me too !! RuntimeError: Failed to replace input_layernorm of type LlamaRMSNorm with FusedRMSNorm with the exception: No module named 'fused_layer_norm_cuda'. Please check your model configuration or sharding policy, you can...

when I uese hybrid_parallel, and set the enable_fused_normalization = True. I can't run the code, here are some error: RuntimeError: Failed to replace input_layernorm of type LlamaRMSNorm with FusedRMSNorm with the exception: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMS normalization kernel. Please check your model configuration or sharding policy, you can set up an issue for us to help you as well. However, I have install the apex, it will still occur. How can i solve it?

And I saw this prompt in `examples/language/llama2/scripts/benchmark_70B/3d.sh` ``` # TODO: fix this echo "3D parallel for LLaMA-2 is not ready yet" ``` Does it mean , even if I deployed...

when I uese hybrid_parallel, and set the enable_fused_normalization = True. I can't run the code, here are some error: RuntimeError: Failed to replace input_layernorm of type LlamaRMSNorm with FusedRMSNorm with the exception: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused RMS normalization kernel. Please check your model configuration or sharding policy, you can set up an issue for us to help you as well. However, I have install the apex, it will still occur. How can i solve it?

> > And I saw this prompt in `examples/language/llama2/scripts/benchmark_70B/3d.sh` > > ``` > > # TODO: fix this > > echo "3D parallel for LLaMA-2 is not ready yet" >...

[BUG]: llama2 training on multi-node slurm, report error: errno: 98 - Address already in use

> ps -ef | grep > > then kill -9 the demon which bind to the port I have done this before, however it doesn't work. And I'm sure the...

[BUG]: llama2 training on multi-node slurm, report error: errno: 98 - Address already in use

Finally , I solve the problem as below : First, using `python xx.py` instead of `colossalai run --nproc_per_node 8 xx.py` works well. So the start command is ```shell srun -p...