EasyParallelLibrary issues

Problem of Data Parallel Model, program didn't end when reached global step

1

I got a problem when using EPL data parallel Model. The num worker is set to 3 and each worker had its own TF data record input and Model save_dir....

Jimmy-jin

Gradient Checkpoint with auto type got a TypeError

After I added below codes to my worked functions, I got a TypeError ```Python epl_config = epl.Config({ "gradient_checkpoint.type": "auto", "zero.level": "v1", "amp.level": "O1", "amp.loss_scale": 128 }) epl.init(epl_config) epl.set_default_strategy(epl.replicate(1)) ``` error...

RussellZZ

DistributedDense只支持按照列切分吗？

1

DistributedDense只支持按照列切分吗？如果想实现Megatron-LM那种方式，先列切，再行切该怎么办？

kuangdao

2台服务器分布式跑example中的resnet_split.py遇到无限等待的情况

4

**环境** nvcr.io/nvidia/tensorflow:21.12-tf1-py3镜像生成的容器 **代码：** FastNN/resnet/resnet_split.py **执行命令：** 服务器1：TF_CONFIG='{"cluster":{"worker":["172.20.21.181:55375","172.20.21.189:55376"]},"task":{"type":"worker","index":0}}' bash scripts/train_split.sh 服务器2：TF_CONFIG='{"cluster":{"worker":["172.20.21.181:55375","172.20.21.189:55376"]},"task":{"type":"worker","index":1}}' bash scripts/train_split.sh 服务器1的执行情况： ![image](https://github.com/alibaba/EasyParallelLibrary/assets/55943192/5bdffa20-3c77-4109-88c6-d1f2fc6d7586) 服务器2的执行情况： ![image](https://github.com/alibaba/EasyParallelLibrary/assets/55943192/1e5d021e-9b53-4ac1-937e-9de9c2f6bc7f) 可以看到服务器1的still waiting只打印了2条就不打印了说明已经接收到了服务器2的回复，但是没有继续往下运行。 **补充：** 同样的环境可以分布式运行bert，服务器之间是可以正常连接跑分布式训练的。想问下是我的执行问题还是代码需要进行修改？

alphabewitch

epl单机单卡和单机多卡训练step如何理解

1

单机单卡：启动命令：TF_CONFIG='{"cluster":{"worker":["127.0.0.1:49119"]},"task":{"type":"worker","index":0}}' CUDA_VISIBLE_DEVICES=0 bash ./scripts/train_dp.sh ![image](https://github.com/alibaba/EasyParallelLibrary/assets/28698695/bc38cc24-48b4-452c-8993-546847954fb4) 单机双卡：启动命令：TF_CONFIG='{"cluster":{"worker":["127.0.0.1:49119"]},"task":{"type":"worker","index":0}}' CUDA_VISIBLE_DEVICES=0,1 bash ./scripts/train_dp.sh ![1693045873752](https://github.com/alibaba/EasyParallelLibrary/assets/28698695/20739a66-8667-46f8-bc5d-1fd5993b0ba5) 代码修改了一下：去掉了last_step限制，数据集repeat=10，将txt改为py，可执行。 [resnet_dp.txt](https://github.com/alibaba/EasyParallelLibrary/files/12671704/resnet_dp.txt) 想请教下，这个如何理解呢？每个卡分别跑了10step？

SueeH

2机2卡实验NCCL报错

1

使用两个容器进行2机2卡实验，报错如下，希望可以帮忙解决一下 ### 环境: 基于nvcr.io/nvidia/tensorflow:21.12-tf1-py3构建的容器 ### 脚本: FastNN的resnet脚本 ### 启动命令 ``` TF_CONFIG='{"cluster":{"worker":["192.168.83.228:6666","192.168.83.228:6667"]},"task":{"type":"worker","index":0}}' bash scripts/train_dp.sh TF_CONFIG='{"cluster":{"worker":["192.168.83.228:6666","192.168.83.228:6667"]},"task":{"type":"worker","index":0}}' bash scripts/train_dp.sh ``` ### 报错 ``` 2023-08-31 01:40:46.786721: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at nccl_communicator.cc:116 : Internal:...

wind818

AttributeError: 'NoneType' object has no attribute 'taskgraph'

3

Hi EPL team, When I use epl library to train the following code: ```python import os import numpy as np from concurrent.futures import ThreadPoolExecutor from PIL import Image import tensorflow...

Waterpine

Has there been an improvement in single-GPU training speed?

1

@anw90

a1342772

EasyParallelLibrary
EasyParallelLibrary copied to clipboard

Metadata

Problem of Data Parallel Model, program didn't end when reached global step

Gradient Checkpoint with auto type got a TypeError

DistributedDense只支持按照列切分吗？

2台服务器分布式跑example中的resnet_split.py遇到无限等待的情况

epl单机单卡和单机多卡训练step如何理解

2机2卡实验NCCL报错

AttributeError: 'NoneType' object has no attribute 'taskgraph'

Has there been an improvement in single-GPU training speed?

← Metadata

Owner

Metadata

EasyParallelLibrary EasyParallelLibrary copied to clipboard

Metadata

← Metadata

Owner

Metadata

EasyParallelLibrary
EasyParallelLibrary copied to clipboard