cheng cheng comments

Results 38 comments of


                                            cheng cheng

Swin graph 3d 并行，打开 acc grad 报错

这个 Check 是我加的，去年年初重构移除 Logical Graph 的时候搞的。理论上不会出现。我看看代码回想一下逻辑 😂

Swin graph 3d 并行，打开 acc grad 报错

德澎你这个测试，是在哪台机器上做的？ @Ldpe2G 金山还是类脑

Swin graph 3d 并行，打开 acc grad 报错

验证这个分支： - [dev_cc_fix_task_edge](https://github.com/Oneflow-Inc/oneflow/tree/dev_cc_fix_task_edge) 是否解决上述问题。这个 BUG 的原因是： swin Variable Op 后面有一个 B2P 的 boxing，该 boxing 会插入 zero boxing task node。但是在：`NaiveB2PSubTskGphBuilder` 中， B2P 构造的 task edge 没有 add lbi。由于之前的 case...

Swin graph 3d 并行，打开 acc grad 报错

不过我有一个问题，B2P 是一个不太常见的 boxing， swin 开 3-D 并行，Variable 后面跟一个 B2P 的消费，是否符合预期？ @Ldpe2G @strint @leaves-zwx @L1aoXingyu

Swin graph 3d 并行，打开 acc grad 报错

> `num_accumulation_steps` 的设置是不是不能超过网络流水的 stage 数量？8卡3d并行是分成两个stage。不是，相反，grad acc 的次数，应该是至少 stage 数量的 2 倍。

Swin graph 3d 并行，打开 acc grad 报错

> ### 这组配置可以跑起来 > ```python > train.train_micro_batch_size = 8 > train.num_accumulation_steps = 2 > train.test_micro_batch_size = 16 > ``` > > ### 这组配置虽然不会报错，但是会卡住，有几张卡利用率 100%，其他为0% > ```python > train.train_micro_batch_size = 32...

Swin graph 3d 并行，打开 acc grad 报错

> > 这组配置虽然不会报错，但是会卡住，有几张卡利用率 100%，其他为0% > > 8卡3D并行也就是 2+2+2 吗？这种现象很大可能跟 nccl 启动顺序不一致所致。nccl_use_compute_stream 这个选项开启了吗？这个在 libai 里是默认开启的。 @L1aoXingyu

Swin graph 3d 并行，打开 acc grad 报错

我怎么记得这个 BUG 之前出现过。。。上次是触发 1024 的上限，导致异步变同步死锁的 @leaves-zwx 😂

Swin graph 3d 并行，打开 acc grad 报错

如果减少 Transformer layer 的层数会跑起来吗

带有 BatchNorm2d 的模型在开启 amp 和 grad acc 时会报错

> 而amp算法只是标记了moving mean和moving var是no cast的（即不插入cast op转换成half）还有哪些 op 是 no cast ？ no cast 跟 black 是同样的区别吗？ @hjchen2