Yongning Xu
Yongning Xu
### libai bert&gpt 正确性验证 https://github.com/Oneflow-Inc/oneflow/commit/bc2c2cba7bd831deb999104d6704562309081203 https://github.com/Oneflow-Inc/libai/commit/e9ca4087cb35b3ad268534ee60456db689e36063 oneflow-25 & oneflow-28 - 1n1g - `LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp1_pp1_mb32_gb128_1n1g`  - 1n4g - `LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp2_pp1_mb32_gb256_1n4g`  - `LibAI_gpt2_nl24_nah16_hs1024_FP16_actrue_mp2_pp1_mb8_gb64_1n4g`  **单机的loss曲线基本重合,正确性无误** - 2n4g `LibAI_bert_nl24_nah16_hs1024_FP16_actrue_mp2_pp2_mb64_gb512_2n4g` 报错 ``` F20221026 03:20:24.009537...
https://github.com/Oneflow-Inc/oneflow/commit/684b0a43b5cb9ca5e698a32c04a0cb90e0340f12 编译有报错 https://github.com/Oneflow-Inc/oneflow/actions/runs/3286674212 ``` ninja: build stopped: subcommand failed. Error: {"ID":"0f3c12d8ef0a3ba46d79b1edd45ecf156129ab222dfae70cdc2bfaa937b55462","Running":false,"ExitCode":1,"ProcessConfig":{"tty":false,"entrypoint":"bash","arguments":["-l","/home/ci-user/runners/release/_work/oneflow/oneflow/ci/manylinux/build-gcc7.sh"],"privileged":false},"OpenStdin":false,"OpenStderr":true,"OpenStdout":true,"CanRemove":false,"ContainerID":"415a31872fb6d3610161033266276f17e23a7258abc57a2e18ed6353f33c1541","DetachKeys":"","Pid":2822682} ``` @daquexian @lixinqi
- https://github.com/Oneflow-Inc/oneflow/pull/9108/commits/fa49459c99f2df912f68b8c7eabcad7bca40388b - https://github.com/Oneflow-Inc/libai/commit/6273c06b15f5499d881d45da4ec93218ba34b6f6 - oneflow-25 & oneflow-28 - t5 3d并行用例 `t5_nl12_nah12_hs768_fp16_actrue_mp2_pp2_mb8_gb128_2n4g` - `export ONEFLOW_LAZY_COMPILE_MODE=rank_per_iter` 报错 [日志](https://oneflow-test.oss-cn-beijing.aliyuncs.com/rank_task_graph/LibAI_t5_nl12_nah12_hs768_FP16_actrue_mp2_pp2_mb8_gb128_2n4g_20221010_035336173629119/output.log) ``` F20221010 03:57:00.982542 1494408 task_graph.cpp:1204] Check failed: src->parallel_desc_sym() == dst->parallel_desc_sym() *** Check failure...
### t5 单机4卡测试 - 机器:oneflow-25 单机4卡 - oneflow master [https://github.com/Oneflow-Inc/oneflow/commit/93d19f3be52632cccc875c8e46011eced14249a0](https://github.com/Oneflow-Inc/oneflow/commit/93d19f3be52632cccc875c8e46011eced14249a0) - libai main [https://github.com/Oneflow-Inc/libai/commit/e9ca4087cb35b3ad268534ee60456db689e36063](https://github.com/Oneflow-Inc/libai/commit/e9ca4087cb35b3ad268534ee60456db689e36063) - 用例:`t5_nl12_nah12_hs768_FP16_actrue_mp2_pp1_mb32_gb512_1n4g` zero_stage=2 - libai:[4089](https://oneflow-test.oss-cn-beijing.aliyuncs.com/libai_vs_megatron/1020_t5/25/libai_zero/LibAI_t5_nl12_nah12_hs768_FP16_actrue_mp2_pp1_mb32_gb512_1n4g_20221020_161944511159180/config.yaml) MiB /[85.83](https://oneflow-test.oss-cn-beijing.aliyuncs.com/libai_vs_megatron/1020_t5/25/libai_zero/LibAI_t5_nl12_nah12_hs768_FP16_actrue_mp2_pp1_mb32_gb512_1n4g_20221020_161944511159180/output.log) samples/s 日志:`oss://oneflow-test/libai_vs_megatron/1020_t5/25/libai_zero/` - Megatron-deepspeed:4725 MiB /[82.7](https://oneflow-test.oss-cn-beijing.aliyuncs.com/libai_vs_megatron/1020_t5/25/megatron_deepspeed_zero/Megatron-Deepspeed_t5_nl12_nah12_hs768_FP16_actrue_mp2_pp1_mb16_gb512_1n4g_20221020_155410248399843.log) samples/s ### t5...
- 刚刚分别跑了dp4_mp2_pp1和dp2_mp4_pp1的2机4卡测试 - dp4_mp2_pp1:吞吐是比较正常的 - dp2_mp4_pp1:这个是IDEA给的配置,跑的很慢,15分钟第一个iter都没有跑完,后面就没再等了。 然后列一下dp4_mp2_pp1这组配置的对比结果,libai的是今天新跑的,megatron用的前面comment里的数据,两个模型的参数对齐了,但是数据集用的不一样,这个麻烦 @xiezipeng-ML 给说明一下 ### projects/T5 单机4卡测试 - 机器:oneflow-28 单机4卡 - oneflow master https://github.com/Oneflow-Inc/oneflow/commit/f97f09f1d9a8668c972a12f66d77aaa19b164635 - libai test_t5_time https://github.com/Oneflow-Inc/libai/commit/0002b6637c92e19728cd26830494fa33ab68efc1 - 对比: - libai:`mt5_pretrain.py` **`mb16_gb256`** `dp2_mp2_pp1` `zero_stage=2`...
单卡 mb4_gb32 [libai_nsys](https://oneflow-test.oss-cn-beijing.aliyuncs.com/mt5_test/1022/nsys/libai_1n1g/1n1g.qdrep) [megatron_nsys](https://oneflow-test.oss-cn-beijing.aliyuncs.com/mt5_test/1022/nsys/megatron_1n1g/Megatron-Deepspeed_t5_nl12_nah12_hs768_FP16_actrue_mp1_pp1_mb4_gb32_1n1g_20221022_141240468908372.qdrep) @chengtbf
### SBP_INFER_RULE_TAG=2 和 自动并行 测试吞吐 - 机器:oneflow-25 oneflow-28 2机一共8卡 - oneflow master https://github.com/Oneflow-Inc/oneflow/commit/f97f09f1d9a8668c972a12f66d77aaa19b164635 - libai main https://github.com/Oneflow-Inc/libai/commit/e9ca4087cb35b3ad268534ee60456db689e36063 - libai吞吐数据 `mt5_pretrain.py` `mb16_gb512` `dp4_mp2_pp1` `zero_stage=2` - **export SBP_INFER_RULE_TAG=2**:[11663](https://oneflow-test.oss-cn-beijing.aliyuncs.com/mt5_test/SBP_INFER_RULE_TAG%3D2/config.yaml) MiB/[61.44](https://oneflow-test.oss-cn-beijing.aliyuncs.com/mt5_test/SBP_INFER_RULE_TAG%3D2/output.log) samples/s 吞吐和之前的数据持平 -...
### 自动并行 2n4g 测试 - 机器:oneflow-25 oneflow-28 2机一共8卡 - oneflow feat-auto_parallel-ZeRO分支 https://github.com/Oneflow-Inc/oneflow/pull/9288/commits/54771bc917aa1b7509e758b7d5c1344ce00e7246 用这个分支 编译+自动并行的时间是半小时,确实加快了 - libai main https://github.com/Oneflow-Inc/libai/commit/e9ca4087cb35b3ad268534ee60456db689e36063 - 为了不OOM,调小了batch_size,做了一组对比 `mb4_gb128` `dp4_mp2_pp1` `zero_stage=2` - libai: [9915](https://oneflow-test.oss-cn-beijing.aliyuncs.com/mt5_test/auto/2n4g_mb4/libai/25/config.yaml) MiB/[60.16](https://oneflow-test.oss-cn-beijing.aliyuncs.com/mt5_test/auto/2n4g_mb4/libai/25/output.log) samples/s - megatron:...
### debug_reshape_sbp_signature分支 - https://github.com/Oneflow-Inc/oneflow/commit/4b04b25f521ab2d7727235347c057e3aa584350b - 2n4g mb16_gb512 - libai: [11525](https://oneflow-test.oss-cn-beijing.aliyuncs.com/mt5_test/debug_reshape_sbp_signature/4b04b25/config.yaml) MiB/[116.95](https://oneflow-test.oss-cn-beijing.aliyuncs.com/mt5_test/debug_reshape_sbp_signature/4b04b25/output.log) samples/s - megatron: 4783 MiB/[164.3](https://oneflow-test.oss-cn-beijing.aliyuncs.com/libai_vs_megatron/1019_t5/megatron_deepspeed_zero/oneflow-28/Megatron-Deepspeed_t5_nl12_nah12_hs768_FP16_actrue_mp2_pp1_mb16_gb512_2n4g_20221019_173310339738103.log) samples/s ### refactor-GetSbpSignature分支 - https://github.com/Oneflow-Inc/oneflow/pull/9304/commits/195b0ea149c77374737751356b97f6bf2da240ff - 2n4g mb4_gb128 - 关自动并行 [5283](https://oneflow-test.oss-cn-beijing.aliyuncs.com/mt5_test/refactor-GetSbpSignature/195b0ea/config.yaml) MiB/[71.42](https://oneflow-test.oss-cn-beijing.aliyuncs.com/mt5_test/refactor-GetSbpSignature/195b0ea/output.log) samples/s -...
**操作失误,上面测试的megatron数据是关掉zero的**, 所以重测了megatron开zero,并在下方整理现有的对比结果 ### 开zero测试 - oneflow debug_reshape_sbp_signature分支 https://github.com/Oneflow-Inc/oneflow/commit/4b04b25f521ab2d7727235347c057e3aa584350b - `export SBP_INFER_RULE_TAG=2` - libai main https://github.com/Oneflow-Inc/libai/commit/e9ca4087cb35b3ad268534ee60456db689e36063 - 2n4g mb16_gb512 - libai: [11525](https://oneflow-test.oss-cn-beijing.aliyuncs.com/mt5_test/debug_reshape_sbp_signature/4b04b25/config.yaml) MiB/[116.95](https://oneflow-test.oss-cn-beijing.aliyuncs.com/mt5_test/debug_reshape_sbp_signature/4b04b25/output.log) samples/s [nsys](https://oneflow-test.oss-cn-beijing.aliyuncs.com/mt5_test/debug_reshape_sbp_signature/4b04b25/nsys/25/2n4g.qdrep) [profiler_nsys](https://oneflow-test.oss-cn-beijing.aliyuncs.com/mt5_test/debug_reshape_sbp_signature/4b04b25/profiler_nsys/2n4g.qdrep) log: `oss://oneflow-test/mt5_test/debug_reshape_sbp_signature/4b04b25/log_path/log/` - megatron: 3653...