init multidevice cuda graph
graph is multidevice, quite not huge perf impact (test hlb, 2gpus). need to enqueue transfers as well to get the speed
cudagraph just works!
cifar on 3 4090:
NCCL_SHM_USE_CUDA_MEMCPY=1 CUDA=1 HALF=1 STEPS=350 BS=768 GPUS=3 TARGET_EVAL_ACC_PCT=93.5 python3 examples/hlb_cifar10.py
shuffling training dataset in 1073.18 ms (epoch=0)
0 9541.00 ms run, 9538.54 ms python, 2.46 ms CUDA * 3, 1197.00 loss, 0.000043 LR, 0.47 GB used, 105.76 GFLOPS, 1009.09 GOPS
1 512.38 ms run, 511.03 ms python, 1.35 ms CUDA * 3, 1196.00 loss, 0.000085 LR, 3.42 GB used, 1965.79 GFLOPS, 1007.23 GOPS
2 45.76 ms run, 8.65 ms python, 37.11 ms CUDA * 3, 1174.00 loss, 0.000128 LR, 3.42 GB used, 22008.81 GFLOPS, 1007.23 GOPS
3 44.22 ms run, 6.89 ms python, 37.33 ms CUDA * 3, 1160.00 loss, 0.000171 LR, 3.42 GB used, 22778.29 GFLOPS, 1007.23 GOPS
4 44.16 ms run, 6.94 ms python, 37.22 ms CUDA * 3, 1163.00 loss, 0.000214 LR, 3.42 GB used, 22810.04 GFLOPS, 1007.23 GOPS
5 44.05 ms run, 6.76 ms python, 37.29 ms CUDA * 3, 1156.00 loss, 0.000256 LR, 3.42 GB used, 22865.37 GFLOPS, 1007.23 GOPS
6 43.93 ms run, 6.71 ms python, 37.22 ms CUDA * 3, 1136.00 loss, 0.000299 LR, 3.42 GB used, 22927.88 GFLOPS, 1007.23 GOPS
7 43.95 ms run, 6.96 ms python, 37.00 ms CUDA * 3, 1106.00 loss, 0.000341 LR, 3.42 GB used, 22916.00 GFLOPS, 1007.23 GOPS
8 42.63 ms run, 6.75 ms python, 35.89 ms CUDA * 3, 1088.00 loss, 0.000384 LR, 3.42 GB used, 23626.07 GFLOPS, 1007.23 GOPS
9 41.54 ms run, 6.75 ms python, 34.79 ms CUDA * 3, 1072.00 loss, 0.000427 LR, 3.42 GB used, 24247.69 GFLOPS, 1007.23 GOPS
@geohot would be good if you can run this on 6gpus as well to test. also maybe we can remove LoadOps.SYNC? isn't better to control sync in the runtime itself (like hsa does and now cuda in transfer)
tiny17 is down right now, will test in a few hours when it's back.
Removing sync should be fine, the runtime should know when a cross buffer is used, right? Though what about OpenCL or platforms that don't have any syncing/you have to do global sync.
batman@tiny17:~/tinygrad$ NCCL_SHM_USE_CUDA_MEMCPY=1 CUDA=1 HALF=1 STEPS=350 BS=1536 GPUS=6 TARGET_EVAL_ACC_PCT=93.5 python3 examples/hlb_cifar10.py
shuffling training dataset in 924.58 ms (epoch=0)
0 22472.73 ms run, 22468.05 ms python, 4.68 ms CUDA * 6, 1197.00 loss, 0.000043 LR, 0.56 GB used, 89.73 GFLOPS, 2016.42 GOPS
1 1177.17 ms run, 1174.25 ms python, 2.92 ms CUDA * 6, 1195.00 loss, 0.000085 LR, 6.45 GB used, 1711.36 GFLOPS, 2014.56 GOPS
2 57.99 ms run, 23.16 ms python, 34.83 ms CUDA * 6, 1178.00 loss, 0.000128 LR, 6.46 GB used, 34739.17 GFLOPS, 2014.56 GOPS
3 54.02 ms run, 19.59 ms python, 34.42 ms CUDA * 6, 1163.00 loss, 0.000171 LR, 6.46 GB used, 37293.52 GFLOPS, 2014.56 GOPS
4 54.15 ms run, 19.86 ms python, 34.28 ms CUDA * 6, 1161.00 loss, 0.000214 LR, 6.46 GB used, 37206.37 GFLOPS, 2014.56 GOPS
5 53.95 ms run, 19.72 ms python, 34.23 ms CUDA * 6, 1156.00 loss, 0.000256 LR, 6.46 GB used, 37344.25 GFLOPS, 2014.56 GOPS
6 53.89 ms run, 19.87 ms python, 34.02 ms CUDA * 6, 1138.00 loss, 0.000299 LR, 6.46 GB used, 37382.56 GFLOPS, 2014.56 GOPS
7 52.90 ms run, 19.83 ms python, 33.07 ms CUDA * 6, 1114.00 loss, 0.000341 LR, 6.46 GB used, 38080.26 GFLOPS, 2014.56 GOPS
8 52.50 ms run, 19.78 ms python, 32.72 ms CUDA * 6, 1088.00 loss, 0.000384 LR, 6.46 GB used, 38370.74 GFLOPS, 2014.56 GOPS
9 52.45 ms run, 19.66 ms python, 32.79 ms CUDA * 6, 1074.00 loss, 0.000427 LR, 6.46 GB used, 38409.27 GFLOPS, 2014.56 GOPS
10 52.51 ms run, 19.75 ms python, 32.76 ms CUDA * 6, 1052.00 loss, 0.000469 LR, 6.46 GB used, 38366.61 GFLOPS, 2014.56 GOPS
11 52.59 ms run, 19.66 ms python, 32.93 ms CUDA * 6, 1033.00 loss, 0.000512 LR, 6.46 GB used, 38309.25 GFLOPS, 2014.56 GOPS
Not bad!
Will need faster Python for BEAM
batman@tiny17:~/tinygrad$ BEAM=2 NCCL_SHM_USE_CUDA_MEMCPY=1 CUDA=1 HALF=1 STEPS=350 BS=1536 GPUS=6 TARGET_EVAL_ACC_PCT=93.5 python3 examples/hlb_cifar10.py
shuffling training dataset in 1453.00 ms (epoch=0)
0 36859.14 ms run, 36853.59 ms python, 5.54 ms CUDA * 6, 1197.00 loss, 0.000043 LR, 0.56 GB used, 54.83 GFLOPS, 2020.84 GOPS
1 1874.02 ms run, 1870.64 ms python, 3.38 ms CUDA * 6, 1195.00 loss, 0.000085 LR, 6.45 GB used, 1077.36 GFLOPS, 2018.99 GOPS
2 36.17 ms run, 23.79 ms python, 12.39 ms CUDA * 6, 1178.00 loss, 0.000128 LR, 6.46 GB used, 55814.30 GFLOPS, 2018.99 GOPS
3 34.00 ms run, 20.64 ms python, 13.36 ms CUDA * 6, 1163.00 loss, 0.000171 LR, 6.46 GB used, 59382.80 GFLOPS, 2018.99 GOPS
4 33.83 ms run, 20.62 ms python, 13.21 ms CUDA * 6, 1161.00 loss, 0.000214 LR, 6.46 GB used, 59672.46 GFLOPS, 2018.99 GOPS
5 34.46 ms run, 21.03 ms python, 13.43 ms CUDA * 6, 1156.00 loss, 0.000256 LR, 6.46 GB used, 58596.79 GFLOPS, 2018.99 GOPS
6 34.05 ms run, 20.72 ms python, 13.33 ms CUDA * 6, 1138.00 loss, 0.000299 LR, 6.46 GB used, 59301.07 GFLOPS, 2018.99 GOPS
7 33.90 ms run, 20.61 ms python, 13.30 ms CUDA * 6, 1114.00 loss, 0.000341 LR, 6.46 GB used, 59550.21 GFLOPS, 2018.99 GOPS
8 34.07 ms run, 20.57 ms python, 13.49 ms CUDA * 6, 1088.00 loss, 0.000384 LR, 6.46 GB used, 59268.52 GFLOPS, 2018.99 GOPS
9 33.77 ms run, 20.72 ms python, 13.05 ms CUDA * 6, 1074.00 loss, 0.000427 LR, 6.46 GB used, 59792.49 GFLOPS, 2018.99 GOPS
batman@tiny17:~/tinygrad$ BEAM=4 BENCHMARK=10 BS=768 GPUS=6 MODEL=resnet python3 examples/mlperf/model_train.py
training resnet
Training on ['CUDA:0', 'CUDA:1', 'CUDA:2', 'CUDA:3', 'CUDA:4', 'CUDA:5']
training with batch size 768 for 41 epochs
0 85994.66 ms run, 85907.01 ms python, 70.07 ms fetch data, 17.59 ms CUDA * 6, 7.01 loss, 0.00 acc, 0.000382 LR, 1.96 GB used, 221.72 GFLOPS
1 10078.31 ms run, 9939.01 ms python, 94.80 ms fetch data, 44.50 ms CUDA * 6, 6.97 loss, 0.00 acc, 0.000764 LR, 113.21 GB used, 1891.90 GFLOPS
2 491.22 ms run, 255.31 ms python, 71.17 ms fetch data, 164.73 ms CUDA * 6, 7.02 loss, 0.00 acc, 0.001147 LR, 113.21 GB used, 38816.08 GFLOPS
3 502.95 ms run, 238.99 ms python, 96.57 ms fetch data, 167.39 ms CUDA * 6, 7.02 loss, 0.00 acc, 0.001529 LR, 113.21 GB used, 37910.92 GFLOPS
4 496.84 ms run, 235.32 ms python, 99.08 ms fetch data, 162.43 ms CUDA * 6, 7.01 loss, 0.00 acc, 0.001911 LR, 113.21 GB used, 38376.80 GFLOPS
5 497.84 ms run, 236.81 ms python, 98.95 ms fetch data, 162.07 ms CUDA * 6, 7.04 loss, 0.00 acc, 0.002293 LR, 113.21 GB used, 38300.13 GFLOPS
6 499.71 ms run, 240.46 ms python, 102.60 ms fetch data, 156.65 ms CUDA * 6, 7.01 loss, 0.00 acc, 0.002675 LR, 113.21 GB used, 38156.29 GFLOPS
7 473.76 ms run, 244.13 ms python, 75.74 ms fetch data, 153.89 ms CUDA * 6, 7.05 loss, 0.00 acc, 0.003058 LR, 113.21 GB used, 40246.31 GFLOPS
8 474.19 ms run, 242.38 ms python, 74.85 ms fetch data, 156.97 ms CUDA * 6, 7.03 loss, 0.00 acc, 0.003440 LR, 113.21 GB used, 40209.80 GFLOPS
9 472.85 ms run, 239.58 ms python, 76.45 ms fetch data, 156.83 ms CUDA * 6, 7.03 loss, 0.00 acc, 0.003822 LR, 113.21 GB used, 40323.91 GFLOPS
Estimated training time: 9h27m
Btw, does NCCL_SHM_USE_CUDA_MEMCPY=1 matter? Aren't we not using nccl?
Changes
Name Lines Diff Tokens/Line Diff
------------------------------ ------- ------ ------------- ------
tinygrad/features/jit.py 140 +1 15.3 -0.0
tinygrad/runtime/graph/cuda.py 78 +14 17.8 +0.6
total lines changes: +15
Yeah NCCL_SHM_USE_CUDA_MEMCPY seems to be useless. I will take a look at python time, looks really high in resnet. And we are at 6499