models
models copied to clipboard
Dlrm benchmark test
dlrm benchmark test scripts
关于下面这些选项:
export CUDA_DEVICE_MAX_CONNECTIONS=32
export ONEFLOW_EP_CUDA_STREAM_FLAGS=1
export ONEFLOW_RAW_READER_PREFETCHING_QUEUE_DEPTH=512
export ONEFLOW_RAW_READER_NUM_WORKERS=1
export LD_PRELOAD=/usr/lib64/libjemalloc.so.1
numactl --interleave=all \
做了一组实验,记录了74000轮的平均latency(ms)结果如下:
| ON | OFF |
|---|---|
| 1.41855692 | 1.44409019 |
| 1.42942288 | 1.43027312 |
| 1.42626776 | 1.43327031 |
| 1.43100398 | 1.43726633 |
| 1.43247646 | 1.43108837 |
| 1.43085669 | 1.4360571 |
| 1.4250376 | 1.43052549 |
| 1.4246417 | 1.44208097 |
| 1.42638928 | 1.43673026 |
| 1.43390266 | 1.43774178 |
| 1.42238418 | 1.43597748 |
| 1.43701162 | 1.43563187 |
| 1.42529816 | 1.43994857 |
| 1.42365005 | 1.43631018 |
| 1.43174504 | 1.43489774 |
| 1.42973357 | 1.43393828 |
| 1.4347752 | |
| 1.43040477 |
统计结果如下:
| ON | OFF | |
|---|---|---|
| mean | 1.4285 | 1.4360 |
| max | 1.4370 | 1.4441 |
| min | 1.4186 | 1.4303 |
| std | 0.0048 | 0.0039 |
都打开的时候有8us左右的提升,其实很微小,先不保留这些选项。