TensorRT issue on RTX5090
Hi, Downloaded the latest v1.15.3 katago. OpenCL works fine. Latest NVIDIA driver. But when I try to use TensorRT, it shows as below issue:
Running quick initial benchmark at 16 threads! 2025-03-29 06:54:25+0800: nnRandSeed0 = 5406967903729009098 2025-03-29 06:54:25+0800: After dedups: nnModelFile0 = ..\katago_weights\b28c512nbt.bin.gz useFP16 auto useNHWC auto 2025-03-29 06:54:25+0800: Initializing neural net buffer to be size 19 * 19 exactly 2025-03-29 06:54:27+0800: TensorRT backend thread 0: Found GPU NVIDIA GeForce RTX 5090 memory 34190458880 compute capability major 12 minor 0 2025-03-29 06:54:27+0800: TensorRT backend thread 0: Initializing (may take a long time) 2025-03-29 06:54:35+0800: Creating new timing cache 2025-03-29 06:54:35+0800: TensorRT backend: 2: [helpers.h::nvinfer1::smVerHex2Dig::694] Error Code 2: Internal Error (Assertion major >= 0 && major < 10 failed. )
Tried both trt8.6.1-cuda12.1 and trt10.2.0-cuda12.5. Not sure why. I also tried the old verstion katago v1.14.0-trt8.6.1-cuda12.1, not work too.
Please help to look into this. Maybe it is something regard to my setup.
Can confirm that TensorRT doesn't work on 5090, except I get a different error:
2025-03-29 10:49:53+0200: Loading model and initializing benchmark...
2025-03-29 10:49:53+0200: Testing with default positions for board size: 19
2025-03-29 10:49:53+0200: nnRandSeed0 = 13233354023296052901
2025-03-29 10:49:53+0200: After dedups: nnModelFile0 = kata1-b28c512nbt-s8326494464-d4628051565.bin.gz useFP16 auto useNHWC auto
2025-03-29 10:49:53+0200: Initializing neural net buffer to be size 19 * 19 exactly
2025-03-29 10:49:55+0200: TensorRT backend thread 0: Found GPU NVIDIA GeForce RTX 5090 memory 34190458880 compute capability major 12 minor 0
2025-03-29 10:49:55+0200: TensorRT backend thread 0: Initializing (may take a long time)
2025-03-29 10:49:56+0200: Creating new timing cache
2025-03-29 10:49:56+0200: TensorRT backend: [convBaseBuilder.cpp::nvinfer1::builder::CaskConvBaseBuilder<class nvinfer1::rt::task::CaskConvolutionRunner,-2147483639>::addGenericTactic::1672] Error Code 2: Internal Error (Assertion genericShader != nullptr failed. )
FWIW I just got done testing a 5090 on runpod and it seemed to work just fine using TRT 10.9.0.34-1 and cuda12.8.1, the container was based on Ubuntu 22.04 and I had cloned the master branch as of a few minutes before this comment.
2025-03-29 22:11:10+0000: TensorRT backend thread 0: Found GPU NVIDIA GeForce RTX 5090 memory 33680457728 compute capability major 12 minor 0
2025-03-29 22:11:10+0000: TensorRT backend thread 0: Initializing (may take a long time)
2025-03-29 22:11:11+0000: Using existing timing cache at /root/.katago/trtcache/trt-100900_gpu-35714f42_tune-7ff0dbc1faaa_exact19x19_batch32_fp16
2025-03-29 22:11:15+0000: TensorRT backend thread 0: Model version 14 useFP16 = true
2025-03-29 22:11:15+0000: TensorRT backend thread 0: Model name: kata1-b18c384nbt-s9996604416-d4316597426
Possible numbers of threads to test: 1, 2, 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32,
numSearchThreads = 5: 10 / 10 positions, visits/s = 1019.24 nnEvals/s = 867.48 nnBatches/s = 348.41 avgBatchSize = 2.49 (7.9 secs)
numSearchThreads = 12: 10 / 10 positions, visits/s = 2315.08 nnEvals/s = 1970.47 nnBatches/s = 333.17 avgBatchSize = 5.91 (3.5 secs)
numSearchThreads = 10: 10 / 10 positions, visits/s = 1886.52 nnEvals/s = 1626.09 nnBatches/s = 328.58 avgBatchSize = 4.95 (4.3 secs)
numSearchThreads = 20: 10 / 10 positions, visits/s = 3425.14 nnEvals/s = 3011.77 nnBatches/s = 308.25 avgBatchSize = 9.77 (2.4 secs)
numSearchThreads = 16: 10 / 10 positions, visits/s = 2900.09 nnEvals/s = 2486.50 nnBatches/s = 316.33 avgBatchSize = 7.86 (2.8 secs)
numSearchThreads = 24: 10 / 10 positions, visits/s = 4010.84 nnEvals/s = 3544.75 nnBatches/s = 301.62 avgBatchSize = 11.75 (2.0 secs)
numSearchThreads = 32: 10 / 10 positions, visits/s = 4767.85 nnEvals/s = 4317.85 nnBatches/s = 281.04 avgBatchSize = 15.36 (1.7 secs)
...
The official TensorRT backend engine you provided is not supported on NVIDIA RTX 50-series graphics cards. It must be compiled separately under environments with TensorRT 10.9 or newer and CUDA 12.8 or newer to function properly. We recommend either updating to a newer version within a few weeks to resolve this issue, or compiling a custom engine specifically for your hardware configuration. @lightvector
FWIW I just got done testing a 5090 on runpod and it seemed to work just fine using TRT 10.9.0.34-1 and cuda12.8.1, the container was based on Ubuntu 22.04 and I had cloned the master branch as of a few minutes before this comment.
2025-03-29 22:11:10+0000: TensorRT backend thread 0: Found GPU NVIDIA GeForce RTX 5090 memory 33680457728 compute capability major 12 minor 0 2025-03-29 22:11:10+0000: TensorRT backend thread 0: Initializing (may take a long time) 2025-03-29 22:11:11+0000: Using existing timing cache at /root/.katago/trtcache/trt-100900_gpu-35714f42_tune-7ff0dbc1faaa_exact19x19_batch32_fp16 2025-03-29 22:11:15+0000: TensorRT backend thread 0: Model version 14 useFP16 = true 2025-03-29 22:11:15+0000: TensorRT backend thread 0: Model name: kata1-b18c384nbt-s9996604416-d4316597426 Possible numbers of threads to test: 1, 2, 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32, numSearchThreads = 5: 10 / 10 positions, visits/s = 1019.24 nnEvals/s = 867.48 nnBatches/s = 348.41 avgBatchSize = 2.49 (7.9 secs) numSearchThreads = 12: 10 / 10 positions, visits/s = 2315.08 nnEvals/s = 1970.47 nnBatches/s = 333.17 avgBatchSize = 5.91 (3.5 secs) numSearchThreads = 10: 10 / 10 positions, visits/s = 1886.52 nnEvals/s = 1626.09 nnBatches/s = 328.58 avgBatchSize = 4.95 (4.3 secs) numSearchThreads = 20: 10 / 10 positions, visits/s = 3425.14 nnEvals/s = 3011.77 nnBatches/s = 308.25 avgBatchSize = 9.77 (2.4 secs) numSearchThreads = 16: 10 / 10 positions, visits/s = 2900.09 nnEvals/s = 2486.50 nnBatches/s = 316.33 avgBatchSize = 7.86 (2.8 secs) numSearchThreads = 24: 10 / 10 positions, visits/s = 4010.84 nnEvals/s = 3544.75 nnBatches/s = 301.62 avgBatchSize = 11.75 (2.0 secs) numSearchThreads = 32: 10 / 10 positions, visits/s = 4767.85 nnEvals/s = 4317.85 nnBatches/s = 281.04 avgBatchSize = 15.36 (1.7 secs) ...
Thank you for your suggestion. I will try the version you mentioned above. Hopefully it will run smoothly on Win11 too.
I tried building katago from source on Windows 11 with Visual Studio 2022, using TensorRT 10.9 and CUDA 12.8 Update 1. I built zlib with:
git clone https://github.com/microsoft/vcpkg.git
cd .\vcpkg\
.\bootstrap-vcpkg.bat
.\vcpkg.exe install zlib:x64-windows
Opening MSVC2022 solution after generating it, and building, i get some warnings:
Severity Code Description Project File Line Suppression State Details
Warning C4267 '=': conversion from 'size_t' to '_Ty', possible loss of data
with
[
_Ty=int
] katago C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.40.33807\include\numeric 37
Warning C4715 'ModelParser::buildActivationLayer': not all control paths return a value katago C:\Users\myusername\Source\Repos\KataGo\cpp\neuralnet\trtbackend.cpp 844
Warning C4551 function call missing argument list katago C:\Users\myusername\Source\Repos\KataGo\cpp\tests\testnnevalcanary.cpp 609
Warning C4551 function call missing argument list katago C:\Users\myusername\Source\Repos\KataGo\cpp\tests\testnnevalcanary.cpp 729
Warning LNK4098 defaultlib 'LIBCMT' conflicts with use of other libs; use /NODEFAULTLIB:library katago C:\Users\myusername\Source\Repos\KataGo\bin\LINK 1
Compiling Release version and running it fails. It just immediately quits with no output. The Debug version, compiled with same project, works without issues.
EDIT: Fixed Release build by removing old DLL's from Katago folder. Also Linked against cudart.lib (instead of cudart_static.lib) to get rid of the linker warning.
Can you release a new fixed version? NVIDIA RTX 50-series graphics cards don't seem to support it.
EDIT: Fixed Release build by removing old DLL's from Katago folder. Also Linked against cudart.lib (instead of cudart_static.lib) to get rid of the linker warning.
Hi @tterava , I encountered the same issue as you did with win11 + msvc22 + trt 10.9 + cuda 12.8.1, except that in my case both the Release and Debug version quit silently. Could you explain a bit more how you fixed the issue? Like which DLL's should be removed.
It would also be great if you could share the compiled exe.
Thanks a lot!
@gty929 If you're only interested in self play, then you only need zlib1.dll in your Katago folder. If you changed Katago to also use cudart.lib instead of cudart_static.lib (which was what CMake linked to), then you need to go to project properties (Project Properties -> Linker -> Input -> Additional Dependencies) in Visual Studio and change it. I don't think that's necessary though.
@gty929 If you're only interested in self play, then you only need zlib1.dll in your Katago folder. If you changed Katago to also use cudart.lib instead of cudart_static.lib (which was what CMake linked to), then you need to go to project properties (Project Properties -> Linker -> Input -> Additional Dependencies) in Visual Studio and change it. I don't think that's necessary though.
Hi, Thanks for reply, but I cannot find zlib1.dll in the folder, instead of libz.dll. However, it also needs libcrypto-3-x64.dll,libssl-3-x64.dll and libzip.dll for running. And slience quit as @gty929 mentioned.
Do you have a packaged file for sharing to solve this?
I tried building katago from source on Windows 11 with Visual Studio 2022, using TensorRT 10.9 and CUDA 12.8 Update 1. I built zlib with:
git clone https://github.com/microsoft/vcpkg.git cd .\vcpkg\ .\bootstrap-vcpkg.bat .\vcpkg.exe install zlib:x64-windows
Opening MSVC2022 solution after generating it, and building, i get some warnings:
Severity Code Description Project File Line Suppression State Details Warning C4267 '=': conversion from 'size_t' to '_Ty', possible loss of data with [ _Ty=int ] katago C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.40.33807\include\numeric 37 Warning C4715 'ModelParser::buildActivationLayer': not all control paths return a value katago C:\Users\myusername\Source\Repos\KataGo\cpp\neuralnet\trtbackend.cpp 844 Warning C4551 function call missing argument list katago C:\Users\myusername\Source\Repos\KataGo\cpp\tests\testnnevalcanary.cpp 609 Warning C4551 function call missing argument list katago C:\Users\myusername\Source\Repos\KataGo\cpp\tests\testnnevalcanary.cpp 729 Warning LNK4098 defaultlib 'LIBCMT' conflicts with use of other libs; use /NODEFAULTLIB:library katago C:\Users\myusername\Source\Repos\KataGo\bin\LINK 1Compiling Release version and running it fails. It just immediately quits with no output. The Debug version, compiled with same project, works without issues.
EDIT: Fixed Release build by removing old DLL's from Katago folder. Also Linked against cudart.lib (instead of cudart_static.lib) to get rid of the linker warning.
I use trt 10.9 for build too, Debug version can run but analysis too slow(800/s on 5080, GPU pwr only run ~80W), all I built except Debug cannot be started.Is there a correct version can share? Or where the problem might be?
For those having RTX5080/5090 GPUs, is TensorRT working now with the default recent KataGo version (such as 1.16.3)? If yes, what is the performance you get please?
For those having RTX5080/5090 GPUs, is TensorRT working now with the default recent KataGo version (such as 1.16.3)? If yes, what is the performance you get please?
The latest version is now operational, with performance meeting expectations (though TRT10 may require further optimization).
For those having RTX5080/5090 GPUs, is TensorRT working now with the default recent KataGo version (such as 1.16.3)? If yes, what is the performance you get please?
Performance varies on position, but I would say it's somewhere between 5k - 10k positions per second, using 138 threads (best i found). It very rarely goes below 5k / s.
Cool, thx! Would you mind to test "katago.exe benchmark -model <NEURALNET>.bin.gz -config gtp_custom.cfg" with the latest b28 model? And what are the results?
On my 4090, I got that (at 100% power), utilising katago-v1.16.3-trt10.9.0-cuda12.8-windows-x64: numSearchThreads = 5: 10 / 10 positions, visits/s = 575.53 nnEvals/s = 483.03 nnBatches/s = 194.01 avgBatchSize = 2.49 (14.0 secs) (EloDiff baseline) numSearchThreads = 10: 10 / 10 positions, visits/s = 1102.73 nnEvals/s = 947.07 nnBatches/s = 191.95 avgBatchSize = 4.93 (7.3 secs) (EloDiff +230) numSearchThreads = 12: 10 / 10 positions, visits/s = 1277.31 nnEvals/s = 1091.44 nnBatches/s = 184.61 avgBatchSize = 5.91 (6.3 secs) (EloDiff +281) numSearchThreads = 16: 10 / 10 positions, visits/s = 1435.37 nnEvals/s = 1233.47 nnBatches/s = 157.47 avgBatchSize = 7.83 (5.7 secs) (EloDiff +318) numSearchThreads = 20: 10 / 10 positions, visits/s = 1711.66 nnEvals/s = 1457.63 nnBatches/s = 150.03 avgBatchSize = 9.72 (4.8 secs) (EloDiff +379) numSearchThreads = 24: 10 / 10 positions, visits/s = 1793.10 nnEvals/s = 1567.40 nnBatches/s = 135.46 avgBatchSize = 11.57 (4.6 secs) (EloDiff +391) numSearchThreads = 32: 10 / 10 positions, visits/s = 2152.22 nnEvals/s = 1919.07 nnBatches/s = 124.90 avgBatchSize = 15.36 (3.8 secs) (EloDiff +451) numSearchThreads = 40: 10 / 10 positions, visits/s = 2347.47 nnEvals/s = 2157.87 nnBatches/s = 112.36 avgBatchSize = 19.20 (3.6 secs) (EloDiff +476) numSearchThreads = 64: 10 / 10 positions, visits/s = 2767.55 nnEvals/s = 2657.12 nnBatches/s = 88.22 avgBatchSize = 30.12 (3.1 secs) (EloDiff +515) numSearchThreads = 80: 10 / 10 positions, visits/s = 3002.41 nnEvals/s = 2921.86 nnBatches/s = 76.76 avgBatchSize = 38.06 (2.9 secs) (EloDiff +533) numSearchThreads = 96: 10 / 10 positions, visits/s = 3087.37 nnEvals/s = 3012.53 nnBatches/s = 63.70 avgBatchSize = 47.30 (2.9 secs) (EloDiff +529) numSearchThreads = 128: 10 / 10 positions, visits/s = 2976.62 nnEvals/s = 2947.39 nnBatches/s = 43.52 avgBatchSize = 67.72 (3.1 secs) (EloDiff +481) The optimal proposed is 80 threads.
Here you go. I used following command to skip most of the low thread counts, and to extend the test a little bit to give more accurate results: katago benchmark -config gtp_ponder.cfg -model kata1-b28c512nbt-s9584861952-d4960414494.bin.gz -t 32,64,80,96,128,138,144,160 -v 20000
2025-07-11 13:11:32+0300: Running with following config:
allowResignation = false
conservativePass = false
lagBuffer = 0.0
logAllGTPCommunication = false
logDir = gtp_logs
logSearchInfo = true
logSearchInfoForChosenMove = false
logToStderr = true
maxTimePondering = 60.0
nnCacheSizePowerOfTwo = 26
numSearchThreads = 138
ponderingEnabled = true
resignConsecTurns = 3
resignThreshold = -0.90
rules = japanese
searchFactorAfterOnePass = 0.50
searchFactorAfterTwoPass = 0.30
searchFactorWhenWinning = 0.30
searchFactorWhenWinningThreshold = 0.95
2025-07-11 13:11:32+0300: Running with following config:
allowResignation = false
conservativePass = false
lagBuffer = 0.0
logAllGTPCommunication = false
logDir = gtp_logs
logSearchInfo = true
logSearchInfoForChosenMove = false
logToStderr = true
maxTimePondering = 60.0
nnCacheSizePowerOfTwo = 26
numSearchThreads = 138
ponderingEnabled = true
resignConsecTurns = 3
resignThreshold = -0.90
rules = japanese
searchFactorAfterOnePass = 0.50
searchFactorAfterTwoPass = 0.30
searchFactorWhenWinning = 0.30
searchFactorWhenWinningThreshold = 0.95
2025-07-11 13:11:32+0300: Loading model and initializing benchmark...
2025-07-11 13:11:32+0300: Loading model and initializing benchmark...
2025-07-11 13:11:32+0300: Testing with default positions for board size: 19
2025-07-11 13:11:32+0300: Testing with default positions for board size: 19
2025-07-11 13:11:32+0300: nnRandSeed0 = 8956017151185888576
2025-07-11 13:11:32+0300: nnRandSeed0 = 8956017151185888576
2025-07-11 13:11:32+0300: After dedups: nnModelFile0 = kata1-b28c512nbt-s9584861952-d4960414494.bin.gz useFP16 auto useNHWC auto
2025-07-11 13:11:32+0300: After dedups: nnModelFile0 = kata1-b28c512nbt-s9584861952-d4960414494.bin.gz useFP16 auto useNHWC auto
2025-07-11 13:11:32+0300: Initializing neural net buffer to be size 19 * 19 exactly
2025-07-11 13:11:32+0300: Initializing neural net buffer to be size 19 * 19 exactly
2025-07-11 13:11:34+0300: TensorRT backend thread 0: Found GPU NVIDIA GeForce RTX 5090 memory 34190458880 compute capability major 12 minor 0
2025-07-11 13:11:34+0300: TensorRT backend thread 0: Found GPU NVIDIA GeForce RTX 5090 memory 34190458880 compute capability major 12 minor 0
2025-07-11 13:11:34+0300: TensorRT backend thread 0: Initializing (may take a long time)
2025-07-11 13:11:34+0300: TensorRT backend thread 0: Initializing (may take a long time)
2025-07-11 13:11:36+0300: Creating new timing cache (usingFP16=true 19x19 maxBatchSizeLimit=160)
2025-07-11 13:11:36+0300: Creating new timing cache (usingFP16=true 19x19 maxBatchSizeLimit=160)
2025-07-11 13:12:43+0300: Saved new timing cache to C:\Users\\Desktop\katago/KataGoData/trtcache/trt-101200_gpu-35714f42_tune-63eb38c37b0b_exact19x19_batch160_fp16
2025-07-11 13:12:43+0300: Saved new timing cache to C:\Users\\Desktop\katago/KataGoData/trtcache/trt-101200_gpu-35714f42_tune-63eb38c37b0b_exact19x19_batch160_fp16
2025-07-11 13:12:46+0300: TensorRT backend thread 0: Model version 15 useFP16 = true
2025-07-11 13:12:46+0300: TensorRT backend thread 0: Model version 15 useFP16 = true
2025-07-11 13:12:46+0300: TensorRT backend thread 0: Model name: kata1-b28c512nbt-s9584861952-d4960414494
2025-07-11 13:12:46+0300: TensorRT backend thread 0: Model name: kata1-b28c512nbt-s9584861952-d4960414494
2025-07-11 13:12:46+0300: Loaded config gtp_ponder.cfg
2025-07-11 13:12:46+0300: Loaded config gtp_ponder.cfg
2025-07-11 13:12:46+0300: Loaded model kata1-b28c512nbt-s9584861952-d4960414494.bin.gz
2025-07-11 13:12:46+0300: Loaded model kata1-b28c512nbt-s9584861952-d4960414494.bin.gz
Testing using 20000 visits.
Your GTP config is currently set to trtUseFP16 = auto
Your GTP config is currently set to use numSearchThreads = 138
Testing different numbers of threads (board size 19x19):
numSearchThreads = 32: 10 / 10 positions, visits/s = 2866.60 nnEvals/s = 1838.94 nnBatches/s = 115.16 avgBatchSize = 15.97 (69.9 secs) (EloDiff baseline)
numSearchThreads = 64: 10 / 10 positions, visits/s = 4116.44 nnEvals/s = 2511.61 nnBatches/s = 70.24 avgBatchSize = 35.76 (48.7 secs) (EloDiff +114)
numSearchThreads = 80: 10 / 10 positions, visits/s = 5117.78 nnEvals/s = 3107.10 nnBatches/s = 62.22 avgBatchSize = 49.94 (39.2 secs) (EloDiff +191)
numSearchThreads = 96: 10 / 10 positions, visits/s = 5044.99 nnEvals/s = 3109.02 nnBatches/s = 47.15 avgBatchSize = 65.94 (39.8 secs) (EloDiff +174)
numSearchThreads = 128: 10 / 10 positions, visits/s = 5499.19 nnEvals/s = 3426.42 nnBatches/s = 36.02 avgBatchSize = 95.13 (36.6 secs) (EloDiff +190)
numSearchThreads = 138: 10 / 10 positions, visits/s = 5610.33 nnEvals/s = 3644.22 nnBatches/s = 34.72 avgBatchSize = 104.96 (35.9 secs) (EloDiff +193)
numSearchThreads = 144: 10 / 10 positions, visits/s = 5801.26 nnEvals/s = 3635.59 nnBatches/s = 32.86 avgBatchSize = 110.63 (34.7 secs) (EloDiff +204)
numSearchThreads = 160: 10 / 10 positions, visits/s = 5642.06 nnEvals/s = 3627.64 nnBatches/s = 29.37 avgBatchSize = 123.53 (35.7 secs) (EloDiff +182)
Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:
numSearchThreads = 32: (baseline)
numSearchThreads = 64: +114 Elo
numSearchThreads = 80: +191 Elo
numSearchThreads = 96: +174 Elo
numSearchThreads = 128: +190 Elo
numSearchThreads = 138: +193 Elo
numSearchThreads = 144: +204 Elo (recommended)
numSearchThreads = 160: +182 Elo
Ok, so I did v=20000 and I get that (with a 4090 vs 5090 for you): Your GTP config is currently set to use numSearchThreads = 40 Testing different numbers of threads (board size 19x19): numSearchThreads = 56: 10 / 10 positions, visits/s = 3974.51 nnEvals/s = 2464.90 nnBatches/s = 88.29 avgBatchSize = 27.92 (50.5 secs) (EloDiff baseline) numSearchThreads = 60: 10 / 10 positions, visits/s = 3717.27 nnEvals/s = 2422.21 nnBatches/s = 80.94 avgBatchSize = 29.93 (54.0 secs) (EloDiff -30) numSearchThreads = 64: 10 / 10 positions, visits/s = 3735.42 nnEvals/s = 2431.47 nnBatches/s = 76.01 avgBatchSize = 31.99 (53.7 secs) (EloDiff -31) numSearchThreads = 68: 10 / 10 positions, visits/s = 4088.60 nnEvals/s = 2543.62 nnBatches/s = 74.39 avgBatchSize = 34.19 (49.1 secs) (EloDiff +2) numSearchThreads = 72: 10 / 10 positions, visits/s = 4209.38 nnEvals/s = 2650.41 nnBatches/s = 72.58 avgBatchSize = 36.52 (47.7 secs) (EloDiff +11) numSearchThreads = 76: 10 / 10 positions, visits/s = 4141.35 nnEvals/s = 2676.58 nnBatches/s = 68.02 avgBatchSize = 39.35 (48.5 secs) (EloDiff +1) numSearchThreads = 80: 10 / 10 positions, visits/s = 3952.12 nnEvals/s = 2514.63 nnBatches/s = 60.00 avgBatchSize = 41.91 (50.8 secs) (EloDiff -21)
Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper. Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse). So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search: numSearchThreads = 56: (baseline) numSearchThreads = 60: -30 Elo numSearchThreads = 64: -31 Elo numSearchThreads = 68: +2 Elo numSearchThreads = 72: +11 Elo (recommended) numSearchThreads = 76: +1 Elo numSearchThreads = 80: -21 Elo
So basically it's optimized at 72 threads with the 4090 and 144 threads with the 5090, and I get 4200 nnEval & 2650 nnBatches, while you get 5800 nnEval & 3600 nnBatches. So the 5090 is about 30% faster than the 4090 (with 2x more threads). Thanks a lot for the test!