john icon indicating copy to clipboard operation
john copied to clipboard

POCL tests sometimes auto-tune to LWS much higher than manual maximum, then segfault

Open frank-dittrich opened this issue 4 years ago • 0 comments

Unfortunately, not reproducible reliably, but sometimes I get (Debug build, because I wanted to analyze the core files)

$ ./john --test --format=opencl --devices=2 --verbosity=6 ; echo $?
initUnicode(UNICODE, RAW/RAW)
RAW -> RAW -> RAW
NOTE: This is a debug build, speed will be lower than normal
[...]
0: No OpenCL devices were found on platform #2: CL_DEVICE_NOT_FOUND (-1)
Benchmarking: KeePass-opencl [SHA256 AES/Twofish/ChaCha OpenCL]... Loaded 9 hashes with 9 different salts to test db from test vectors
TestDB LWS=7 GWS=49 (7 blocks) PASS, 
Test mask: ?a?a?l?u?d?d?s
Calculating best GWS for LWS=8; max. 100 ms single kernel invocation.
Raw speed figures including buffer transfers:
Tuning to length 7
xfer: 4.554 us, init: 86.611 us, loop: 60x1.374 ms, final: 650.639 us, xfer: 509 ns
gws:       256     3075 c/s    18450000 rounds/s   83.233 ms per crypt_all()!
Tuning to length 7
xfer: 7.764 us, init: 84.833 us, loop: 60x2.607 ms, final: 2.573 ms, xfer: 866 ns
gws:       512     3217 c/s    19302000 rounds/s  159.123 ms per crypt_all()+
Tuning to length 7
xfer: 15.631 us, init: 135.513 us, loop: 60x5.397 ms, final: 2.721 ms, xfer: 1.048 us
gws:      1024     3134 c/s    18804000 rounds/s  326.697 ms per crypt_all()
Tuning to length 7
xfer: 30.317 us, init: 247.395 us, loop: 60x10.525 ms, final: 4.768 ms, xfer: 1.181 us
gws:      2048     3217 c/s    19302000 rounds/s  636.570 ms per crypt_all()
Tuning to length 7
xfer: 50.398 us, init: 479.622 us, loop: 60x21.356 ms, final: 9.467 ms, xfer: 2.582 us
gws:      4096     3171 c/s    19026000 rounds/s     1.291 s per crypt_all()
Tuning to length 7
xfer: 105.530 us, init: 941.412 us, loop: 60x41.374 ms, final: 19.767 ms, xfer: 3.365 us
gws:      8192     3272 c/s    19632000 rounds/s     2.503 s per crypt_all()+
Tuning to length 7
xfer: 180.491 us, init: 1.737 ms, loop: 60x91.639 ms, final: 39.405 ms, xfer: 7.042 us
gws:     16384     2957 c/s    17742000 rounds/s     5.539 s per crypt_all()
Tuning to length 7
xfer: 374.219 us, init: 3.867 ms, loop: 60x176.684 ms (exceeds 100 ms)
Tuning to length 7
xfer: 51.386 us, init: 578.297 us, loop: 60x21.946 ms, final: 10.774 ms, xfer: 16.653 us
gws:      4096     3083 c/s    18498000 rounds/s     1.328 s per crypt_all()-
Calculating best LWS for GWS=8192
Tuning to length 7
Testing LWS=8 GWS=8192 ... 222.740 ms+
Testing LWS=16 GWS=8192 ... 225.743 ms
Testing LWS=32 GWS=8192 ... 227.079 ms
Testing LWS=64 GWS=8192 ... 228.159 ms
Testing LWS=128 GWS=8192 ... 230.313 ms
Testing LWS=256 GWS=8192 ... 230.493 ms
Testing LWS=512 GWS=8192 ... 229.185 ms
Testing LWS=1024 GWS=8192 ... 229.409 ms
Testing LWS=2048 GWS=8192 ...Segmentation fault (core dumped)
139

When I try to force LWS=2048 GWS=8192, I get

$ LWS=2048 GWS=8192 ./john --test --format=keepass-opencl --verbosity=6
initUnicode(UNICODE, RAW/RAW)
RAW -> RAW -> RAW
NOTE: This is a debug build, speed will be lower than normal
0: No OpenCL devices were found on platform #2: CL_DEVICE_NOT_FOUND (-1)
Device 1: Intel(R) UHD Graphics 620 [0x5917]
Benchmarking: KeePass-opencl [SHA256 AES/Twofish/ChaCha OpenCL]... Loaded 9 hashes with 9 different salts to test db from test vectors
TestDB LWS=256 GWS=8192 (32 blocks) PASS, 
Test mask: ?a?a?l?u?d?d?s
TestDB LWS=256 GWS=8192 (32 blocks) DONE
Speed for cost 1 (iteration count) of 50000 and 6000, cost 2 (version) of 1 and 2, cost 3 (algorithm [0=AES 1=TwoFish 2=ChaCha]) of 0
Raw:	974 c/s real, 975 c/s virtual

OTOH, I also had a successful run where john ended up with GWS=16384:

$ ./john --test --format=keepass-opencl --verbosity=6
initUnicode(UNICODE, RAW/RAW)
RAW -> RAW -> RAW
NOTE: This is a debug build, speed will be lower than normal
0: No OpenCL devices were found on platform #2: CL_DEVICE_NOT_FOUND (-1)
Device 1: Intel(R) UHD Graphics 620 [0x5917]
Benchmarking: KeePass-opencl [SHA256 AES/Twofish/ChaCha OpenCL]... Loaded 9 hashes with 9 different salts to test db from test vectors
TestDB LWS=7 GWS=49 (7 blocks) PASS, 
Test mask: ?a?a?l?u?d?d?s
Calculating best GWS for LWS=8; max. 100 ms single kernel invocation.
Raw speed figures including buffer transfers:
Tuning to length 7
xfer: 5.250 us, init: 231 us, loop: 60x1.428 ms, final: 2.443 ms, xfer: 256 ns
gws:       256     2896 c/s    17376000 rounds/s   88.395 ms per crypt_all()!
Tuning to length 7
xfer: 7.666 us, init: 312 us, loop: 60x1.684 ms, final: 2.787 ms, xfer: 285 ns
gws:       512     4915 c/s    29490000 rounds/s  104.152 ms per crypt_all()+
Tuning to length 7
xfer: 9.666 us, init: 614.166 us, loop: 60x3.403 ms, final: 3.637 ms, xfer: 841 ns
gws:      1024     4912 c/s    29472000 rounds/s  208.446 ms per crypt_all()
Tuning to length 7
xfer: 20.416 us, init: 1.004 ms, loop: 60x7.593 ms, final: 7.968 ms, xfer: 986 ns
gws:      2048     4407 c/s    26442000 rounds/s  464.619 ms per crypt_all()
Tuning to length 7
xfer: 35.833 us, init: 1.638 ms, loop: 60x13.320 ms, final: 13.726 ms, xfer: 1.964 us
gws:      4096     5028 c/s    30168000 rounds/s  814.627 ms per crypt_all()+
Tuning to length 7
xfer: 69.166 us, init: 2.762 ms, loop: 60x26.559 ms, final: 25.441 ms, xfer: 3.107 us
gws:      8192     5051 c/s    30306000 rounds/s     1.621 s per crypt_all()
Tuning to length 7
xfer: 178.833 us, init: 4.624 ms, loop: 60x51.397 ms, final: 48.698 ms, xfer: 5.526 us
gws:     16384     5222 c/s    31332000 rounds/s     3.137 s per crypt_all()+
Tuning to length 7
xfer: 318.583 us, init: 7.998 ms, loop: 60x102.373 ms (exceeds 100 ms)
Tuning to length 7
xfer: 72.083 us, init: 2.606 ms, loop: 60x27.202 ms, final: 25.821 ms, xfer: 3.229 us
gws:      8192     4933 c/s    29598000 rounds/s     1.660 s per crypt_all()-
Calculating best LWS for GWS=16384
Tuning to length 7
Testing LWS=8 GWS=16384 ... 215.680 ms+
Testing LWS=16 GWS=16384 ... 215.938 ms
Testing LWS=32 GWS=16384 ... 218.990 ms
Testing LWS=64 GWS=16384 ... 218.721 ms
Testing LWS=128 GWS=16384 ... 214.170 ms+
Testing LWS=256 GWS=16384 ... 215.401 ms
Calculating best GWS for LWS=128; max. 200 ms single kernel invocation.
Raw speed figures including buffer transfers:
Tuning to length 7
xfer: 7 us, init: 236.916 us, loop: 60x1.595 ms, final: 2.731 ms, xfer: 243 ns
gws:       384     3891 c/s    23346000 rounds/s   98.685 ms per crypt_all()!
Tuning to length 7
xfer: 8.165 us, init: 414.416 us, loop: 60x2.102 ms, final: 3.468 ms, xfer: 374 ns
gws:       768     5905 c/s    35430000 rounds/s  130.056 ms per crypt_all()+
Tuning to length 7
xfer: 23.416 us, init: 790.583 us, loop: 60x6.810 ms, final: 6.677 ms, xfer: 813 ns
gws:      1536     3691 c/s    22146000 rounds/s  416.142 ms per crypt_all()
Tuning to length 7
xfer: 23.583 us, init: 1.301 ms, loop: 60x11.358 ms, final: 11.109 ms, xfer: 1.351 us
gws:      3072     4426 c/s    26556000 rounds/s  693.960 ms per crypt_all()
Tuning to length 7
xfer: 53.666 us, init: 2.146 ms, loop: 60x20.132 ms, final: 19.714 ms, xfer: 2.444 us
gws:      6144     4995 c/s    29970000 rounds/s     1.229 s per crypt_all()
Tuning to length 7
xfer: 111.333 us, init: 3.616 ms, loop: 60x40.716 ms, final: 38.433 ms, xfer: 5.263 us
gws:     12288     4944 c/s    29664000 rounds/s     2.485 s per crypt_all()
Tuning to length 7
xfer: 252.166 us, init: 6.197 ms, loop: 60x81.287 ms, final: 77.137 ms, xfer: 7.621 us
gws:     24576     4954 c/s    29724000 rounds/s     4.960 s per crypt_all()
Tuning to length 7
xfer: 512.166 us, init: 11.878 ms, loop: 60x160.554 ms, final: 151.274 ms, xfer: 51.811 us
gws:     49152     5017 c/s    30102000 rounds/s     9.796 s per crypt_all()
Tuning to length 7
xfer: 852.666 us, init: 23.343 ms, loop: 60x322.813 ms (exceeds 200 ms)
Tuning to length 7
xfer: 6.750 us, init: 276.166 us, loop: 60x1.691 ms, final: 2.578 ms, xfer: 290 ns
gws:       384     3679 c/s    22074000 rounds/s  104.366 ms per crypt_all()-
TestDB LWS=128 GWS=768 (6 blocks) DONE
Speed for cost 1 (iteration count) of 50000 and 6000, cost 2 (version) of 1 and 2, cost 3 (algorithm [0=AES 1=TwoFish 2=ChaCha]) of 0
Raw:	1055 c/s real, 1034 c/s virtual

Another run ended up with completely different GWS:

$ ./john --test --format=keepass-opencl --devices=2 --verbosity=6          
initUnicode(UNICODE, RAW/RAW)
RAW -> RAW -> RAW
NOTE: This is a debug build, speed will be lower than normal
0: No OpenCL devices were found on platform #2: CL_DEVICE_NOT_FOUND (-1)
Device 2: pthread-Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
Benchmarking: KeePass-opencl [SHA256 AES/Twofish/ChaCha OpenCL]... Loaded 9 hashes with 9 different salts to test db from test vectors
TestDB LWS=7 GWS=49 (7 blocks) PASS, 
Test mask: ?a?a?l?u?d?d?s
Calculating best GWS for LWS=8; max. 100 ms single kernel invocation.
Raw speed figures including buffer transfers:
Tuning to length 7
xfer: 3.482 us, init: 58.219 us, loop: 60x981.186 us, final: 1.366 ms, xfer: 524 ns
gws:       256     4245 c/s    25470000 rounds/s   60.299 ms per crypt_all()!
Tuning to length 7
xfer: 22.490 us, init: 72.043 us, loop: 60x1.998 ms, final: 1.147 ms, xfer: 640 ns
gws:       512     4226 c/s    25356000 rounds/s  121.139 ms per crypt_all()
Tuning to length 7
xfer: 12.065 us, init: 125.282 us, loop: 60x4.742 ms, final: 2.064 ms, xfer: 923 ns
gws:      1024     3570 c/s    21420000 rounds/s  286.756 ms per crypt_all()
Tuning to length 7
xfer: 24.508 us, init: 240.059 us, loop: 60x9.197 ms, final: 4.177 ms, xfer: 1.766 us
gws:      2048     3681 c/s    22086000 rounds/s  556.284 ms per crypt_all()
Tuning to length 7
xfer: 45.811 us, init: 405.103 us, loop: 60x17.860 ms, final: 9.064 ms, xfer: 5.447 us
gws:      4096     3788 c/s    22728000 rounds/s     1.081 s per crypt_all()
Tuning to length 7
xfer: 159.839 us, init: 808.868 us, loop: 60x35.681 ms, final: 16.044 ms, xfer: 3.024 us
gws:      8192     3796 c/s    22776000 rounds/s     2.157 s per crypt_all()
Tuning to length 7
xfer: 161.293 us, init: 1.681 ms, loop: 60x70.593 ms, final: 31.806 ms, xfer: 11.220 us
gws:     16384     3837 c/s    23022000 rounds/s     4.269 s per crypt_all()
Tuning to length 7
xfer: 401.729 us, init: 3.691 ms, loop: 60x142.534 ms (exceeds 100 ms)
Tuning to length 7
xfer: 4.399 us, init: 136.545 us, loop: 60x532.961 us, final: 298.040 us, xfer: 477 ns
gws:       128     3948 c/s    23688000 rounds/s   32.417 ms per crypt_all()-
Calculating best LWS for GWS=256
Tuning to length 7
Testing LWS=8 GWS=256 ... 243.652 ms+
Testing LWS=16 GWS=256 ... 241.340 ms+
Testing LWS=32 GWS=256 ... 246.731 ms
Testing LWS=64 GWS=256 ... 400.366 ms
Testing LWS=128 GWS=256 ... 560.459 ms

Why these these different GWS values?
What can I do to figure out what's going on? Why is autotuning trying LWS / GWS combinations which are not even used when requested on the command line?

frank-dittrich avatar May 08 '21 18:05 frank-dittrich