--optimize-level 3 slowing things down
Yesterday I filed a bug prematurely, but I knew something was not quite right. The code below takes >30% longer with --optimize-level 3 than without it.
% scheme --script div-and-mod.ss
(time (for-each (lambda (...) ...) ...))
no collections
6.400004545s elapsed cpu time
6.400545484s elapsed real time
0 bytes allocated
Total 45,149,700,000,000
% echo '(compile-file "div-and-mod.ss")' | scheme -q
compiling div-and-mod.ss with output to div-and-mod.so
% scheme --script div-and-mod.so
(time (for-each (lambda (...) ...) ...))
no collections
6.117455272s elapsed cpu time
6.117724356s elapsed real time
0 bytes allocated
Total 45,149,700,000,000
% echo '(compile-file "div-and-mod.ss")' | scheme -q --optimize-level 3
compiling div-and-mod.ss with output to div-and-mod.so
% scheme --script div-and-mod.so
(time (for-each (lambda (...) ...) ...))
no collections
8.324555132s elapsed cpu time
8.329026569s elapsed real time
0 bytes allocated
Total 45,149,700,000,000
(set! total 0)
(let ([l (iota 300000000)])
(time
(for-each
(lambda (x)
(let-values ([(quo rem) (fxdiv-and-mod x 1000)])
(set! total (+ total quo rem))))
l)))
(printf "Total ~:d~n" total)
% scheme --version
10.1.0
% uname -a
Linux welwitschia 6.1.0-27-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.115-1 (2024-11-01) x86_64 GNU/Linux
This works as expected (i.e., optimize level 3 is faster) on tarm64osx. Someone else will have to check the x64 linux builds.
i've got same timings approx as @mml. level 3 is slower
$ cat test.scm
(set! total 0)
(let ([l (iota 300000000)])
(time
(for-each
(lambda (x)
(let-values ([(quo rem) (fxdiv-and-mod x 1000)])
(set! total (+ total quo rem))))
l)))
(printf "Total ~:d~n" total)
$ echo '(compile-file "test.scm")' | chez -q
compiling test.scm with output to test.so
$ chez --script test.so
(time (for-each (lambda (...) ...) ...))
no collections
6.137851039s elapsed cpu time
6.142206869s elapsed real time
0 bytes allocated
Total 45,149,700,000,000
$ echo '(compile-file "test.scm")' | chez -q --optimize-level 3
compiling test.scm with output to test.so
$ chez --optimize-level 3 --script test.so
(time (for-each (lambda (...) ...) ...))
no collections
8.453322405s elapsed cpu time
8.458856063s elapsed real time
0 bytes allocated
Total 45,149,700,000,000
$ uname -a
Linux rocket 6.12.4-arch1-1 #1 SMP PREEMPT_DYNAMIC Mon, 09 Dec 2024 14:31:57 +0000 x86_64 GNU/Linux
$ cat /proc/cpuinfo | grep 'model name' | uniq
model name : Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz
$ chez --version
10.1.0
I see the slowdown on arm64osx and a6le.
It looks like the issue is that the cp0 inline handler for fxdiv-and-mod does not (yet) anticipate how np-expand-primitives will handle fxdiv and fxmod. Since 1000 is not a power of two, we end up turning one out-of-line call into two such calls. If the test program had used 1024 instead of 1000, then the optimize-level 3 version would likely run faster.
We can cut a lot of (unmeasured) overhead from the test by replacing the for-each over the output of iota with a do loop:
(let ()
(define (add-quo-rem total x)
(let-values ([(quo rem) (fxdiv-and-mod x 1000)])
(+ total quo rem)))
(define total
(time
(do ([x 0 (fx+ x 1)]
[total 0 (add-quo-rem total x)])
((fx= x 300000000) total))))
(printf "Total ~:d~n" total))
Looks like the fix was shipped in 10.2.0. Thanks!