ChezScheme icon indicating copy to clipboard operation
ChezScheme copied to clipboard

--optimize-level 3 slowing things down

Open mml opened this issue 1 year ago • 3 comments

Yesterday I filed a bug prematurely, but I knew something was not quite right. The code below takes >30% longer with --optimize-level 3 than without it.

% scheme --script div-and-mod.ss
(time (for-each (lambda (...) ...) ...))
    no collections
    6.400004545s elapsed cpu time
    6.400545484s elapsed real time
    0 bytes allocated
Total 45,149,700,000,000
% echo '(compile-file "div-and-mod.ss")' | scheme -q
compiling div-and-mod.ss with output to div-and-mod.so
% scheme --script div-and-mod.so
(time (for-each (lambda (...) ...) ...))
    no collections
    6.117455272s elapsed cpu time
    6.117724356s elapsed real time
    0 bytes allocated
Total 45,149,700,000,000
% echo '(compile-file "div-and-mod.ss")' | scheme -q --optimize-level 3
compiling div-and-mod.ss with output to div-and-mod.so
% scheme --script div-and-mod.so
(time (for-each (lambda (...) ...) ...))
    no collections
    8.324555132s elapsed cpu time
    8.329026569s elapsed real time
    0 bytes allocated
Total 45,149,700,000,000
(set! total 0)
(let ([l (iota 300000000)])
  (time
    (for-each
      (lambda (x)
        (let-values ([(quo rem) (fxdiv-and-mod x 1000)])
          (set! total (+ total quo rem))))
      l)))
(printf "Total ~:d~n" total)
% scheme --version
10.1.0
% uname -a
Linux welwitschia 6.1.0-27-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.115-1 (2024-11-01) x86_64 GNU/Linux

mml avatar Dec 13 '24 01:12 mml

This works as expected (i.e., optimize level 3 is faster) on tarm64osx. Someone else will have to check the x64 linux builds.

jltaylor-us avatar Dec 13 '24 03:12 jltaylor-us

i've got same timings approx as @mml. level 3 is slower

$ cat test.scm 
(set! total 0)

(let ([l (iota 300000000)])
  (time
    (for-each
      (lambda (x)
        (let-values ([(quo rem) (fxdiv-and-mod x 1000)])
          (set! total (+ total quo rem))))
      l)))

(printf "Total ~:d~n" total)
$ echo '(compile-file "test.scm")' | chez -q
compiling test.scm with output to test.so
$ chez --script test.so
(time (for-each (lambda (...) ...) ...))
    no collections
    6.137851039s elapsed cpu time
    6.142206869s elapsed real time
    0 bytes allocated
Total 45,149,700,000,000
$ echo '(compile-file "test.scm")' | chez -q --optimize-level 3
compiling test.scm with output to test.so
$ chez --optimize-level 3 --script test.so
(time (for-each (lambda (...) ...) ...))
    no collections
    8.453322405s elapsed cpu time
    8.458856063s elapsed real time
    0 bytes allocated
Total 45,149,700,000,000
$ uname -a
Linux rocket 6.12.4-arch1-1 #1 SMP PREEMPT_DYNAMIC Mon, 09 Dec 2024 14:31:57 +0000 x86_64 GNU/Linux
$ cat /proc/cpuinfo | grep 'model name' | uniq
model name	: Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz
$ chez --version
10.1.0

mbakhterev avatar Dec 13 '24 07:12 mbakhterev

I see the slowdown on arm64osx and a6le.

It looks like the issue is that the cp0 inline handler for fxdiv-and-mod does not (yet) anticipate how np-expand-primitives will handle fxdiv and fxmod. Since 1000 is not a power of two, we end up turning one out-of-line call into two such calls. If the test program had used 1024 instead of 1000, then the optimize-level 3 version would likely run faster.

We can cut a lot of (unmeasured) overhead from the test by replacing the for-each over the output of iota with a do loop:

(let ()
  (define (add-quo-rem total x)
    (let-values ([(quo rem) (fxdiv-and-mod x 1000)])
      (+ total quo rem)))
  (define total
    (time
     (do ([x 0 (fx+ x 1)]
          [total 0 (add-quo-rem total x)])
         ((fx= x 300000000) total))))
  (printf "Total ~:d~n" total))

owaddell avatar Dec 13 '24 13:12 owaddell

Looks like the fix was shipped in 10.2.0. Thanks!

mml avatar Nov 03 '25 16:11 mml