wasmtime Small difference makes suspicion performance decreasing

Test Cases

Steps to Reproduce

Hi, I run the attached two cases(good.wasm&bad.wasm) in Wasmtime and WasmEdge(AOT), and collect their execution time respectively (measured by time tool).

# command to collect execution time of wasmtime
wasmtime compile bad.wasm -o bad.cwasm
time wasmtime run --allow-precompiled bad.cwasm

# command to collect execution time of wasmedge
wasmedgec bad.wasm bad-wasmedge-aot.wasm
time wasmedge bad-wasmedge-aot.wasm

Expected Results & Actual Results

For good.wasm, the execution time in different runtimes are as follows:

Wasmtime: 1.17s
WasmEdge: 0.59s

For bad.wasm, the execution time in different runtimes are as follows:

Wasmtime: 6.01s
WasmEdge: 0.59s

The difference between the attached two cases is as follow, i.e., changing the operand of i32.add from i32.const 0 to local.get 1, which can bring 5x performance decreasing on Wasmtime but has no effect on WasmEdge.

➜  cases diff good.wat bad.wat
62c62
<                       i32.const 0
---
>                       local.get 1

;; good.wat
i32.const 0
i32.const 1
i32.add

;; bad.wat
local.get 1
i32.const 1
i32.add

At first I thought the performance decreasing was caused by the difference, because the good one uses a constant while the bad one uses a local variable which may need to fetch from memory. So I do a small experiment: repeatly calculate (2000000000 times) an addition operation whose operand are a constant or a local variable and measure the execution time respectively. And I find that they are almost the same, 1.09s vs. 1.1s. Therefore, I think the above performance decreasing is caused by other reasons.

;; experiment case
(module
  (type (;0;) (func))
  (type (;1;) (func (param i32)))
  (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 1)))
  (func (;1;) (type 0)
    (local i32 i32)
    loop
      ;; i32.const 0
      local.get 1
      i32.const 1
      i32.add
      local.get 0
      i32.add
      local.tee 0
      i32.const 2000000000
      i32.ne
      br_if 0
    end
    i32.const 0
    call 0
    unreachable)
  (memory (;0;) 8192 8192)
  (export "_start" (func 1))
  (export "memory" (memory 0)))

Profiling Information

I use Perf tool to profile the execution time and find that the hotspot is in the loop where the small difference happens, so I think the difference change some compilation strategy which may cause the performance decreasing.

Samples: 23K of event 'cycles', Event count (approx.): 21853956752
Overhead  Command          Shared Object      Symbol
  99.87%  wasmtime         jitted-93855-1.so  [.] wasm[0]::function[2]                                                                                           ▒
   0.02%  tokio-runtime-w  [kernel.kallsyms]  [k] __mod_memcg_lruvec_state                                                                                       ▒
   0.02%  wasmtime         ld-2.31.so         [.] _dl_relocate_object                                                                                            ▒
   0.02%  wasmtime         [kernel.kallsyms]  [k] __do_fault                                                                                                     ▒
   0.01%  wasmtime         [kernel.kallsyms]  [k] pmd_page_vaddr                                                                                                 ▒

# the commands to profile
perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm
perf inject --jit --input perf.data --output perf.jit.data
perf report --input perf.jit.data --no-children

Versions and Environment

Wasmtime version or commit: d0cf46a098d97bab9
Operating system: Linux ringzzz-OptiPlex-7070 5.15.0-97-generic
Architecture: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz

May 15 '24 14:05 hungryzzz

Locally on Intel(R) Core(TM) i9-14900K I get 0.9s on bad.wasm and 0.74s on good.wasm

Profiling good.wasm the hot loop is, according to perf:

       │ d5:┌─→mov    (%rbx),%r10d                                                                                                                                                                                                                                                                                             ▒
       │    │  add    $0x1,%r13d                                                                                                                                                                                                                                                                                               ▒
       │    ├──test   %r13d,%r13d                                                                                                                                                                                                                                                                                              ▒
100.00 │    └──jne    d5                                                                                                                                                                                                                                                                                                       ▒

For bad.wasm it's

       │ c5:┌─→mov    (%rbx),%r10d                                                                                                                                                                                                                                                                                             ▒
       │    │  mov    (%rsp),%rcx                                                                                                                                                                                                                                                                                              ▒
  0.09 │    │  lea    0x1(%rcx),%r10d                                                                                                                                                                                                                                                                                          ▒
 55.32 │    │  test   %r10d,%r10d                                                                                                                                                                                                                                                                                              ▒
       │    │↓ je     e5                                                                                                                                                                                                                                                                                                       ▒
  0.06 │    │  mov    %r10,%rcx                                                                                                                                                                                                                                                                                                ▒
  0.03 │    │  mov    %rcx,(%rsp)                                                                                                                                                                                                                                                                                              ▒
 44.51 │    └──jmp    c5                                                                                                                                                                                                                                                                                                       ▒
       │ e5:   xor    %edi,%edi                                                                                                                                                                                                                                                                                                ▒

I believe WasmEdge is an LLVM-based backend so it probably unrolls this loop or something similar.

May 15 '24 15:05 alexcrichton

I find that there is not difference between the binary generated by WasmEdge before and after changes, so I think maybe the block which contains the small changes is optimized (eliminated) during code generation in LLVM.

So maybe the performance decreasing in Wasmtime is caused by the newly generated instructions in loop, but I am still surprised that only the small changes in machine code can make more than 5 times performance impact...

May 16 '24 14:05 hungryzzz

My guess is you're hitting micro-architectural limits or things like that. Basically various cliffs in the CPU in terms of performance where once you fall off the happy path it's both difficult to explain why and difficult to understand the effects. That's just my best guess though.

May 16 '24 17:05 alexcrichton