Small difference makes suspicion performance decreasing
Test Cases
Steps to Reproduce
Hi, I run the attached two cases(good.wasm&bad.wasm) in Wasmtime and WasmEdge(AOT), and collect their execution time respectively (measured by time tool).
# command to collect execution time of wasmtime
wasmtime compile bad.wasm -o bad.cwasm
time wasmtime run --allow-precompiled bad.cwasm
# command to collect execution time of wasmedge
wasmedgec bad.wasm bad-wasmedge-aot.wasm
time wasmedge bad-wasmedge-aot.wasm
Expected Results & Actual Results
For good.wasm, the execution time in different runtimes are as follows:
- Wasmtime: 1.17s
- WasmEdge: 0.59s
For bad.wasm, the execution time in different runtimes are as follows:
- Wasmtime: 6.01s
- WasmEdge: 0.59s
The difference between the attached two cases is as follow, i.e., changing the operand of i32.add from i32.const 0 to local.get 1, which can bring 5x performance decreasing on Wasmtime but has no effect on WasmEdge.
➜ cases diff good.wat bad.wat
62c62
< i32.const 0
---
> local.get 1
;; good.wat
i32.const 0
i32.const 1
i32.add
;; bad.wat
local.get 1
i32.const 1
i32.add
At first I thought the performance decreasing was caused by the difference, because the good one uses a constant while the bad one uses a local variable which may need to fetch from memory. So I do a small experiment: repeatly calculate (2000000000 times) an addition operation whose operand are a constant or a local variable and measure the execution time respectively. And I find that they are almost the same, 1.09s vs. 1.1s. Therefore, I think the above performance decreasing is caused by other reasons.
;; experiment case
(module
(type (;0;) (func))
(type (;1;) (func (param i32)))
(import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 1)))
(func (;1;) (type 0)
(local i32 i32)
loop
;; i32.const 0
local.get 1
i32.const 1
i32.add
local.get 0
i32.add
local.tee 0
i32.const 2000000000
i32.ne
br_if 0
end
i32.const 0
call 0
unreachable)
(memory (;0;) 8192 8192)
(export "_start" (func 1))
(export "memory" (memory 0)))
Profiling Information
I use Perf tool to profile the execution time and find that the hotspot is in the loop where the small difference happens, so I think the difference change some compilation strategy which may cause the performance decreasing.
Samples: 23K of event 'cycles', Event count (approx.): 21853956752
Overhead Command Shared Object Symbol
99.87% wasmtime jitted-93855-1.so [.] wasm[0]::function[2] ▒
0.02% tokio-runtime-w [kernel.kallsyms] [k] __mod_memcg_lruvec_state ▒
0.02% wasmtime ld-2.31.so [.] _dl_relocate_object ▒
0.02% wasmtime [kernel.kallsyms] [k] __do_fault ▒
0.01% wasmtime [kernel.kallsyms] [k] pmd_page_vaddr ▒
# the commands to profile
perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm
perf inject --jit --input perf.data --output perf.jit.data
perf report --input perf.jit.data --no-children
Versions and Environment
- Wasmtime version or commit: d0cf46a098d97bab9
- Operating system: Linux ringzzz-OptiPlex-7070 5.15.0-97-generic
- Architecture: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
Locally on Intel(R) Core(TM) i9-14900K I get 0.9s on bad.wasm and 0.74s on good.wasm
Profiling good.wasm the hot loop is, according to perf:
│ d5:┌─→mov (%rbx),%r10d ▒
│ │ add $0x1,%r13d ▒
│ ├──test %r13d,%r13d ▒
100.00 │ └──jne d5 ▒
For bad.wasm it's
│ c5:┌─→mov (%rbx),%r10d ▒
│ │ mov (%rsp),%rcx ▒
0.09 │ │ lea 0x1(%rcx),%r10d ▒
55.32 │ │ test %r10d,%r10d ▒
│ │↓ je e5 ▒
0.06 │ │ mov %r10,%rcx ▒
0.03 │ │ mov %rcx,(%rsp) ▒
44.51 │ └──jmp c5 ▒
│ e5: xor %edi,%edi ▒
I believe WasmEdge is an LLVM-based backend so it probably unrolls this loop or something similar.
I find that there is not difference between the binary generated by WasmEdge before and after changes, so I think maybe the block which contains the small changes is optimized (eliminated) during code generation in LLVM.
So maybe the performance decreasing in Wasmtime is caused by the newly generated instructions in loop, but I am still surprised that only the small changes in machine code can make more than 5 times performance impact...
My guess is you're hitting micro-architectural limits or things like that. Basically various cliffs in the CPU in terms of performance where once you fall off the happy path it's both difficult to explain why and difficult to understand the effects. That's just my best guess though.