`@[inline]` annotation is ignored when specializing function with fixed type class argument
Prerequisites
Please put an X between the brackets as you perform the following steps:
- [x] Check that your issue is not already filed: https://github.com/leanprover/lean4/issues
- [x] Reduce the issue to a minimal, self-contained, reproducible test case. Avoid dependencies to Mathlib or Batteries.
- [x] Test your test case against the latest nightly release, for example on https://live.lean-lang.org/#project=lean-nightly (You can also use the settings there to switch to “Lean nightly”)
Description
The following clumsy benchmark tests two variants of a pair of functions reinsertAux and expand_go (named after the hash map operations they are inspired by). The difference is that the first version passes a function into reinsertAux as a type class, while the second version passes in the function directly.
class ToUInt64 (α : Type u) where
toUInt64 : α → UInt64
instance : ToUInt64 UInt32 where
toUInt64 := UInt32.toUInt64
@[inline] def reinsertAux₁ {α : Type u} [ToUInt64 α] (acc : List UInt64) (a : α) : List UInt64 :=
ToUInt64.toUInt64 a :: acc
def expand₁_go {α : Type u} [ToUInt64 α] (a : List α) : List UInt64 :=
a.foldl reinsertAux₁ []
@[noinline] def expand₁' {α : Type u} [ToUInt64 α] (a : List α) : IO (List UInt64) := do
return expand₁_go a
@[inline] def reinsertAux₂ {α : Type u} (f : α → UInt64) (acc : List UInt64) (a : α) : List UInt64 :=
f a :: acc
def expand₂_go {α : Type u} [ToUInt64 α] (a : List α) : List UInt64 :=
a.foldl (reinsertAux₂ ToUInt64.toUInt64) []
@[noinline] def expand₂' {α : Type u} [ToUInt64 α] (a : List α) : IO (List UInt64) := do
return expand₂_go a
def benchmarkSize : Nat := 50000000
@[noinline] def mkNumbers : IO (List UInt32) := do
let mut a : List UInt32 := []
for i in [:benchmarkSize] do
a := i.toUInt32 :: a
return a
@[noinline] def runBenchmark₁ : IO PUnit := do
for _ in [:5] do
let n ← mkNumbers
let _ ← timeit "expand₁" $ expand₁' n
@[noinline] def runBenchmark₂ : IO PUnit := do
for _ in [:5] do
let n ← mkNumbers
let _ ← timeit "expand₂" $ expand₂' n
def main (_ : List String) : IO UInt32 := do
runBenchmark₁
runBenchmark₂
return 0
It is important for performance that reinsertAux is inlined into expand_go, because only after the inlining it is possible to see that the memory from the input of expand_go can be reused in the constructor application in reinsertAux. Inspecting the IR, we indeed see that the IR for both expand₁_go and expand₂_go gets this right. In fact, the IR for these two functions returned by trace.compiler.ir.reset_reuse is exactly identical. It looks like this:
def List.foldl._at.expand₁_go._spec_1._rarg (x_1 : obj) (x_2 : obj) (x_3 : obj) : obj :=
case x_3 : obj of
List.nil →
ret x_2
List.cons →
let x_4 : obj := proj[0] x_3;
let x_5 : obj := proj[1] x_3;
let x_9 : obj := reset[2] x_3;
let x_6 : u64 := app x_1 x_4;
let x_7 : obj := reuse x_9 in ctor_1[List.cons] x_6 x_2;
let x_8 : obj := List.foldl._at.expand₁_go._spec_1._rarg x_1 x_7 x_5;
ret x_8
def List.foldl._at.expand₁_go._spec_1 (x_1 : ◾) : obj :=
let x_2 : obj := pap List.foldl._at.expand₁_go._spec_1._rarg;
ret x_2
def expand₁_go._rarg (x_1 : obj) (x_2 : obj) : obj :=
let x_3 : obj := ctor_0[List.nil];
let x_4 : obj := List.foldl._at.expand₁_go._spec_1._rarg x_1 x_3 x_2;
ret x_4
However, this IR isn't actually used due to specialization. The specialized IR is visible at the two runBenchmark functions. Inspecting the reset_reuse IR this time, we get
def reinsertAux₁._at.runBenchmark₁._spec_3 (x_1 : obj) (x_2 : u32) : obj :=
let x_3 : u64 := UInt32.toUInt64 x_2;
let x_4 : obj := ctor_1[List.cons] x_3 x_1;
ret x_4
def List.foldl._at.runBenchmark₁._spec_4 (x_1 : obj) (x_2 : obj) : obj :=
case x_2 : obj of
List.nil →
ret x_1
List.cons →
let x_3 : obj := proj[0] x_2;
let x_4 : obj := proj[1] x_2;
let x_5 : obj := reinsertAux₁._at.runBenchmark₁._spec_3 x_1 x_3;
let x_6 : obj := List.foldl._at.runBenchmark₁._spec_4 x_5 x_4;
ret x_6
for the first version and
def List.foldl._at.runBenchmark₂._spec_3 (x_1 : obj) (x_2 : obj) : obj :=
case x_2 : obj of
List.nil →
ret x_1
List.cons →
let x_3 : obj := proj[0] x_2;
let x_4 : obj := proj[1] x_2;
let x_8 : obj := reset[2] x_2;
let x_5 : u64 := UInt32.toUInt64 x_3;
let x_6 : obj := reuse x_8 in ctor_1[List.cons] x_5 x_1;
let x_7 : obj := List.foldl._at.runBenchmark₂._spec_3 x_6 x_4;
ret x_7
for the second version. Here we see that in the first version reinsertAux was not inlined, and so there is no reset/reuse pair for the memory cells of the lists, and this is clearly measurable when we run the code: on my machine, I get
expand₁ 609ms
expand₁ 475ms
expand₁ 519ms
expand₁ 467ms
expand₁ 525ms
expand₂ 309ms
expand₂ 335ms
expand₂ 304ms
expand₂ 337ms
expand₂ 307ms
Context
During benchmarking I noticed that insert performance for the Lean hash map is about 10% better than for the Batteries hash map. This issue is the result of analyzing that observation. The Lean hash map looks like the second version, while the Batteries hash map looks like the first version.
Steps to Reproduce
- Run the above code
Expected behavior: Both versions should respect the @[inline] attribute on reinsertAux during specialization.
Actual behavior: Passing in the function via a type class leads to the @[inline] attribute being ignored.
Versions
4.9.0-nightly-2024-05-16
Linux markus-z16 6.8.9-300.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Thu May 2 18:59:06 UTC 2024 x86_64 GNU/Linux
Additional Information
[Additional information, configuration or data that might be necessary to reproduce the issue]
Impact
Add :+1: to issues you consider important. If others are impacted by this issue, please ask them to add :+1: to it.