fsharp icon indicating copy to clipboard operation
fsharp copied to clipboard

Better integral range lowering: `start..finish`, `start..step..finish`

Open brianrourkeboll opened this issue 2 years ago • 28 comments

Description

This subsumes #16577 (and #13573).

Fixes #938. Fixes #9548.

Before

The compiler currently only optimizes for-loops over integral ranges when:

  1. The type is int/int32.
  2. The step is a compile-time constant 1 or -1.

for-loops over ranges for other integral types (int64, uint32, etc.), or when the step is not 1 or -1, use a relatively slow, IEnumerable<_>-based implementation.

List and array comprehension expressions initialized by an integral range also use the slow IEnumerable<_>-based implementation, and, in the case of arrays, incur additional unnecessary array allocations and copying.

After

With this PR, the compiler now lowers for-loops for all built-in integral types with any arbitrary step down to fast integral while-loops.

This optimization is in turn applied to the lowering of computed list collection expressions where an integral range is used to initialize the list, as in [start..finish] and [start..step..finish]; a similar technique is used for arrays in [|start..finish|] and [|start..step..finish|], where the total size of the array is computed from the start, step, and finish, at compile-time if possible. Such initialization expressions are now significantly faster and allocate significantly less, especially for arrays and for smaller lists.

  • A for-loop over an integral range is about 5–6 times as fast and no longer allocates at all
  • Array initialization from an integral range is about 2.5–8 times as fast and allocates less than half as much
  • List initialization from an integral range is about 1.25 times as fast and allocates a bit less (and not at all for empty ranges)

The following are the approximate high-level transformations applied to for-loops and list and array comprehensions over integral ranges.

The basic idea is to precompute the iteration count like this:

start..finish ($step = 1$)

let count = if finish < start then 0 else unsigned (widen finish - widen start) + 1
// We "widen" finish and start to the next-biggest integral type, since the count
// won't fit in the original type if finish = MaxValue and start = MinValue.

start..step..finish

if step = 0 then
    ignore ((.. ..) start step finish) // Throws the appropriate localized exception at runtime.

let count =
    if 0 < step then
        if finish < start then 0 else unsigned (widen finish - widen start) / unsigned (widen step) + 1
    else // step < 0
        if start < finish then 0 else unsigned (widen start - widen finish) / (unsigned (widen ~~~step) + 1) + 1
        // We use unsigned (widen ~~~step) + 1 to get the step's absolute value, since step might be the minimum value
        // for the given numeric type, and the neg instruction won't do what we want on the minimum value
        // of a two's complement number.

This lets us handle any potential overflows just once, instead of in each iteration of the loop. The loop then becomes simply:

let mutable loopVar = start
let mutable i = 0

while i < count do
    …
    loopVar <- loopVar + step
    i <- i + 1

Additional optimizations are applied for various scenarios:

  • If start, step, and finish are constants, we can compute the count at build-time.
  • If step is constant, or if the integral type is unsigned, we can simplify the code we emit for computing the count at runtime: we potentially don't need to emit code to check whether $0 < step$, find $|step|$, etc.
  • Etc.

Here is an example of the change in emitted IL: https://gist.github.com/brianrourkeboll/db2738f838847d2968affc43b61b21c0

Sample diff for `for n in start..finish do` for non-`int32` type
let ``for n in start..finish do`` (start: int64) finish =
    let xs = ResizeArray ()
    for n in start..finish do
        xs.Add n
    List.ofSeq xs
@@ -310,63 +573,68 @@
                                .custom instance void [FSharp.Core]Microsoft.FSharp.Core.CompilationArgumentCountsAttribute::.ctor(int32[]) = (
                                        01 00 02 00 00 00 01 00 00 00 01 00 00 00 00 00
                                )
-                               // Method begins at RVA 0x2180
+                               // Method begins at RVA 0x2240
                                // Header size: 12
-                               // Code size: 74 (0x4a)
-                               .maxstack 5
+                               // Code size: 61 (0x3d)
+                               .maxstack 4
                                .locals init (
                                        [0] class [System.Collections]System.Collections.Generic.List`1<int64>,
-                                       [1] class [System.Runtime]System.Collections.Generic.IEnumerable`1<int64>,
-                                       [2] class [System.Runtime]System.Collections.Generic.IEnumerator`1<int64>,
-                                       [3] class [System.Runtime]System.IDisposable
+                                       [1] uint64,
+                                       [2] uint64,
+                                       [3] int64
                                )

                                IL_0000: newobj instance void class [System.Collections]System.Collections.Generic.List`1<int64>::.ctor()
                                IL_0005: stloc.0
-                               IL_0006: ldarg.0
-                               IL_0007: ldc.i4.1
-                               IL_0008: conv.i8
-                               IL_0009: ldarg.1
-                               IL_000a: call class [System.Runtime]System.Collections.Generic.IEnumerable`1<int64> [FSharp.Core]Microsoft.FSharp.Core.Operators/OperatorIntrinsics::RangeInt64(int64, int64, int64)
-                               IL_000f: stloc.1
-                               IL_0010: ldloc.1
-                               IL_0011: callvirt instance class [System.Runtime]System.Collections.Generic.IEnumerator`1<!0> class [System.Runtime]System.Collections.Generic.IEnumerable`1<int64>::GetEnumerator()
-                               IL_0016: stloc.2
-                               .try
-                               {
-                                       IL_0017: br.s IL_0025
-                                       // loop start (head: IL_0025)
-                                               IL_0019: ldloc.0
-                                               IL_001a: ldloc.2
-                                               IL_001b: callvirt instance !0 class [System.Runtime]System.Collections.Generic.IEnumerator`1<int64>::get_Current()
-                                               IL_0020: callvirt instance void class [System.Collections]System.Collections.Generic.List`1<int64>::Add(!0)
-
-                                               IL_0025: ldloc.2
-                                               IL_0026: callvirt instance bool [System.Runtime]System.Collections.IEnumerator::MoveNext()
-                                               IL_002b: brtrue.s IL_0019
-                                       // end loop
-
-                                       IL_002d: leave.s IL_0041
-                               } // end .try
-                               finally
-                               {
-                                       IL_002f: ldloc.2
-                                       IL_0030: isinst [System.Runtime]System.IDisposable
-                                       IL_0035: stloc.3
-                                       IL_0036: ldloc.3
-                                       IL_0037: brfalse.s IL_0040
-
-                                       IL_0039: ldloc.3
-                                       IL_003a: callvirt instance void [System.Runtime]System.IDisposable::Dispose()
-                                       IL_003f: endfinally
-
-                                       IL_0040: endfinally
-                               } // end handler
+                               IL_0006: ldarg.1
+                               IL_0007: ldarg.0
+                               IL_0008: bge.s IL_000f
+
+                               IL_000a: ldc.i4.0
+                               IL_000b: conv.i8
+                               IL_000c: nop
+                               IL_000d: br.s IL_0018
+
+                               IL_000f: ldarg.1
+                               IL_0010: conv.i8
+                               IL_0011: ldarg.0
+                               IL_0012: conv.i8
+                               IL_0013: sub
+                               IL_0014: ldc.i4.1
+                               IL_0015: conv.i8
+                               IL_0016: add.ovf.un
+                               IL_0017: nop
+
+                               IL_0018: stloc.1
+                               IL_0019: ldc.i4.0
+                               IL_001a: conv.i8
+                               IL_001b: stloc.2
+                               IL_001c: ldarg.0
+                               IL_001d: stloc.3
+                               IL_001e: br.s IL_0030
+                               // loop start (head: IL_0030)
+                                       IL_0020: ldloc.0
+                                       IL_0021: ldloc.3
+                                       IL_0022: callvirt instance void class [System.Collections]System.Collections.Generic.List`1<int64>::Add(!0)
+                                       IL_0027: ldloc.3
+                                       IL_0028: ldc.i4.1
+                                       IL_0029: add
+                                       IL_002a: stloc.3
+                                       IL_002b: ldloc.2
+                                       IL_002c: ldc.i4.1
+                                       IL_002d: conv.i8
+                                       IL_002e: add
+                                       IL_002f: stloc.2
+
+                                       IL_0030: ldloc.2
+                                       IL_0031: ldloc.1
+                                       IL_0032: blt.un.s IL_0020
+                               // end loop

-                               IL_0041: ldloc.0
-                               IL_0042: tail.
-                               IL_0044: call class [FSharp.Core]Microsoft.FSharp.Collections.FSharpList`1<!!0> [FSharp.Core]Microsoft.FSharp.Collections.SeqModule::ToList<int64>(class [System.Runtime]System.Collections.Generic.IEnumerable`1<!!0>)
-                               IL_0049: ret
+                               IL_0034: ldloc.0
+                               IL_0035: tail.
+                               IL_0037: call class [FSharp.Core]Microsoft.FSharp.Collections.FSharpList`1<!!0> [FSharp.Core]Microsoft.FSharp.Collections.SeqModule::ToList<int64>(class [System.Runtime]System.Collections.Generic.IEnumerable`1<!!0>)
+                               IL_003c: ret
                        } // end of method ForLoop::'for n in start..finish do'
Sample diff for `[start..finish]`
let ``[start..finish]`` (start: int64) finish = [start..finish]
@@ -516,20 +1006,68 @@
                                .custom instance void [FSharp.Core]Microsoft.FSharp.Core.CompilationArgumentCountsAttribute::.ctor(int32[]) = (
                                        01 00 02 00 00 00 01 00 00 00 01 00 00 00 00 00
                                )
-                               // Method begins at RVA 0x2280
-                               // Header size: 1
-                               // Code size: 22 (0x16)
-                               .maxstack 8
-
-                               IL_0000: ldarg.0
-                               IL_0001: ldc.i4.1
-                               IL_0002: conv.i8
-                               IL_0003: ldarg.1
-                               IL_0004: call class [System.Runtime]System.Collections.Generic.IEnumerable`1<int64> [FSharp.Core]Microsoft.FSharp.Core.Operators/OperatorIntrinsics::RangeInt64(int64, int64, int64)
-                               IL_0009: call class [System.Runtime]System.Collections.Generic.IEnumerable`1<!!0> [FSharp.Core]Microsoft.FSharp.Core.Operators::CreateSequence<int64>(class [System.Runtime]System.Collections.Generic.IEnumerable`1<!!0>)
-                               IL_000e: tail.
-                               IL_0010: call class [FSharp.Core]Microsoft.FSharp.Collections.FSharpList`1<!!0> [FSharp.Core]Microsoft.FSharp.Collections.SeqModule::ToList<int64>(class [System.Runtime]System.Collections.Generic.IEnumerable`1<!!0>)
-                               IL_0015: ret
+                               // Method begins at RVA 0x23dc
+                               // Header size: 12
+                               // Code size: 58 (0x3a)
+                               .maxstack 4
+                               .locals init (
+                                       [0] uint64,
+                                       [1] valuetype [FSharp.Core]Microsoft.FSharp.Core.CompilerServices.ListCollector`1<int64>,
+                                       [2] uint64,
+                                       [3] int64
+                               )
+
+                               IL_0000: nop
+                               IL_0001: ldarg.1
+                               IL_0002: ldarg.0
+                               IL_0003: bge.s IL_000a
+
+                               IL_0005: ldc.i4.0
+                               IL_0006: conv.i8
+                               IL_0007: nop
+                               IL_0008: br.s IL_0013
+
+                               IL_000a: ldarg.1
+                               IL_000b: conv.i8
+                               IL_000c: ldarg.0
+                               IL_000d: conv.i8
+                               IL_000e: sub
+                               IL_000f: ldc.i4.1
+                               IL_0010: conv.i8
+                               IL_0011: add.ovf.un
+                               IL_0012: nop
+
+                               IL_0013: stloc.0
+                               IL_0014: ldc.i4.0
+                               IL_0015: conv.i8
+                               IL_0016: stloc.2
+                               IL_0017: ldarg.0
+                               IL_0018: stloc.3
+                               IL_0019: br.s IL_002e
+                               // loop start (head: IL_002e)
+                                       IL_001b: ldloca.s 1
+                                       IL_001d: ldloc.3
+                                       IL_001e: call instance void valuetype [FSharp.Core]Microsoft.FSharp.Core.CompilerServices.ListCollector`1<int64>::Add(!0)
+                                       IL_0023: nop
+                                       IL_0024: ldloc.3
+                                       IL_0025: ldc.i4.1
+                                       IL_0026: conv.i8
+                                       IL_0027: add
+                                       IL_0028: stloc.3
+                                       IL_0029: ldloc.2
+                                       IL_002a: ldc.i4.1
+                                       IL_002b: conv.i8
+                                       IL_002c: add
+                                       IL_002d: stloc.2
+
+                                       IL_002e: ldloc.2
+                                       IL_002f: ldloc.0
+                                       IL_0030: blt.un.s IL_001b
+                               // end loop
+
+                               IL_0032: ldloca.s 1
+                               IL_0034: call instance class [FSharp.Core]Microsoft.FSharp.Collections.FSharpList`1<!0> valuetype [FSharp.Core]Microsoft.FSharp.Core.CompilerServices.ListCollector`1<int64>::Close()
+                               IL_0039: ret
                        } // end of method List::'[start..finish]'
Sample diff for `[|start..finish|]`
let ``[|start..finish|]`` (start: int64) finish = [|start..finish|]
@@ -456,20 +778,80 @@
                                .custom instance void [FSharp.Core]Microsoft.FSharp.Core.CompilationArgumentCountsAttribute::.ctor(int32[]) = (
                                        01 00 02 00 00 00 01 00 00 00 01 00 00 00 00 00
                                )
-                               // Method begins at RVA 0x2250
-                               // Header size: 1
-                               // Code size: 22 (0x16)
-                               .maxstack 8
-
-                               IL_0000: ldarg.0
-                               IL_0001: ldc.i4.1
-                               IL_0002: conv.i8
-                               IL_0003: ldarg.1
-                               IL_0004: call class [System.Runtime]System.Collections.Generic.IEnumerable`1<int64> [FSharp.Core]Microsoft.FSharp.Core.Operators/OperatorIntrinsics::RangeInt64(int64, int64, int64)
-                               IL_0009: call class [System.Runtime]System.Collections.Generic.IEnumerable`1<!!0> [FSharp.Core]Microsoft.FSharp.Core.Operators::CreateSequence<int64>(class [System.Runtime]System.Collections.Generic.IEnumerable`1<!!0>)
-                               IL_000e: tail.
-                               IL_0010: call !!0[] [FSharp.Core]Microsoft.FSharp.Collections.SeqModule::ToArray<int64>(class [System.Runtime]System.Collections.Generic.IEnumerable`1<!!0>)
-                               IL_0015: ret
+                               // Method begins at RVA 0x2308
+                               // Header size: 12
+                               // Code size: 67 (0x43)
+                               .maxstack 5
+                               .locals init (
+                                       [0] uint64,
+                                       [1] int64[],
+                                       [2] uint64,
+                                       [3] int64
+                               )
+
+                               IL_0000: nop
+                               IL_0001: ldarg.1
+                               IL_0002: ldarg.0
+                               IL_0003: bge.s IL_000a
+
+                               IL_0005: ldc.i4.0
+                               IL_0006: conv.i8
+                               IL_0007: nop
+                               IL_0008: br.s IL_0013
+
+                               IL_000a: ldarg.1
+                               IL_000b: conv.i8
+                               IL_000c: ldarg.0
+                               IL_000d: conv.i8
+                               IL_000e: sub
+                               IL_000f: ldc.i4.1
+                               IL_0010: conv.i8
+                               IL_0011: add.ovf.un
+                               IL_0012: nop
+
+                               IL_0013: stloc.0
+                               IL_0014: ldloc.0
+                               IL_0015: ldc.i4.1
+                               IL_0016: conv.i8
+                               IL_0017: bge.un.s IL_001f
+
+                               IL_0019: call !!0[] [System.Runtime]System.Array::Empty<int64>()
+                               IL_001e: ret
+
+                               IL_001f: ldloc.0
+                               IL_0020: conv.ovf.i.un
+                               IL_0021: newarr [System.Runtime]System.Int64
+                               IL_0026: stloc.1
+                               IL_0027: ldc.i4.0
+                               IL_0028: conv.i8
+                               IL_0029: stloc.2
+                               IL_002a: ldarg.0
+                               IL_002b: stloc.3
+                               IL_002c: br.s IL_003d
+                               // loop start (head: IL_003d)
+                                       IL_002e: ldloc.1
+                                       IL_002f: ldloc.2
+                                       IL_0030: conv.ovf.i.un
+                                       IL_0031: ldloc.3
+                                       IL_0032: stelem.i8
+                                       IL_0033: ldloc.3
+                                       IL_0034: ldc.i4.1
+                                       IL_0035: conv.i8
+                                       IL_0036: add
+                                       IL_0037: stloc.3
+                                       IL_0038: ldloc.2
+                                       IL_0039: ldc.i4.1
+                                       IL_003a: conv.i8
+                                       IL_003b: add
+                                       IL_003c: stloc.2
+
+                                       IL_003d: ldloc.2
+                                       IL_003e: ldloc.0
+                                       IL_003f: blt.un.s IL_002e
+                               // end loop
+
+                               IL_0041: ldloc.1
+                               IL_0042: ret
                        } // end of method Array::'[|start..finish|]'

for-loops

for loopVar in start..finish do …

let count = if finish < start then 0 else unsigned (finish - start) + 1
let mutable loopVar = start
let mutable i = 0

while i < count do
    …
    loopVar <- loopVar + 1
    i <- i + 1

for loopVar in start..step..finish do …

if step = 0 then
    // Call the range operator so that it throws the appropriate localized exception.
    ignore ((.. ..) start step finish)

let count =
    if 0 < step then
        if finish < start then 0 else unsigned (finish - start) / unsigned step + 1
    else // step < 0
        if start < finish then 0 else unsigned (start - finish) / (unsigned ~~~step + 1) + 1

let mutable loopVar = start
let mutable i = 0

while i < count do
    …
    loopVar <- loopVar + step
    i <- i + 1

Lists

[start..finish]

let count = if finish < start then 0 else unsigned (finish - start) + 1
let mutable collector = ListCollector ()
let mutable loopVar = start
let mutable i = 0

while i < count do
    collector.Add loopVar
    loopVar <- loopVar + 1
    i <- i + 1

collector.Close ()

[start..step..finish]

if step = 0 then
    // Call the range operator so that it throws the appropriate localized exception.
    ignore ((.. ..) start step finish)

let count =
    if 0 < step then
        if finish < start then 0 else unsigned (finish - start) / unsigned step + 1
    else // step < 0
        if start < finish then 0 else unsigned (start - finish) / (unsigned ~~~step + 1) + 1

let mutable loopVar = start
let mutable i = 0

while i < count do
    collector.Add loopVar
    loopVar <- loopVar + step
    i <- i + 1

collector.Close ()

Arrays

[|start..finish|]

let count = if finish < start then 0 else unsigned (finish - start) + 1

if count < 1 then
    [||]
else
    let array = (# "newarr !0" type ('T) count : 'T array #)
    let mutable loopVar = start
    let mutable i = 0

    while i < count do
        array[i] <- loopVar
        loopVar <- loopVar + 1
        i <- i + 1

    array

[|start..step..finish|]

if step = 0 then
    // Call the range operator so that it throws the appropriate localized exception.
    ignore ((.. ..) start step finish)

let count =
    if 0 < step then
        if finish < start then 0 else unsigned (finish - start) / unsigned step + 1
    else // step < 0
        if start < finish then 0 else unsigned (start - finish) / (unsigned ~~~step + 1) + 1

if count < 1 then
    [||]
else
    let array = (# "newarr !0" type ('T) count : 'T array #)
    let mutable loopVar = start
    let mutable i = 0

    while i < count do
        array[i] <- loopVar
        loopVar <- loopVar + step
        i <- i + 1

    array

Benchmarks

Source:

  • https://github.com/brianrourkeboll/for-loop-lowering-benchmarks
  • https://github.com/brianrourkeboll/collection-range-lowering-benchmarks

for-loops

| Categories                                                    | Mean           | Ratio | Gen0   | Allocated | Alloc Ratio |
|-------------------------------------------------------------- |---------------:|------:|-------:|----------:|------------:|
| Int32,1,10..1                                                 |      0.4267 ns |  1.00 |      - |         - |          NA |
| Int32,1,10..1                                                 |      0.4258 ns |  1.00 |      - |         - |          NA |
|                                                               |                |       |        |           |             |
| Int32,2,1..256                                                |     60.2826 ns |  1.00 |      - |         - |          NA |
| Int32,2,1..256                                                |     60.3195 ns |  1.00 |      - |         - |          NA |
|                                                               |                |       |        |           |             |
| Int32,3,start..finish (start=1,finish=65536)                  | 13,769.2041 ns |  1.00 |      - |         - |          NA |
| Int32,3,start..finish (start=1,finish=65536)                  | 13,768.2217 ns |  1.00 |      - |         - |          NA |
|                                                               |                |       |        |           |             |
| Int32,4,1..2..256                                             |    202.9774 ns |  1.00 | 0.0076 |      96 B |        1.00 |
| Int32,4,1..2..256                                             |     39.8180 ns |  0.20 |      - |         - |        0.00 |
|                                                               |                |       |        |           |             |
| Int32,5,start..step..finish (start=1,step=2,finish=65536)     |      45.362 us |  1.00 |      - |      96 B |        1.00 |
| Int32,5,start..step..finish (start=1,step=2,finish=65536)     |       8.650 us |  0.19 |      - |         - |        0.00 |
|                                                               |                |       |        |           |             |
| Int64,1,10L..1L                                               |     13.6396 ns |  1.00 | 0.0096 |     120 B |        1.00 |
| Int64,1,10L..1L                                               |      0.4290 ns |  0.03 |      - |         - |        0.00 |
|                                                               |                |       |        |           |             |
| Int64,2,1L..256L                                              |    374.5056 ns |  1.00 | 0.0076 |      96 B |        1.00 |
| Int64,2,1L..256L                                              |     73.7102 ns |  0.20 |      - |         - |        0.00 |
|                                                               |                |       |        |           |             |
| Int64,3,start..finish (start=1L,finish=65536L)                | 97,152.8207 ns |  1.00 |      - |      96 B |        1.00 |
| Int64,3,start..finish (start=1L,finish=65536L)                | 17,318.2889 ns |  0.18 |      - |         - |        0.00 |
|                                                               |                |       |        |           |             |
| Int64,4,1L..2L..256L                                          |    207.1656 ns |  1.00 | 0.0095 |     120 B |        1.00 |
| Int64,4,1L..2L..256L                                          |     40.1491 ns |  0.19 |      - |         - |        0.00 |
|                                                               |                |       |        |           |             |
| Int64,5,start..step..finish (start=1L,step=2L,finish=65536L)  | 45,688.0579 ns |  1.00 |      - |     120 B |        1.00 |
| Int64,5,start..step..finish (start=1L,step=2L,finish=65536L)  |  8,655.6187 ns |  0.19 |      - |         - |        0.00 |
|                                                               |                |       |        |           |             |
| UInt32,1,10u..1u                                              |     12.8097 ns |  1.00 | 0.0076 |      96 B |        1.00 |
| UInt32,1,10u..1u                                              |      0.4222 ns |  0.03 |      - |         - |        0.00 |
|                                                               |                |       |        |           |             |
| UInt32,2,1u..256u                                             |    367.8889 ns |  1.00 | 0.0062 |      80 B |        1.00 |
| UInt32,2,1u..256u                                             |     73.1184 ns |  0.20 |      - |         - |        0.00 |
|                                                               |                |       |        |           |             |
| UInt32,3,start..finish (start=1u,finish=65536u)               | 94,149.9528 ns |  1.00 |      - |      80 B |        1.00 |
| UInt32,3,start..finish (start=1u,finish=65536u)               | 17,276.5017 ns |  0.18 |      - |         - |        0.00 |
|                                                               |                |       |        |           |             |
| UInt32,4,1u..2u..256u                                         |    200.7767 ns |  1.00 | 0.0076 |      96 B |        1.00 |
| UInt32,4,1u..2u..256u                                         |     39.3779 ns |  0.20 |      - |         - |        0.00 |
|                                                               |                |       |        |           |             |
| UInt32,5,start..step..finish (start=1u,step=2u,finish=65536u) | 45,443.6056 ns |  1.00 |      - |      96 B |        1.00 |
| UInt32,5,start..step..finish (start=1u,step=2u,finish=65536u) |  8,647.0885 ns |  0.19 |      - |         - |        0.00 |

Lists & arrays

| Categories                                                    | Mean            | Ratio | Gen0     | Gen1     | Gen2     | Allocated | Alloc Ratio |
|-------------------------------------------------------------- |----------------:|------:|---------:|---------:|---------:|----------:|------------:|
| Array,1,[|10..1|]                                             |      20.5415 ns |  1.00 |   0.0076 |        - |        - |      96 B |        1.00 |
| Array,1,[|10..1|]                                             |       0.4331 ns |  0.02 |        - |        - |        - |         - |        0.00 |
|                                                               |                 |       |          |          |          |           |             |
| Array,2,[|1..10|]                                             |      59.1133 ns |  1.00 |   0.0261 |        - |        - |     328 B |        1.00 |
| Array,2,[|1..10|]                                             |       7.3448 ns |  0.12 |   0.0051 |        - |        - |      64 B |        0.20 |
|                                                               |                 |       |          |          |          |           |             |
| Array,3,[|1..256|]                                            |     434.6095 ns |  1.00 |   0.1817 |        - |        - |    2280 B |        1.00 |
| Array,3,[|1..256|]                                            |     120.8985 ns |  0.28 |   0.0834 |        - |        - |    1048 B |        0.46 |
|                                                               |                 |       |          |          |          |           |             |
| Array,4,[|start..finish|] (start=1,finish=65536)              | 219,164.6816 ns |  1.00 | 124.7559 | 124.7559 | 124.7559 |  524754 B |        1.00 |
| Array,4,[|start..finish|] (start=1,finish=65536)              |  91,762.7987 ns |  0.42 |  83.2520 |  83.2520 |  83.2520 |  262196 B |        0.50 |
|                                                               |                 |       |          |          |          |           |             |
| Array,5,[|1..2..256|]                                         |     356.9520 ns |  1.00 |   0.0992 |        - |        - |    1248 B |        1.00 |
| Array,5,[|1..2..256|]                                         |      66.0176 ns |  0.18 |   0.0427 |        - |        - |     536 B |        0.43 |
|                                                               |                 |       |          |          |          |           |             |
| Array,6,[|start..step..finish|] (start=1,step=2,finish=65536) | 113,568.0729 ns |  1.00 |  41.6260 |  41.6260 |  41.6260 |  262574 B |        1.00 |
| Array,6,[|start..step..finish|] (start=1,step=2,finish=65536) |  46,214.4967 ns |  0.41 |  41.6260 |  41.6260 |  41.6260 |  131110 B |        0.50 |
|                                                               |                 |       |          |          |          |           |             |
| List,1,[10..1]                                                |      17.4362 ns |  1.00 |   0.0076 |        - |        - |      96 B |        1.00 |
| List,1,[10..1]                                                |       1.0708 ns |  0.06 |        - |        - |        - |         - |        0.00 |
|                                                               |                 |       |          |          |          |           |             |
| List,2,[1..10]                                                |      55.0514 ns |  1.00 |   0.0318 |        - |        - |     400 B |        1.00 |
| List,2,[1..10]                                                |      33.4564 ns |  0.61 |   0.0255 |        - |        - |     320 B |        0.80 |
|                                                               |                 |       |          |          |          |           |             |
| List,3,[1..256]                                               |     957.4592 ns |  1.00 |   0.6590 |   0.0191 |        - |    8272 B |        1.00 |
| List,3,[1..256]                                               |     766.2994 ns |  0.80 |   0.6523 |   0.0191 |        - |    8192 B |        0.99 |
|                                                               |                 |       |          |          |          |           |             |
| List,4,[start..finish] (start=1,finish=65536)                 | 467,288.4144 ns |  1.00 | 166.9922 | 154.2969 |        - | 2097232 B |        1.00 |
| List,4,[start..finish] (start=1,finish=65536)                 | 431,082.7161 ns |  0.93 | 166.9922 | 154.7852 |        - | 2097152 B |        1.00 |
|                                                               |                 |       |          |          |          |           |             |
| List,5,[1..2..256]                                            |     598.9907 ns |  1.00 |   0.3338 |   0.0048 |        - |    4192 B |        1.00 |
| List,5,[1..2..256]                                            |     395.3087 ns |  0.66 |   0.3262 |   0.0048 |        - |    4096 B |        0.98 |
|                                                               |                 |       |          |          |          |           |             |
| List,6,[start..step..finish] (start=1,step=2,finish=65536)    | 194,246.7494 ns |  1.00 |  83.4961 |  71.7773 |        - | 1048672 B |        1.00 |
| List,6,[start..step..finish] (start=1,step=2,finish=65536)    | 153,653.6115 ns |  0.79 |  83.4961 |  71.5332 |        - | 1048576 B |        1.00 |

Checklist

  • [x] Test cases added.
    • I have added emitted IL tests for int32 list and array comprehension initializations under FSharp.Compiler.UnitTests, but I also see emitted IL tests under FSharp.Compiler.ComponentTests — would the latter be a better spot? Either way, I could add the same for all other supported integral types (and all combinations of for-loop, list, array, and constant/runtime args), but they would add tens to hundreds of thousands of lines to this PR. Let me know if I should do so, or if I should add at least one copy for, say, an unsigned type — I could even just check in the IL for these tests, although that would be a couple hundred thousand lines... Some of the existing baselines have however been updated and do show examples of these changes applied to unsigned ranges.

      All existing tests involving ranges or comprehensions of ranges, like these and these, pass.

  • [x] Performance benchmarks added in case of performance changes.
    • I have included some basic illustrative benchmarks in this description, but I'm willing to run more if desired.
  • [x] Release notes entry updated.

Design/implementation notes

  • In theory, these optimizations could subsume the existing optimization that applies only to for-loops over int32 ranges with constant 1 or -1 steps. The performance should be comparable, if not identical (and is, in a quick spot-check). I have not replaced or removed the existing transformation in this PR.

    Alternatively, the existing TOp.IntegerForLoop construct could have been extended to support arbitrary steps and integer types, although that might (?) have compatibility implications. Compare #13573.

  • The emitted IL looks reasonably good to me, but additional optimizations could be made in certain cases:

    • If we knew that $start = 0$ and $step = 1$, we could consolidate $idxVar$ and $loopVar$ and halve the number of increment operations.
    • If we knew that $|finish - start| \leq maxValue(ty)$, we wouldn't need to widen the $count$ variable.
    • Try to ensure we emit loop patterns that the JIT knows how to optimize.
    • Try to emit patterns that will let the JIT omit bounds checks when initializing arrays (if we know that $count \leq 2^{32}$).
    • Etc.

    The returns begin to diminish relatively quickly relative to the complexity they add to the compiler, though.

  • I'm emitting IL instructions for arithmetic, comparison, etc., rather than their library equivalents. That's mainly because the call to LowerComputedCollections.LowerComputedListOrArrayExpr is made in IlxGen.fs, which happens after type-checking and optimization, which means that calls to library-level operators won't work. If we extended the TOp.IntegerForLoop construct instead, though, perhaps we could defer the lower-level codegen to IlxGen.fs.

  • Rather than directly emitting loops for collection initialization, we could instead just emit code to compute the count, and then call Array.init/List.init as in #16577. This wouldn't really save on code size, though, since closure types would end up being emitted instead. If we really wanted to minimize code size while keeping the performance boost, we could add some kind of new library method for fast collection initialization from ranges and just emit calls to that. I'm not exactly sure where that would go, though, or how it would be exposed.

brianrourkeboll avatar Feb 05 '24 02:02 brianrourkeboll

:heavy_exclamation_mark: Release notes required


:white_check_mark: Found changes and release notes in following paths:

Change path Release notes path Description
src/Compiler docs/release-notes/.FSharp.Compiler.Service/8.0.300.md
LanguageFeatures.fsi docs/release-notes/.Language/preview.md

github-actions[bot] avatar Feb 05 '24 02:02 github-actions[bot]

/azp run

psfinaki avatar Feb 06 '24 12:02 psfinaki

Azure Pipelines successfully started running 2 pipeline(s).

azure-pipelines[bot] avatar Feb 06 '24 12:02 azure-pipelines[bot]

Ah. I see what the problem is. I'll put this back in draft mode until it's fixed.

brianrourkeboll avatar Feb 07 '24 23:02 brianrourkeboll

Looks very promising, thanks for the amazing work here so far :)

psfinaki avatar Feb 14 '24 12:02 psfinaki

@psfinaki

Looks very promising, thanks for the amazing work here so far :)

Thanks! I think I've worked out most of the kinks — there were many, and the feedback loop for this particular change is rather slow...

CI green 🔜...

brianrourkeboll avatar Feb 14 '24 18:02 brianrourkeboll

Hmm, any idea what about 32-bit net472 FSI — which is causing the one remaining test failure when invoked on tests/fsharp/core/control/test.fsx — would cause it to behave differently from everything else, including 32-bit net472 FSC, which appears to emit the correct, expected IL when used to compile that same file targeting x86?

(Or maybe there's some more obvious explanation for the behavior[^1] that I'm missing?)

[^1]: for n in … do … [|0..n-1|] is resulting in empty arrays.

brianrourkeboll avatar Feb 15 '24 00:02 brianrourkeboll

Serious question - how much do we have to care about 32-bit anything these days?

baronfel avatar Feb 15 '24 00:02 baronfel

I was missing one widening conversion (conv.i8, etc.) after ldc.i4.1 in one spot ¯\(ツ)

There's also (separately) a thing with native ints that I need to fix.

Edit: I believe I have fixed them.

brianrourkeboll avatar Feb 15 '24 17:02 brianrourkeboll

@psfinaki

Could I ask you to share the bench code here for future reference? You can either attach them as a gist or add them to the CompiledCodeBenchmarks/MicroPerf.

Sure:

  • https://github.com/brianrourkeboll/for-loop-lowering-benchmarks
  • https://github.com/brianrourkeboll/collection-range-lowering-benchmarks

There's probably a more elegant way to do it, but I just added

<DotnetFscCompilerPath>D:\src\fsharp\artifacts\bin\fsc\Release\net8.0\fsc.dll</DotnetFscCompilerPath>

to the "new" project to point to my locally-built compiler. The "old" project is built with the shipped compiler and then used as the baseline.

brianrourkeboll avatar Feb 19 '24 14:02 brianrourkeboll

Alright, thanks for sharing the code - now just don't forget to never delete it :D I'll give this another round tomorrow, stay tuned!

psfinaki avatar Feb 19 '24 18:02 psfinaki

@psfinaki

Alright, thanks for sharing the code - now just don't forget to never delete it :D

Lol, I'm happy to add something to the benchmarks project here if you want. I'm not sure how you've been using the MicroPerf project, but I wonder if it makes sense to try to set it up something like this —

let config =
    DefaultConfig.Instance
        .AddJob(Job.Default.WithId("Shipped").AsBaseline())
        .AddJob(Job.Default.WithId("Local").WithCustomBuildConfiguration "LocalBuild")

— where the version built with the shipped compiler is used as the baseline, and the LocalBuild configuration is configured to be built with the local compiler (or maybe that project is already being built with the local compiler by default, I haven't checked, but you get the idea).

brianrourkeboll avatar Feb 19 '24 21:02 brianrourkeboll

@psfinaki

gathering mental energy for reviewing the actual code

Oh, and I'm happy to try to explain (and document, if needed) anything you have questions about.

brianrourkeboll avatar Feb 20 '24 00:02 brianrourkeboll

Lol, I'm happy to add something to the benchmarks project here if you want.

Nah I didn't mean it seriously :D However...

I'm not sure how you've been using the MicroPerf project, but I wonder if it makes sense to try to set it up something like this

This would be actually really cool I think. As of now, here, I just have two repos on my machine, one with main and another with my branch, copypaste benchmarks here and there, run them at the same time and realize how lame it is 🤦

psfinaki avatar Feb 20 '24 09:02 psfinaki

I agree that we should probably add a feature flag for this, maybe even switching between the existing and new optimizations (if this is easy to do).

No, not that for sure. We don't do that for non-experimental stuff. People will start relying on it. It being under language version is what we usually do.

vzarytovskii avatar Feb 20 '24 10:02 vzarytovskii

@vzarytovskii

we should put it under a language version flag

Under preview?

brianrourkeboll avatar Feb 20 '24 15:02 brianrourkeboll

@vzarytovskii

we should put it under a language version flag

Under preview?

Yes, it will be promoted to F# 9 before the release.

vzarytovskii avatar Feb 20 '24 15:02 vzarytovskii

@vzarytovskii

we should put it under a language version flag

Under preview?

Yes, it will be promoted to F# 9 before the release.

Putting it behind preview does mean that the existing IL baselines that were affected by the change (because they happened to have [|start..finish|], etc., in them, like https://github.com/dotnet/fsharp/pull/16650#discussion_r1494535165) will not be updated in this PR, but would instead need to be updated whenever the feature was brought out of preview. Is that what we want?

brianrourkeboll avatar Feb 20 '24 15:02 brianrourkeboll

@vzarytovskii

we should put it under a language version flag

Under preview?

Yes, it will be promoted to F# 9 before the release.

Putting it behind preview does mean that the existing IL baselines that were affected by the change (because they happened to have [|start..finish|], etc., in them, like #16650 (comment)) will not be updated in this PR, but would instead need to be updated whenever the feature was brought out of preview. Is that what we want?

There are three choices here:

  1. Run them with preview version, and when we will be bumping version of language, we will change preview -> 9.
  2. Not change them and run with "default" language version, and we will update when we run it.
  3. Have one for version 8 and one for preview (9 in the future). - it's the hardest one, since a bunch of "duplicated" test cases should be added and baselines will need to be generated for them.

I usually do 1., but this is a risk, because by doing that we are essentially excluding tests until we bump the version. If all (or most) of the baselines are external (i.e. .bsl files), we can do 2., since it will be easy to update them when we bump lang version.

vzarytovskii avatar Feb 20 '24 16:02 vzarytovskii

@psfinaki

have you noticed some other optimization opportunities in this space while working on this?

Sure, there are many — although I have not thought much at all about what the most worthwhile ones would be :)

  • Treat list comprehensions over ranges that are consumed directly in for-loops, e.g., for … in [start..finish] do …, as if they were for … in start..finish do … — i.e., don't first build the list and then iterate over it. This is a relatively common phenomenon (see, e.g., #9548).
  • Apply the range initialization optimizations in this PR to [for n in start..finish -> f n] and [|for n in start..finish -> f n|].
  • Apply similar optimizations for float, float32, and decimal.
  • Lower [for x in xs -> f x] to List.map f xs when xs is a list.
  • Lower [|for x in xs -> f x|] to Array.map f xs when xs is an array.
  • For other uses of for in a list or array comprehension, use the appropriate fast iteration method when the collection being iterated over is a list or an array (https://github.com/dotnet/fsharp/issues/9548#issuecomment-649044107).
  • Specialize the iteration method for functions like let f xs = for x in xs do ignore (x ** x) (https://github.com/dotnet/fsharp/issues/9548#issuecomment-648935832) once the type is known.
  • https://github.com/fsharp/fslang-suggestions/issues/1350
  • ...

brianrourkeboll avatar Feb 20 '24 19:02 brianrourkeboll

Ugh, at some point after e44dc371e219424056b7218efef2304190847aa8 (probably e45dd55d6c51a783730a56b24ccfac947986b589) I broke MinValue..MaxValue and MaxValue..-1..MinValue... I'll add tests for those — at least for Byte.MinValue..Byte.MaxValue, SByte.MinValue..SByte.MaxValue, and SByte.MaxValue..-1y..SByte.MinValue, so the tests still run in a reasonable amount of time — and fix it.

Relatedly, I forgot to mention that the "convert to unsigned or widen" technique won't work for the specific ranges[^1]

$0..2^{64}$

$-2^{63}..2^{63}-1$

$2^{63}-1..-1..-2^{63}$

because $finish - start = 2^{64}$ (or $start - finish = 2^{64}$) in those cases, and, since there's nothing to widen $2^{64}$ to, adding $1$ to it to reach the full count would mean overflow and an exception.

I have no doubt that that's acceptable for list/array initialization, since the initialization would always eventually use up all available memory and fail anyway, but in theory someone could write this "valid" loop —

for n in UInt64.MinValue..UInt64.MaxValue do
    …

— even though it would probably take decades to hundreds of years to terminate on normal present-day machines...

I wonder if it's OK to accept that one limitation?

Spelled out, the limitation is that $0..2^{64}$ would immediately raise an overflow exception, rather than the current behavior of eventually raising one (for array initialization), eventually using up all memory (for list initialization), or serving as a poor man's infinite loop (for a for-loop).

If that limitation is not acceptable, then we can emit different (slightly slower) code for int64/uint64/nativeint/unativeint ranges that keeps track of $current$ and $next$ and checks for overflow that way.

[^1]: For which there are no existing tests, for obvious reasons. I'm not really sure exactly what a useful test for that would look like, other than perhaps one with the emitted IL.

brianrourkeboll avatar Feb 21 '24 02:02 brianrourkeboll

All right, I fixed the MinValue..MaxValue/MaxValue..-1..MinValue problem[^1] and added tests for it.

[^1]: With the one caveat for ranges of size exactly 264 + 1, and I guess technically also nativeint or unativeint ranges of size 232 + 1 on x86. See https://github.com/dotnet/fsharp/blob/13a2d4621b71f310b56d739e61bc38a887740f91/tests/FSharp.Core.UnitTests/FSharp.Core/PrimTypes.fs#L1110-L1152

brianrourkeboll avatar Feb 21 '24 17:02 brianrourkeboll

Thanks for sharing the ideas, Brian :)

I will take another look tomorrow then.

psfinaki avatar Feb 21 '24 18:02 psfinaki

Well generally speaking, if we are talking about failing anyway, failing faster is better.

psfinaki avatar Feb 22 '24 15:02 psfinaki

Well generally speaking, if we are talking about failing anyway, failing faster is better.

That is not as easy, unfortunately, we can't afford failing code which will be compiling well otherwise and maybe failing in runtime later (since we don't have control flow analysis for such cases). So, we can't just start producing errors (even warnings sometimes), for code which was compiling before.

vzarytovskii avatar Feb 22 '24 16:02 vzarytovskii

Well generally speaking, if we are talking about failing anyway, failing faster is better.

That is not as easy, unfortunately, we can't afford failing code which will be compiling well otherwise and maybe failing in runtime later (since we don't have control flow analysis for such cases). So, we can't just start producing errors (even warnings sometimes), for code which was compiling before.

The code here as it is now will never cause any new errors at build-time. The only new error is an (immediate) overflow exception at runtime in these specific scenarios:

for n in UInt64.MinValue..UInt64.MaxValue do …
for n in Int64.MinValue..Int64.MaxValue do …
for n in Int64.MaxValue.. -1L .. Int64.MinValue do …
for n in UIntPtr.MinValue..UIntPtr.MaxValue do …
for n in IntPtr.MinValue..IntPtr.MaxValue do …
for n in IntPtr.MaxValue.. -1n .. IntPtr.MinValue do …

On a 64-bit machine, the size or iteration count of these ranges is $2^{64} + 1$, i.e., $18,446,744,073,709,551,616$. While such code is technically valid, it is effectively indistinguishable from an infinite loop on a present-day machine. Any code that uses such a loop written since F# 1.0 (or whenever the range operator was added) is still looping and will be long after we're dead 🙃.

If we feel nonetheless that we cannot change the behavior in that scenario — which I would understand — then we will need to emit different code (at least for 64-bit and native int types) than what I currently have in this PR.

Theoretically, someone could have a loop like for n in UIntPtr.MinValue..UIntPtr.MaxValue do … running on a 32-bit machine doing useful work, although it would be entirely dependent on running on a 32-bit machine forever, since running such code on a 64-bit machine would have drastically different behavior. We could address this specific case (for x86) just by using a uint64 for the count variable here.

Below are what I see as some of our options.

  1. Keep the code in this PR as-is.

    • Pros
      • All ranges behave as before, but faster, except for loops over ranges with the exact size $2^{64} + 1$ (which, using the current library implementation of (..) or (.. ..), would take hundreds or thousands of years to terminate...).
      • Unified code gen.
    • Cons
      • A loop over a range of size $2^{64} + 1$ would now immediately raise an overflow exception at runtime. We'd probably need to document this behavior.
  2. Just always emit code that does an overflow check on every iteration by keeping track of $current$ and $next$, similar to the existing library implementations.

    • Pros
      • All ranges behave as before.
    • Cons
      • We'd still need to precompute $count$ for array initialization.
      • Slower loops than the count-based code, based on some exploratory benchmarks.
  3. For 64-bit (or potentially 64-bit) types, emit code that checks whether the size of the range would be $2^{64} + 1$, and, if so, switches over to a special loop just for that case; if it wouldn't overflow, or if the type isn't or couldn't be 64-bit, use the count-based loop technique from this PR.

    • Pros
      • All ranges behave as before.
      • Probably still pretty fast.
    • Cons
      • More complex code gen.
      • More code gen.
  4. Punt on this question by only applying this PR's optimizations to 8, 16, and 32-bit types.

For what it's worth, Don said the following in https://github.com/dotnet/fsharp/issues/938#issuecomment-231077004:

Perhaps we could just sacrifice semantics for striding loops near the maxint condition - though whatever we do parity with C# is really needed.

However, given the F# loop

for i in UInt64.MinValue..UInt64.MaxValue do …

the C# "equivalent" —

for (var i = ulong.MinValue; i <= ulong.MaxValue; i++) { … }

— is actually an infinite loop, since the guard condition will continue to hold when 1 is added to ulong.MaxValue and it overflows.

But I don't think we can take that behavior generally, because existing, reasonable, working F# code like

for i in Byte.MinValue..Byte.MaxValue do …

would suddenly become an infinite loop.

To achieve the same semantics as F# .. in C# at this edge case, even for something like byte or int that could terminate in a reasonable amount of time, you'd really need a loop like

var i = int.MinValue;
for (; i != int.MaxValue; i++)
{
    M(i);
}
M(i);

brianrourkeboll avatar Feb 22 '24 16:02 brianrourkeboll

/azp run

psfinaki avatar Feb 23 '24 15:02 psfinaki

Azure Pipelines successfully started running 2 pipeline(s).

azure-pipelines[bot] avatar Feb 23 '24 15:02 azure-pipelines[bot]

Maybe we could emit something like the following for 64-bit types and nativeint/unativeint (sbyte/byte used in the example to make it easy to run in FSI):

NoOvf.fsx

open System

let f bod start step finish =
    // We only need this for 64-bit types.
    // For smaller types we can just widen to uint64.
    // Using sbyte/byte here as a runnable example.

    if step = 0y then
        ignore ((.. ..) start step finish)

    let mutable count = 0uy
    let mutable wouldOvf = false

    if 0y < step then
        if start <= finish then
            let pseudoCount = byte (finish - start) / byte step
            if pseudoCount = Byte.MaxValue then
                wouldOvf <- true
            else
                count <- pseudoCount + 1uy
    else // step < 0
        if finish <= start then
            let pseudoCount = byte (start - finish) / (byte ~~~step + 1uy)
            if pseudoCount = Byte.MaxValue then
                wouldOvf <- true
            else
                count <- pseudoCount + 1uy

    if not wouldOvf then
        let mutable loopVar = start
        let mutable i = 0uy

        while i < count do
            bod i loopVar
            loopVar <- loopVar + step
            i <- i + 1uy
    else
        // Count won't fit in the type,
        // so we basically emit a do-while loop
        // where the guard detects when we _have_ overflowed
        // and wrapped around.
        let mutable loopVar = start
        let mutable i = 0uy

        bod i loopVar
        loopVar <- loopVar + step
        i <- i + 1uy

        while i <> 0uy do
            bod i loopVar
            loopVar <- loopVar + step
            i <- i + 1uy

//f (fun i l -> printfn $"{i}, {l}") -128y 1y 127y
f (fun i l -> printfn $"{i}, {l}") 127y -1y -128y

The main downsides would be some added complexity in the compiler code and bigger code size for 64-bit types when count couldn't be determined at build time:

  • The little "count-injection" trick in the buildLoop: (Count -> ((Idx -> Elem -> Body) -> Loop) -> Expr) parameter would need some complexifying, since we have an unutterable count in this case, and we would still want an overflow exception to be raised when someone tried to initialize an array from a range with count $2^{64} + 1$.
  • We'd want to make sure that we handled this appropriately for native ints on both x86 and x64 at runtime, which probably means emitting additional runtime conditions that check sizeof<nativeint>, etc.

brianrourkeboll avatar Feb 26 '24 12:02 brianrourkeboll

The code here as it is now will never cause any new errors at build-time. The only new error is an (immediate) overflow exception at runtime in these specific scenarios:

I have no issues with that. The only ones I'm concerned with is if we start checking for overflows in compile time, and fail to compile.

vzarytovskii avatar Feb 26 '24 16:02 vzarytovskii