fsharp icon indicating copy to clipboard operation
fsharp copied to clipboard

SIMD vectorization of Array.sum<int>, etc

Open Thorium opened this issue 9 months ago • 27 comments

Description

Specific overloads (float, float32, int, int64) of Seq.sum, ~~Seq.average,~~ Array.sum ~~and Array.average~~ to take advantage of vectorization in System.Linq.Enumerable module.

This is potentially a naive first try to solve #16230 by the spirit of @T-Gro comment https://github.com/dotnet/fsharp/issues/16230#issuecomment-2826895557

Checklist

  • [ ] Test cases added
  • [x] Performance benchmarks added in case of performance changes
  • [x] Release notes entry updated:

    Please make sure to add an entry with short succinct description of the change as well as link to this pull request to the respective release notes file, if applicable.

Thorium avatar Apr 26 '25 12:04 Thorium

:heavy_exclamation_mark: Release notes required


:white_check_mark: Found changes and release notes in following paths:

Change path Release notes path Description
src/FSharp.Core docs/release-notes/.FSharp.Core/10.0.100.md

github-actions[bot] avatar Apr 26 '25 12:04 github-actions[bot]

It looks like the error case is now different for average over an empty collection. If it is the same exception type, and only a different message, we could justify the break.

(we should not evaluate a seq before calling into Average, since this would be breaking in a different way)

T-Gro avatar Apr 26 '25 20:04 T-Gro

I did some initial tests to see if this makes sense at all or not. I used the current benchmarks\CompiledCodeBenchmarks\MicroPerf\MicroPerf.fsproj with adding following tests:

    [<Benchmark>]
    member x.ArraySum() = // Array sum
        array
        |> Array.sum 
        |> ignore

    [<Benchmark>]
    member x.ArrayAverage() = // There has an extra map, because average needs float
        array
        |> Array.map float
        |> Array.average 
        |> ignore

    [<Benchmark>]
    member x.ArraySeqSum() = // Seq sum by using array as base data
        array
        |> Seq.sum 
        |> ignore

    [<Benchmark>]
    member x.ListSeqSum() = // Seq sum by using list as base data
        list
        |> Seq.sum 
        |> ignore

And here are the results with current main:


BenchmarkDotNet v0.13.10, Windows 11 (10.0.26100.3915)
13th Gen Intel Core i9-13900H, 1 CPU, 20 logical and 14 physical cores
.NET SDK 9.0.203
  [Host]     : .NET 9.0.4 (9.0.425.16305), X64 RyuJIT AVX2 DEBUG
  Job-ZVNJYC : .NET 9.0.4 (9.0.425.16305), X64 RyuJIT AVX2
  Job-NFLFBL : .NET 9.0.4 (9.0.425.16305), X64 RyuJIT AVX2
  Job-ABSOZP : .NET 9.0.4 (9.0.425.16305), X64 RyuJIT AVX2

Server=True  

Method Job Arguments IterationCount LaunchCount WarmupCount Length Mean Error StdDev Ratio RatioSD Gen0 Allocated Alloc Ratio
ArraySum Job-ZVNJYC /p:BUILDING_USING_DOTNET=true Default Default Default 1000 206.4 ns 1.13 ns 1.06 ns 1.00 0.00 - - NA
ArraySum Job-NFLFBL /p:BUILDING_USING_DOTNET=true Default Default Default 1000 205.9 ns 0.50 ns 0.39 ns 1.00 0.01 - - NA
ArraySum Job-ABSOZP Default 2 2 1 1000 207.8 ns 9.28 ns 1.44 ns 1.01 0.01 - - NA
ArrayAverage Job-ZVNJYC /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,538.6 ns 35.62 ns 105.04 ns 1.00 0.00 0.1736 8024 B 1.00
ArrayAverage Job-NFLFBL /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,471.2 ns 29.23 ns 53.45 ns 0.94 0.07 0.1736 8024 B 1.00
ArrayAverage Job-ABSOZP Default 2 2 1 1000 1,495.0 ns 759.33 ns 117.51 ns 0.92 0.07 0.1736 8024 B 1.00
ArraySeqSum Job-ZVNJYC /p:BUILDING_USING_DOTNET=true Default Default Default 1000 500.6 ns 3.69 ns 3.46 ns 1.00 0.00 - 32 B 1.00
ArraySeqSum Job-NFLFBL /p:BUILDING_USING_DOTNET=true Default Default Default 1000 496.4 ns 1.80 ns 1.40 ns 0.99 0.01 - 32 B 1.00
ArraySeqSum Job-ABSOZP Default 2 2 1 1000 498.9 ns 34.99 ns 5.41 ns 1.00 0.01 - 32 B 1.00
ListSeqSum Job-ZVNJYC /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,234.6 ns 3.12 ns 2.91 ns 1.00 0.00 - 40 B 1.00
ListSeqSum Job-NFLFBL /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,243.7 ns 19.42 ns 17.21 ns 1.01 0.01 - 40 B 1.00
ListSeqSum Job-ABSOZP Default 2 2 1 1000 1,251.6 ns 55.34 ns 8.56 ns 1.01 0.01 0.0019 40 B 1.00
ArraySum Job-ZVNJYC /p:BUILDING_USING_DOTNET=true Default Default Default 10000 1,949.4 ns 8.32 ns 6.95 ns 1.00 0.00 - - NA
ArraySum Job-NFLFBL /p:BUILDING_USING_DOTNET=true Default Default Default 10000 1,956.5 ns 24.32 ns 22.75 ns 1.00 0.01 - - NA
ArraySum Job-ABSOZP Default 2 2 1 10000 1,966.3 ns 128.42 ns 19.87 ns 1.01 0.01 - - NA
ArrayAverage Job-ZVNJYC /p:BUILDING_USING_DOTNET=true Default Default Default 10000 13,892.3 ns 415.41 ns 1,211.78 ns 1.00 0.00 1.8616 80024 B 1.00
ArrayAverage Job-NFLFBL /p:BUILDING_USING_DOTNET=true Default Default Default 10000 15,129.6 ns 242.87 ns 227.18 ns 1.09 0.13 1.9379 80024 B 1.00
ArrayAverage Job-ABSOZP Default 2 2 1 10000 12,991.8 ns 8,355.23 ns 1,292.98 ns 1.00 0.26 1.8616 80024 B 1.00
ArraySeqSum Job-ZVNJYC /p:BUILDING_USING_DOTNET=true Default Default Default 10000 4,889.1 ns 48.55 ns 43.04 ns 1.00 0.00 - 32 B 1.00
ArraySeqSum Job-NFLFBL /p:BUILDING_USING_DOTNET=true Default Default Default 10000 4,828.3 ns 45.71 ns 38.17 ns 0.99 0.01 - 32 B 1.00
ArraySeqSum Job-ABSOZP Default 2 2 1 10000 4,833.0 ns 122.61 ns 18.97 ns 0.98 0.00 - 32 B 1.00
ListSeqSum Job-ZVNJYC /p:BUILDING_USING_DOTNET=true Default Default Default 10000 14,042.3 ns 280.58 ns 483.98 ns 1.00 0.00 - 40 B 1.00
ListSeqSum Job-NFLFBL /p:BUILDING_USING_DOTNET=true Default Default Default 10000 13,767.9 ns 240.03 ns 420.40 ns 0.98 0.04 - 40 B 1.00
ListSeqSum Job-ABSOZP Default 2 2 1 10000 13,537.7 ns 4,046.83 ns 626.25 ns 0.95 0.07 - 40 B 1.00

Here are the results with this PR:

Method Job Arguments IterationCount LaunchCount WarmupCount Length Mean Error StdDev Median Ratio RatioSD Gen0 Allocated Alloc Ratio
ArraySum Job-WQFBJH /p:BUILDING_USING_DOTNET=true Default Default Default 1000 60.66 ns 0.489 ns 0.434 ns 60.59 ns 1.00 0.00 - - NA
ArraySum Job-ZZOCDZ /p:BUILDING_USING_DOTNET=true Default Default Default 1000 207.89 ns 3.431 ns 3.209 ns 206.96 ns 3.43 0.05 - - NA
ArraySum Job-BMQAWT Default 2 2 1 1000 60.18 ns 3.505 ns 0.542 ns 60.19 ns 0.99 0.02 - - NA
ArrayAverage Job-WQFBJH /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,693.99 ns 29.205 ns 28.683 ns 1,694.44 ns 1.00 0.00 0.1736 8024 B 1.00
ArrayAverage Job-ZZOCDZ /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,443.74 ns 47.701 ns 140.649 ns 1,446.84 ns 0.87 0.04 0.1736 8024 B 1.00
ArrayAverage Job-BMQAWT Default 2 2 1 1000 1,486.78 ns 226.790 ns 35.096 ns 1,493.10 ns 0.89 0.02 0.1736 8024 B 1.00
ArraySeqSum Job-WQFBJH /p:BUILDING_USING_DOTNET=true Default Default Default 1000 60.23 ns 0.426 ns 0.378 ns 60.11 ns 1.00 0.00 - - NA
ArraySeqSum Job-ZZOCDZ /p:BUILDING_USING_DOTNET=true Default Default Default 1000 497.31 ns 4.195 ns 3.719 ns 496.69 ns 8.26 0.09 - 32 B NA
ArraySeqSum Job-BMQAWT Default 2 2 1 1000 61.30 ns 1.343 ns 0.208 ns 61.31 ns 1.02 0.00 - - NA
ListSeqSum Job-WQFBJH /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,154.49 ns 9.687 ns 9.061 ns 1,155.70 ns 1.00 0.00 - 40 B 1.00
ListSeqSum Job-ZZOCDZ /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,228.62 ns 5.630 ns 4.395 ns 1,228.27 ns 1.06 0.01 - 40 B 1.00
ListSeqSum Job-BMQAWT Default 2 2 1 1000 1,164.41 ns 66.398 ns 10.275 ns 1,159.92 ns 1.01 0.02 0.0019 40 B 1.00
ArraySum Job-WQFBJH /p:BUILDING_USING_DOTNET=true Default Default Default 10000 546.81 ns 1.656 ns 1.549 ns 547.20 ns 1.00 0.00 - - NA
ArraySum Job-ZZOCDZ /p:BUILDING_USING_DOTNET=true Default Default Default 10000 1,961.44 ns 25.704 ns 24.044 ns 1,961.63 ns 3.59 0.05 - - NA
ArraySum Job-BMQAWT Default 2 2 1 10000 541.97 ns 33.331 ns 5.158 ns 541.34 ns 0.99 0.01 - - NA
ArrayAverage Job-WQFBJH /p:BUILDING_USING_DOTNET=true Default Default Default 10000 16,012.49 ns 316.673 ns 721.226 ns 16,238.87 ns 1.00 0.00 1.8463 80024 B 1.00
ArrayAverage Job-ZZOCDZ /p:BUILDING_USING_DOTNET=true Default Default Default 10000 13,651.54 ns 310.627 ns 865.903 ns 13,843.05 ns 0.84 0.08 1.8768 80024 B 1.00
ArrayAverage Job-BMQAWT Default 2 2 1 10000 17,271.21 ns 3,620.263 ns 560.240 ns 17,365.04 ns 1.10 0.04 1.8921 80024 B 1.00
ArraySeqSum Job-WQFBJH /p:BUILDING_USING_DOTNET=true Default Default Default 10000 554.40 ns 3.815 ns 3.569 ns 554.73 ns 1.00 0.00 - - NA
ArraySeqSum Job-ZZOCDZ /p:BUILDING_USING_DOTNET=true Default Default Default 10000 4,813.90 ns 39.532 ns 33.011 ns 4,801.26 ns 8.69 0.09 - 32 B NA
ArraySeqSum Job-BMQAWT Default 2 2 1 10000 551.93 ns 12.280 ns 1.900 ns 552.36 ns 1.00 0.00 - - NA
ListSeqSum Job-WQFBJH /p:BUILDING_USING_DOTNET=true Default Default Default 10000 14,523.24 ns 768.122 ns 2,240.648 ns 13,893.55 ns 1.00 0.00 - 40 B 1.00
ListSeqSum Job-ZZOCDZ /p:BUILDING_USING_DOTNET=true Default Default Default 10000 15,181.03 ns 458.748 ns 1,338.189 ns 14,985.06 ns 1.07 0.18 - 40 B 1.00
ListSeqSum Job-BMQAWT Default 2 2 1 10000 13,985.62 ns 10,680.237 ns 1,652.779 ns 13,530.39 ns 1.00 0.17 - 40 B 1.00

Edit: I used Net 9 but the FSharp.Core.dll netstandard2.0 version in both, which was probably a mistake because Spans are truely efficient on netstandard2.1 only (?)

Thorium avatar Apr 26 '25 20:04 Thorium

The IL for Enumerable is very complicated, and is heavily relying on the JIT to simplify it. In Release mode, I observe Enumerable to take half the time; in Debug mode Enumerable takes about 4x the time.

Hmm, I'm thinking if default FSI is compiled with debug mode and people use F# as a scripting language, then performance optimizations like this could be marked as #if !DEBUG. However, it would be even weirder if an error message were different between debug and release.

I did run the same performance tests with FSharp.Core.dll netstandard2.1 version

Main branch:

Method Job Arguments IterationCount LaunchCount WarmupCount Length Mean Error StdDev Ratio RatioSD Gen0 Allocated Alloc Ratio
ArraySum Job-WCJSKA /p:BUILDING_USING_DOTNET=true Default Default Default 1000 207.2 ns 2.34 ns 2.19 ns 1.00 0.00 - - NA
ArraySum Job-ZZTGPQ /p:BUILDING_USING_DOTNET=true Default Default Default 1000 204.2 ns 3.47 ns 2.90 ns 0.99 0.02 - - NA
ArraySum Job-EYKYAG Default 2 2 1 1000 206.8 ns 22.17 ns 3.43 ns 1.01 0.02 - - NA
ArraySeqSum Job-WCJSKA /p:BUILDING_USING_DOTNET=true Default Default Default 1000 490.6 ns 2.15 ns 1.80 ns 1.00 0.00 - 32 B 1.00
ArraySeqSum Job-ZZTGPQ /p:BUILDING_USING_DOTNET=true Default Default Default 1000 486.8 ns 2.41 ns 2.13 ns 0.99 0.01 - 32 B 1.00
ArraySeqSum Job-EYKYAG Default 2 2 1 1000 488.8 ns 74.19 ns 11.48 ns 1.00 0.02 - 32 B 1.00
ListSeqSum Job-WCJSKA /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,214.5 ns 2.83 ns 2.51 ns 1.00 0.00 - 40 B 1.00
ListSeqSum Job-ZZTGPQ /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,241.1 ns 19.15 ns 17.91 ns 1.02 0.01 - 40 B 1.00
ListSeqSum Job-EYKYAG Default 2 2 1 1000 1,216.6 ns 34.26 ns 5.30 ns 1.00 0.00 0.0019 40 B 1.00
ArraySum Job-WCJSKA /p:BUILDING_USING_DOTNET=true Default Default Default 10000 1,905.3 ns 8.41 ns 7.02 ns 1.00 0.00 - - NA
ArraySum Job-ZZTGPQ /p:BUILDING_USING_DOTNET=true Default Default Default 10000 1,924.9 ns 12.14 ns 11.36 ns 1.01 0.01 - - NA
ArraySum Job-EYKYAG Default 2 2 1 10000 1,933.3 ns 95.32 ns 14.75 ns 1.01 0.01 - - NA
ArraySeqSum Job-WCJSKA /p:BUILDING_USING_DOTNET=true Default Default Default 10000 4,710.8 ns 18.93 ns 15.80 ns 1.00 0.00 - 32 B 1.00
ArraySeqSum Job-ZZTGPQ /p:BUILDING_USING_DOTNET=true Default Default Default 10000 4,760.8 ns 17.52 ns 15.53 ns 1.01 0.00 - 32 B 1.00
ArraySeqSum Job-EYKYAG Default 2 2 1 10000 4,776.5 ns 345.62 ns 53.49 ns 1.01 0.01 - 32 B 1.00
ListSeqSum Job-WCJSKA /p:BUILDING_USING_DOTNET=true Default Default Default 10000 14,616.1 ns 355.10 ns 1,030.22 ns 1.00 0.00 - 40 B 1.00
ListSeqSum Job-ZZTGPQ /p:BUILDING_USING_DOTNET=true Default Default Default 10000 13,809.1 ns 270.69 ns 300.87 ns 0.94 0.08 - 40 B 1.00
ListSeqSum Job-EYKYAG Default 2 2 1 10000 15,396.3 ns 3,747.35 ns 579.91 ns 1.05 0.14 - 40 B 1.00

This PR

Method Job Arguments IterationCount LaunchCount WarmupCount Length Mean Error StdDev Median Ratio RatioSD Gen0 Allocated Alloc Ratio
ArraySum Job-YRWBVV /p:BUILDING_USING_DOTNET=true Default Default Default 1000 59.19 ns 0.550 ns 0.514 ns 59.01 ns 1.00 0.00 - - NA
ArraySum Job-YQEOCP /p:BUILDING_USING_DOTNET=true Default Default Default 1000 206.10 ns 0.720 ns 0.674 ns 206.08 ns 3.48 0.03 - - NA
ArraySum Job-SEOIAS Default 2 2 1 1000 59.65 ns 4.451 ns 0.689 ns 59.64 ns 1.01 0.02 - - NA
ArraySeqSum Job-YRWBVV /p:BUILDING_USING_DOTNET=true Default Default Default 1000 59.88 ns 0.326 ns 0.305 ns 59.84 ns 1.00 0.00 - - NA
ArraySeqSum Job-YQEOCP /p:BUILDING_USING_DOTNET=true Default Default Default 1000 496.87 ns 6.676 ns 6.245 ns 493.92 ns 8.30 0.10 - 32 B NA
ArraySeqSum Job-SEOIAS Default 2 2 1 1000 60.61 ns 3.319 ns 0.514 ns 60.63 ns 1.01 0.01 - - NA
ListSeqSum Job-YRWBVV /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,163.60 ns 18.969 ns 17.743 ns 1,162.85 ns 1.00 0.00 - 40 B 1.00
ListSeqSum Job-YQEOCP /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,230.86 ns 6.546 ns 6.123 ns 1,233.38 ns 1.06 0.02 - 40 B 1.00
ListSeqSum Job-SEOIAS Default 2 2 1 1000 1,156.84 ns 66.957 ns 10.362 ns 1,153.00 ns 0.99 0.02 0.0019 40 B 1.00
ArraySum Job-YRWBVV /p:BUILDING_USING_DOTNET=true Default Default Default 10000 553.52 ns 6.542 ns 6.119 ns 554.30 ns 1.00 0.00 - - NA
ArraySum Job-YQEOCP /p:BUILDING_USING_DOTNET=true Default Default Default 10000 1,944.70 ns 16.488 ns 15.423 ns 1,944.27 ns 3.51 0.05 - - NA
ArraySum Job-SEOIAS Default 2 2 1 10000 547.22 ns 53.119 ns 8.220 ns 543.16 ns 1.00 0.02 - - NA
ArraySeqSum Job-YRWBVV /p:BUILDING_USING_DOTNET=true Default Default Default 10000 549.69 ns 2.934 ns 2.600 ns 549.51 ns 1.00 0.00 - - NA
ArraySeqSum Job-YQEOCP /p:BUILDING_USING_DOTNET=true Default Default Default 10000 4,843.34 ns 64.337 ns 60.181 ns 4,826.96 ns 8.82 0.12 - 32 B NA
ArraySeqSum Job-SEOIAS Default 2 2 1 10000 551.21 ns 42.723 ns 6.611 ns 550.96 ns 1.00 0.02 - - NA
ListSeqSum Job-YRWBVV /p:BUILDING_USING_DOTNET=true Default Default Default 10000 13,349.78 ns 349.754 ns 992.195 ns 12,981.18 ns 1.00 0.00 - 40 B 1.00
ListSeqSum Job-YQEOCP /p:BUILDING_USING_DOTNET=true Default Default Default 10000 14,501.57 ns 311.609 ns 894.065 ns 14,248.72 ns 1.09 0.10 - 40 B 1.00
ListSeqSum Job-SEOIAS Default 2 2 1 10000 12,715.10 ns 817.355 ns 126.487 ns 12,726.11 ns 0.99 0.03 - 40 B 1.00

By quick look of this, the Sum seems to be improved but the Average not really. Should I revert the Average change?

Edit: Average removed

Thorium avatar Apr 27 '25 09:04 Thorium

1/As per your benchmark, the cost for average might be dominated by doing the casts and additional array allocations. Better try without the mapping step for fully prove that.

2/The idea of using if !DEBUG - if this is placed in the added code, it would react on conditional defines of when we build FSharp.Core (.e.g when packing together a new .NET SDK), not based on user code. The shipped FSharp.Core is always RELEASE, even if user code is in DEBUG.

T-Gro avatar Apr 28 '25 07:04 T-Gro

What are the numbers when running on the full clr? 4.6-4.8? Enumerable probably has different implementation there.

vzarytovskii avatar Apr 28 '25 08:04 vzarytovskii

Average will now throw different exception, so it's a breaking change at this point.

vzarytovskii avatar Apr 28 '25 08:04 vzarytovskii

Average will now throw different exception, so it's a breaking change at this point.

@Thorium: This does not mean it is a showstopper. If the change is good from all other perspective, such break can be accepted and documented (especially if it will be e.g. just a subtype from the same exception type).

T-Gro avatar Apr 28 '25 08:04 T-Gro

Better try without the mapping step for fully prove that.

I did check this, and the results were worse for Enumerable.Sum Main - current

Method Job Arguments IterationCount LaunchCount WarmupCount Length Mean Error StdDev Ratio RatioSD Allocated Alloc Ratio
ArrayAverage Job-XSEKSP /p:BUILDING_USING_DOTNET=true Default Default Default 1000 291.8 ns 5.86 ns 8.21 ns 1.00 0.00 - NA
ArrayAverage Job-BKIONA /p:BUILDING_USING_DOTNET=true Default Default Default 1000 276.6 ns 1.25 ns 1.11 ns 0.94 0.02 - NA
ArrayAverage Job-LYJOOZ Default 2 2 1 1000 277.2 ns 4.73 ns 0.73 ns 0.93 0.02 - NA
ArrayAverage Job-XSEKSP /p:BUILDING_USING_DOTNET=true Default Default Default 10000 2,669.5 ns 12.65 ns 11.21 ns 1.00 0.00 - NA
ArrayAverage Job-BKIONA /p:BUILDING_USING_DOTNET=true Default Default Default 10000 2,661.2 ns 4.14 ns 3.87 ns 1.00 0.00 - NA
ArrayAverage Job-LYJOOZ Default 2 2 1 10000 2,677.6 ns 139.82 ns 21.64 ns 1.00 0.01 - NA
------------- ----------- ------------------------------ --------------- ------------ ------------ ------- -----------: ----------: ---------: ------: --------: ----------: ------------:

This PR - with Enumerable.Sum

Method Job Arguments IterationCount LaunchCount WarmupCount Length Mean Error StdDev Ratio Allocated Alloc Ratio
ArrayAverage Job-GSXKPB /p:BUILDING_USING_DOTNET=true Default Default Default 1000 503.6 ns 9.32 ns 8.71 ns 1.00 - NA
ArrayAverage Job-LFPVKH /p:BUILDING_USING_DOTNET=true Default Default Default 1000 275.4 ns 0.45 ns 0.37 ns 0.55 - NA
ArrayAverage Job-SPCSFF Default 2 2 1 1000 494.2 ns 12.91 ns 2.00 ns 0.98 - NA
ArrayAverage Job-GSXKPB /p:BUILDING_USING_DOTNET=true Default Default Default 10000 5,277.4 ns 8.47 ns 7.07 ns 1.00 - NA
ArrayAverage Job-LFPVKH /p:BUILDING_USING_DOTNET=true Default Default Default 10000 2,661.1 ns 4.92 ns 4.11 ns 0.50 - NA
ArrayAverage Job-SPCSFF Default 2 2 1 10000 5,280.3 ns 83.03 ns 12.85 ns 1.00 - NA

Thorium avatar Apr 28 '25 08:04 Thorium

Because this is just a forwarding, behaviour difference is easy to explore already in FSI:

[| 1.; Double.NaN |] |> Array.sum;;
//val it: float = nan
[| 1.; Double.NaN |] |> System.Linq.Enumerable.Sum;;
//val it: float = nan

Seems to be the same with edge-cases like Double.MaxValue and infinity.

Thorium avatar Apr 28 '25 09:04 Thorium

What are the numbers when running on the full clr? 4.6-4.8? Enumerable probably has different implementation there.

I don't know how to easily run MicroPerf.fsproj in .NET, but I can run array |> Array.sum vs array |> System.Linq.Enumerable.Sum in .NET 4.8.

Benchmarks.LinqEnumerableSum: .NET Framework 4.8(Runtime=.NET Framework 4.8) [Length=10000]
Runtime = .NET Framework 4.8.1 (4.8.9310.0), X64 RyuJIT VectorSize=256; GC = Concurrent Workstation
Mean = 29.934 us, StdErr = 0.069 us (0.23%), N = 15, StdDev = 0.266 us
Min = 29.483 us, Q1 = 29.784 us, Median = 29.985 us, Q3 = 30.094 us, Max = 30.387 us
IQR = 0.310 us, LowerFence = 29.320 us, UpperFence = 30.558 us
ConfidenceInterval = [29.649 us; 30.219 us] (CI 99.9%), Margin = 0.285 us (0.95% of Mean)
Skewness = -0.15, Kurtosis = 1.96, MValue = 2
-------------------- Histogram --------------------
[29.341 us ; 30.529 us) | @@@@@@@@@@@@@@@
---------------------------------------------------

// * Summary *

BenchmarkDotNet v0.13.12, Windows 11 (10.0.26100.3915) 13th Gen Intel Core i9-13900H, 1 CPU, 20 logical and 14 physical cores [Host] : .NET Framework 4.8.1 (4.8.9310.0), X64 LegacyJIT VectorSize=256 DEBUG .NET Framework 4.8 : .NET Framework 4.8.1 (4.8.9310.0), X64 RyuJIT VectorSize=256

Job=.NET Framework 4.8 Runtime=.NET Framework 4.8

Method Length Mean Error StdDev Ratio RatioSD Gen0 Allocated Alloc Ratio
FSharpArraySum 1000 456.7 ns 8.08 ns 7.56 ns 1.00 0.00 - - NA
LinqEnumerableSum 1000 3,033.3 ns 51.61 ns 48.28 ns 6.64 0.07 0.0038 32 B NA
FSharpArraySum 10000 4,480.0 ns 55.90 ns 52.29 ns 1.00 0.00 - - NA
LinqEnumerableSum 10000 29,933.7 ns 284.86 ns 266.46 ns 6.68 0.10 - 32 B NA

Not exactly the results I was hoping for. Any ideas?

Thorium avatar Apr 28 '25 09:04 Thorium

Framework-dependent runtime switch would solve this (and since it is static, I would hope JIT would eliminate such branching), but is not something we are doing in FSharp.Core.

On Desktop:

  • Only forward Seq calls to Enumerable, if anything (or maybe just keep as it is )

On Core:

  • Forward Seq.sum and Array.sum. Exclude average due to reasons above.

It would be great if we could make use of https://learn.microsoft.com/en-us/dotnet/api/system.numerics.tensors.tensorprimitives.sum?view=net-9.0-pp#system-numerics-tensors-tensorprimitives-sum-1(system-readonlyspan((-0))) for some of those implementations. Right now, any such support depending on Tensors will have to be done as a separate lib outside of FSharp.Core .

T-Gro avatar Apr 28 '25 09:04 T-Gro

Not exactly the results I was hoping for. Any ideas?

Yeah, that's what I suspected will happen :( Bcl in full framework doesn't have those functions vectorised.

And the issue with fslib is that it ships as netstandard, so there's no way to tailor separate implementations for netfx and netcore.

vzarytovskii avatar Apr 28 '25 09:04 vzarytovskii

The IEnumerable.Sum of NET 9 checks the support for vectors via Vector<T>.IsSupported and Vector.IsHardwareAccelerated. But I assume there is not even a type of System.Numerics.Vector<T> in .NET Framework 4.8. / netstandard2.0. Would there be a fast way to check something like Type.GetType("System.Numerics.Vector") <> null to check in static optimization conditionals?

On Desktop:

What do you mean by "On Desktop" and "On Core" ?

I can see that in FSharp.Core.fsproj there is <TargetFrameworks Condition="'$(Configuration)' != 'Proto'">netstandard2.0;netstandard2.1</TargetFrameworks> so there would be possible to do: <DefineConstants Condition=" '$(TargetFramework)' == 'netstandard2.1' >$(DefineConstants);NETSTANDARD21</DefineConstants> And then use that, however, this would add one more parallel code-path for long-term maintenance.

Thorium avatar Apr 28 '25 10:04 Thorium

The IEnumerable.Sum of NET 9 checks the support for vectors via Vector<T>.IsSupported and Vector.IsHardwareAccelerated.

But I assume there is not even a type of System.Numerics.Vector<T> in .NET Framework 4.8. / netstandard2.0.

Would there be a fast way to check something like Type.GetType("System.Numerics.Vector") <> null to check in static optimization conditionals?

Reflection can likely neglect many optimizations (needs proving tho).

On Desktop:

What do you mean by "On Desktop" and "On Core" ?

Desktop is usually referred when full framework is used, and core is when coreclr runtime and bcl are used.

I can see that in FSharp.Core.fsproj there is <TargetFrameworks Condition="'$(Configuration)' != 'Proto'">netstandard2.0;netstandard2.1</TargetFrameworks> so there would be possible to do:

<DefineConstants Condition=" '$(TargetFramework)' == 'netstandard2.1' >$(DefineConstants);NETSTANDARD21</DefineConstants>

And then use that, however, this would add one more parallel code-path for long-term maintenance.

These should already be defined. However they will be defined on both net48 and net9.0

vzarytovskii avatar Apr 28 '25 10:04 vzarytovskii

I meant a statically stored flag based on something like:

open System.Runtime.InteropServices
let runtimeDescription = RuntimeInformation.FrameworkDescription
let hasLinqAcceleration = runtimeDescription > "..."  // string-based logic here

And switching at runtime (so cannot be part of statically optimized when clauses in FSharp.Core, since those worked with inlining at compile-time.

e.g. top level function: For unsupported types (anything but numerics), it would immediately call the existing code. For statically optimized numeric primitives, it would then do the decision based on this static flag, and pick between existing code and Enumerable.Sum.

T-Gro avatar Apr 28 '25 11:04 T-Gro

And switching at runtime

Runtime check added. Compared to this https://github.com/dotnet/fsharp/pull/18509#issuecomment-2833350985 ...this PR Array.sum still beats the current main over 50%:

This PR:

Method Job Arguments IterationCount LaunchCount WarmupCount Length Mean Error StdDev Median Ratio RatioSD Gen0 Allocated Alloc Ratio
ArraySum Job-HPWEPX /p:BUILDING_USING_DOTNET=true Default Default Default 1000 71.20 ns 0.395 ns 0.330 ns 71.28 ns 1.00 0.00 - - NA
ArraySum Job-FURVPH /p:BUILDING_USING_DOTNET=true Default Default Default 1000 203.16 ns 1.536 ns 1.437 ns 202.73 ns 2.85 0.03 - - NA
ArraySum Job-RIJJDM Default 2 2 1 1000 69.23 ns 3.499 ns 0.541 ns 69.03 ns 0.97 0.00 - - NA
ArraySeqSum Job-HPWEPX /p:BUILDING_USING_DOTNET=true Default Default Default 1000 72.44 ns 1.457 ns 1.842 ns 71.38 ns 1.00 0.00 - - NA
ArraySeqSum Job-FURVPH /p:BUILDING_USING_DOTNET=true Default Default Default 1000 497.44 ns 6.606 ns 5.856 ns 495.07 ns 6.79 0.22 - 32 B NA
ArraySeqSum Job-RIJJDM Default 2 2 1 1000 70.06 ns 2.019 ns 0.312 ns 70.09 ns 0.94 0.01 - - NA
ListSeqSum Job-HPWEPX /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,143.31 ns 7.123 ns 6.663 ns 1,140.53 ns 1.00 0.00 - 40 B 1.00
ListSeqSum Job-FURVPH /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,227.64 ns 13.709 ns 12.823 ns 1,222.18 ns 1.07 0.01 - 40 B 1.00
ListSeqSum Job-RIJJDM Default 2 2 1 1000 1,153.34 ns 41.507 ns 6.423 ns 1,155.69 ns 1.00 0.01 0.0019 40 B 1.00
ArraySum Job-HPWEPX /p:BUILDING_USING_DOTNET=true Default Default Default 10000 559.23 ns 6.984 ns 6.191 ns 559.52 ns 1.00 0.00 - - NA
ArraySum Job-FURVPH /p:BUILDING_USING_DOTNET=true Default Default Default 10000 1,939.43 ns 19.240 ns 17.997 ns 1,932.29 ns 3.47 0.04 - - NA
ArraySum Job-RIJJDM Default 2 2 1 10000 550.02 ns 50.072 ns 7.749 ns 549.63 ns 0.99 0.01 - - NA
ArraySeqSum Job-HPWEPX /p:BUILDING_USING_DOTNET=true Default Default Default 10000 568.95 ns 10.940 ns 13.836 ns 564.94 ns 1.00 0.00 - - NA
ArraySeqSum Job-FURVPH /p:BUILDING_USING_DOTNET=true Default Default Default 10000 4,784.67 ns 60.194 ns 56.306 ns 4,776.55 ns 8.41 0.23 - 32 B NA
ArraySeqSum Job-RIJJDM Default 2 2 1 10000 561.27 ns 85.016 ns 13.156 ns 557.93 ns 0.96 0.03 - - NA
ListSeqSum Job-HPWEPX /p:BUILDING_USING_DOTNET=true Default Default Default 10000 14,808.61 ns 704.739 ns 2,077.937 ns 14,333.68 ns 1.00 0.00 - 40 B 1.00
ListSeqSum Job-FURVPH /p:BUILDING_USING_DOTNET=true Default Default Default 10000 16,225.89 ns 609.042 ns 1,795.773 ns 16,355.29 ns 1.11 0.18 - 40 B 1.00
ListSeqSum Job-RIJJDM Default 2 2 1 10000 15,371.93 ns 14,341.671 ns 2,219.390 ns 15,512.85 ns 1.00 0.23 - 40 B 1.00

Thorium avatar Apr 28 '25 13:04 Thorium

After Sum is solved, I guess it's time for SumBy to do the same, because SumBy is the more commonly used in F#. There is System.Linq.Enumerable.Sum overload that takes a tuple of IEnumerable and a projection.

Thorium avatar Apr 28 '25 13:04 Thorium

After Sum is solved, I guess it's time for SumBy to do the same, because SumBy is the more commonly used in F#. There is System.Linq.Enumerable.Sum overload that takes a tuple of IEnumerable and a projection.

Is it vectorized as well? Hard to imagine how - since with a projection, the data to sum by is not stored next to each other.

Is it doing copies first? I am curious.

T-Gro avatar Apr 28 '25 14:04 T-Gro

Oh, you are right, it isn't. That's sad, makes this very marginal feature especially when tools like FSharpLint prefer sumBy over (map >> sum)

Thorium avatar Apr 28 '25 15:04 Thorium

However they will be defined on both net48 and net9.0

.NET Framework is not .NET Standard 2.1 compatible, thus .NET Standard 2.1 version shouldn't need runtime check. I expect the "supported" frameworks (aka. where things should be as fast as possible) are only:

  • .NET Framework Full (4.62-...) with .NET Standard 2.0
  • dotnet 8/9 (both .NET Standard 2.0 and .NET Standard 2,1 should work here)

Thorium avatar Apr 28 '25 16:04 Thorium

However they will be defined on both net48 and net9.0

.NET Framework is not .NET Standard 2.1 compatible, thus .NET Standard 2.1 version shouldn't need runtime check.

I expect the "supported" frameworks (aka. where things should be as fast as possible) are only:

  • .NET Framework Full (4.62-...) with .NET Standard 2.0

  • dotnet 8/9 (both .NET Standard 2.0 and .NET Standard 2,1 should work here)

Will all netstandard2.1 implementations be vectorized? If yes, then it's safe enough (and also it should be defined already).

vzarytovskii avatar Apr 28 '25 16:04 vzarytovskii

However they will be defined on both net48 and net9.0

.NET Framework is not .NET Standard 2.1 compatible, thus .NET Standard 2.1 version shouldn't need runtime check. I expect the "supported" frameworks (aka. where things should be as fast as possible) are only:

  • .NET Framework Full (4.62-...) with .NET Standard 2.0
  • dotnet 8/9 (both .NET Standard 2.0 and .NET Standard 2,1 should work here)

@Thorium : I think the discussion about the future deployments of FSharp.Core derailed the discussion. It is a very valid point, but not one which should prevent this PR from happening.

(It's not that a different strategy for shipping FSharp.Core for netCurrent is dismissed as not being allowed, it is rather work which did not happen yet.)

I would like this optimization to go into NET10 as is, are you still interested in finalizing this PR and getting it in?

T-Gro avatar Jun 03 '25 08:06 T-Gro

I would like this optimization to go into NET10 as is, are you still interested in finalizing this PR and getting it in?

Sure. Is there anything that should be added still? I think all the existing conversation items were already resolved?

Thorium avatar Jun 03 '25 08:06 Thorium

Tests were failing, otherwise it was good. The results have expired since, let me get fresh results.

T-Gro avatar Jun 03 '25 11:06 T-Gro

@Thorium :

As I am revisiting open PRs before NET10 closes, I again came across this one. The remaining concern is the one raised by Tanner, that is how much worse is a non-desktop implementation that is not HW accelerated.

The check for HW acceleration happens within Enumerable.Sum itself and is correct to be there, that is fine.

If you are still interested in pursuing that, I think that this last remaining concern could be lifted by running a test on: .net9 target, but with DOTNET_EnableHWIntrinsic=0 enviroment variable. That will hit the "new" code path of Enumerable.Sum, however not use HW acceleration. The goal is not to be drastically worse than existing F# Array.sum.

If this holds, I would be happy to merge it in 👍 .

As far as the runtime check in FSharp.Core goes, it is not up to this PR to restructure it - it shall use the already implemented mechanism. Multitargeting of FSharp.Core (easy) and assessing the full downstream impact (hard) shall be taken separately.

T-Gro avatar Aug 14 '25 12:08 T-Gro

Hi,

The remaining concern is the one raised by Tanner, that is how much worse is a non-desktop implementation that is not HW accelerated.

This case was already tested before earlier in this thread: https://github.com/dotnet/fsharp/pull/18509#issuecomment-2834590096 ...and the answer was: almost 10x worse.

Thorium avatar Nov 02 '25 09:11 Thorium