SIMD vectorization of Array.sum<int>, etc
Description
Specific overloads (float, float32, int, int64) of Seq.sum, ~~Seq.average,~~ Array.sum ~~and Array.average~~ to take advantage of vectorization in System.Linq.Enumerable module.
This is potentially a naive first try to solve #16230 by the spirit of @T-Gro comment https://github.com/dotnet/fsharp/issues/16230#issuecomment-2826895557
Checklist
- [ ] Test cases added
- [x] Performance benchmarks added in case of performance changes
- [x] Release notes entry updated:
Please make sure to add an entry with short succinct description of the change as well as link to this pull request to the respective release notes file, if applicable.
:heavy_exclamation_mark: Release notes required
:white_check_mark: Found changes and release notes in following paths:
Change path Release notes path Description src/FSharp.Coredocs/release-notes/.FSharp.Core/10.0.100.md
It looks like the error case is now different for average over an empty collection. If it is the same exception type, and only a different message, we could justify the break.
(we should not evaluate a seq before calling into Average, since this would be breaking in a different way)
I did some initial tests to see if this makes sense at all or not. I used the current benchmarks\CompiledCodeBenchmarks\MicroPerf\MicroPerf.fsproj with adding following tests:
[<Benchmark>]
member x.ArraySum() = // Array sum
array
|> Array.sum
|> ignore
[<Benchmark>]
member x.ArrayAverage() = // There has an extra map, because average needs float
array
|> Array.map float
|> Array.average
|> ignore
[<Benchmark>]
member x.ArraySeqSum() = // Seq sum by using array as base data
array
|> Seq.sum
|> ignore
[<Benchmark>]
member x.ListSeqSum() = // Seq sum by using list as base data
list
|> Seq.sum
|> ignore
And here are the results with current main:
BenchmarkDotNet v0.13.10, Windows 11 (10.0.26100.3915)
13th Gen Intel Core i9-13900H, 1 CPU, 20 logical and 14 physical cores
.NET SDK 9.0.203
[Host] : .NET 9.0.4 (9.0.425.16305), X64 RyuJIT AVX2 DEBUG
Job-ZVNJYC : .NET 9.0.4 (9.0.425.16305), X64 RyuJIT AVX2
Job-NFLFBL : .NET 9.0.4 (9.0.425.16305), X64 RyuJIT AVX2
Job-ABSOZP : .NET 9.0.4 (9.0.425.16305), X64 RyuJIT AVX2
Server=True
| Method | Job | Arguments | IterationCount | LaunchCount | WarmupCount | Length | Mean | Error | StdDev | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ArraySum | Job-ZVNJYC | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 206.4 ns | 1.13 ns | 1.06 ns | 1.00 | 0.00 | - | - | NA |
| ArraySum | Job-NFLFBL | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 205.9 ns | 0.50 ns | 0.39 ns | 1.00 | 0.01 | - | - | NA |
| ArraySum | Job-ABSOZP | Default | 2 | 2 | 1 | 1000 | 207.8 ns | 9.28 ns | 1.44 ns | 1.01 | 0.01 | - | - | NA |
| ArrayAverage | Job-ZVNJYC | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 1,538.6 ns | 35.62 ns | 105.04 ns | 1.00 | 0.00 | 0.1736 | 8024 B | 1.00 |
| ArrayAverage | Job-NFLFBL | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 1,471.2 ns | 29.23 ns | 53.45 ns | 0.94 | 0.07 | 0.1736 | 8024 B | 1.00 |
| ArrayAverage | Job-ABSOZP | Default | 2 | 2 | 1 | 1000 | 1,495.0 ns | 759.33 ns | 117.51 ns | 0.92 | 0.07 | 0.1736 | 8024 B | 1.00 |
| ArraySeqSum | Job-ZVNJYC | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 500.6 ns | 3.69 ns | 3.46 ns | 1.00 | 0.00 | - | 32 B | 1.00 |
| ArraySeqSum | Job-NFLFBL | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 496.4 ns | 1.80 ns | 1.40 ns | 0.99 | 0.01 | - | 32 B | 1.00 |
| ArraySeqSum | Job-ABSOZP | Default | 2 | 2 | 1 | 1000 | 498.9 ns | 34.99 ns | 5.41 ns | 1.00 | 0.01 | - | 32 B | 1.00 |
| ListSeqSum | Job-ZVNJYC | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 1,234.6 ns | 3.12 ns | 2.91 ns | 1.00 | 0.00 | - | 40 B | 1.00 |
| ListSeqSum | Job-NFLFBL | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 1,243.7 ns | 19.42 ns | 17.21 ns | 1.01 | 0.01 | - | 40 B | 1.00 |
| ListSeqSum | Job-ABSOZP | Default | 2 | 2 | 1 | 1000 | 1,251.6 ns | 55.34 ns | 8.56 ns | 1.01 | 0.01 | 0.0019 | 40 B | 1.00 |
| ArraySum | Job-ZVNJYC | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 1,949.4 ns | 8.32 ns | 6.95 ns | 1.00 | 0.00 | - | - | NA |
| ArraySum | Job-NFLFBL | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 1,956.5 ns | 24.32 ns | 22.75 ns | 1.00 | 0.01 | - | - | NA |
| ArraySum | Job-ABSOZP | Default | 2 | 2 | 1 | 10000 | 1,966.3 ns | 128.42 ns | 19.87 ns | 1.01 | 0.01 | - | - | NA |
| ArrayAverage | Job-ZVNJYC | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 13,892.3 ns | 415.41 ns | 1,211.78 ns | 1.00 | 0.00 | 1.8616 | 80024 B | 1.00 |
| ArrayAverage | Job-NFLFBL | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 15,129.6 ns | 242.87 ns | 227.18 ns | 1.09 | 0.13 | 1.9379 | 80024 B | 1.00 |
| ArrayAverage | Job-ABSOZP | Default | 2 | 2 | 1 | 10000 | 12,991.8 ns | 8,355.23 ns | 1,292.98 ns | 1.00 | 0.26 | 1.8616 | 80024 B | 1.00 |
| ArraySeqSum | Job-ZVNJYC | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 4,889.1 ns | 48.55 ns | 43.04 ns | 1.00 | 0.00 | - | 32 B | 1.00 |
| ArraySeqSum | Job-NFLFBL | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 4,828.3 ns | 45.71 ns | 38.17 ns | 0.99 | 0.01 | - | 32 B | 1.00 |
| ArraySeqSum | Job-ABSOZP | Default | 2 | 2 | 1 | 10000 | 4,833.0 ns | 122.61 ns | 18.97 ns | 0.98 | 0.00 | - | 32 B | 1.00 |
| ListSeqSum | Job-ZVNJYC | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 14,042.3 ns | 280.58 ns | 483.98 ns | 1.00 | 0.00 | - | 40 B | 1.00 |
| ListSeqSum | Job-NFLFBL | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 13,767.9 ns | 240.03 ns | 420.40 ns | 0.98 | 0.04 | - | 40 B | 1.00 |
| ListSeqSum | Job-ABSOZP | Default | 2 | 2 | 1 | 10000 | 13,537.7 ns | 4,046.83 ns | 626.25 ns | 0.95 | 0.07 | - | 40 B | 1.00 |
Here are the results with this PR:
| Method | Job | Arguments | IterationCount | LaunchCount | WarmupCount | Length | Mean | Error | StdDev | Median | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ArraySum | Job-WQFBJH | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 60.66 ns | 0.489 ns | 0.434 ns | 60.59 ns | 1.00 | 0.00 | - | - | NA |
| ArraySum | Job-ZZOCDZ | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 207.89 ns | 3.431 ns | 3.209 ns | 206.96 ns | 3.43 | 0.05 | - | - | NA |
| ArraySum | Job-BMQAWT | Default | 2 | 2 | 1 | 1000 | 60.18 ns | 3.505 ns | 0.542 ns | 60.19 ns | 0.99 | 0.02 | - | - | NA |
| ArrayAverage | Job-WQFBJH | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 1,693.99 ns | 29.205 ns | 28.683 ns | 1,694.44 ns | 1.00 | 0.00 | 0.1736 | 8024 B | 1.00 |
| ArrayAverage | Job-ZZOCDZ | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 1,443.74 ns | 47.701 ns | 140.649 ns | 1,446.84 ns | 0.87 | 0.04 | 0.1736 | 8024 B | 1.00 |
| ArrayAverage | Job-BMQAWT | Default | 2 | 2 | 1 | 1000 | 1,486.78 ns | 226.790 ns | 35.096 ns | 1,493.10 ns | 0.89 | 0.02 | 0.1736 | 8024 B | 1.00 |
| ArraySeqSum | Job-WQFBJH | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 60.23 ns | 0.426 ns | 0.378 ns | 60.11 ns | 1.00 | 0.00 | - | - | NA |
| ArraySeqSum | Job-ZZOCDZ | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 497.31 ns | 4.195 ns | 3.719 ns | 496.69 ns | 8.26 | 0.09 | - | 32 B | NA |
| ArraySeqSum | Job-BMQAWT | Default | 2 | 2 | 1 | 1000 | 61.30 ns | 1.343 ns | 0.208 ns | 61.31 ns | 1.02 | 0.00 | - | - | NA |
| ListSeqSum | Job-WQFBJH | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 1,154.49 ns | 9.687 ns | 9.061 ns | 1,155.70 ns | 1.00 | 0.00 | - | 40 B | 1.00 |
| ListSeqSum | Job-ZZOCDZ | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 1,228.62 ns | 5.630 ns | 4.395 ns | 1,228.27 ns | 1.06 | 0.01 | - | 40 B | 1.00 |
| ListSeqSum | Job-BMQAWT | Default | 2 | 2 | 1 | 1000 | 1,164.41 ns | 66.398 ns | 10.275 ns | 1,159.92 ns | 1.01 | 0.02 | 0.0019 | 40 B | 1.00 |
| ArraySum | Job-WQFBJH | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 546.81 ns | 1.656 ns | 1.549 ns | 547.20 ns | 1.00 | 0.00 | - | - | NA |
| ArraySum | Job-ZZOCDZ | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 1,961.44 ns | 25.704 ns | 24.044 ns | 1,961.63 ns | 3.59 | 0.05 | - | - | NA |
| ArraySum | Job-BMQAWT | Default | 2 | 2 | 1 | 10000 | 541.97 ns | 33.331 ns | 5.158 ns | 541.34 ns | 0.99 | 0.01 | - | - | NA |
| ArrayAverage | Job-WQFBJH | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 16,012.49 ns | 316.673 ns | 721.226 ns | 16,238.87 ns | 1.00 | 0.00 | 1.8463 | 80024 B | 1.00 |
| ArrayAverage | Job-ZZOCDZ | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 13,651.54 ns | 310.627 ns | 865.903 ns | 13,843.05 ns | 0.84 | 0.08 | 1.8768 | 80024 B | 1.00 |
| ArrayAverage | Job-BMQAWT | Default | 2 | 2 | 1 | 10000 | 17,271.21 ns | 3,620.263 ns | 560.240 ns | 17,365.04 ns | 1.10 | 0.04 | 1.8921 | 80024 B | 1.00 |
| ArraySeqSum | Job-WQFBJH | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 554.40 ns | 3.815 ns | 3.569 ns | 554.73 ns | 1.00 | 0.00 | - | - | NA |
| ArraySeqSum | Job-ZZOCDZ | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 4,813.90 ns | 39.532 ns | 33.011 ns | 4,801.26 ns | 8.69 | 0.09 | - | 32 B | NA |
| ArraySeqSum | Job-BMQAWT | Default | 2 | 2 | 1 | 10000 | 551.93 ns | 12.280 ns | 1.900 ns | 552.36 ns | 1.00 | 0.00 | - | - | NA |
| ListSeqSum | Job-WQFBJH | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 14,523.24 ns | 768.122 ns | 2,240.648 ns | 13,893.55 ns | 1.00 | 0.00 | - | 40 B | 1.00 |
| ListSeqSum | Job-ZZOCDZ | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 15,181.03 ns | 458.748 ns | 1,338.189 ns | 14,985.06 ns | 1.07 | 0.18 | - | 40 B | 1.00 |
| ListSeqSum | Job-BMQAWT | Default | 2 | 2 | 1 | 10000 | 13,985.62 ns | 10,680.237 ns | 1,652.779 ns | 13,530.39 ns | 1.00 | 0.17 | - | 40 B | 1.00 |
Edit: I used Net 9 but the FSharp.Core.dll netstandard2.0 version in both, which was probably a mistake because Spans are truely efficient on netstandard2.1 only (?)
The IL for Enumerable is very complicated, and is heavily relying on the JIT to simplify it. In Release mode, I observe Enumerable to take half the time; in Debug mode Enumerable takes about 4x the time.
Hmm, I'm thinking if default FSI is compiled with debug mode and people use F# as a scripting language, then performance optimizations like this could be marked as #if !DEBUG. However, it would be even weirder if an error message were different between debug and release.
I did run the same performance tests with FSharp.Core.dll netstandard2.1 version
Main branch:
| Method | Job | Arguments | IterationCount | LaunchCount | WarmupCount | Length | Mean | Error | StdDev | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ArraySum | Job-WCJSKA | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 207.2 ns | 2.34 ns | 2.19 ns | 1.00 | 0.00 | - | - | NA |
| ArraySum | Job-ZZTGPQ | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 204.2 ns | 3.47 ns | 2.90 ns | 0.99 | 0.02 | - | - | NA |
| ArraySum | Job-EYKYAG | Default | 2 | 2 | 1 | 1000 | 206.8 ns | 22.17 ns | 3.43 ns | 1.01 | 0.02 | - | - | NA |
| ArraySeqSum | Job-WCJSKA | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 490.6 ns | 2.15 ns | 1.80 ns | 1.00 | 0.00 | - | 32 B | 1.00 |
| ArraySeqSum | Job-ZZTGPQ | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 486.8 ns | 2.41 ns | 2.13 ns | 0.99 | 0.01 | - | 32 B | 1.00 |
| ArraySeqSum | Job-EYKYAG | Default | 2 | 2 | 1 | 1000 | 488.8 ns | 74.19 ns | 11.48 ns | 1.00 | 0.02 | - | 32 B | 1.00 |
| ListSeqSum | Job-WCJSKA | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 1,214.5 ns | 2.83 ns | 2.51 ns | 1.00 | 0.00 | - | 40 B | 1.00 |
| ListSeqSum | Job-ZZTGPQ | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 1,241.1 ns | 19.15 ns | 17.91 ns | 1.02 | 0.01 | - | 40 B | 1.00 |
| ListSeqSum | Job-EYKYAG | Default | 2 | 2 | 1 | 1000 | 1,216.6 ns | 34.26 ns | 5.30 ns | 1.00 | 0.00 | 0.0019 | 40 B | 1.00 |
| ArraySum | Job-WCJSKA | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 1,905.3 ns | 8.41 ns | 7.02 ns | 1.00 | 0.00 | - | - | NA |
| ArraySum | Job-ZZTGPQ | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 1,924.9 ns | 12.14 ns | 11.36 ns | 1.01 | 0.01 | - | - | NA |
| ArraySum | Job-EYKYAG | Default | 2 | 2 | 1 | 10000 | 1,933.3 ns | 95.32 ns | 14.75 ns | 1.01 | 0.01 | - | - | NA |
| ArraySeqSum | Job-WCJSKA | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 4,710.8 ns | 18.93 ns | 15.80 ns | 1.00 | 0.00 | - | 32 B | 1.00 |
| ArraySeqSum | Job-ZZTGPQ | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 4,760.8 ns | 17.52 ns | 15.53 ns | 1.01 | 0.00 | - | 32 B | 1.00 |
| ArraySeqSum | Job-EYKYAG | Default | 2 | 2 | 1 | 10000 | 4,776.5 ns | 345.62 ns | 53.49 ns | 1.01 | 0.01 | - | 32 B | 1.00 |
| ListSeqSum | Job-WCJSKA | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 14,616.1 ns | 355.10 ns | 1,030.22 ns | 1.00 | 0.00 | - | 40 B | 1.00 |
| ListSeqSum | Job-ZZTGPQ | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 13,809.1 ns | 270.69 ns | 300.87 ns | 0.94 | 0.08 | - | 40 B | 1.00 |
| ListSeqSum | Job-EYKYAG | Default | 2 | 2 | 1 | 10000 | 15,396.3 ns | 3,747.35 ns | 579.91 ns | 1.05 | 0.14 | - | 40 B | 1.00 |
This PR
| Method | Job | Arguments | IterationCount | LaunchCount | WarmupCount | Length | Mean | Error | StdDev | Median | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ArraySum | Job-YRWBVV | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 59.19 ns | 0.550 ns | 0.514 ns | 59.01 ns | 1.00 | 0.00 | - | - | NA |
| ArraySum | Job-YQEOCP | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 206.10 ns | 0.720 ns | 0.674 ns | 206.08 ns | 3.48 | 0.03 | - | - | NA |
| ArraySum | Job-SEOIAS | Default | 2 | 2 | 1 | 1000 | 59.65 ns | 4.451 ns | 0.689 ns | 59.64 ns | 1.01 | 0.02 | - | - | NA |
| ArraySeqSum | Job-YRWBVV | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 59.88 ns | 0.326 ns | 0.305 ns | 59.84 ns | 1.00 | 0.00 | - | - | NA |
| ArraySeqSum | Job-YQEOCP | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 496.87 ns | 6.676 ns | 6.245 ns | 493.92 ns | 8.30 | 0.10 | - | 32 B | NA |
| ArraySeqSum | Job-SEOIAS | Default | 2 | 2 | 1 | 1000 | 60.61 ns | 3.319 ns | 0.514 ns | 60.63 ns | 1.01 | 0.01 | - | - | NA |
| ListSeqSum | Job-YRWBVV | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 1,163.60 ns | 18.969 ns | 17.743 ns | 1,162.85 ns | 1.00 | 0.00 | - | 40 B | 1.00 |
| ListSeqSum | Job-YQEOCP | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 1,230.86 ns | 6.546 ns | 6.123 ns | 1,233.38 ns | 1.06 | 0.02 | - | 40 B | 1.00 |
| ListSeqSum | Job-SEOIAS | Default | 2 | 2 | 1 | 1000 | 1,156.84 ns | 66.957 ns | 10.362 ns | 1,153.00 ns | 0.99 | 0.02 | 0.0019 | 40 B | 1.00 |
| ArraySum | Job-YRWBVV | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 553.52 ns | 6.542 ns | 6.119 ns | 554.30 ns | 1.00 | 0.00 | - | - | NA |
| ArraySum | Job-YQEOCP | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 1,944.70 ns | 16.488 ns | 15.423 ns | 1,944.27 ns | 3.51 | 0.05 | - | - | NA |
| ArraySum | Job-SEOIAS | Default | 2 | 2 | 1 | 10000 | 547.22 ns | 53.119 ns | 8.220 ns | 543.16 ns | 1.00 | 0.02 | - | - | NA |
| ArraySeqSum | Job-YRWBVV | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 549.69 ns | 2.934 ns | 2.600 ns | 549.51 ns | 1.00 | 0.00 | - | - | NA |
| ArraySeqSum | Job-YQEOCP | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 4,843.34 ns | 64.337 ns | 60.181 ns | 4,826.96 ns | 8.82 | 0.12 | - | 32 B | NA |
| ArraySeqSum | Job-SEOIAS | Default | 2 | 2 | 1 | 10000 | 551.21 ns | 42.723 ns | 6.611 ns | 550.96 ns | 1.00 | 0.02 | - | - | NA |
| ListSeqSum | Job-YRWBVV | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 13,349.78 ns | 349.754 ns | 992.195 ns | 12,981.18 ns | 1.00 | 0.00 | - | 40 B | 1.00 |
| ListSeqSum | Job-YQEOCP | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 14,501.57 ns | 311.609 ns | 894.065 ns | 14,248.72 ns | 1.09 | 0.10 | - | 40 B | 1.00 |
| ListSeqSum | Job-SEOIAS | Default | 2 | 2 | 1 | 10000 | 12,715.10 ns | 817.355 ns | 126.487 ns | 12,726.11 ns | 0.99 | 0.03 | - | 40 B | 1.00 |
By quick look of this, the Sum seems to be improved but the Average not really. Should I revert the Average change?
Edit: Average removed
1/As per your benchmark, the cost for average might be dominated by doing the casts and additional array allocations. Better try without the mapping step for fully prove that.
2/The idea of using if !DEBUG - if this is placed in the added code, it would react on conditional defines of when we build FSharp.Core (.e.g when packing together a new .NET SDK), not based on user code. The shipped FSharp.Core is always RELEASE, even if user code is in DEBUG.
What are the numbers when running on the full clr? 4.6-4.8? Enumerable probably has different implementation there.
Average will now throw different exception, so it's a breaking change at this point.
Average will now throw different exception, so it's a breaking change at this point.
@Thorium: This does not mean it is a showstopper. If the change is good from all other perspective, such break can be accepted and documented (especially if it will be e.g. just a subtype from the same exception type).
Better try without the mapping step for fully prove that.
I did check this, and the results were worse for Enumerable.Sum Main - current
| Method | Job | Arguments | IterationCount | LaunchCount | WarmupCount | Length | Mean | Error | StdDev | Ratio | RatioSD | Allocated | Alloc Ratio |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ArrayAverage | Job-XSEKSP | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 291.8 ns | 5.86 ns | 8.21 ns | 1.00 | 0.00 | - | NA |
| ArrayAverage | Job-BKIONA | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 276.6 ns | 1.25 ns | 1.11 ns | 0.94 | 0.02 | - | NA |
| ArrayAverage | Job-LYJOOZ | Default | 2 | 2 | 1 | 1000 | 277.2 ns | 4.73 ns | 0.73 ns | 0.93 | 0.02 | - | NA |
| ArrayAverage | Job-XSEKSP | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 2,669.5 ns | 12.65 ns | 11.21 ns | 1.00 | 0.00 | - | NA |
| ArrayAverage | Job-BKIONA | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 2,661.2 ns | 4.14 ns | 3.87 ns | 1.00 | 0.00 | - | NA |
| ArrayAverage | Job-LYJOOZ | Default | 2 | 2 | 1 | 10000 | 2,677.6 ns | 139.82 ns | 21.64 ns | 1.00 | 0.01 | - | NA |
| ------------- | ----------- | ------------------------------ | --------------- | ------------ | ------------ | ------- | -----------: | ----------: | ---------: | ------: | --------: | ----------: | ------------: |
This PR - with Enumerable.Sum
| Method | Job | Arguments | IterationCount | LaunchCount | WarmupCount | Length | Mean | Error | StdDev | Ratio | Allocated | Alloc Ratio |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ArrayAverage | Job-GSXKPB | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 503.6 ns | 9.32 ns | 8.71 ns | 1.00 | - | NA |
| ArrayAverage | Job-LFPVKH | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 275.4 ns | 0.45 ns | 0.37 ns | 0.55 | - | NA |
| ArrayAverage | Job-SPCSFF | Default | 2 | 2 | 1 | 1000 | 494.2 ns | 12.91 ns | 2.00 ns | 0.98 | - | NA |
| ArrayAverage | Job-GSXKPB | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 5,277.4 ns | 8.47 ns | 7.07 ns | 1.00 | - | NA |
| ArrayAverage | Job-LFPVKH | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 2,661.1 ns | 4.92 ns | 4.11 ns | 0.50 | - | NA |
| ArrayAverage | Job-SPCSFF | Default | 2 | 2 | 1 | 10000 | 5,280.3 ns | 83.03 ns | 12.85 ns | 1.00 | - | NA |
Because this is just a forwarding, behaviour difference is easy to explore already in FSI:
[| 1.; Double.NaN |] |> Array.sum;;
//val it: float = nan
[| 1.; Double.NaN |] |> System.Linq.Enumerable.Sum;;
//val it: float = nan
Seems to be the same with edge-cases like Double.MaxValue and infinity.
What are the numbers when running on the full clr? 4.6-4.8? Enumerable probably has different implementation there.
I don't know how to easily run MicroPerf.fsproj in .NET, but I can run
array |> Array.sum vs array |> System.Linq.Enumerable.Sum in .NET 4.8.
Benchmarks.LinqEnumerableSum: .NET Framework 4.8(Runtime=.NET Framework 4.8) [Length=10000]
Runtime = .NET Framework 4.8.1 (4.8.9310.0), X64 RyuJIT VectorSize=256; GC = Concurrent Workstation
Mean = 29.934 us, StdErr = 0.069 us (0.23%), N = 15, StdDev = 0.266 us
Min = 29.483 us, Q1 = 29.784 us, Median = 29.985 us, Q3 = 30.094 us, Max = 30.387 us
IQR = 0.310 us, LowerFence = 29.320 us, UpperFence = 30.558 us
ConfidenceInterval = [29.649 us; 30.219 us] (CI 99.9%), Margin = 0.285 us (0.95% of Mean)
Skewness = -0.15, Kurtosis = 1.96, MValue = 2
-------------------- Histogram --------------------
[29.341 us ; 30.529 us) | @@@@@@@@@@@@@@@
---------------------------------------------------
// * Summary *
BenchmarkDotNet v0.13.12, Windows 11 (10.0.26100.3915) 13th Gen Intel Core i9-13900H, 1 CPU, 20 logical and 14 physical cores [Host] : .NET Framework 4.8.1 (4.8.9310.0), X64 LegacyJIT VectorSize=256 DEBUG .NET Framework 4.8 : .NET Framework 4.8.1 (4.8.9310.0), X64 RyuJIT VectorSize=256
Job=.NET Framework 4.8 Runtime=.NET Framework 4.8
| Method | Length | Mean | Error | StdDev | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio |
|---|---|---|---|---|---|---|---|---|---|
| FSharpArraySum | 1000 | 456.7 ns | 8.08 ns | 7.56 ns | 1.00 | 0.00 | - | - | NA |
| LinqEnumerableSum | 1000 | 3,033.3 ns | 51.61 ns | 48.28 ns | 6.64 | 0.07 | 0.0038 | 32 B | NA |
| FSharpArraySum | 10000 | 4,480.0 ns | 55.90 ns | 52.29 ns | 1.00 | 0.00 | - | - | NA |
| LinqEnumerableSum | 10000 | 29,933.7 ns | 284.86 ns | 266.46 ns | 6.68 | 0.10 | - | 32 B | NA |
Not exactly the results I was hoping for. Any ideas?
Framework-dependent runtime switch would solve this (and since it is static, I would hope JIT would eliminate such branching), but is not something we are doing in FSharp.Core.
On Desktop:
- Only forward Seq calls to Enumerable, if anything (or maybe just keep as it is )
On Core:
- Forward Seq.sum and Array.sum. Exclude average due to reasons above.
It would be great if we could make use of https://learn.microsoft.com/en-us/dotnet/api/system.numerics.tensors.tensorprimitives.sum?view=net-9.0-pp#system-numerics-tensors-tensorprimitives-sum-1(system-readonlyspan((-0))) for some of those implementations. Right now, any such support depending on Tensors will have to be done as a separate lib outside of FSharp.Core .
Not exactly the results I was hoping for. Any ideas?
Yeah, that's what I suspected will happen :( Bcl in full framework doesn't have those functions vectorised.
And the issue with fslib is that it ships as netstandard, so there's no way to tailor separate implementations for netfx and netcore.
The IEnumerable.Sum of NET 9 checks the support for vectors via Vector<T>.IsSupported and Vector.IsHardwareAccelerated.
But I assume there is not even a type of System.Numerics.Vector<T> in .NET Framework 4.8. / netstandard2.0.
Would there be a fast way to check something like Type.GetType("System.Numerics.Vector") <> null to check in static optimization conditionals?
On Desktop:
What do you mean by "On Desktop" and "On Core" ?
I can see that in FSharp.Core.fsproj there is <TargetFrameworks Condition="'$(Configuration)' != 'Proto'">netstandard2.0;netstandard2.1</TargetFrameworks> so there would be possible to do:
<DefineConstants Condition=" '$(TargetFramework)' == 'netstandard2.1' >$(DefineConstants);NETSTANDARD21</DefineConstants>
And then use that, however, this would add one more parallel code-path for long-term maintenance.
The
IEnumerable.Sumof NET 9 checks the support for vectors viaVector<T>.IsSupportedandVector.IsHardwareAccelerated.But I assume there is not even a type of
System.Numerics.Vector<T>in .NET Framework 4.8. / netstandard2.0.Would there be a fast way to check something like
Type.GetType("System.Numerics.Vector") <> nullto check in static optimization conditionals?
Reflection can likely neglect many optimizations (needs proving tho).
On Desktop:
What do you mean by "On Desktop" and "On Core" ?
Desktop is usually referred when full framework is used, and core is when coreclr runtime and bcl are used.
I can see that in FSharp.Core.fsproj there is
<TargetFrameworks Condition="'$(Configuration)' != 'Proto'">netstandard2.0;netstandard2.1</TargetFrameworks>so there would be possible to do:
<DefineConstants Condition=" '$(TargetFramework)' == 'netstandard2.1' >$(DefineConstants);NETSTANDARD21</DefineConstants>And then use that, however, this would add one more parallel code-path for long-term maintenance.
These should already be defined. However they will be defined on both net48 and net9.0
I meant a statically stored flag based on something like:
open System.Runtime.InteropServices
let runtimeDescription = RuntimeInformation.FrameworkDescription
let hasLinqAcceleration = runtimeDescription > "..." // string-based logic here
And switching at runtime (so cannot be part of statically optimized when clauses in FSharp.Core, since those worked with inlining at compile-time.
e.g. top level function:
For unsupported types (anything but numerics), it would immediately call the existing code.
For statically optimized numeric primitives, it would then do the decision based on this static flag, and pick between existing code and Enumerable.Sum.
And switching at runtime
Runtime check added. Compared to this https://github.com/dotnet/fsharp/pull/18509#issuecomment-2833350985 ...this PR Array.sum still beats the current main over 50%:
This PR:
| Method | Job | Arguments | IterationCount | LaunchCount | WarmupCount | Length | Mean | Error | StdDev | Median | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ArraySum | Job-HPWEPX | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 71.20 ns | 0.395 ns | 0.330 ns | 71.28 ns | 1.00 | 0.00 | - | - | NA |
| ArraySum | Job-FURVPH | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 203.16 ns | 1.536 ns | 1.437 ns | 202.73 ns | 2.85 | 0.03 | - | - | NA |
| ArraySum | Job-RIJJDM | Default | 2 | 2 | 1 | 1000 | 69.23 ns | 3.499 ns | 0.541 ns | 69.03 ns | 0.97 | 0.00 | - | - | NA |
| ArraySeqSum | Job-HPWEPX | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 72.44 ns | 1.457 ns | 1.842 ns | 71.38 ns | 1.00 | 0.00 | - | - | NA |
| ArraySeqSum | Job-FURVPH | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 497.44 ns | 6.606 ns | 5.856 ns | 495.07 ns | 6.79 | 0.22 | - | 32 B | NA |
| ArraySeqSum | Job-RIJJDM | Default | 2 | 2 | 1 | 1000 | 70.06 ns | 2.019 ns | 0.312 ns | 70.09 ns | 0.94 | 0.01 | - | - | NA |
| ListSeqSum | Job-HPWEPX | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 1,143.31 ns | 7.123 ns | 6.663 ns | 1,140.53 ns | 1.00 | 0.00 | - | 40 B | 1.00 |
| ListSeqSum | Job-FURVPH | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 1000 | 1,227.64 ns | 13.709 ns | 12.823 ns | 1,222.18 ns | 1.07 | 0.01 | - | 40 B | 1.00 |
| ListSeqSum | Job-RIJJDM | Default | 2 | 2 | 1 | 1000 | 1,153.34 ns | 41.507 ns | 6.423 ns | 1,155.69 ns | 1.00 | 0.01 | 0.0019 | 40 B | 1.00 |
| ArraySum | Job-HPWEPX | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 559.23 ns | 6.984 ns | 6.191 ns | 559.52 ns | 1.00 | 0.00 | - | - | NA |
| ArraySum | Job-FURVPH | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 1,939.43 ns | 19.240 ns | 17.997 ns | 1,932.29 ns | 3.47 | 0.04 | - | - | NA |
| ArraySum | Job-RIJJDM | Default | 2 | 2 | 1 | 10000 | 550.02 ns | 50.072 ns | 7.749 ns | 549.63 ns | 0.99 | 0.01 | - | - | NA |
| ArraySeqSum | Job-HPWEPX | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 568.95 ns | 10.940 ns | 13.836 ns | 564.94 ns | 1.00 | 0.00 | - | - | NA |
| ArraySeqSum | Job-FURVPH | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 4,784.67 ns | 60.194 ns | 56.306 ns | 4,776.55 ns | 8.41 | 0.23 | - | 32 B | NA |
| ArraySeqSum | Job-RIJJDM | Default | 2 | 2 | 1 | 10000 | 561.27 ns | 85.016 ns | 13.156 ns | 557.93 ns | 0.96 | 0.03 | - | - | NA |
| ListSeqSum | Job-HPWEPX | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 14,808.61 ns | 704.739 ns | 2,077.937 ns | 14,333.68 ns | 1.00 | 0.00 | - | 40 B | 1.00 |
| ListSeqSum | Job-FURVPH | /p:BUILDING_USING_DOTNET=true | Default | Default | Default | 10000 | 16,225.89 ns | 609.042 ns | 1,795.773 ns | 16,355.29 ns | 1.11 | 0.18 | - | 40 B | 1.00 |
| ListSeqSum | Job-RIJJDM | Default | 2 | 2 | 1 | 10000 | 15,371.93 ns | 14,341.671 ns | 2,219.390 ns | 15,512.85 ns | 1.00 | 0.23 | - | 40 B | 1.00 |
After Sum is solved, I guess it's time for SumBy to do the same, because SumBy is the more commonly used in F#. There is System.Linq.Enumerable.Sum overload that takes a tuple of IEnumerable and a projection.
After
Sumis solved, I guess it's time forSumByto do the same, because SumBy is the more commonly used in F#. There is System.Linq.Enumerable.Sum overload that takes a tuple of IEnumerable and a projection.
Is it vectorized as well? Hard to imagine how - since with a projection, the data to sum by is not stored next to each other.
Is it doing copies first? I am curious.
Oh, you are right, it isn't. That's sad, makes this very marginal feature especially when tools like FSharpLint prefer sumBy over (map >> sum)
However they will be defined on both net48 and net9.0
.NET Framework is not .NET Standard 2.1 compatible, thus .NET Standard 2.1 version shouldn't need runtime check. I expect the "supported" frameworks (aka. where things should be as fast as possible) are only:
- .NET Framework Full (4.62-...) with .NET Standard 2.0
- dotnet 8/9 (both .NET Standard 2.0 and .NET Standard 2,1 should work here)
However they will be defined on both net48 and net9.0
.NET Framework is not .NET Standard 2.1 compatible, thus .NET Standard 2.1 version shouldn't need runtime check.
I expect the "supported" frameworks (aka. where things should be as fast as possible) are only:
.NET Framework Full (4.62-...) with .NET Standard 2.0
dotnet 8/9 (both .NET Standard 2.0 and .NET Standard 2,1 should work here)
Will all netstandard2.1 implementations be vectorized? If yes, then it's safe enough (and also it should be defined already).
However they will be defined on both net48 and net9.0
.NET Framework is not .NET Standard 2.1 compatible, thus .NET Standard 2.1 version shouldn't need runtime check. I expect the "supported" frameworks (aka. where things should be as fast as possible) are only:
- .NET Framework Full (4.62-...) with .NET Standard 2.0
- dotnet 8/9 (both .NET Standard 2.0 and .NET Standard 2,1 should work here)
@Thorium : I think the discussion about the future deployments of FSharp.Core derailed the discussion. It is a very valid point, but not one which should prevent this PR from happening.
(It's not that a different strategy for shipping FSharp.Core for netCurrent is dismissed as not being allowed, it is rather work which did not happen yet.)
I would like this optimization to go into NET10 as is, are you still interested in finalizing this PR and getting it in?
I would like this optimization to go into NET10 as is, are you still interested in finalizing this PR and getting it in?
Sure. Is there anything that should be added still? I think all the existing conversation items were already resolved?
Tests were failing, otherwise it was good. The results have expired since, let me get fresh results.
@Thorium :
As I am revisiting open PRs before NET10 closes, I again came across this one. The remaining concern is the one raised by Tanner, that is how much worse is a non-desktop implementation that is not HW accelerated.
The check for HW acceleration happens within Enumerable.Sum itself and is correct to be there, that is fine.
If you are still interested in pursuing that, I think that this last remaining concern could be lifted by running a test on:
.net9 target, but with DOTNET_EnableHWIntrinsic=0 enviroment variable. That will hit the "new" code path of Enumerable.Sum, however not use HW acceleration.
The goal is not to be drastically worse than existing F# Array.sum.
If this holds, I would be happy to merge it in 👍 .
As far as the runtime check in FSharp.Core goes, it is not up to this PR to restructure it - it shall use the already implemented mechanism.
Multitargeting of FSharp.Core (easy) and assessing the full downstream impact (hard) shall be taken separately.
Hi,
The remaining concern is the one raised by Tanner, that is how much worse is a non-desktop implementation that is not HW accelerated.
This case was already tested before earlier in this thread: https://github.com/dotnet/fsharp/pull/18509#issuecomment-2834590096 ...and the answer was: almost 10x worse.