FSharpPerformance icon indicating copy to clipboard operation
FSharpPerformance copied to clipboard

Some WIP's and ideas for SIMD

Open delneg opened this issue 3 years ago • 7 comments

Hi, I've studied version10 of Topological Sort a bit and I see that there's some operations which are done once per loop, and I had an idea that they can be parallelized. Parallelization via Tasks can bring a lot of overhead, but probably one could use SIMD in this case for primitive operations.

I've added an example of Edge.BatchAdd, and some comments - although I can't get it fully working but I wanted to share the general idea

P.S. Also, .NET 7.0 is needed because they added ShiftRight / ShiftLeft only in this new version

delneg avatar May 29 '22 08:05 delneg

This is super interesting! I'll need to look over this and see what the performance difference will be. Thank you for the ideas!

matthewcrews avatar May 31 '22 16:05 matthewcrews

Also, added some micro-optimizations here https://github.com/delneg/FSharpPerformance/tree/wip/optimize-v10

|     Method |     Mean |    Error |   StdDev |   Median |   Gen 0 | Allocated |
|----------- |---------:|---------:|---------:|---------:|--------:|----------:|
| Version_10 | 750.6 μs | 14.75 μs | 29.45 μs | 739.2 μs | 28.3203 |    234 KB |
| Version_11 | 724.6 μs | 10.36 μs |  9.69 μs | 721.7 μs | 24.4141 |    203 KB |

It's a little bit faster for me & 31kb less memory hungry

delneg avatar May 31 '22 23:05 delneg

Interesting, on my machine it's slower. I have an AMD 3900x which is a Zen 2. Are you running on an Intel?

// * Summary *

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.4.22252.9
  [Host]     : .NET 7.0.0 (7.0.22.22904), X64 RyuJIT DEBUG
  DefaultJob : .NET 7.0.0 (7.0.22.22904), X64 RyuJIT
Method Mean Error StdDev Gen 0 Allocated
Version_10 522.7 us 7.37 us 6.90 us 28.3203 234 KB
Version_11 662.1 us 12.82 us 14.77 us 24.4141 203 KB

It is interesting to see the performance difference between NET6 and NET7. Version_10 is ~50us faster with NET7

matthewcrews avatar Jun 01 '22 03:06 matthewcrews

Yep, I'm testing on an x64 macOS

BenchmarkDotNet=v0.13.1, OS=macOS Big Sur 11.4
Intel Core i9-9980HK CPU 2.40GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.5.22267.11
  [Host]     : .NET 7.0.0 (7.0.22.26611), X64 RyuJIT DEBUG
  DefaultJob : .NET 7.0.0 (7.0.22.26611), X64 RyuJIT

|     Method |     Mean |    Error |   StdDev |   Gen 0 | Allocated |
|----------- |---------:|---------:|---------:|--------:|----------:|
| Version_10 | 845.0 us | 15.81 us | 15.52 us | 28.3203 |    234 KB |
| Version_11 | 771.5 us | 11.80 us |  9.85 us | 24.4141 |    203 KB |

I've also tried it on an M1 Mac for interesting results:

BenchmarkDotNet=v0.13.1, OS=macOS Monterey 12.1
Apple M1, 1 CPU, 8 logical and 8 physical cores
.NET SDK=7.0.100-preview.4.22252.9
  [Host]     : .NET 7.0.0 (7.0.22.22904), Arm64 RyuJIT DEBUG
  DefaultJob : .NET 7.0.0 (7.0.22.22904), Arm64 RyuJIT


|     Method |     Mean |   Error |  StdDev |   Gen 0 | Allocated |
|----------- |---------:|--------:|--------:|--------:|----------:|
| Version_10 | 475.9 us | 0.98 us | 0.92 us | 38.0859 |    234 KB |
| Version_11 | 475.4 us | 1.92 us | 1.80 us | 32.7148 |    203 KB |

Both with .NET 7, both after commenting out value option ValuesCount caching (last commit just pushed) - commented it out because it was actually making it slower.

So, maybe what I did was actually a No-Op after all

delneg avatar Jun 01 '22 08:06 delneg

Super interesting. This is why I've been tempted to build an Intel machine along with my AMD. I knew these performance differences were possible but that seems pretty dramatic. I wonder what the root cause is for the performance difference between our different CPUs.

matthewcrews avatar Jun 01 '22 13:06 matthewcrews

Well, arm64 is very different from x64 for sure and got many improvements in .net7 so no wonder Also I'm testing on mac and not on windows so it may also affect the performance due to differences in .net runtime performance Anyway, for your use case of <500us it's safe to say that you could buy an M1 Mac and have your one million graphs sorted in time :)

P.S. I'd say it's tempting to have x86_64, ARM64, and RISC-V platforms for testing, not only amd and intel Also maybe get your hands onto some used cloud-grade cpu's (like the ones amazon uses or smth like that) if you're running your workloads in the cloud

delneg avatar Jun 01 '22 13:06 delneg

It seems that I've managed to make it faster by making Graph a Struct and passing it by inref<Graph>

BenchmarkDotNet=v0.13.1, OS=macOS Big Sur 11.4 (20F71) [Darwin 20.5.0]
Intel Core i9-9980HK CPU 2.40GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.5.22267.11
  [Host]     : .NET 7.0.0 (7.0.22.26611), X64 RyuJIT DEBUG
  DefaultJob : .NET 7.0.0 (7.0.22.26611), X64 RyuJIT


|     Method |     Mean |    Error |   StdDev |   Gen 0 |  Gen 1 | Allocated |
|----------- |---------:|---------:|---------:|--------:|-------:|----------:|
| Version_10 | 808.6 us | 10.51 us |  9.83 us | 28.3203 |      - |    234 KB |
| Version_11 | 699.5 us | 10.69 us | 10.00 us | 21.4844 | 0.9766 |    180 KB |
BenchmarkDotNet=v0.13.1, OS=macOS Monterey 12.1 (21C52) [Darwin 21.2.0]
Apple M1, 1 CPU, 8 logical and 8 physical cores
.NET SDK=7.0.100-preview.4.22252.9
  [Host]     : .NET 7.0.0 (7.0.22.22904), Arm64 RyuJIT DEBUG
  DefaultJob : .NET 7.0.0 (7.0.22.22904), Arm64 RyuJIT


|     Method |     Mean |   Error |  StdDev |   Gen 0 | Allocated |
|----------- |---------:|--------:|--------:|--------:|----------:|
| Version_10 | 477.9 us | 0.76 us | 0.67 us | 38.0859 |    234 KB |
| Version_11 | 467.0 us | 0.95 us | 0.89 us | 29.2969 |    180 KB |

Update: By moving all the EdgeTracker code to static inline members & passing around remainingEdges array with Span directly, I was able to get this (M1 here because it's more consistent)

BenchmarkDotNet=v0.13.1, OS=macOS Monterey 12.1 (21C52) [Darwin 21.2.0]
Apple M1, 1 CPU, 8 logical and 8 physical cores
.NET SDK=7.0.100-preview.4.22252.9
  [Host]     : .NET 7.0.0 (7.0.22.22904), Arm64 RyuJIT DEBUG
  DefaultJob : .NET 7.0.0 (7.0.22.22904), Arm64 RyuJIT


|     Method |     Mean |   Error |  StdDev |   Gen 0 | Allocated |
|----------- |---------:|--------:|--------:|--------:|----------:|
| Version_10 | 477.6 us | 0.78 us | 0.61 us | 38.0859 |    234 KB |
| Version_11 | 448.7 us | 1.05 us | 0.93 us | 29.2969 |    180 KB |

delneg avatar Jun 01 '22 15:06 delneg