Some WIP's and ideas for SIMD
Hi, I've studied version10 of Topological Sort a bit and I see that there's some operations which are done once per loop, and I had an idea that they can be parallelized. Parallelization via Tasks can bring a lot of overhead, but probably one could use SIMD in this case for primitive operations.
I've added an example of Edge.BatchAdd, and some comments - although I can't get it fully working but I wanted to share the general idea
P.S. Also, .NET 7.0 is needed because they added ShiftRight / ShiftLeft only in this new version
This is super interesting! I'll need to look over this and see what the performance difference will be. Thank you for the ideas!
Also, added some micro-optimizations here https://github.com/delneg/FSharpPerformance/tree/wip/optimize-v10
| Method | Mean | Error | StdDev | Median | Gen 0 | Allocated |
|----------- |---------:|---------:|---------:|---------:|--------:|----------:|
| Version_10 | 750.6 μs | 14.75 μs | 29.45 μs | 739.2 μs | 28.3203 | 234 KB |
| Version_11 | 724.6 μs | 10.36 μs | 9.69 μs | 721.7 μs | 24.4141 | 203 KB |
It's a little bit faster for me & 31kb less memory hungry
Interesting, on my machine it's slower. I have an AMD 3900x which is a Zen 2. Are you running on an Intel?
// * Summary *
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.4.22252.9
[Host] : .NET 7.0.0 (7.0.22.22904), X64 RyuJIT DEBUG
DefaultJob : .NET 7.0.0 (7.0.22.22904), X64 RyuJIT
| Method | Mean | Error | StdDev | Gen 0 | Allocated |
|---|---|---|---|---|---|
| Version_10 | 522.7 us | 7.37 us | 6.90 us | 28.3203 | 234 KB |
| Version_11 | 662.1 us | 12.82 us | 14.77 us | 24.4141 | 203 KB |
It is interesting to see the performance difference between NET6 and NET7. Version_10 is ~50us faster with NET7
Yep, I'm testing on an x64 macOS
BenchmarkDotNet=v0.13.1, OS=macOS Big Sur 11.4
Intel Core i9-9980HK CPU 2.40GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.5.22267.11
[Host] : .NET 7.0.0 (7.0.22.26611), X64 RyuJIT DEBUG
DefaultJob : .NET 7.0.0 (7.0.22.26611), X64 RyuJIT
| Method | Mean | Error | StdDev | Gen 0 | Allocated |
|----------- |---------:|---------:|---------:|--------:|----------:|
| Version_10 | 845.0 us | 15.81 us | 15.52 us | 28.3203 | 234 KB |
| Version_11 | 771.5 us | 11.80 us | 9.85 us | 24.4141 | 203 KB |
I've also tried it on an M1 Mac for interesting results:
BenchmarkDotNet=v0.13.1, OS=macOS Monterey 12.1
Apple M1, 1 CPU, 8 logical and 8 physical cores
.NET SDK=7.0.100-preview.4.22252.9
[Host] : .NET 7.0.0 (7.0.22.22904), Arm64 RyuJIT DEBUG
DefaultJob : .NET 7.0.0 (7.0.22.22904), Arm64 RyuJIT
| Method | Mean | Error | StdDev | Gen 0 | Allocated |
|----------- |---------:|--------:|--------:|--------:|----------:|
| Version_10 | 475.9 us | 0.98 us | 0.92 us | 38.0859 | 234 KB |
| Version_11 | 475.4 us | 1.92 us | 1.80 us | 32.7148 | 203 KB |
Both with .NET 7, both after commenting out value option ValuesCount caching (last commit just pushed) - commented it out because it was actually making it slower.
So, maybe what I did was actually a No-Op after all
Super interesting. This is why I've been tempted to build an Intel machine along with my AMD. I knew these performance differences were possible but that seems pretty dramatic. I wonder what the root cause is for the performance difference between our different CPUs.
Well, arm64 is very different from x64 for sure and got many improvements in .net7 so no wonder Also I'm testing on mac and not on windows so it may also affect the performance due to differences in .net runtime performance Anyway, for your use case of <500us it's safe to say that you could buy an M1 Mac and have your one million graphs sorted in time :)
P.S. I'd say it's tempting to have x86_64, ARM64, and RISC-V platforms for testing, not only amd and intel Also maybe get your hands onto some used cloud-grade cpu's (like the ones amazon uses or smth like that) if you're running your workloads in the cloud
It seems that I've managed to make it faster by making Graph a Struct and passing it by inref<Graph>
BenchmarkDotNet=v0.13.1, OS=macOS Big Sur 11.4 (20F71) [Darwin 20.5.0]
Intel Core i9-9980HK CPU 2.40GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.5.22267.11
[Host] : .NET 7.0.0 (7.0.22.26611), X64 RyuJIT DEBUG
DefaultJob : .NET 7.0.0 (7.0.22.26611), X64 RyuJIT
| Method | Mean | Error | StdDev | Gen 0 | Gen 1 | Allocated |
|----------- |---------:|---------:|---------:|--------:|-------:|----------:|
| Version_10 | 808.6 us | 10.51 us | 9.83 us | 28.3203 | - | 234 KB |
| Version_11 | 699.5 us | 10.69 us | 10.00 us | 21.4844 | 0.9766 | 180 KB |
BenchmarkDotNet=v0.13.1, OS=macOS Monterey 12.1 (21C52) [Darwin 21.2.0]
Apple M1, 1 CPU, 8 logical and 8 physical cores
.NET SDK=7.0.100-preview.4.22252.9
[Host] : .NET 7.0.0 (7.0.22.22904), Arm64 RyuJIT DEBUG
DefaultJob : .NET 7.0.0 (7.0.22.22904), Arm64 RyuJIT
| Method | Mean | Error | StdDev | Gen 0 | Allocated |
|----------- |---------:|--------:|--------:|--------:|----------:|
| Version_10 | 477.9 us | 0.76 us | 0.67 us | 38.0859 | 234 KB |
| Version_11 | 467.0 us | 0.95 us | 0.89 us | 29.2969 | 180 KB |
Update: By moving all the EdgeTracker code to static inline members & passing around remainingEdges array with Span directly, I was able to get this (M1 here because it's more consistent)
BenchmarkDotNet=v0.13.1, OS=macOS Monterey 12.1 (21C52) [Darwin 21.2.0]
Apple M1, 1 CPU, 8 logical and 8 physical cores
.NET SDK=7.0.100-preview.4.22252.9
[Host] : .NET 7.0.0 (7.0.22.22904), Arm64 RyuJIT DEBUG
DefaultJob : .NET 7.0.0 (7.0.22.22904), Arm64 RyuJIT
| Method | Mean | Error | StdDev | Gen 0 | Allocated |
|----------- |---------:|--------:|--------:|--------:|----------:|
| Version_10 | 477.6 us | 0.78 us | 0.61 us | 38.0859 | 234 KB |
| Version_11 | 448.7 us | 1.05 us | 0.93 us | 29.2969 | 180 KB |