Fix Histogram synchronization issue
Changes
- Use the same "locking" mechanism for
UpdateandTakeSnapshotmethods - Remove
HistogramBuckets.LockObject
Stress Test Results (high contention) with Prometheus Exporter scraping interval = 10s
Multiple threads updating the same Histogram MetricPoint
C:\opentelemetry-dotnet\test\OpenTelemetry.Tests.Stress.Metrics>dotnet run --framework net6.0 --configuration Release
Running (concurrency = 8, prometheusEndpoint = http://localhost:9184/metrics/), press <Esc> to stop...
2022-08-05T00:14:26.5444500Z Loops: 838,148,771, Loops/Second: 6,927,676, CPU Cycles/Loop: 574, RunwayTime (Seconds): 120
2022-08-05T00:14:26.7499001Z Loops: 839,576,993, Loops/Second: 6,954,500, CPU Cycles/Loop: 555, RunwayTime (Seconds): 120
2022-08-05T00:14:26.9569740Z Loops: 841,011,261, Loops/Second: 6,939,252, CPU Cycles/Loop: 504, RunwayTime (Seconds): 121
2022-08-05T00:14:27.1648655Z Loops: 842,455,627, Loops/Second: 6,956,855, CPU Cycles/Loop: 504, RunwayTime (Seconds): 121
2022-08-05T00:14:27.3728648Z Loops: 843,925,397, Loops/Second: 7,081,836, CPU Cycles/Loop: 560, RunwayTime (Seconds): 121
2022-08-05T00:14:27.5806669Z Loops: 845,361,965, Loops/Second: 6,926,754, CPU Cycles/Loop: 497, RunwayTime (Seconds): 121
2022-08-05T00:14:27.7877784Z Loops: 846,817,179, Loops/Second: 7,039,306, CPU Cycles/Loop: 483, RunwayTime (Seconds): 121
Stress Test Results (low contention) with Prometheus Exporter scraping interval = 10s
Multiple threads updating random Histogram MetricPoints
Running (concurrency = 8, prometheusEndpoint = http://localhost:9184/metrics/), press <Esc> to stop...
2022-08-05T00:00:58.8000307Z Loops: 2,810,996,143, Loops/Second: 22,274,636, CPU Cycles/Loop: 1,035, RunwayTime (Seconds): 126
2022-08-05T00:00:59.0289061Z Loops: 2,817,588,665, Loops/Second: 22,209,231, CPU Cycles/Loop: 1,031, RunwayTime (Seconds): 126
2022-08-05T00:00:59.2534955Z Loops: 2,822,548,480, Loops/Second: 22,216,776, CPU Cycles/Loop: 1,040, RunwayTime (Seconds): 126
2022-08-05T00:00:59.4750764Z Loops: 2,827,539,322, Loops/Second: 22,264,727, CPU Cycles/Loop: 1,031, RunwayTime (Seconds): 126
2022-08-05T00:00:59.7044293Z Loops: 2,832,168,493, Loops/Second: 21,668,380, CPU Cycles/Loop: 1,030, RunwayTime (Seconds): 127
2022-08-05T00:00:59.9354118Z Loops: 2,837,591,015, Loops/Second: 22,042,236, CPU Cycles/Loop: 1,026, RunwayTime (Seconds): 127
2022-08-05T00:01:00.1581187Z Loops: 2,842,384,330, Loops/Second: 21,681,888, CPU Cycles/Loop: 1,027, RunwayTime (Seconds): 127
Benchmark Results
// * Summary *
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000 Intel Core i7-9700 CPU 3.00GHz, 1 CPU, 8 logical and 8 physical cores .NET SDK=6.0.302 [Host] : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT DefaultJob : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT
| Method | BoundCount | Mean | Error | StdDev | Median |
|---|---|---|---|---|---|
| HistogramHotPath | 10 | 41.87 ns | 0.847 ns | 1.101 ns | 42.11 ns |
| HistogramWith1LabelHotPath | 10 | 91.11 ns | 1.836 ns | 3.402 ns | 92.75 ns |
| HistogramWith3LabelsHotPath | 10 | 194.98 ns | 3.920 ns | 6.217 ns | 196.77 ns |
| HistogramWith5LabelsHotPath | 10 | 291.10 ns | 5.825 ns | 10.798 ns | 292.59 ns |
| HistogramWith7LabelsHotPath | 10 | 355.66 ns | 7.119 ns | 15.324 ns | 356.80 ns |
| HistogramHotPath | 20 | 47.52 ns | 0.962 ns | 1.029 ns | 47.54 ns |
| HistogramWith1LabelHotPath | 20 | 98.94 ns | 2.003 ns | 3.907 ns | 99.71 ns |
| HistogramWith3LabelsHotPath | 20 | 197.79 ns | 3.957 ns | 7.134 ns | 198.93 ns |
| HistogramWith5LabelsHotPath | 20 | 300.20 ns | 6.010 ns | 12.545 ns | 299.06 ns |
| HistogramWith7LabelsHotPath | 20 | 360.02 ns | 7.161 ns | 13.624 ns | 363.94 ns |
| HistogramHotPath | 50 | 54.60 ns | 1.095 ns | 1.705 ns | 55.09 ns |
| HistogramWith1LabelHotPath | 50 | 101.19 ns | 2.014 ns | 3.831 ns | 101.43 ns |
| HistogramWith3LabelsHotPath | 50 | 215.94 ns | 4.332 ns | 8.347 ns | 217.93 ns |
| HistogramWith5LabelsHotPath | 50 | 309.55 ns | 4.752 ns | 4.212 ns | 309.56 ns |
| HistogramWith7LabelsHotPath | 50 | 372.97 ns | 7.288 ns | 6.086 ns | 373.39 ns |
| HistogramHotPath | 100 | 65.78 ns | 1.342 ns | 2.205 ns | 66.74 ns |
| HistogramWith1LabelHotPath | 100 | 114.77 ns | 2.308 ns | 4.446 ns | 115.07 ns |
| HistogramWith3LabelsHotPath | 100 | 231.84 ns | 4.622 ns | 11.681 ns | 233.23 ns |
| HistogramWith5LabelsHotPath | 100 | 323.52 ns | 6.354 ns | 8.482 ns | 325.70 ns |
| HistogramWith7LabelsHotPath | 100 | 410.18 ns | 8.138 ns | 15.680 ns | 409.59 ns |
Codecov Report
Merging #3534 (0372ff6) into main (ad55478) will increase coverage by
0.01%. The diff coverage is86.20%.
@@ Coverage Diff @@
## main #3534 +/- ##
==========================================
+ Coverage 87.26% 87.28% +0.01%
==========================================
Files 277 277
Lines 9931 9938 +7
==========================================
+ Hits 8666 8674 +8
+ Misses 1265 1264 -1
| Impacted Files | Coverage Δ | |
|---|---|---|
| src/OpenTelemetry/Metrics/HistogramBuckets.cs | 100.00% <ø> (ø) |
|
| src/OpenTelemetry/Metrics/MetricPoint.cs | 83.10% <86.20%> (-1.90%) |
:arrow_down: |
| ...ter.ZPages/Implementation/ZPagesActivityTracker.cs | 97.14% <0.00%> (-2.86%) |
:arrow_down: |
| ...Telemetry/Internal/SelfDiagnosticsEventListener.cs | 96.87% <0.00%> (-0.79%) |
:arrow_down: |
| ...nTelemetry/Internal/OpenTelemetrySdkEventSource.cs | 77.50% <0.00%> (+0.83%) |
:arrow_up: |
| ...ZPages/Implementation/ZPagesExporterEventSource.cs | 62.50% <0.00%> (+6.25%) |
:arrow_up: |
| ...emetry.Api/Internal/OpenTelemetryApiEventSource.cs | 82.35% <0.00%> (+8.82%) |
:arrow_up: |
| ...tation/OpenTelemetryProtocolExporterEventSource.cs | 85.00% <0.00%> (+10.00%) |
:arrow_up: |
I ran Benchmark and Stress tests for Interlocked.Exchange vs Interlocked.CompareExchange. The performance is comparable for both Benchmarks and Stress tests. In fact, I ran the benchmark multiple times only to find that sometimes CompareExchange comes out on top and other times it's Exchange. The difference between the time taken between these two methods is ~0.02ns.
Expand for code
using System.Threading;
using BenchmarkDotNet.Attributes;
namespace Benchmarks.Metrics
{
[MemoryDiagnoser]
public class TestBenchmarks
{
private int isLocked = 1;
[Benchmark]
public void Exchange()
{
var temp = Interlocked.Exchange(ref this.isLocked, 1) == 0;
}
[Benchmark]
public void CompareExchange()
{
var temp = Interlocked.CompareExchange(ref this.isLocked, 1, 0) == 0;
}
}
}
| Method | Mean | Error | StdDev | Allocated |
|---|---|---|---|---|
| Exchange | 4.351 ns | 0.0465 ns | 0.0435 ns | - |
| CompareExchange | 4.330 ns | 0.0663 ns | 0.0620 ns | - |
I have updated the code to go back to using Interlocked.Exchange anyway just to have lesser things to review 😄
Beware that single threaded, zero contention lock perf benchmarks can (and probably will) give very different results vs. a test with multiple threads and some contention. Even if threads are never explicitly waiting for another to release the lock, down at the hardware level the cache line backing the critical section memory has to be moved between the caches connected to different cores. When the cores are continuously renegotiating which one has exclusive access to the memory that creates large additional latencies on those Interlocked instructions.
Check the PRs for the other two synchronization techniques:
- #3547
- #3550
Performance comparison of the different synchronization techniques
High contention scenario:
| Synchronization technique | Average Loops/second | Average CPU cycles/loop |
|---|---|---|
| Interlocked | 6.97 M | 525 |
| Classic lock | 5.16 M | 3117 |
| SpinLock | 5.14 M | 775 |
Low contention scenario:
| Synchronization technique | Average Loops/second | Average CPU cycles/loop |
|---|---|---|
| Interlocked | 22.05M | 1031 |
| Classic lock | 19.92 M | 1127 |
| SpinLock | 18.54 M | 1160 |
The Interlocked based "lock" outperforms the classic lock and SpinLock for high contention scenario where multiple threads try to update the same MetricPoint in parallel. It had ~35% more loops/sec than the other techniques. It also has significantly lower CPU cycles/loop: ~83% less than classic lock and ~23% less than SpinLock.
For low contention scenario, there is still some perf gain for Interlocked based "lock", but it's not as high.
The benchmark results for Interlocked based "lock" are comparable to classic lock. The benchmark numbers for SpinLock are lower than that of Interlocked based "lock" and classic lock.