opentelemetry-dotnet icon indicating copy to clipboard operation
opentelemetry-dotnet copied to clipboard

Fix Histogram synchronization issue

Open utpilla opened this issue 3 years ago • 4 comments

Changes

  • Use the same "locking" mechanism for Update and TakeSnapshot methods
  • Remove HistogramBuckets.LockObject

Stress Test Results (high contention) with Prometheus Exporter scraping interval = 10s

Multiple threads updating the same Histogram MetricPoint

C:\opentelemetry-dotnet\test\OpenTelemetry.Tests.Stress.Metrics>dotnet run --framework net6.0 --configuration Release
Running (concurrency = 8, prometheusEndpoint = http://localhost:9184/metrics/), press <Esc> to stop...
2022-08-05T00:14:26.5444500Z Loops: 838,148,771, Loops/Second: 6,927,676, CPU Cycles/Loop: 574, RunwayTime (Seconds): 120
2022-08-05T00:14:26.7499001Z Loops: 839,576,993, Loops/Second: 6,954,500, CPU Cycles/Loop: 555, RunwayTime (Seconds): 120
2022-08-05T00:14:26.9569740Z Loops: 841,011,261, Loops/Second: 6,939,252, CPU Cycles/Loop: 504, RunwayTime (Seconds): 121
2022-08-05T00:14:27.1648655Z Loops: 842,455,627, Loops/Second: 6,956,855, CPU Cycles/Loop: 504, RunwayTime (Seconds): 121
2022-08-05T00:14:27.3728648Z Loops: 843,925,397, Loops/Second: 7,081,836, CPU Cycles/Loop: 560, RunwayTime (Seconds): 121
2022-08-05T00:14:27.5806669Z Loops: 845,361,965, Loops/Second: 6,926,754, CPU Cycles/Loop: 497, RunwayTime (Seconds): 121
2022-08-05T00:14:27.7877784Z Loops: 846,817,179, Loops/Second: 7,039,306, CPU Cycles/Loop: 483, RunwayTime (Seconds): 121

Stress Test Results (low contention) with Prometheus Exporter scraping interval = 10s

Multiple threads updating random Histogram MetricPoints

Running (concurrency = 8, prometheusEndpoint = http://localhost:9184/metrics/), press <Esc> to stop...
2022-08-05T00:00:58.8000307Z Loops: 2,810,996,143, Loops/Second: 22,274,636, CPU Cycles/Loop: 1,035, RunwayTime (Seconds): 126
2022-08-05T00:00:59.0289061Z Loops: 2,817,588,665, Loops/Second: 22,209,231, CPU Cycles/Loop: 1,031, RunwayTime (Seconds): 126
2022-08-05T00:00:59.2534955Z Loops: 2,822,548,480, Loops/Second: 22,216,776, CPU Cycles/Loop: 1,040, RunwayTime (Seconds): 126
2022-08-05T00:00:59.4750764Z Loops: 2,827,539,322, Loops/Second: 22,264,727, CPU Cycles/Loop: 1,031, RunwayTime (Seconds): 126
2022-08-05T00:00:59.7044293Z Loops: 2,832,168,493, Loops/Second: 21,668,380, CPU Cycles/Loop: 1,030, RunwayTime (Seconds): 127
2022-08-05T00:00:59.9354118Z Loops: 2,837,591,015, Loops/Second: 22,042,236, CPU Cycles/Loop: 1,026, RunwayTime (Seconds): 127
2022-08-05T00:01:00.1581187Z Loops: 2,842,384,330, Loops/Second: 21,681,888, CPU Cycles/Loop: 1,027, RunwayTime (Seconds): 127

Benchmark Results

// * Summary *

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000 Intel Core i7-9700 CPU 3.00GHz, 1 CPU, 8 logical and 8 physical cores .NET SDK=6.0.302 [Host] : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT DefaultJob : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT

Method BoundCount Mean Error StdDev Median
HistogramHotPath 10 41.87 ns 0.847 ns 1.101 ns 42.11 ns
HistogramWith1LabelHotPath 10 91.11 ns 1.836 ns 3.402 ns 92.75 ns
HistogramWith3LabelsHotPath 10 194.98 ns 3.920 ns 6.217 ns 196.77 ns
HistogramWith5LabelsHotPath 10 291.10 ns 5.825 ns 10.798 ns 292.59 ns
HistogramWith7LabelsHotPath 10 355.66 ns 7.119 ns 15.324 ns 356.80 ns
HistogramHotPath 20 47.52 ns 0.962 ns 1.029 ns 47.54 ns
HistogramWith1LabelHotPath 20 98.94 ns 2.003 ns 3.907 ns 99.71 ns
HistogramWith3LabelsHotPath 20 197.79 ns 3.957 ns 7.134 ns 198.93 ns
HistogramWith5LabelsHotPath 20 300.20 ns 6.010 ns 12.545 ns 299.06 ns
HistogramWith7LabelsHotPath 20 360.02 ns 7.161 ns 13.624 ns 363.94 ns
HistogramHotPath 50 54.60 ns 1.095 ns 1.705 ns 55.09 ns
HistogramWith1LabelHotPath 50 101.19 ns 2.014 ns 3.831 ns 101.43 ns
HistogramWith3LabelsHotPath 50 215.94 ns 4.332 ns 8.347 ns 217.93 ns
HistogramWith5LabelsHotPath 50 309.55 ns 4.752 ns 4.212 ns 309.56 ns
HistogramWith7LabelsHotPath 50 372.97 ns 7.288 ns 6.086 ns 373.39 ns
HistogramHotPath 100 65.78 ns 1.342 ns 2.205 ns 66.74 ns
HistogramWith1LabelHotPath 100 114.77 ns 2.308 ns 4.446 ns 115.07 ns
HistogramWith3LabelsHotPath 100 231.84 ns 4.622 ns 11.681 ns 233.23 ns
HistogramWith5LabelsHotPath 100 323.52 ns 6.354 ns 8.482 ns 325.70 ns
HistogramWith7LabelsHotPath 100 410.18 ns 8.138 ns 15.680 ns 409.59 ns

utpilla avatar Aug 03 '22 23:08 utpilla

Codecov Report

Merging #3534 (0372ff6) into main (ad55478) will increase coverage by 0.01%. The diff coverage is 86.20%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3534      +/-   ##
==========================================
+ Coverage   87.26%   87.28%   +0.01%     
==========================================
  Files         277      277              
  Lines        9931     9938       +7     
==========================================
+ Hits         8666     8674       +8     
+ Misses       1265     1264       -1     
Impacted Files Coverage Δ
src/OpenTelemetry/Metrics/HistogramBuckets.cs 100.00% <ø> (ø)
src/OpenTelemetry/Metrics/MetricPoint.cs 83.10% <86.20%> (-1.90%) :arrow_down:
...ter.ZPages/Implementation/ZPagesActivityTracker.cs 97.14% <0.00%> (-2.86%) :arrow_down:
...Telemetry/Internal/SelfDiagnosticsEventListener.cs 96.87% <0.00%> (-0.79%) :arrow_down:
...nTelemetry/Internal/OpenTelemetrySdkEventSource.cs 77.50% <0.00%> (+0.83%) :arrow_up:
...ZPages/Implementation/ZPagesExporterEventSource.cs 62.50% <0.00%> (+6.25%) :arrow_up:
...emetry.Api/Internal/OpenTelemetryApiEventSource.cs 82.35% <0.00%> (+8.82%) :arrow_up:
...tation/OpenTelemetryProtocolExporterEventSource.cs 85.00% <0.00%> (+10.00%) :arrow_up:

codecov[bot] avatar Aug 03 '22 23:08 codecov[bot]

I ran Benchmark and Stress tests for Interlocked.Exchange vs Interlocked.CompareExchange. The performance is comparable for both Benchmarks and Stress tests. In fact, I ran the benchmark multiple times only to find that sometimes CompareExchange comes out on top and other times it's Exchange. The difference between the time taken between these two methods is ~0.02ns.

Expand for code
using System.Threading;
using BenchmarkDotNet.Attributes;

namespace Benchmarks.Metrics
{
    [MemoryDiagnoser]
    public class TestBenchmarks
    {
        private int isLocked = 1;

        [Benchmark]
        public void Exchange()
        {
            var temp = Interlocked.Exchange(ref this.isLocked, 1) == 0;
        }

        [Benchmark]
        public void CompareExchange()
        {
            var temp = Interlocked.CompareExchange(ref this.isLocked, 1, 0) == 0;
        }
    }
}
Method Mean Error StdDev Allocated
Exchange 4.351 ns 0.0465 ns 0.0435 ns -
CompareExchange 4.330 ns 0.0663 ns 0.0620 ns -

I have updated the code to go back to using Interlocked.Exchange anyway just to have lesser things to review 😄

utpilla avatar Aug 04 '22 22:08 utpilla

Beware that single threaded, zero contention lock perf benchmarks can (and probably will) give very different results vs. a test with multiple threads and some contention. Even if threads are never explicitly waiting for another to release the lock, down at the hardware level the cache line backing the critical section memory has to be moved between the caches connected to different cores. When the cores are continuously renegotiating which one has exclusive access to the memory that creates large additional latencies on those Interlocked instructions.

noahfalk avatar Aug 04 '22 23:08 noahfalk

Check the PRs for the other two synchronization techniques:

  • #3547
  • #3550

Performance comparison of the different synchronization techniques

High contention scenario:

Synchronization technique Average Loops/second Average CPU cycles/loop
Interlocked 6.97 M 525
Classic lock 5.16 M 3117
SpinLock 5.14 M 775

Low contention scenario:

Synchronization technique Average Loops/second Average CPU cycles/loop
Interlocked 22.05M 1031
Classic lock 19.92 M 1127
SpinLock 18.54 M 1160

The Interlocked based "lock" outperforms the classic lock and SpinLock for high contention scenario where multiple threads try to update the same MetricPoint in parallel. It had ~35% more loops/sec than the other techniques. It also has significantly lower CPU cycles/loop: ~83% less than classic lock and ~23% less than SpinLock.

For low contention scenario, there is still some perf gain for Interlocked based "lock", but it's not as high.

The benchmark results for Interlocked based "lock" are comparable to classic lock. The benchmark numbers for SpinLock are lower than that of Interlocked based "lock" and classic lock.

utpilla avatar Aug 05 '22 02:08 utpilla