Changes

Use the same "locking" mechanism for Update and TakeSnapshot methods
Remove HistogramBuckets.LockObject

Stress Test Results (high contention) with Prometheus Exporter scraping interval = 10s

Multiple threads updating the same Histogram MetricPoint

C:\opentelemetry-dotnet\test\OpenTelemetry.Tests.Stress.Metrics>dotnet run --framework net6.0 --configuration Release
Running (concurrency = 8, prometheusEndpoint = http://localhost:9184/metrics/), press <Esc> to stop...
2022-08-05T00:14:26.5444500Z Loops: 838,148,771, Loops/Second: 6,927,676, CPU Cycles/Loop: 574, RunwayTime (Seconds): 120
2022-08-05T00:14:26.7499001Z Loops: 839,576,993, Loops/Second: 6,954,500, CPU Cycles/Loop: 555, RunwayTime (Seconds): 120
2022-08-05T00:14:26.9569740Z Loops: 841,011,261, Loops/Second: 6,939,252, CPU Cycles/Loop: 504, RunwayTime (Seconds): 121
2022-08-05T00:14:27.1648655Z Loops: 842,455,627, Loops/Second: 6,956,855, CPU Cycles/Loop: 504, RunwayTime (Seconds): 121
2022-08-05T00:14:27.3728648Z Loops: 843,925,397, Loops/Second: 7,081,836, CPU Cycles/Loop: 560, RunwayTime (Seconds): 121
2022-08-05T00:14:27.5806669Z Loops: 845,361,965, Loops/Second: 6,926,754, CPU Cycles/Loop: 497, RunwayTime (Seconds): 121
2022-08-05T00:14:27.7877784Z Loops: 846,817,179, Loops/Second: 7,039,306, CPU Cycles/Loop: 483, RunwayTime (Seconds): 121

Stress Test Results (low contention) with Prometheus Exporter scraping interval = 10s

Multiple threads updating random Histogram MetricPoints

Running (concurrency = 8, prometheusEndpoint = http://localhost:9184/metrics/), press <Esc> to stop...
2022-08-05T00:00:58.8000307Z Loops: 2,810,996,143, Loops/Second: 22,274,636, CPU Cycles/Loop: 1,035, RunwayTime (Seconds): 126
2022-08-05T00:00:59.0289061Z Loops: 2,817,588,665, Loops/Second: 22,209,231, CPU Cycles/Loop: 1,031, RunwayTime (Seconds): 126
2022-08-05T00:00:59.2534955Z Loops: 2,822,548,480, Loops/Second: 22,216,776, CPU Cycles/Loop: 1,040, RunwayTime (Seconds): 126
2022-08-05T00:00:59.4750764Z Loops: 2,827,539,322, Loops/Second: 22,264,727, CPU Cycles/Loop: 1,031, RunwayTime (Seconds): 126
2022-08-05T00:00:59.7044293Z Loops: 2,832,168,493, Loops/Second: 21,668,380, CPU Cycles/Loop: 1,030, RunwayTime (Seconds): 127
2022-08-05T00:00:59.9354118Z Loops: 2,837,591,015, Loops/Second: 22,042,236, CPU Cycles/Loop: 1,026, RunwayTime (Seconds): 127
2022-08-05T00:01:00.1581187Z Loops: 2,842,384,330, Loops/Second: 21,681,888, CPU Cycles/Loop: 1,027, RunwayTime (Seconds): 127

Benchmark Results

// * Summary *

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000 Intel Core i7-9700 CPU 3.00GHz, 1 CPU, 8 logical and 8 physical cores .NET SDK=6.0.302 [Host] : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT DefaultJob : .NET 6.0.7 (6.0.722.32202), X64 RyuJIT

Method	BoundCount	Mean	Error	StdDev	Median
HistogramHotPath	10	41.87 ns	0.847 ns	1.101 ns	42.11 ns
HistogramWith1LabelHotPath	10	91.11 ns	1.836 ns	3.402 ns	92.75 ns
HistogramWith3LabelsHotPath	10	194.98 ns	3.920 ns	6.217 ns	196.77 ns
HistogramWith5LabelsHotPath	10	291.10 ns	5.825 ns	10.798 ns	292.59 ns
HistogramWith7LabelsHotPath	10	355.66 ns	7.119 ns	15.324 ns	356.80 ns
HistogramHotPath	20	47.52 ns	0.962 ns	1.029 ns	47.54 ns
HistogramWith1LabelHotPath	20	98.94 ns	2.003 ns	3.907 ns	99.71 ns
HistogramWith3LabelsHotPath	20	197.79 ns	3.957 ns	7.134 ns	198.93 ns
HistogramWith5LabelsHotPath	20	300.20 ns	6.010 ns	12.545 ns	299.06 ns
HistogramWith7LabelsHotPath	20	360.02 ns	7.161 ns	13.624 ns	363.94 ns
HistogramHotPath	50	54.60 ns	1.095 ns	1.705 ns	55.09 ns
HistogramWith1LabelHotPath	50	101.19 ns	2.014 ns	3.831 ns	101.43 ns
HistogramWith3LabelsHotPath	50	215.94 ns	4.332 ns	8.347 ns	217.93 ns
HistogramWith5LabelsHotPath	50	309.55 ns	4.752 ns	4.212 ns	309.56 ns
HistogramWith7LabelsHotPath	50	372.97 ns	7.288 ns	6.086 ns	373.39 ns
HistogramHotPath	100	65.78 ns	1.342 ns	2.205 ns	66.74 ns
HistogramWith1LabelHotPath	100	114.77 ns	2.308 ns	4.446 ns	115.07 ns
HistogramWith3LabelsHotPath	100	231.84 ns	4.622 ns	11.681 ns	233.23 ns
HistogramWith5LabelsHotPath	100	323.52 ns	6.354 ns	8.482 ns	325.70 ns
HistogramWith7LabelsHotPath	100	410.18 ns	8.138 ns	15.680 ns	409.59 ns

Aug 03 '22 23:08 utpilla

Codecov Report

Merging #3534 (0372ff6) into main (ad55478) will increase coverage by 0.01%. The diff coverage is 86.20%.

@@            Coverage Diff             @@
##             main    #3534      +/-   ##
==========================================
+ Coverage   87.26%   87.28%   +0.01%     
==========================================
  Files         277      277              
  Lines        9931     9938       +7     
==========================================
+ Hits         8666     8674       +8     
+ Misses       1265     1264       -1

Impacted Files	Coverage Δ
src/OpenTelemetry/Metrics/HistogramBuckets.cs	`100.00% <ø> (ø)`
src/OpenTelemetry/Metrics/MetricPoint.cs	`83.10% <86.20%> (-1.90%)`	:arrow_down:
...ter.ZPages/Implementation/ZPagesActivityTracker.cs	`97.14% <0.00%> (-2.86%)`	:arrow_down:
...Telemetry/Internal/SelfDiagnosticsEventListener.cs	`96.87% <0.00%> (-0.79%)`	:arrow_down:
...nTelemetry/Internal/OpenTelemetrySdkEventSource.cs	`77.50% <0.00%> (+0.83%)`	:arrow_up:
...ZPages/Implementation/ZPagesExporterEventSource.cs	`62.50% <0.00%> (+6.25%)`	:arrow_up:
...emetry.Api/Internal/OpenTelemetryApiEventSource.cs	`82.35% <0.00%> (+8.82%)`	:arrow_up:
...tation/OpenTelemetryProtocolExporterEventSource.cs	`85.00% <0.00%> (+10.00%)`	:arrow_up:

Aug 03 '22 23:08 codecov[bot]

I ran Benchmark and Stress tests for Interlocked.Exchange vs Interlocked.CompareExchange. The performance is comparable for both Benchmarks and Stress tests. In fact, I ran the benchmark multiple times only to find that sometimes CompareExchange comes out on top and other times it's Exchange. The difference between the time taken between these two methods is ~0.02ns.

Expand for code

using System.Threading;
using BenchmarkDotNet.Attributes;

namespace Benchmarks.Metrics
{
    [MemoryDiagnoser]
    public class TestBenchmarks
    {
        private int isLocked = 1;

        [Benchmark]
        public void Exchange()
        {
            var temp = Interlocked.Exchange(ref this.isLocked, 1) == 0;
        }

        [Benchmark]
        public void CompareExchange()
        {
            var temp = Interlocked.CompareExchange(ref this.isLocked, 1, 0) == 0;
        }
    }
}

Method	Mean	Error	StdDev	Allocated
Exchange	4.351 ns	0.0465 ns	0.0435 ns	-
CompareExchange	4.330 ns	0.0663 ns	0.0620 ns	-

I have updated the code to go back to using Interlocked.Exchange anyway just to have lesser things to review 😄

Aug 04 '22 22:08 utpilla

Beware that single threaded, zero contention lock perf benchmarks can (and probably will) give very different results vs. a test with multiple threads and some contention. Even if threads are never explicitly waiting for another to release the lock, down at the hardware level the cache line backing the critical section memory has to be moved between the caches connected to different cores. When the cores are continuously renegotiating which one has exclusive access to the memory that creates large additional latencies on those Interlocked instructions.

Aug 04 '22 23:08 noahfalk

Check the PRs for the other two synchronization techniques:

#3547
#3550

Performance comparison of the different synchronization techniques

High contention scenario:

Synchronization technique	Average Loops/second	Average CPU cycles/loop
Interlocked	6.97 M	525
Classic lock	5.16 M	3117
SpinLock	5.14 M	775

Low contention scenario:

Synchronization technique	Average Loops/second	Average CPU cycles/loop
Interlocked	22.05M	1031
Classic lock	19.92 M	1127
SpinLock	18.54 M	1160

The Interlocked based "lock" outperforms the classic lock and SpinLock for high contention scenario where multiple threads try to update the same MetricPoint in parallel. It had ~35% more loops/sec than the other techniques. It also has significantly lower CPU cycles/loop: ~83% less than classic lock and ~23% less than SpinLock.

For low contention scenario, there is still some perf gain for Interlocked based "lock", but it's not as high.

The benchmark results for Interlocked based "lock" are comparable to classic lock. The benchmark numbers for SpinLock are lower than that of Interlocked based "lock" and classic lock.

Aug 05 '22 02:08 utpilla

Fix Histogram synchronization issue

Changes

Stress Test Results (high contention) with Prometheus Exporter scraping interval = 10s

Stress Test Results (low contention) with Prometheus Exporter scraping interval = 10s

Benchmark Results

Codecov Report

Performance comparison of the different synchronization techniques

High contention scenario:

Low contention scenario: