prometheus-cpp Scalable counter

Oct 23 '18 14:10 jerryct

Hi. I would like to propose a new implementation of Counter to make it

scalable upon multi-threading
single-threaded performance roughly the same as without any synchronisation

Here are some performance measurements:

4 threads

$ ./benchmarks --benchmark_filter='BM_(Counter|Gauge)'

Run on (4 X 3400 MHz CPU s)
2018-10-23 16:02:33
--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BM_Counter_Increment                              3 ns          3 ns  262704325
BM_Counter/ConcurrentIncrement/threads:4          1 ns          3 ns  207105724
BM_Counter_Collect                              264 ns        264 ns    2599009
BM_Gauge_Increment                               14 ns         14 ns   50683315
BM_Gauge/ConcurrentIncrement/threads:4           85 ns        339 ns    2513412
BM_Gauge_Decrement                               14 ns         14 ns   51979167
BM_Gauge_SetToCurrentTime                        18 ns         18 ns   39151309
BM_Gauge_Collect                                  6 ns          6 ns  116305932

16 threads

Run on (16 X 3600 MHz CPU s)
2018-10-23 16:08:53
---------------------------------------------------------------------------------
Benchmark                                          Time           CPU Iterations
---------------------------------------------------------------------------------
BM_Counter_Increment                               3 ns          3 ns  278594326
BM_Counter/ConcurrentIncrement/threads:16          0 ns          3 ns  259262288
BM_Counter_Collect                               195 ns        195 ns    3618464
BM_Gauge_Increment                                14 ns         14 ns   50904293
BM_Gauge/ConcurrentIncrement/threads:16           92 ns       1471 ns     480976
BM_Gauge_Decrement                                14 ns         14 ns   52015530
BM_Gauge_SetToCurrentTime                         15 ns         15 ns   45556675
BM_Gauge_Collect                                   6 ns          6 ns  125183114

All measurements of BM_Counter.* are the new implementation - the BM_Gauge.* are the old implementations. So in the following I will compare these two implementations.

The single-threaded performance BM_Counter_Increment vs. BM_Gauge_Increment is roughly 4 times faster. I also added a baseline BM_Counter_IncrementBaseline which is a single threaded implementation with no synchronisation. Without synchronisation it is only roughly 2 times faster:

Run on (4 X 3400 MHz CPU s)
2018-10-23 18:14:19
--------------------------------------------------------------------------------
Benchmark                                         Time           CPU Iterations
--------------------------------------------------------------------------------
BM_Counter_IncrementBaseline                      2 ns          2 ns  571664704
BM_Counter_Increment                              3 ns          3 ns  259162781

Comparing BM_Counter/ConcurrentIncrement vs. BM_Gauge/ConcurrentIncrement shows how well it scales in the context of multiple threads: On a 4 core machine it is roughly 100 times faster - on a 16 core machine it is roughly 500 times faster.

The work is shifted to the Collect function, which now takes longer. But I think this is ok because collecting/scraping the counters shouldn't be done in high frequencies. The Collect function can be made faster by reducing the size of the array in counter.h. The size of the array determines the number of concurrent threads and can be made a template parameter.

What do you think?

Oct 23 '18 14:10 jerryct

Regarding the implementation:

I intentionally implemented almost everything as inline functions in the header, because it made the benchmark faster (I can provide numbers if needed)
The per thread counters are of type CacheLine to avoid false sharing with multiple threads. Also this is measurable. The magic number 128 for the alignment could be a (template) parameter (128 Bytes is the size of the cache line of PowerPC).
The only point which should be mentioned and makes the implementation slightly less convenient than the previous implementation is that the per thread counters are a fixed size array. I tried an std::vector put this will not work without further synchronization which makes the whole change pointless. Essentially it means that one Counter instantiation can only be used with 256 different threads over the lifetime of the Counter object. This 256 can be made a parameter. But in general I think this is not a problem because I expect most systems will use something like a thread pool so threads are reused instead of created all the time.

Oct 23 '18 15:10 jerryct

It bothered me a little bit why the trivial implementation is still a factor of 2 faster than this implementation. I found out that the reason is only the usage of a per thread counter array. The index into the array is only known at run time and thus this indexing causes the performance lost. But I think we should stick with the array because other implementations would involve something like a thread registration by the user. So I think the array is a good compromise and the performance is nevertheless great.

Nov 01 '18 11:11 jerryct

Sorry, it took a while until I found some time to look into this PR. I have two major comments:

AFAIR some platforms (mostly mobile) have problems with thread_local storage. I therefore would not make this Counter implementation the default one. Instead I could think of making this code available as standalone ConcurrentCounter implementation.
I don't like my daemons to crash only because a thread limit was reached. Instead I'd propose code like in c1874369ce8822dd31a6e0eb5223c67c6bf4e24c. I use std::hash to hash the thread-id into a bucket-id. Chances that two busy threads end up in the same bucket exist, but the thread number is limitless. If you find the probability of hash collisions unacceptable I could think of making the bucket-id retrieval function a policy or such. On my macbook replacing thread_local with always retrieving the std::this_thread::get_id() was within measuring tolerance.

Thanks, Gregor

Nov 03 '18 17:11 gjasny

I just realized that code with collisions won't work with:

    const double new_value{c.v.load(std::memory_order_relaxed) + v};
    c.v.store(new_value, std::memory_order_relaxed);

Nov 03 '18 18:11 gjasny