Scalable counter
Hi. I would like to propose a new implementation of Counter to make it
- scalable upon multi-threading
- single-threaded performance roughly the same as without any synchronisation
Here are some performance measurements:
4 threads
$ ./benchmarks --benchmark_filter='BM_(Counter|Gauge)'
Run on (4 X 3400 MHz CPU s)
2018-10-23 16:02:33
--------------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------------
BM_Counter_Increment 3 ns 3 ns 262704325
BM_Counter/ConcurrentIncrement/threads:4 1 ns 3 ns 207105724
BM_Counter_Collect 264 ns 264 ns 2599009
BM_Gauge_Increment 14 ns 14 ns 50683315
BM_Gauge/ConcurrentIncrement/threads:4 85 ns 339 ns 2513412
BM_Gauge_Decrement 14 ns 14 ns 51979167
BM_Gauge_SetToCurrentTime 18 ns 18 ns 39151309
BM_Gauge_Collect 6 ns 6 ns 116305932
16 threads
Run on (16 X 3600 MHz CPU s)
2018-10-23 16:08:53
---------------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------------
BM_Counter_Increment 3 ns 3 ns 278594326
BM_Counter/ConcurrentIncrement/threads:16 0 ns 3 ns 259262288
BM_Counter_Collect 195 ns 195 ns 3618464
BM_Gauge_Increment 14 ns 14 ns 50904293
BM_Gauge/ConcurrentIncrement/threads:16 92 ns 1471 ns 480976
BM_Gauge_Decrement 14 ns 14 ns 52015530
BM_Gauge_SetToCurrentTime 15 ns 15 ns 45556675
BM_Gauge_Collect 6 ns 6 ns 125183114
All measurements of BM_Counter.* are the new implementation - the BM_Gauge.* are the old implementations. So in the following I will compare these two implementations.
The single-threaded performance BM_Counter_Increment vs. BM_Gauge_Increment is roughly 4 times faster. I also added a baseline BM_Counter_IncrementBaseline which is a single threaded implementation with no synchronisation. Without synchronisation it is only roughly 2 times faster:
Run on (4 X 3400 MHz CPU s)
2018-10-23 18:14:19
--------------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------------
BM_Counter_IncrementBaseline 2 ns 2 ns 571664704
BM_Counter_Increment 3 ns 3 ns 259162781
Comparing BM_Counter/ConcurrentIncrement vs. BM_Gauge/ConcurrentIncrement shows how well it scales in the context of multiple threads: On a 4 core machine it is roughly 100 times faster - on a 16 core machine it is roughly 500 times faster.
The work is shifted to the Collect function, which now takes longer. But I think this is ok because collecting/scraping the counters shouldn't be done in high frequencies. The Collect function can be made faster by reducing the size of the array in counter.h. The size of the array determines the number of concurrent threads and can be made a template parameter.
What do you think?
Regarding the implementation:
- I intentionally implemented almost everything as inline functions in the header, because it made the benchmark faster (I can provide numbers if needed)
- The per thread counters are of type CacheLine to avoid false sharing with multiple threads. Also this is measurable. The magic number 128 for the alignment could be a (template) parameter (128 Bytes is the size of the cache line of PowerPC).
- The only point which should be mentioned and makes the implementation slightly less convenient than the previous implementation is that the per thread counters are a fixed size array. I tried an std::vector put this will not work without further synchronization which makes the whole change pointless. Essentially it means that one Counter instantiation can only be used with 256 different threads over the lifetime of the Counter object. This 256 can be made a parameter. But in general I think this is not a problem because I expect most systems will use something like a thread pool so threads are reused instead of created all the time.
It bothered me a little bit why the trivial implementation is still a factor of 2 faster than this implementation. I found out that the reason is only the usage of a per thread counter array. The index into the array is only known at run time and thus this indexing causes the performance lost. But I think we should stick with the array because other implementations would involve something like a thread registration by the user. So I think the array is a good compromise and the performance is nevertheless great.
Sorry, it took a while until I found some time to look into this PR. I have two major comments:
- AFAIR some platforms (mostly mobile) have problems with
thread_localstorage. I therefore would not make this Counter implementation the default one. Instead I could think of making this code available as standaloneConcurrentCounterimplementation. - I don't like my daemons to crash only because a thread limit was reached. Instead I'd propose code like in c1874369ce8822dd31a6e0eb5223c67c6bf4e24c. I use
std::hashto hash the thread-id into a bucket-id. Chances that two busy threads end up in the same bucket exist, but the thread number is limitless. If you find the probability of hash collisions unacceptable I could think of making the bucket-id retrieval function a policy or such. On my macbook replacingthread_localwith always retrieving thestd::this_thread::get_id()was within measuring tolerance.
Thanks, Gregor
I just realized that code with collisions won't work with:
const double new_value{c.v.load(std::memory_order_relaxed) + v};
c.v.store(new_value, std::memory_order_relaxed);