Add efficient C++20 solution for computing log
A best practice style PR, shows shiny new C++20 toys.
Visit the preview URL for this PR (for commit 121cf73aa9aab8d64ea808ad0f6025e9725eeaa4):
https://cp-algorithms--preview-911-i6to3sti.web.app
(expires 2022-08-09T07:25:07.618992295Z)
Visit the preview URL for this PR (for commit 548bec87130a4b876e669e3a3e17fe4bf36d759c):
https://cp-algorithms--preview-911-i6to3sti.web.app
(expires 2022-08-09T07:40:32.933245109Z)
Visit the preview URL for this PR (for commit 1ebe53470903555238690ff1f7f5da4659cbfad4):
https://cp-algorithms--preview-911-i6to3sti.web.app
(expires 2022-08-09T07:46:03.293700245Z)
Hi, thanks for the suggestion! However, I'd like to understand whether this provides more efficient solution than the one which is already written in the article. Could you maybe benchmark them against each other for repeated calls?
Also, I think before C++20 it was also possible to do in constant time with __builtin_clz, it'd be nice to also mention it in the article and maybe also benchmark it against the current computation method...
Hey @adamant-pwn, it all boils down to memory access patterns. In a benchmark, there will be a tight loop, therefore caching computation will be efficient because there's no work to do just read from memory.
L1 access is 1-2 clock cycles, and IIRC memory is around 60-80 clock cycles.
In real world, reading from cold cache will be a lot slower than just executing two instructions (lzcnt and then sub)
As you correctly pointed out std::bit_width is just sugar coated __builtin_clz both results in the same instructions.
I wrote a benchmark out of curiosity just to test my theory, and it is faster even when there's a "good cache locality".
https://quick-bench.com/q/kXvGQ9Jeyscb6yhWcUjCwOlUghI
In the best-case scenario, both are equally fast:
https://quick-bench.com/q/j8fNME-7M1KuekZqyaU1rEVOPbA
(NOTE: I don't see lzcnt instruction being used in above because of no -march=skylake etc, it'll make things even faster)
Therefore becoming an efficient solution in all cases.
It seems that __builtin_clz(1) - __builtin_clz(X) is a bit faster way to go. And it also works before C++20.
Could you add some info on __builtin_clz and __builtin_clzll as well into the text, and link to the benchmarks?
@adamant-pwn I actually looked at assembly of both. Here's it
The only difference is that C++20 version has a branch to deal with X = 0 as __builtin_clz(0) is undefined.
Apart from that both should have same performance as they are basically the same instructions. bsr in old CPU and lzcnt in new ones.
Thanks, could you please add some brief notes about it in the article, so the readers will be aware of this things as well?
@madhur4127 hi! Would it be a problem?
@adamant-pwn sorry for the delay, the C++20 solution using std::bit_width will handle case of 0 so it'll not have UB. I am not sure whether you want me to describe that __builtin_clz(0) is UB.
Yeah, basically I want to include some overview of possible ways to perform this, including __builtin_clz. Like with some clarifications on possible caveats and with links to benchmarks that we've run here. Of course it would be good to mention the UB on 0 as well.
Don't you think it will derail the article of sparse table? Going into deep on how to compute log?
Maybe we should simply remove log array and just use log2_floor() and add C++20 and builtin_clz implementations.
Hm... I would probably keep the solution with an array as well, because it is easier to memorize and reproduce if you forgot specific function names and don't have any reference by hand. I don't think it would derail of sparse table too much, as computing this stuff is integral part of sparse table.
okay cool, i added benchmark and non-C++20 solution
Visit the preview URL for this PR (for commit 2bbb52aeba926e47868ffb9763f63402797c7a7a):
https://cp-algorithms--preview-911-sutbkjjc.web.app
(expires 2022-10-24T16:16:36.749738144Z)
Visit the preview URL for this PR (for commit 10d625acafaf690e66f19ad0c82019bbfd47d595):
https://cp-algorithms--preview-911-sutbkjjc.web.app
(expires 2022-10-24T16:31:29.486736635Z)
Thanks! But I don't think the linked benchmark shows what it's supposed to show? It compares bit_width and builtin function, not bit_width and lg array.
Besides... Do we have a benchmark specifically for array access? That is, if we treat initialization of the array as pre-computation and remove it from benchmarked part, thus only benchmarking access of the array vs bit_width?
Urggh, you're right. I pushed the wrong link. Here's the right one: https://quick-bench.com/q/Zghbdj_TEkmw4XG2nqOpD3tsJ8U
yeah in the benchmark the pre-computation time is excluded because its outside of the benchmarking loop.
Visit the preview URL for this PR (for commit befb17a1e4ac651a0c948e9ad800efddcd92ae9d):
https://cp-algorithms--preview-911-sutbkjjc.web.app
(expires 2022-10-24T16:45:28.225502903Z)
Ok, let's use it like this! Thanks for the contribution!