cp-algorithms Add efficient C++20 solution for computing log

A best practice style PR, shows shiny new C++20 toys.

Aug 02 '22 07:08 madhur4127

Visit the preview URL for this PR (for commit 121cf73aa9aab8d64ea808ad0f6025e9725eeaa4):

https://cp-algorithms--preview-911-i6to3sti.web.app

(expires 2022-08-09T07:25:07.618992295Z)

Aug 02 '22 07:08 github-actions[bot]

Visit the preview URL for this PR (for commit 548bec87130a4b876e669e3a3e17fe4bf36d759c):

https://cp-algorithms--preview-911-i6to3sti.web.app

(expires 2022-08-09T07:40:32.933245109Z)

Aug 02 '22 07:08 github-actions[bot]

Visit the preview URL for this PR (for commit 1ebe53470903555238690ff1f7f5da4659cbfad4):

https://cp-algorithms--preview-911-i6to3sti.web.app

(expires 2022-08-09T07:46:03.293700245Z)

Aug 02 '22 07:08 github-actions[bot]

Hi, thanks for the suggestion! However, I'd like to understand whether this provides more efficient solution than the one which is already written in the article. Could you maybe benchmark them against each other for repeated calls?

Also, I think before C++20 it was also possible to do in constant time with __builtin_clz, it'd be nice to also mention it in the article and maybe also benchmark it against the current computation method...

Aug 04 '22 13:08 adamant-pwn

Hey @adamant-pwn, it all boils down to memory access patterns. In a benchmark, there will be a tight loop, therefore caching computation will be efficient because there's no work to do just read from memory.

L1 access is 1-2 clock cycles, and IIRC memory is around 60-80 clock cycles.

In real world, reading from cold cache will be a lot slower than just executing two instructions (lzcnt and then sub)

As you correctly pointed out std::bit_width is just sugar coated __builtin_clz both results in the same instructions.

Aug 04 '22 13:08 madhur4127

I wrote a benchmark out of curiosity just to test my theory, and it is faster even when there's a "good cache locality".

https://quick-bench.com/q/kXvGQ9Jeyscb6yhWcUjCwOlUghI

In the best-case scenario, both are equally fast:

https://quick-bench.com/q/j8fNME-7M1KuekZqyaU1rEVOPbA

(NOTE: I don't see lzcnt instruction being used in above because of no -march=skylake etc, it'll make things even faster)

Therefore becoming an efficient solution in all cases.

Aug 04 '22 14:08 madhur4127

It seems that __builtin_clz(1) - __builtin_clz(X) is a bit faster way to go. And it also works before C++20.

Could you add some info on __builtin_clz and __builtin_clzll as well into the text, and link to the benchmarks?

Aug 23 '22 15:08 adamant-pwn

@adamant-pwn I actually looked at assembly of both. Here's it

The only difference is that C++20 version has a branch to deal with X = 0 as __builtin_clz(0) is undefined.

Apart from that both should have same performance as they are basically the same instructions. bsr in old CPU and lzcnt in new ones.

Aug 23 '22 16:08 madhur4127

Thanks, could you please add some brief notes about it in the article, so the readers will be aware of this things as well?

Sep 05 '22 18:09 adamant-pwn

@madhur4127 hi! Would it be a problem?

Oct 17 '22 11:10 adamant-pwn

@adamant-pwn sorry for the delay, the C++20 solution using std::bit_width will handle case of 0 so it'll not have UB. I am not sure whether you want me to describe that __builtin_clz(0) is UB.

Oct 17 '22 12:10 madhur4127

Yeah, basically I want to include some overview of possible ways to perform this, including __builtin_clz. Like with some clarifications on possible caveats and with links to benchmarks that we've run here. Of course it would be good to mention the UB on 0 as well.

Oct 17 '22 14:10 adamant-pwn

Don't you think it will derail the article of sparse table? Going into deep on how to compute log?

Maybe we should simply remove log array and just use log2_floor() and add C++20 and builtin_clz implementations.

Oct 17 '22 14:10 madhur4127

Hm... I would probably keep the solution with an array as well, because it is easier to memorize and reproduce if you forgot specific function names and don't have any reference by hand. I don't think it would derail of sparse table too much, as computing this stuff is integral part of sparse table.

Oct 17 '22 15:10 adamant-pwn

okay cool, i added benchmark and non-C++20 solution

Oct 17 '22 16:10 madhur4127

Visit the preview URL for this PR (for commit 2bbb52aeba926e47868ffb9763f63402797c7a7a):

https://cp-algorithms--preview-911-sutbkjjc.web.app

(expires 2022-10-24T16:16:36.749738144Z)

Oct 17 '22 16:10 github-actions[bot]

Visit the preview URL for this PR (for commit 10d625acafaf690e66f19ad0c82019bbfd47d595):

https://cp-algorithms--preview-911-sutbkjjc.web.app

(expires 2022-10-24T16:31:29.486736635Z)

Oct 17 '22 16:10 github-actions[bot]

Thanks! But I don't think the linked benchmark shows what it's supposed to show? It compares bit_width and builtin function, not bit_width and lg array.

Besides... Do we have a benchmark specifically for array access? That is, if we treat initialization of the array as pre-computation and remove it from benchmarked part, thus only benchmarking access of the array vs bit_width?

Oct 17 '22 16:10 adamant-pwn

Urggh, you're right. I pushed the wrong link. Here's the right one: https://quick-bench.com/q/Zghbdj_TEkmw4XG2nqOpD3tsJ8U

yeah in the benchmark the pre-computation time is excluded because its outside of the benchmarking loop.

Oct 17 '22 16:10 madhur4127

Visit the preview URL for this PR (for commit befb17a1e4ac651a0c948e9ad800efddcd92ae9d):

https://cp-algorithms--preview-911-sutbkjjc.web.app

(expires 2022-10-24T16:45:28.225502903Z)

Oct 17 '22 16:10 github-actions[bot]

Ok, let's use it like this! Thanks for the contribution!

Oct 17 '22 18:10 adamant-pwn