PR about 3 issues
Hello @max0x7ba ,
I have three issues about atomic_queue.
Firstly, i could not find in the documentation, but i assume that the queues with atomic<T>::is_always_lock_free (Atomic_Queue and Atomic_QueueB) shall be used with caution such that these queues use a NIL sentinel (T{}). T{} shall not be pushed to these queues as it represents the empty state. I think this should be documented clearly.
The 2nd one is related to replacing the modulo (index % size_) with a bit mask (index & (size_ - 1)) during the index determination which you already mentioned in another issue and already applied to the queues recieving the size parameter. I can apply this modification to the statically sized queues with a PR.
The last one is about the branch prediction annotations missed within the try_ versions of push and pop. i think, the inspections for FULL/EMPTY queue could be annotated UNLIKELY.
If these three points make sense to you, I can prepare a documentation PR for the 1st issue and two PRs for the remaining two.
Hello @max0x7ba ,
I have three issues about atomic_queue.
Firstly, i could not find in the documentation, but i assume that the queues with atomic::is_always_lock_free (Atomic_Queue and Atomic_QueueB) shall be used with caution such that these queues use a NIL sentinel (T{}). T{} shall not be pushed to these queues as it represents the empty state. I think this should be documented clearly.
May I advise reading the laconic documentation in full instead of searching it poorly?
https://github.com/max0x7ba/atomic_queue/blob/84a2cf09d20b7b1bc17e09e7ec0547caed238c8f/README.md?plain=1#L134
The 2nd one is related to replacing the modulo (index % size_) with a bit mask (index & (size_ - 1)) during the index determination which you already mentioned in another issue and already applied to the queues recieving the size parameter. I can apply this modification to the statically sized queues with a PR.
You may be pleasantly surprised to learn that compilers do such code transformations for you because size_ is a compile-time constant. It is one of strength reduction transformations, see https://en.wikipedia.org/wiki/Strength_reduction
The documentation mentions that too. https://github.com/max0x7ba/atomic_queue/blob/84a2cf09d20b7b1bc17e09e7ec0547caed238c8f/README.md?plain=1#L158
Have you inspected assembly output and found evidence to the contrary?
The last one is about the branch prediction annotations missed within the try_ versions of push and pop. i think, the inspections for FULL/EMPTY queue could be annotated UNLIKELY.
This hypothesis requires empirical data to confirm or deny: benchmark numbers before and after this change of yours.
There are no branch prediction hints in there because they didn't show any measurable improvements of the benchmarks.
Thanks alot for returning my poor issue. The NIL sentinel is clearly noted in the documentation and guarded by the assert in the push function. Really, sorry for that one. For the size modulo, i should have known/considered if the compiler optimizes for modulo 2.
I actually studied your code hardly in detail together with the moodycamel concurrent queue, but somehow rushed to contact to you for the above issues. Anyway, thanks for sharing your time to response to my questions although most are already answered in the code or documentation. Now, i am working on the benchmark whether the UNLIKELY annotation makes any difference on the performance of the try functions. I will share my results soon.
thanks and best regards
i should have known/considered if the compiler optimizes for modulo 2.
C++ compilers are optimizing compilers, which means they transform your code.
The only accurate and complete source of information about what actually happens under the hood is the generated machine code.
Anything else is just guessing what a compiler might do, conditional on the language standard, compiler options, architecture, platform ABI, etc..
Diagnosing the generated machine code is the only way to make accurate and useful prescriptions at all times.
The compiler generated assembly output is not the full story, though.
The full story is the assembly output along with particular addresses/offsets, padding and instruction bytecodes. Oftentimes, compilers cannot do ideal register allocation or instruction selection/scheduling, which results in longer bytecodes and padding, which results in more L1i cache misses, than otherwise necessary.
In other words, not always C++ compilers can produce 0-overhead machine code compared to platonically ideal hand-written assembly, and that gets only worse when functions gets inlined. Register allocation and instruction selection/scheduling are hard computer-science problems preventing compilers from generating ideal machine code.
Such issues become apparent and obvious when looking at the machine code. Whereas the compiler assembly output doesn't have this information because it is just another text source code to be fed into a (non-optimizing) assembler.
If you are serious about getting the best possible performance from your C++ code, or just want to have complete and accurate understanding of what some C++ code compiles into, objdump is the right tool for that. That's what «compile to binary» option does for you at https://gcc.godbolt.org/ , which you can and should do on your own in fully automated fashion and 0-friction.
Example invocation:
$ cd ~/src/atomic_queue
$ make -j8
...
$ nm -C --defined-only build/release/gcc/benchmarks.o | egrep 'throughput_consumer.+RetryDecorator'
...
0000000000005340 t void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)
...
$ cxx_symbol='void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)'
$ obj=$(realpath build/release/gcc/benchmarks.o)
$ objdump --disassembler-color=on -Mintel -C --disassemble="$cxx_symbol" "$obj" | egrep -v -e '^[[:space:]]*$|^Disassembly of section |^/[^:]+:[[:space:]]+file format'
That function is the benchmark consumer doing try_pop of a power-2-sized AtomicQueue you mention.
Which outputs colourful and most informative:
0000000000005340 <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)>:
5340: f3 0f 1e fa endbr64
5344: 55 push rbp
5345: 49 89 d2 mov r10,rdx
5348: 53 push rbx
5349: 49 89 cb mov r11,rcx
534c: 89 fb mov ebx,edi
534e: f0 41 ff 01 lock inc DWORD PTR [r9]
5352: eb 0e jmp 5362 <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0x22>
5354: 66 66 2e 0f 1f 84 00 data16 cs nop WORD PTR [rax+rax*1+0x0]
535b: 00 00 00 00
535f: 90 nop
5360: f3 90 pause
5362: 41 8b 01 mov eax,DWORD PTR [r9]
5365: 85 c0 test eax,eax
5367: 75 f7 jne 5360 <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0x20>
5369: ff c3 inc ebx
536b: 45 31 c9 xor r9d,r9d
536e: 48 8d 56 40 lea rdx,[rsi+0x40]
5372: 48 8d be 80 00 00 00 lea rdi,[rsi+0x80]
5379: 31 c9 xor ecx,ecx
537b: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
5380: 8b 02 mov eax,DWORD PTR [rdx]
5382: 8b 2e mov ebp,DWORD PTR [rsi]
5384: 29 c5 sub ebp,eax
5386: 85 ed test ebp,ebp
5388: 7e 56 jle 53e0 <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0xa0>
538a: 8d 68 01 lea ebp,[rax+0x1]
538d: f0 0f b1 2a lock cmpxchg DWORD PTR [rdx],ebp
5391: 75 ef jne 5382 <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0x42>
5393: 0f b7 c0 movzx eax,ax
5396: 89 c5 mov ebp,eax
5398: c1 ed 04 shr ebp,0x4
539b: 31 c5 xor ebp,eax
539d: 83 e5 0f and ebp,0xf
53a0: 31 e8 xor eax,ebp
53a2: c1 e5 04 shl ebp,0x4
53a5: 31 e8 xor eax,ebp
53a7: 48 8d 04 87 lea rax,[rdi+rax*4]
53ab: 89 cd mov ebp,ecx
53ad: 87 28 xchg DWORD PTR [rax],ebp
53af: 85 ed test ebp,ebp
53b1: 74 0d je 53c0 <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0x80>
53b3: 39 eb cmp ebx,ebp
53b5: 74 39 je 53f0 <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0xb0>
53b7: 89 ed mov ebp,ebp
53b9: 49 01 e9 add r9,rbp
53bc: eb c2 jmp 5380 <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0x40>
53be: 66 90 xchg ax,ax
53c0: f3 90 pause
53c2: 8b 28 mov ebp,DWORD PTR [rax]
53c4: 85 ed test ebp,ebp
53c6: 75 e3 jne 53ab <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0x6b>
53c8: f3 90 pause
53ca: 8b 28 mov ebp,DWORD PTR [rax]
53cc: 85 ed test ebp,ebp
53ce: 74 f0 je 53c0 <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0x80>
53d0: eb d9 jmp 53ab <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0x6b>
53d2: 66 66 2e 0f 1f 84 00 data16 cs nop WORD PTR [rax+rax*1+0x0]
53d9: 00 00 00 00
53dd: 0f 1f 00 nop DWORD PTR [rax]
53e0: f3 90 pause
53e2: eb 9c jmp 5380 <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0x40>
53e4: 66 66 2e 0f 1f 84 00 data16 cs nop WORD PTR [rax+rax*1+0x0]
53eb: 00 00 00 00
53ef: 90 nop
53f0: 0f 31 rdtsc
53f2: 48 c1 e2 20 shl rdx,0x20
53f6: 48 09 d0 or rax,rdx
53f9: f0 41 ff 0b lock dec DWORD PTR [r11]
53fd: 75 03 jne 5402 <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0xc2>
53ff: 49 89 00 mov QWORD PTR [r8],rax
5402: 5b pop rbx
5403: 4d 89 0a mov QWORD PTR [r10],r9
5406: 5d pop rbp
5407: c3 ret
In the above, index % size_ compiled into:
5393: 0f b7 c0 movzx eax,ax
index % size_ got transformed into index & (size_ - 1). size_ - 1 is 65535, or 0xffff in hex.
What a lucky value of the 2nd argument of and instruction, the optimizing compiler probably thinks to itself, and transforms index & 0xffff further into even cheaper movzx eax, ax.
The last one is about the branch prediction annotations missed within the try_ versions of push and pop. i think, the inspections for FULL/EMPTY queue could be annotated UNLIKELY.
try_push is ideal for when the queue may be full. When that's rather unlikely, push is the best choice.
try_pop is ideal for when the queue may be empty. When that's rather unlikely, pop is the best choice.
In the throughput benchmark, empty queue is unlikely, full queue is likely.
In the latency benchmark, the queue is either empty or has 1 element in it.
The ideal branch prediction hint in try_ versions depends on the particular usage with a particular workload.
This information isn't available inside the scope of try_ function, so that a branch prediction hint in there would have a positive effect on some use cases and negative effect on others.
This information is much more likely to be available at higher-level scopes, and this is where a branch prediction hint can have its maximum strictly-positive impact. In other words, using a branch prediction hint on the return value of try_push or try_pop is the best; whereas placing a wrong prediction hint in these functions prevents/overrides a more informed correct branch prediction hint at a higher level from propagating down into callees and achieving its full positive impact.
Not placing a branch prediction inside try_push or try_pop is avoiding a premature optimization which can have negative impact in 50% of use-cases. No branch prediction in there leaves the door open for branch hint information to propagate from higher levels, and that, in fact, is the best possible optimization at try_push or try_pop level.
That's the logic between the lines in there.
If you think someone took an effort to place specific branch prediction hints everywhere, but somehow overlooked here, think twice -- it may be you overlooking the bigger picture behind the trees. And in 20-40 years that's going to be you telling these same words to younger people, if software engineering is something you love doing, I promise.