atomic_queue icon indicating copy to clipboard operation
atomic_queue copied to clipboard

PR about 3 issues

Open BarisAlbayrakIEEE opened this issue 1 month ago • 4 comments

Hello @max0x7ba ,

I have three issues about atomic_queue.

Firstly, i could not find in the documentation, but i assume that the queues with atomic<T>::is_always_lock_free (Atomic_Queue and Atomic_QueueB) shall be used with caution such that these queues use a NIL sentinel (T{}). T{} shall not be pushed to these queues as it represents the empty state. I think this should be documented clearly.

The 2nd one is related to replacing the modulo (index % size_) with a bit mask (index & (size_ - 1)) during the index determination which you already mentioned in another issue and already applied to the queues recieving the size parameter. I can apply this modification to the statically sized queues with a PR.

The last one is about the branch prediction annotations missed within the try_ versions of push and pop. i think, the inspections for FULL/EMPTY queue could be annotated UNLIKELY.

If these three points make sense to you, I can prepare a documentation PR for the 1st issue and two PRs for the remaining two.

BarisAlbayrakIEEE avatar Dec 10 '25 08:12 BarisAlbayrakIEEE

Hello @max0x7ba ,

I have three issues about atomic_queue.

Firstly, i could not find in the documentation, but i assume that the queues with atomic::is_always_lock_free (Atomic_Queue and Atomic_QueueB) shall be used with caution such that these queues use a NIL sentinel (T{}). T{} shall not be pushed to these queues as it represents the empty state. I think this should be documented clearly.

May I advise reading the laconic documentation in full instead of searching it poorly?

https://github.com/max0x7ba/atomic_queue/blob/84a2cf09d20b7b1bc17e09e7ec0547caed238c8f/README.md?plain=1#L134

The 2nd one is related to replacing the modulo (index % size_) with a bit mask (index & (size_ - 1)) during the index determination which you already mentioned in another issue and already applied to the queues recieving the size parameter. I can apply this modification to the statically sized queues with a PR.

You may be pleasantly surprised to learn that compilers do such code transformations for you because size_ is a compile-time constant. It is one of strength reduction transformations, see https://en.wikipedia.org/wiki/Strength_reduction

The documentation mentions that too. https://github.com/max0x7ba/atomic_queue/blob/84a2cf09d20b7b1bc17e09e7ec0547caed238c8f/README.md?plain=1#L158

Have you inspected assembly output and found evidence to the contrary?

The last one is about the branch prediction annotations missed within the try_ versions of push and pop. i think, the inspections for FULL/EMPTY queue could be annotated UNLIKELY.

This hypothesis requires empirical data to confirm or deny: benchmark numbers before and after this change of yours.

There are no branch prediction hints in there because they didn't show any measurable improvements of the benchmarks.

max0x7ba avatar Dec 11 '25 02:12 max0x7ba

Thanks alot for returning my poor issue. The NIL sentinel is clearly noted in the documentation and guarded by the assert in the push function. Really, sorry for that one. For the size modulo, i should have known/considered if the compiler optimizes for modulo 2.

I actually studied your code hardly in detail together with the moodycamel concurrent queue, but somehow rushed to contact to you for the above issues. Anyway, thanks for sharing your time to response to my questions although most are already answered in the code or documentation. Now, i am working on the benchmark whether the UNLIKELY annotation makes any difference on the performance of the try functions. I will share my results soon.

thanks and best regards

BarisAlbayrakIEEE avatar Dec 11 '25 10:12 BarisAlbayrakIEEE

i should have known/considered if the compiler optimizes for modulo 2.

C++ compilers are optimizing compilers, which means they transform your code.

The only accurate and complete source of information about what actually happens under the hood is the generated machine code.

Anything else is just guessing what a compiler might do, conditional on the language standard, compiler options, architecture, platform ABI, etc..

Diagnosing the generated machine code is the only way to make accurate and useful prescriptions at all times.


The compiler generated assembly output is not the full story, though.

The full story is the assembly output along with particular addresses/offsets, padding and instruction bytecodes. Oftentimes, compilers cannot do ideal register allocation or instruction selection/scheduling, which results in longer bytecodes and padding, which results in more L1i cache misses, than otherwise necessary.

In other words, not always C++ compilers can produce 0-overhead machine code compared to platonically ideal hand-written assembly, and that gets only worse when functions gets inlined. Register allocation and instruction selection/scheduling are hard computer-science problems preventing compilers from generating ideal machine code.

Such issues become apparent and obvious when looking at the machine code. Whereas the compiler assembly output doesn't have this information because it is just another text source code to be fed into a (non-optimizing) assembler.

If you are serious about getting the best possible performance from your C++ code, or just want to have complete and accurate understanding of what some C++ code compiles into, objdump is the right tool for that. That's what «compile to binary» option does for you at https://gcc.godbolt.org/ , which you can and should do on your own in fully automated fashion and 0-friction.


Example invocation:

$ cd ~/src/atomic_queue

$ make -j8
...

$ nm -C --defined-only build/release/gcc/benchmarks.o | egrep 'throughput_consumer.+RetryDecorator'
...
0000000000005340 t void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)
...

$ cxx_symbol='void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)'

$ obj=$(realpath build/release/gcc/benchmarks.o)

$ objdump --disassembler-color=on -Mintel -C --disassemble="$cxx_symbol" "$obj" | egrep -v -e '^[[:space:]]*$|^Disassembly of section |^/[^:]+:[[:space:]]+file format'

That function is the benchmark consumer doing try_pop of a power-2-sized AtomicQueue you mention.

Which outputs colourful and most informative:

0000000000005340 <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)>:
    5340:	f3 0f 1e fa          	endbr64
    5344:	55                   	push   rbp
    5345:	49 89 d2             	mov    r10,rdx
    5348:	53                   	push   rbx
    5349:	49 89 cb             	mov    r11,rcx
    534c:	89 fb                	mov    ebx,edi
    534e:	f0 41 ff 01          	lock inc DWORD PTR [r9]
    5352:	eb 0e                	jmp    5362 <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0x22>
    5354:	66 66 2e 0f 1f 84 00 	data16 cs nop WORD PTR [rax+rax*1+0x0]
    535b:	00 00 00 00 
    535f:	90                   	nop
    5360:	f3 90                	pause
    5362:	41 8b 01             	mov    eax,DWORD PTR [r9]
    5365:	85 c0                	test   eax,eax
    5367:	75 f7                	jne    5360 <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0x20>
    5369:	ff c3                	inc    ebx
    536b:	45 31 c9             	xor    r9d,r9d
    536e:	48 8d 56 40          	lea    rdx,[rsi+0x40]
    5372:	48 8d be 80 00 00 00 	lea    rdi,[rsi+0x80]
    5379:	31 c9                	xor    ecx,ecx
    537b:	0f 1f 44 00 00       	nop    DWORD PTR [rax+rax*1+0x0]
    5380:	8b 02                	mov    eax,DWORD PTR [rdx]
    5382:	8b 2e                	mov    ebp,DWORD PTR [rsi]
    5384:	29 c5                	sub    ebp,eax
    5386:	85 ed                	test   ebp,ebp
    5388:	7e 56                	jle    53e0 <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0xa0>
    538a:	8d 68 01             	lea    ebp,[rax+0x1]
    538d:	f0 0f b1 2a          	lock cmpxchg DWORD PTR [rdx],ebp
    5391:	75 ef                	jne    5382 <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0x42>
    5393:	0f b7 c0             	movzx  eax,ax
    5396:	89 c5                	mov    ebp,eax
    5398:	c1 ed 04             	shr    ebp,0x4
    539b:	31 c5                	xor    ebp,eax
    539d:	83 e5 0f             	and    ebp,0xf
    53a0:	31 e8                	xor    eax,ebp
    53a2:	c1 e5 04             	shl    ebp,0x4
    53a5:	31 e8                	xor    eax,ebp
    53a7:	48 8d 04 87          	lea    rax,[rdi+rax*4]
    53ab:	89 cd                	mov    ebp,ecx
    53ad:	87 28                	xchg   DWORD PTR [rax],ebp
    53af:	85 ed                	test   ebp,ebp
    53b1:	74 0d                	je     53c0 <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0x80>
    53b3:	39 eb                	cmp    ebx,ebp
    53b5:	74 39                	je     53f0 <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0xb0>
    53b7:	89 ed                	mov    ebp,ebp
    53b9:	49 01 e9             	add    r9,rbp
    53bc:	eb c2                	jmp    5380 <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0x40>
    53be:	66 90                	xchg   ax,ax
    53c0:	f3 90                	pause
    53c2:	8b 28                	mov    ebp,DWORD PTR [rax]
    53c4:	85 ed                	test   ebp,ebp
    53c6:	75 e3                	jne    53ab <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0x6b>
    53c8:	f3 90                	pause
    53ca:	8b 28                	mov    ebp,DWORD PTR [rax]
    53cc:	85 ed                	test   ebp,ebp
    53ce:	74 f0                	je     53c0 <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0x80>
    53d0:	eb d9                	jmp    53ab <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0x6b>
    53d2:	66 66 2e 0f 1f 84 00 	data16 cs nop WORD PTR [rax+rax*1+0x0]
    53d9:	00 00 00 00 
    53dd:	0f 1f 00             	nop    DWORD PTR [rax]
    53e0:	f3 90                	pause
    53e2:	eb 9c                	jmp    5380 <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0x40>
    53e4:	66 66 2e 0f 1f 84 00 	data16 cs nop WORD PTR [rax+rax*1+0x0]
    53eb:	00 00 00 00 
    53ef:	90                   	nop
    53f0:	0f 31                	rdtsc
    53f2:	48 c1 e2 20          	shl    rdx,0x20
    53f6:	48 09 d0             	or     rax,rdx
    53f9:	f0 41 ff 0b          	lock dec DWORD PTR [r11]
    53fd:	75 03                	jne    5402 <void (anonymous namespace)::throughput_consumer<atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> > >(unsigned int, atomic_queue::RetryDecorator<atomic_queue::AtomicQueue<unsigned int, 65536u, 0u, true, true, false, false> >*, long long*, std::atomic<unsigned int>*, unsigned long long*, atomic_queue::Barrier*)+0xc2>
    53ff:	49 89 00             	mov    QWORD PTR [r8],rax
    5402:	5b                   	pop    rbx
    5403:	4d 89 0a             	mov    QWORD PTR [r10],r9
    5406:	5d                   	pop    rbp
    5407:	c3                   	ret

In the above, index % size_ compiled into:

 5393:	0f b7 c0             	movzx  eax,ax

index % size_ got transformed into index & (size_ - 1). size_ - 1 is 65535, or 0xffff in hex.

What a lucky value of the 2nd argument of and instruction, the optimizing compiler probably thinks to itself, and transforms index & 0xffff further into even cheaper movzx eax, ax.

max0x7ba avatar Dec 12 '25 02:12 max0x7ba

The last one is about the branch prediction annotations missed within the try_ versions of push and pop. i think, the inspections for FULL/EMPTY queue could be annotated UNLIKELY.

try_push is ideal for when the queue may be full. When that's rather unlikely, push is the best choice.

try_pop is ideal for when the queue may be empty. When that's rather unlikely, pop is the best choice.

In the throughput benchmark, empty queue is unlikely, full queue is likely.

In the latency benchmark, the queue is either empty or has 1 element in it.

The ideal branch prediction hint in try_ versions depends on the particular usage with a particular workload.

This information isn't available inside the scope of try_ function, so that a branch prediction hint in there would have a positive effect on some use cases and negative effect on others.

This information is much more likely to be available at higher-level scopes, and this is where a branch prediction hint can have its maximum strictly-positive impact. In other words, using a branch prediction hint on the return value of try_push or try_pop is the best; whereas placing a wrong prediction hint in these functions prevents/overrides a more informed correct branch prediction hint at a higher level from propagating down into callees and achieving its full positive impact.

Not placing a branch prediction inside try_push or try_pop is avoiding a premature optimization which can have negative impact in 50% of use-cases. No branch prediction in there leaves the door open for branch hint information to propagate from higher levels, and that, in fact, is the best possible optimization at try_push or try_pop level.

That's the logic between the lines in there.

If you think someone took an effort to place specific branch prediction hints everywhere, but somehow overlooked here, think twice -- it may be you overlooking the bigger picture behind the trees. And in 20-40 years that's going to be you telling these same words to younger people, if software engineering is something you love doing, I promise.

max0x7ba avatar Dec 12 '25 06:12 max0x7ba