concurrentqueue icon indicating copy to clipboard operation
concurrentqueue copied to clipboard

Dequeue sometimes skips a prefix of a sub-queue

Open fhackett opened this issue 2 months ago • 13 comments

This might be related to 1 or 2 other open issues, but I can't prove it's the same thing. I read the README very carefully, and I want to check whether I understood correctly. If so, I might have a reproducer for an issue others have reported.

Key question(s): is this a bug? If so, how can I help you diagnose it further, and if not, what is expected so I can revise my model?

For context, I'm part of a research project that chose to use your concurrent queue (among some other systems) to evaluate our new analysis tool. The tool fuzz tests target code, both with fuzzy call sequences and fuzzy scheduler behavior (via rr chaos mode https://rr-project.org/). We record timeboxes within which each operation happened, as well as what went into / out of the data structure.

Given these detailed logs, we use a model checker to compare them against a relatively complete TLA+ model of the collection-of-subqueues design your data structure uses. Most of the time our model agrees with the documented non-linearizable behavior (pseudo randomly picking which subqueue to dequeue from, switching subqueues whenever-ly). Note that we use the model checker to explore every interleaving of behaviors that are reported as overlapping, and every interpretation of operations that are ambiguous.

Sometimes, our validation process fails in a scenario like this (each operation happens in a distinct period of real time, all multiplexed onto one single CPU core):

  • Thread 1 enqueues [1, 2, 3, 4, 5]
  • Thread 2 dequeues [3, 4, 5]
  • Thread 3 dequeues [1, 2]

The real scenarios are more complex, but ultimately no possible interpretation where the data structure returns elements in per-producer enqueue order is valid. The skip-forward pattern above is common, where subsequences keep their order internally (don't see [2, 1] for instance), but the subsequences come out wrong (assuming I understand your docs).

Because we use rr, we have full record-replay debugging of these bug scenarios, and I can answer very specific questions if I know what to look for in the code. I am also willing to make edits to my testing setup, share data, etc.

The testing rig is public here: https://github.com/DistCompiler/pgo/tree/main/omnilink. You're welcome to check it out. The code calling your data structure is in concurrentqueue/workload. The build system can recompile it against any version of your code on demand. Note that the whole thing is "research grade" software, in that it is robust but not well documented, and in flux as we approach a paper submission deadline.

More details:

  • I am calling token-less bulk enqueue from multiple threads (but I am tracking which thread did what; your data structure uses an implicit per-thread token in this case right?)
  • my dequeue operation, again from any thread, is token-less bulk dequeue
  • to make fuzzing easier, I deliberately configured a small block size of 4
  • each producer enqueues a characteristic range of integers, so we can tell them apart easily

fhackett avatar Nov 12 '25 07:11 fhackett

Thank you for the detailed write-up (and for choosing to fuzz this project!).

Whether this is a bug or not hinges on how the order of calls to dequeue_bulk across threads is established. You say they're done sequentially in a non-overlapping way, but are there any memory ordering fences between the threads? (I understand that you're simulating this all on one actual thread but you're also modelling the interplay of low-level atomic operations as if multiple threads were actually involved.)

Does the following ever occur without another thread dequeuing [1, 2]?

  • Thread 1 enqueues [1, 2, 3, 4, 5]
  • Thread 2 dequeues [3, 4, 5]

In other words, does the issue depend on the presence of a 3rd thread invoking dequeue_bulk too? If it only reproduces with three operations and not two, that would suggest that there is an ordering issue such that the non-overlapping calls are not completely ordered as expected with respect to each other. On the other hand, if this reproduces with just two operations, that suggests this is a bug in the queue itself.

cameron314 avatar Nov 12 '25 13:11 cameron314

Another question: does it reproduce with producer tokens for the enqueue side? The implicit mapping of producers to sub-queues when tokens aren't used relies on several very specific features which may not be emulated correctly.

cameron314 avatar Nov 12 '25 13:11 cameron314

Thanks for your quick response.

To clarify, we're not modelling any weak memory effects here. If we're seeing those effects, it's surprising and interesting to me, because I fully expect that to not be measurable by our analysis. It's an expected weakness that we would not detect anything like that.

From my understanding, we're running a normal single core execution, with the exception that scheduler time slices are randomized. No CPU operations are being emulated. The randomization is based on interfering with syscalls. Here's the blog post introducing the mode we're using, in case I misunderstood it somehow https://robert.ocallahan.org/2016/02/introducing-rr-chaos-mode.html. (rr is very interesting in its own right - didn't know about it till we started using it in our project)

On the other hand, do mention if all the above is not relevant to what you were saying. Maybe you mean that the effect we're seeing is also considered valid on a true single core CPU?

Regardless, I also ran some more experiments to double check what you asked.

I ran some batch runs with only 2 threads (to rule out the 3 thread scenario), and they all validate. Interesting. It is something to do with the 3rd thread. But, since I'm unsure if you mean the behavior is intended on a true single core execution, I'm not sure if this confirms, discounts, or (option 3) the supposed bug.

I also double checked, and the issue is unchanged by using a producer token directly.

Bonus thought: here is our timing operation, called before and after every data structure interaction.

static uint64_t get_timestamp_now() {
    // Thread local is ok, per link below.
    // This might work better from a C library, since static initialization
    // happens inside the function (and we can be called from a C library).
    static thread_local std::atomic<int> _dummy_var_for_mfence = 0;
    unsigned int lo,hi;
    _dummy_var_for_mfence.exchange(1); // prevent CPU reorderings
    // potentially faster than std::atomic_thread_fence https://stackoverflow.com/questions/48316830/why-does-this-stdatomic-thread-fence-work
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    _dummy_var_for_mfence.exchange(1); // prevent CPU reorderings
    return (((uint64_t)(hi & ~(1<<31)) << 32) | lo);
}

We wrote this to fence timestamp ops correctly even in actual multicore executions, so regardless of how rr does or does not work, maybe this acts as a sufficient fence? (I am the lesser of 2 authors of this function - I can check details with the true expert if you have questions)

fhackett avatar Nov 12 '25 21:11 fhackett

Thanks. I think I misunderstood what you're modelling versus what rr is doing.

Let me frame my question another way -- how do you know that it's the queue dequeuing out of order and not the operations being printed out of order? Is it related to these timestamps?

cameron314 avatar Nov 12 '25 22:11 cameron314

Ah I see - yes it's the timestamps. As long as the timestamp measuring code I pasted above is robust and non reorderable, then the operations are boxed in by those timestamps. Neither timestamp is "on" the operation, but the operation can't happen before its starting timestamp, nor can it happen after its ending timestamp. We write all these out in per-thread logs with no runtime effort in synchronizing them (except taking care that the timeboxes accurately enclose when the op happened).

Given this pile of timeboxed logs, we're using the model checker to reconstruct a hypothetical order of events based on "the operation happened sometime between its pair of timestamps". So, it's not that I'm seeing wrong orders printed, but rather our solver cannot reconstruct any reasonable* order that could have been printed. Basically anything else could be wrong, but the timestamping accuracy is decently battle tested across data structures by now.

*reasonable depends on a model we custom made for this data structure that may be inaccurate

fhackett avatar Nov 13 '25 00:11 fhackett

Thanks. Can you extract the four timestamp values for the two dequeue operations?

cameron314 avatar Nov 13 '25 00:11 cameron314

I'm assuming that if you model operations like this the same way you never see out of order values?

uint64_t operation() {
    static std::atomic<uint64_t> a(0);
    return a.fetch_add(1, std::memory_order_acq_rel);
}

cameron314 avatar Nov 13 '25 00:11 cameron314

I'm assuming that if you model operations like this the same way you never see out of order values?

uint64_t operation() { static std::atomic<uint64_t> a(0); return a.fetch_add(1, std::memory_order_acq_rel); }

That was a really good simple question, we had never explicitly checked, only speculated at length. I did, and the code you suggested behaves totally-ordered across 500k operations under our normal fuzzing conditions.

I just pushed the full test case under atomic/. It's basically the same as what you wrote, same memory order, just with the plumbing attached.

fhackett avatar Nov 13 '25 01:11 fhackett

Well, now I'm leaning towards "bug" :-)

But this will be difficult to debug. Are you able to extract variables from the rr recordings? I'd be interested in the state in ImplicitProducer during the two dequeue operations.

cameron314 avatar Nov 13 '25 02:11 cameron314

The rr recordings have a full gdb interface actually, where I can single step through operations based on their log timestamps. I've only got it to work reliably on the command line so far, though.

Our paper deadline is coming up fast, so I will have to get creative with how I use my time, and I won't be able to do any big follow-ups for the next couple of days. I figure now we have some rough "we should examine this further" consensus, putting us in the same position as another bug we found late. We know it's weird but don't know why it happens yet. (not all our bugs are like this, just these 2 late finds)

For now, here's a bunch more info (and sort of the timestamps you asked for): a diagram of one of the concrete counterexamples. Note that this is not "real" time, since the uncompressed 64 bit timestamps cause eye pain. Instead, I compress all distinct timestamps so they are all 1 apart from one another. I have the actual timestamps, but not in the diagram.

Here are the timeboxes, columns are threads, time flows upwards (overlaps are solved in any permutation that "works", in this case none): Image

And here is the only scenario the solver could find, stopping part way through those actions, again time flows from bottom to top: Image

These are in the context of a computed-so-far queue state. After EnqueueWithToken(elements = <<57>>, producer = 4) it was solved to look like this: Image The computed state is mostly trustworthy. It's the longest interpretation it could solve, so if there is a bad assumption that works for a long time, it might report that instead of a shorter path that points to what went wrong.

The concrete issue is, for this example, that neither of 2 candidate operations can proceed:

  • one wants to dequeue 57, despite 51 still being there (can't schedule the op that dequeues 57 yet, that one is measured as in the future relative to this op)
  • another wants to dequeue 51 (ok so far) but then jumps to 33, 32, 32 in the middle of queue 2, skipping elements. Jumping to 37, 34, 35 etc would be ok, but it doesn't do that. Again, the dequeue that consumes 37, 34, etc is measurably in the future.
  • anything else that looks plausible wasn't considered for timing reasons, because it will require these 2 above to have happened already.

Lastly, this is not a secret toolchain and you can use it yourself if you like. Pictures are from a viewer I wrote in JavaFX (odd library choice I know, it lets me load other JVM libs we have). I can teach you how to run the viewer, the fuzzer, or both given a decently powerful dev machine w/ Linux and Nix (Linux due to rr requirements, Nix to handle all the systems we have to build and fuzz w/ multiple versions).

fhackett avatar Nov 13 '25 04:11 fhackett

Oh, and I forgot to say: this is all based on the TLA+ modeling tooling. This file shows how we think your data structure works (well, the parts we're testing): https://github.com/DistCompiler/pgo/blob/main/omnilink/concurrentqueue/ConcurrentQueue.tla ... the syntax can be tricky, sorry about that.

And, while I will be strapped for time, I'll obviously still respond to follow-ups. I just won't be able to run more big experiments for a couple days.

fhackett avatar Nov 13 '25 05:11 fhackett

Quick update: we finished our submission, and I've been working on this off and on for a while since then. During said work, I both have and have not made progress. It's interesting.

First, I found some problems with our model, which was still demanding too much atomicity from the queue (found counterexamples where we were clearly in the wrong vs the queue). So, I rewrote the model to consider atomic sub-actions. For any bulk enqueue or dequeue, it's now ok if the behavior can be explained by any interleaving of individual pushes / pops to process-specific subqueues. Of course, they still have to match the API's surface level claims about producer local ordering and call/return times.

As a result, validation still fails, but a bit more interestingly. There are cases where it looks like an element is being dequeued twice (across 2 competing threads), as well as cases where it's less clear what happened, but the implied producer-specific queue order is messed up somehow.

I'm still working on analysis / visualization to understand these scenarios better (and better present them to you), as well as to understand what to look for in the implementation. I've started reading the code, but as one might expect, it is no joke to understand from just reading it. I will ask questions when I've seen enough examples / thought about it enough.

fhackett avatar Dec 05 '25 02:12 fhackett

Thanks for the update. I haven't had time to really dig in (and am currently sick) so I'm glad you're unblocked. I can answer questions about the code.

cameron314 avatar Dec 05 '25 16:12 cameron314