ringbuf False sharing?

From a quick-ish glance at RingBuffer, it looks like head and tail could possibly be falsely shared.

However, I tried to improve it by padding the atomics so they'd be on different cache lines, wrote a threaded benchmark, and failed to produce something faster than the current implementation (in fact, my version was usually slightly slower in benchmarks).

So, I guess this issue is either:

How is the current implementation getting around false sharing?
Or, if you haven't explicitly thought about false sharing before, perhaps look into it and see if you can make ringbuf even faster :)

Jul 30 '20 00:07 crsaracco

Thank you for the pointing out to this possible issue. To be honest, I haven't considered it.

But does the current implementation is really subjected to the false sharing? Let me slightly rewrite the code sample from Wikipedia to roughly demonstrate how does the current ring buffer logic looks like:

struct foo {
    int x;
    int y; 
};

static struct foo f;

/* The two following functions are running concurrently: */

void consume(void)
{
    for (int i = 0; i < 1000000; ++i) {
        if (f.x < f.y) {
            f.x += 1;
        }
    }
}

void produce(void)
{
    for (int i = 0; i < 1000000; ++i) {
        if (f.y - f.x < 100) {
            f.y += 1;
        }
    }
}

Both of consumer and producer do read both head and tail (but write only one of them). I'm not sure this sharing is false because both head and tail should be synchronized between cores, but maybe I missed something.

Jul 30 '20 11:07 agerasev

Okay, so I did some more reading on this, and the usual trick is to have a "cached" head and tail: [1] [2]. I missed this detail when I tried it yesterday.

Basically, you can reduce the cache line contention by having the consumer update their knowledge of the producer's position less frequently (only when it thinks the queue might be empty does it double-check/update). Same idea for the producer.

I'll play around with this a little bit more and see if I can get something faster in benchmarks. It does increase the complexity of the code a little bit, so you can decide if you want to actually use it (assuming I'm successful). :)

Jul 30 '20 12:07 crsaracco

@crsaracco, @mgeier, thank you for your interest! And sorry for such late answer.

I agree that the false sharing may be the one of the issues. I've tried to fix it.

But the idea of updating atomic indices less often while being transparent for user seems to have a drawback - you cannot decide when to actually synchronize indices. For example, you synchronize only when buffer is full or empty. And in such case what if consumer thread need to get just one byte, but it has to wait until buffer is full.

I believe the better approach is to synchronize buffer state on every action (as ringbuf does), but if you don't need such frequent synchronization you may accumulate data in some local buffer and push it all at once when it's needed. Another possible approach is to remove any implicit synchronization on push/pop at all and require for user to flush buffer state explicitly (maybe using some kind of guards that flushes buffer when get dropped).

Anyway thank you for your assistance!

May 14 '21 16:05 agerasev

Please note that caching the atomic indices is not a solution to the false sharing problem (but it might might reduce the number of accesses to the atomic variables, and somewhat work around the problem).

The proper solution to false sharing is to make sure that the two indices don't share a cache line.

But the idea of updating atomic indices less often while being transparent for user seems to have a drawback - you cannot decide when to actually synchronize indices. For example, you synchronize only when buffer is full or empty. And in such case what if consumer thread need to get just one byte, but it has to wait until buffer is full.

I don't understand. I don't think that's what should happen.

The atomic indices should never be updated too late. The idea is that if an index is advanced by a lot, all further increments will not have to touch the atomic variable at all until it's all caught up.

I'm not sure whether it's actually good to cache both indices or if caching only one of them is better, see https://github.com/mgeier/rtrb/pull/48. But caching one (if it's the right one) definitely has a measurable effect.

The other options you mentioned are of course also possible.

May 19 '21 18:05 mgeier