glommio skew the scheduler towards locality and throughput

This PR changes how the scheduler behaves within a task queue to favor throughput and locality.

This is a trick from the Go scheduler to get most of the benefits of a LIFO queue but preserve the fairness we get from a FIFO queue. This approach eliminates the (potentially large) scheduling latency that otherwise arises from adding scheduled runnable to the end of the task queue.

The change works as such:

scheduled tasks go in a "priority slot" instead of the queue. If the slot is already taken, the previous task is replaced and the old one goes at the back of the queue.
when the scheduler retrieves the next task, it'll take the task from the priority slot instead of the queue, if any.
The LocalExecutor maintains an internal counter that is incremented each time we run a task from the priority slot instead of the queue. When this counter reaches a constant (256 as of this commit), the task in the priority slot is pushed at the back of the queue, and we pop the next task as normal.
Each time we switch task queues, the priority slot is emptied such that we pick from the FIFO queue the next time we visit it.

I added a new benchmark that simulates a worst-case scenario for the scheduler. Those are the results for glommio master:

time to complete for 100 tasks: 40.715854ms (407.158µs per task)
time to complete for 1000 tasks: 397.54535ms (397.545µs per task)
time to complete for 10000 tasks: 3.81689191s (381.689µs per task)

And after I apply the changes in this PR:

time to complete for 100 tasks: 836.719µs (8.367µs per task)
time to complete for 1000 tasks: 1.014916ms (1.014µs per task)
time to complete for 10000 tasks: 2.421971ms (242ns per task)

I also get a 40% improvement when running the glommio SPSC buffered shared channel benchmark!

Such wow!

Sep 27 '21 01:09 HippoBaro

Note that this PR breaks two Glommio unit tests. I removed them. These unit tests weren't testing for correctness but were testing the order in which the reactor was running tasks within a common task queue. Since we don't make any guarantee as to that, I feel fine removing them.

@glommer although I think this is correct, I'd really like if you could take a good look at this. :bow:

Sep 27 '21 01:09 HippoBaro

I like the idea! Some small comments.

Sep 27 '21 15:09 duarten

I have a few problems with this strategy:

First, please validate my understanding: what you are proposing we do here is prioritize explicit wakes within the same task queue by essentially allowing temporary reordering of those tasks that are chained together by the explicit wake.

That's fine, but:

Explicit wakes outside the reactor are rare, and I have no reason to believe they are more important than wakes triggered by io_uring, for example. Most explicit wakes are in channels. Why is a channel wake more important than I/O completion?
On that point, this only covers local channels. In most interesting applications, the channels that matter a lot are cross-executor.

So I think this risks unfairness for questionable gains. This could be made more interesting by somehow handling foreign wakes in a similar manner, and/or identifying I/O patterns down in the reactor that could also benefit from similar treatment.

Sep 28 '21 12:09 glommer

First, please validate my understanding: what you are proposing we do here is prioritize explicit wakes within the same task queue by essentially allowing temporary reordering of those tasks that are chained together by the explicit wake.

Not quite; this change makes it such that the last woken up task will have scheduling priority. This task may have been woken up locally or by the reactor. It doesn't matter. The reactor uses a simple Waker object to wake up tasks, and therefore my changes apply there.

The same is true for tasks awakened remotely. A remote executor wakes up a foreign task by posting a message on a communication channel. Upon receiving the message, the foreign executor will wake the associated task using a simple Waker. Therefore, that task will have scheduling priority.

Now, that being said, there is a big caveat here. The way the reactor works is in batches. i.e., we consume the IO rings and awaken many tasks in bulk. Same for shared channels. We eagerly consume them and wake up a lot of tasks at once. Etc. My change doesn't work well with this pattern because the LIFO buffer I introduced has a capacity of 1.

But the same is valid for user code that wakes up many tasks (for instance, if they use a condvar and notify_all.). In that case, only the last woken up task will have priority, the others being pushed to the back of the queue as usual.

Sep 28 '21 13:09 HippoBaro

@HippoBaro - Hey can you post any/all of your benchmarks all together in one block, so it's easy to spot when people come across this thread? I will make other comments but so far I like this as an idea if we can mitigate some of the downsides (which it seems like you are doing..). Good stuff!

Oct 02 '21 22:10 bryandmc