HttpRemoteTaskRunnerV2
Description
Still a draft.
I've seen on the giant lock in HttpRemoteTaskRunner cause severe performance degradation under heavy load(200-500ms per acquisition with 1000s of activeTasks can slow down the startPendingTasks loop in TaskQueue). This leads to scheduling delays, which leads to more lag, which auto-scales more tasks, ..., etc. The runner also has a few (un)documented races abundant in the code. This overhead also slows down query tasks under load (e.g. MSQE and others) which utilize the scheduler for execution.
I'm attempting a rewrite of this class to optimize for throughput and safety.
Apart from the performance improvements/bug fixes, this will also include some new features:
- Simpler code. The old task runner had old, legacy ZK references dangling around as well as a pretty complicated scheduling loop.
- Task priority-based scheduling onto workers (e.g. prioritize realtime tasks for scheduling), follows a simple priority-based fair-queueing policy.
I would ultimately like to make this the default HttpRemoteTaskRunner and have it run in all tests/production clusters, etc. as I think that would help catch more bugs/issues.
Performance Testing
Test results thus far have shown ~100-300ms speed up per task runner operation (add(), etc.). Over 1000s of tasks, this amounts to minutes of delay saved.
Release note
Key changed/added classes in this PR
-
MyFoo -
OurBar -
TheirBaz
This PR has:
- [ ] been self-reviewed.
- [ ] using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [ ] a release note entry in the PR description.
- [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in licenses.yaml
- [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
- [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
- [x] added integration tests.
- [x] been tested in a test Druid cluster.
@kfaraz any thoughts here?
@gianm @clintropolis could I get your thoughts here?
Bumping this @kfaraz 🙃.