druid icon indicating copy to clipboard operation
druid copied to clipboard

HttpRemoteTaskRunnerV2

Open jtuglu1 opened this issue 3 months ago • 3 comments

Description

Still a draft.

I've seen on the giant lock in HttpRemoteTaskRunner cause severe performance degradation under heavy load(200-500ms per acquisition with 1000s of activeTasks can slow down the startPendingTasks loop in TaskQueue). This leads to scheduling delays, which leads to more lag, which auto-scales more tasks, ..., etc. The runner also has a few (un)documented races abundant in the code. This overhead also slows down query tasks under load (e.g. MSQE and others) which utilize the scheduler for execution.

I'm attempting a rewrite of this class to optimize for throughput and safety.

Apart from the performance improvements/bug fixes, this will also include some new features:

  • Simpler code. The old task runner had old, legacy ZK references dangling around as well as a pretty complicated scheduling loop.
  • Task priority-based scheduling onto workers (e.g. prioritize realtime tasks for scheduling), follows a simple priority-based fair-queueing policy.

I would ultimately like to make this the default HttpRemoteTaskRunner and have it run in all tests/production clusters, etc. as I think that would help catch more bugs/issues.

Performance Testing

Test results thus far have shown ~100-300ms speed up per task runner operation (add(), etc.). Over 1000s of tasks, this amounts to minutes of delay saved.

Release note


Key changed/added classes in this PR
  • MyFoo
  • OurBar
  • TheirBaz

This PR has:

  • [ ] been self-reviewed.
    • [ ] using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
  • [ ] added documentation for new or modified features or behaviors.
  • [ ] a release note entry in the PR description.
  • [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • [ ] added or updated version, license, or notice information in licenses.yaml
  • [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • [x] added integration tests.
  • [x] been tested in a test Druid cluster.

jtuglu1 avatar Nov 08 '25 08:11 jtuglu1

@kfaraz any thoughts here?

jtuglu1 avatar Dec 08 '25 16:12 jtuglu1

@gianm @clintropolis could I get your thoughts here?

jtuglu1 avatar Dec 08 '25 18:12 jtuglu1

Bumping this @kfaraz 🙃.

jtuglu1 avatar Dec 15 '25 19:12 jtuglu1