A new Druid MiddleManager Resource Scheduling Model Based On K8s(MOK)
Fixes https://github.com/apache/druid/issues/10824.
Description
Please read https://github.com/apache/druid/issues/10824 for details.
This PR adds a new extension named druid-kubernetes-middlemanager-extensions in extension-contrib which means there is no harm to druid core and already has been tested on DEV druid cluster which is running on K8s including index_kafka, index, index_parallel and compact.
Now we can use a single 2Cores and 2Gi Memory middelmanager pod to control Dozens or even hundreds peon pods and there is no need to let MiddleManager take up a lot of resources in advance.
Also now different kinds of tasks can use different configs including CPU and Memory resources.
As for the Autoscale : Maybe there is no need for middlemanager autoscaler in this scenario because the resources occupied by middlemanager are small enough and Combined with #10524 Druid has the ability which is tested to create peon pods and auto scale the pod numbers!

This PR has:
- [x] been self-reviewed.
- [ ] using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
- [x] added or updated version, license, or notice information in licenses.yaml
- [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
- [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
- [x] added integration tests.
- [x] been tested in a test Druid cluster.
all the ci jobs are passed now, except job https://travis-ci.com/github/apache/druid/jobs/484742979 which is failed by network issue. Maybe retry will be successful.
Will add a document for this feature soon to explain how to enable it.
Modify current K8s related IT job(76), now this job is:
- Build Druid Cluster on K8s.
- No Zookeeper dependency.
- MiddleManager launches Peon pods to ingest data.
Also CI is passed now which means this pr is ready for review!
Users also can run these commands on their local env to try this new feature.
export CONFIG_FILE='k8s_run_config_file.json'
export IT_TEST='-Dit.test=ITNestedQueryPushDownTest'
export POD_NAME=int-test
export POD_NAMESPACE=default
export BUILD_DRUID_CLSUTER=true
export MAVEN_SKIP="-Pskip-static-checks -Ddruid.console.skip=true -Dmaven.javadoc.skip=true"
mvn -B clean install -q -ff -Pskip-static-checks -Ddruid.console.skip=true -Dmaven.javadoc.skip=true -Pskip-tests -T1C
mvn verify -pl integration-tests -P int-tests-config-file ${IT_TEST} ${MAVEN_SKIP} -Dpod.name=${POD_NAME} -Dpod.namespace=${POD_NAMESPACE} -Dbuild.druid.cluster=${BUILD_DRUID_CLSUTER}
We'd love to be able to run peons as k8s pods.
It does kind of raise the question of why we would want middlemanagers at all rather than just letting the overlord create k8s pods... But if it's a lot easier to build this way, then that's reasonable.
We'd love to be able to run peons as k8s pods.
It does kind of raise the question of why we would want middlemanagers at all rather than just letting the overlord create k8s pods... But if it's a lot easier to build this way, then that's reasonable.
Hi @glasser
letting the overlord create k8s pods is a huge change including API, Task scheduling model and etc.
Let middleMange to create peon pods and control pods lifecycle is much more easier. Maybe we can let overlord to create pod directly and disable mm in the future.
letting the overlord create k8s pods is a huge change including API, Task scheduling model and etc.
That's too bad, since one of the original goals of the TaskRunner interface is that it could be used to allow the Overlord to launch tasks directly as YARN applications. (At the time, YARN was more popular than K8S 🙂)
I guess it didn't live up to this goal, if you found it easier to have the MMs launch K8S pods.
Anyway, I'm not a K8S expert, but this seems like a very interesting idea.
letting the overlord create k8s pods is a huge change including API, Task scheduling model and etc.
That's too bad, since one of the original goals of the TaskRunner interface is that it could be used to allow the Overlord to launch tasks directly as YARN applications. (At the time, YARN was more popular than K8S 🙂)
I guess it didn't live up to this goal, if you found it easier to have the MMs launch K8S pods.
Anyway, I'm not a K8S expert, but this seems like a very interesting idea.
Hi @gianm Thanks for your attention. Maybe I didn’t make my point clear.
In my opinion, I just follow the current workflow as overlord -> middlemanger -> peon
The difference between ForkingTaskRunner and new K8sForkingTaskRunner(something like ThreadingTaskRunner for CliIndexer) is that middlemanger launch peon as pod in Kubernetes rather than a child process on the same machine.
Because of the same workflow mentioned above, we don't need to consider potential api, task life control or other changes. So that I believe it's easier to have the MMs launch K8S pods :)
letting the overlord create k8s pods is a huge change including API, Task scheduling model and etc.
That's too bad, since one of the original goals of the TaskRunner interface is that it could be used to allow the Overlord to launch tasks directly as YARN applications. (At the time, YARN was more popular than K8S 🙂) I guess it didn't live up to this goal, if you found it easier to have the MMs launch K8S pods. Anyway, I'm not a K8S expert, but this seems like a very interesting idea.
Hi @gianm Thanks for your attention. Maybe I didn’t make my point clear. In my opinion, I just follow the current workflow as overlord -> middlemanger -> peon The difference between ForkingTaskRunner and new K8sForkingTaskRunner(something like
ThreadingTaskRunnerforCliIndexer) is that middlemanger launch peon as pod in Kubernetes rather than a child process on the same machine.Because of the same workflow mentioned above, we don't need to consider potential api, task life control or other changes. So that I believe it's easier to have the MMs launch K8S pods :)
I think it's possible and a valid use-case to use ForkingTaskRunner/RemoteTaskRunner/K8sForkingTaskRunner directly on the overlord, it's just a configuration change on the overlord unless the k8sForkingTaskRunner depends on something specific from the MiddleManager.
@zhangyue19921010 : Can you add some user docs on how to use the new runner and the configurations required to get started ?
@zhangyue19921010 : Can you add some user docs on how to use the new runner and the configurations required to get started ?
Sure, will add docs ASAP.
@zhangyue19921010 is there any plans of merging this PR to master sooner?
Same issue here. Looking forward to this PR getting merged.
this is awesome, but i see that the parallel index task now uses the shuffle which now relies on the middle manager as something more than just launching and managing tasks.
I don't know enough about the new feature, but saw there was a deep storage implementation for shuffle? Will this work without the MM, if so I am happy to take a stab at taking this PR as a base and trying to totally remove the MM in a k8s world.
@zhangyue19921010 are you still working on this PR? Happy to help with the reviewing if needed.
@zhangyue19921010 are you still working on this PR? Happy to help with the reviewing if needed.
@abhishekagarwal87
I am happy to pick this up. I have reviewed the code and there are a few things missing.
- Lets make this work with k8s jobs, not pods. Let k8s manage lifecycles not the middle manager. Also simplify the configuration, I am thinking one configuration option to use this feature from a user perspective.
druid.peon.k8s.tasks=true - Make the task itself push the reports.json, right now it seems to be ignored.
- There is no checkpointing for k8s tasks. We need to have the tasks themselves be able to push checkpoints and recover from them.
Just to note, this patch to make it work the same as Druid does today requires core changes. It can no longer be isolated to just an extension. If that seems like a good plan, I am happy to finish up this work and try to get it into druid. Let me know your thoughts.
Thank you for taking this up @churromorales. It will be nice to keep this as extension. You can certainly add extension points in the core if you need them. Or some code can go to the core itself. In any case, that's not a blocker for starting the implementation.
Thank you for taking this up @churromorales. It will be nice to keep this as extension. You can certainly add extension points in the core if you need them. Or some code can go to the core itself. In any case, that's not a blocker for starting the implementation.
@abhishekagarwal87 sorry it took a while: https://github.com/apache/druid/pull/13156
I have tested this on our druid clusters for various tasks, ingestion, parallel index, compaction, etc. I used the operator to deploy and removed the Middle manager from the spec altogether from the deployment. The code is totally different from this patch and should be feature complete to the way druid works today with the current task runner implementations.
Let me know what you think
@churromorales - This is really great. I will take a look at your PR very soon.
Closing this PR since the feature is merged in another PR.