Change default segment loading to http
Description
We have observed more stability with http segment loading than curator segment loading in production clusters. For example, we have observed that problems with zookeeper can lead to the inability to query realtime data.
This PR has:
- [X ] added documentation for new or modified features or behaviors.
- [X ] been tested in a test Druid cluster.
Is HTTP based loading ready for prime time? I am curious about any at-scale testing that has been done to verify HTTP based loading is performing as expected. Also, whether all major functional issues with it are fixed before we make it the default. I see at least one open bug right now.
@samarthjain +1 on fixing https://github.com/apache/druid/pull/11717. If I'm not mistaken, that issue was first observed when multiple load rules were changed across different tiers, so hopefully that makes the bug less likely to run into(?)
fwiw, I have seen http segment loading work without issue in many production environments (and actually have seen many problems related to curator loading)
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.
I am not sure if http is ready for prime time, the problem with http arises when jetty http server runs low on threads.
This issue is no longer marked as stale.
ZK segment loading is broken right now. As of ~2 years ago, a PR was merged that breaks the order of segment loading and dropping via ZK, such that the assignment can enter into deadlocks when a cluster is mostly full. This wasn't widely an issue (personally, I only learned about it ~6 months ago) because the largest clusters (at least that I'm aware of) have all been using http segment assignment.
https://github.com/apache/druid/pull/11717 has been merged. While it is and was a bug, it was a corner case that we've only seen in development environments and never actually saw it in a production environment. Every cluster I touch, I move from ZK assignment to HTTP assignment because my experience is that HTTP assignment is more stable. I'm +1 on this directionally, but the PR does need the tests fixed as Kashif suggested before it can be approved.
Also, anyone else have this problem with http loading where Coordinator somehow cached the old Historical's IP addresses?
We saw this often in our Kubernetes deployments.
@didip , please create an issue for the http loading discrepancies if you have been facing them recently.
I am not sure if http is ready for prime time, the problem with http arises when jetty http server runs low on threads.
@didip
Isn't this addressed with a combination of:
- using async IO on historical side to avoid holding threads while work is being done
- the explicit recommendation of setting jetty thread count above aggregate count of connections from query servers (broker) to avoid queries consuming all jetty threads
Or are you saying there is a risk of exhaustion on the outgoing side from the coordinator?
closing now that https://github.com/apache/druid/pull/13092 has been merged