druid icon indicating copy to clipboard operation
druid copied to clipboard

Change default segment loading to http

Open Caroline1000 opened this issue 4 years ago • 9 comments

Description

We have observed more stability with http segment loading than curator segment loading in production clusters. For example, we have observed that problems with zookeeper can lead to the inability to query realtime data.

This PR has:

  • [X ] added documentation for new or modified features or behaviors.
  • [X ] been tested in a test Druid cluster.

Caroline1000 avatar Sep 29 '21 23:09 Caroline1000

Is HTTP based loading ready for prime time? I am curious about any at-scale testing that has been done to verify HTTP based loading is performing as expected. Also, whether all major functional issues with it are fixed before we make it the default. I see at least one open bug right now.

samarthjain avatar Nov 03 '21 23:11 samarthjain

@samarthjain +1 on fixing https://github.com/apache/druid/pull/11717. If I'm not mistaken, that issue was first observed when multiple load rules were changed across different tiers, so hopefully that makes the bug less likely to run into(?)

fwiw, I have seen http segment loading work without issue in many production environments (and actually have seen many problems related to curator loading)

Caroline1000 avatar Jan 21 '22 23:01 Caroline1000

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.

stale[bot] avatar Apr 16 '22 12:04 stale[bot]

I am not sure if http is ready for prime time, the problem with http arises when jetty http server runs low on threads.

didip avatar Jun 18 '22 14:06 didip

This issue is no longer marked as stale.

stale[bot] avatar Jun 18 '22 14:06 stale[bot]

ZK segment loading is broken right now. As of ~2 years ago, a PR was merged that breaks the order of segment loading and dropping via ZK, such that the assignment can enter into deadlocks when a cluster is mostly full. This wasn't widely an issue (personally, I only learned about it ~6 months ago) because the largest clusters (at least that I'm aware of) have all been using http segment assignment.

https://github.com/apache/druid/pull/11717 has been merged. While it is and was a bug, it was a corner case that we've only seen in development environments and never actually saw it in a production environment. Every cluster I touch, I move from ZK assignment to HTTP assignment because my experience is that HTTP assignment is more stable. I'm +1 on this directionally, but the PR does need the tests fixed as Kashif suggested before it can be approved.

imply-cheddar avatar Jun 21 '22 22:06 imply-cheddar

Also, anyone else have this problem with http loading where Coordinator somehow cached the old Historical's IP addresses?

We saw this often in our Kubernetes deployments.

didip avatar Jun 23 '22 14:06 didip

@didip , please create an issue for the http loading discrepancies if you have been facing them recently.

kfaraz avatar Jun 24 '22 06:06 kfaraz

I am not sure if http is ready for prime time, the problem with http arises when jetty http server runs low on threads.

@didip

Isn't this addressed with a combination of:

  • using async IO on historical side to avoid holding threads while work is being done
  • the explicit recommendation of setting jetty thread count above aggregate count of connections from query servers (broker) to avoid queries consuming all jetty threads

Or are you saying there is a risk of exhaustion on the outgoing side from the coordinator?

capistrant avatar Aug 08 '22 20:08 capistrant

closing now that https://github.com/apache/druid/pull/13092 has been merged

Caroline1000 avatar Dec 02 '22 18:12 Caroline1000