OpenSearch icon indicating copy to clipboard operation
OpenSearch copied to clipboard

Make balanced shards allocator timebound

Open imRishN opened this issue 1 year ago • 18 comments

Description

This PR aims to time bound the reroute duration to finish within a specific timeout so that it allows for URGENT priority tasks that would otherwise be waiting in queue.

For instance time taken by rebalance -

% cat elasticsearch.log | grep "to compute"
[2024-08-21T03:44:42,385][WARN ][o.o.c.s.MasterService    ] [a64d304b34ad798faae32932ab14d605] took [7.8m], which is over [10s], to compute cluster state update for [cluster_reroute(reroute after starting shards)]
[2024-08-21T03:52:45,982][WARN ][o.o.c.s.MasterService    ] [a64d304b34ad798faae32932ab14d605] took [7.7m], which is over [10s], to compute cluster state update for [cluster_reroute(reroute after starting shards)]

Hot threads in master -

curl localhost:9200/_nodes/<node id>/hot_threads?interval=5s
   
   99.9% (4.9s out of 5s) cpu usage by thread 'opensearch[<>][clusterManagerService#updateTask][T#1]'
     9/10 snapshots sharing following 20 elements
       app//org.opensearch.cluster.routing.allocation.allocator.LocalShardsBalancer.balanceByWeights(LocalShardsBalancer.java:440)
       app//org.opensearch.cluster.routing.allocation.allocator.LocalShardsBalancer.balance(LocalShardsBalancer.java:204)
       app//org.opensearch.cluster.routing.allocation.allocator.BalancedShardsAllocator.allocate(BalancedShardsAllocator.java:324)
       app//org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:576)
       app//org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:538)
       app//org.opensearch.node.Node$$Lambda$2639/0x0000001800a9ec70.apply(Unknown Source)
       app//org.opensearch.cluster.routing.BatchedRerouteService$1.execute(BatchedRerouteService.java:136)
       app//org.opensearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:67)
       app//org.opensearch.cluster.service.MasterService.executeTasks(MasterService.java:882)
       app//org.opensearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:434)
       app//org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:301)
       app//org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:212)
       app//org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:209)
       app//org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:247)
       app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:863)
       app//org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283)
       app//org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246)
       [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
       [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
       [email protected]/java.lang.Thread.run(Thread.java:840)
     unique snapshot
       app//org.opensearch.cluster.routing.allocation.allocator.LocalShardsBalancer.balanceByWeights(LocalShardsBalancer.java:470)
       app//org.opensearch.cluster.routing.allocation.allocator.LocalShardsBalancer.balance(LocalShardsBalancer.java:204)
       app//org.opensearch.cluster.routing.allocation.allocator.BalancedShardsAllocator.allocate(BalancedShardsAllocator.java:324)
       app//org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:576)
       app//org.opensearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:538)
       app//org.opensearch.node.Node$$Lambda$2639/0x0000001800a9ec70.apply(Unknown Source)
       app//org.opensearch.cluster.routing.BatchedRerouteService$1.execute(BatchedRerouteService.java:136)
       app//org.opensearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:67)
       app//org.opensearch.cluster.service.MasterService.executeTasks(MasterService.java:882)
       app//org.opensearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:434)
       app//org.opensearch.cluster.service.MasterService.runTasks(MasterService.java:301)
       app//org.opensearch.cluster.service.MasterService$Batcher.run(MasterService.java:212)
       app//org.opensearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:209)
       app//org.opensearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:247)
       app//org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:863)
       app//org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283)
       app//org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246)
       [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
       [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
       [email protected]/java.lang.Thread.run(Thread.java:840)

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • [X] Functionality includes testing.
  • [ ] API changes companion pull request created, if applicable.
  • [x] Public documentation issue/PR created, if applicable. - https://github.com/opensearch-project/documentation-website/issues/8086

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

imRishN avatar Aug 14 '24 05:08 imRishN

:x: Gradle check result for 0e9151cd99c04ed50b41952865ff1fce6dea484b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Aug 14 '24 06:08 github-actions[bot]

:x: Gradle check result for 7fe10d7b27c4579cb9de3ba1edd311511aaad122: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Aug 15 '24 17:08 github-actions[bot]

:x: Gradle check result for 8f558d7a71386edf5fb2e1ae1adfe64093b4b30f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Aug 16 '24 17:08 github-actions[bot]

:x: Gradle check result for d591b26d90254bc6fce082e33f4e4eaece68c559: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Aug 20 '24 08:08 github-actions[bot]

:x: Gradle check result for 9ab5a29001a3b0e9535945224abf3e14022cd8ee: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Aug 20 '24 08:08 github-actions[bot]

:x: Gradle check result for 31f02cf8304cd15f5ba171efd121155d3e69290f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Aug 21 '24 06:08 github-actions[bot]

:x: Gradle check result for 5eaaa60773de2f75e246cfafaf7b2a78e80db18a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Aug 21 '24 07:08 github-actions[bot]

:x: Gradle check result for e426ffb4d1ecda616450649311d279ad99633233: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Aug 21 '24 07:08 github-actions[bot]

:x: Gradle check result for c67486af8e1a94432dd531c1d466541d52373e81: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Aug 21 '24 09:08 github-actions[bot]

:x: Gradle check result for 4e5bcc80b1633e9734a8a85fee02bd691734ebbb: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Aug 21 '24 10:08 github-actions[bot]

:x: Gradle check result for c094e95a5ce3cbb776b149609298a6eb9143f96e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Aug 21 '24 10:08 github-actions[bot]

:x: Gradle check result for 825d796a5988c82e57975cf6d148ad6797c72e7f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Aug 21 '24 21:08 github-actions[bot]

:x: Gradle check result for c072f27b0ee429e85205361776077388640a2958: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Aug 22 '24 05:08 github-actions[bot]

:x: Gradle check result for 5d2cd14b6ac76d83ff9838b4833d6ab22311c6dd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Aug 27 '24 08:08 github-actions[bot]

:white_check_mark: Gradle check result for 87a5cfc1fa729b257042857261bf8a8dd7742507: SUCCESS

github-actions[bot] avatar Aug 27 '24 08:08 github-actions[bot]

Codecov Report

Attention: Patch coverage is 79.48718% with 8 lines in your changes missing coverage. Please review.

Project coverage is 71.93%. Comparing base (46a269e) to head (75145b7). Report is 22 commits behind head on main.

Files with missing lines Patch % Lines
.../allocation/allocator/BalancedShardsAllocator.java 76.47% 3 Missing and 1 partial :warning:
...ting/allocation/allocator/LocalShardsBalancer.java 78.94% 0 Missing and 4 partials :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #15239      +/-   ##
============================================
+ Coverage     71.88%   71.93%   +0.05%     
- Complexity    63242    63263      +21     
============================================
  Files          5224     5224              
  Lines        296137   296174      +37     
  Branches      42777    42785       +8     
============================================
+ Hits         212881   213067     +186     
+ Misses        65784    65613     -171     
- Partials      17472    17494      +22     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Aug 27 '24 08:08 codecov[bot]

:x: Gradle check result for fc9a8ff5e411b8135849362e61dd636de44ad9fa: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Aug 27 '24 09:08 github-actions[bot]

:x: Gradle check result for 9a101b9e4e382d3d010e4ae81f5cfaf8cd4f1ef3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Aug 27 '24 11:08 github-actions[bot]

:x: Gradle check result for 72a10b410c174614a06b8a36acc5dbbfc9a7e707: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions[bot] avatar Aug 28 '24 15:08 github-actions[bot]

:white_check_mark: Gradle check result for 3ba16156366408c25e7e24d1607eda93140108e0: SUCCESS

github-actions[bot] avatar Aug 29 '24 10:08 github-actions[bot]

:grey_exclamation: Gradle check result for c18d44c6736d49ef1b11d9a6913ea34595ee5348: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

github-actions[bot] avatar Aug 29 '24 10:08 github-actions[bot]

Do we need an explicit reroute?

Usually, if in the current round of reroute any amount of work was attempted (like allocating an unassigned shard, or moving a shard or rebalancing a shard), a reroute will eventually be triggered. If no work was done (allocating no unassigned shards, moving no shards, rebalancing no shards) a subsequent reroute would also be most probably wasteful. But there could be edge cases where a subsequent reroute might help. Opened issue to track this - https://github.com/opensearch-project/OpenSearch/issues/14945

imRishN avatar Aug 29 '24 11:08 imRishN

:white_check_mark: Gradle check result for 75145b7a20c6748af275b13d3df8f540a9bf4ed8: SUCCESS

github-actions[bot] avatar Aug 29 '24 12:08 github-actions[bot]