storm icon indicating copy to clipboard operation
storm copied to clipboard

Ability to Force Worker Spreading Across Hosts

Open JessicaLHartog opened this issue 9 years ago • 4 comments

Sometimes the approach of "let Storm do its own thing" when it comes to scheduling worker slots can lead to interesting scenarios:

(1) A topology may wish to enable running agents in the JVM which listen on predefined ports (such as JDWP debugging or JMX remote). In such a scenario if the port is statically defined somewhere, then two workers on the same host will stall the topology. For example, if we pass -Dcom.sun.management.jmxremote.port=9111 as one of the java childopts, we will see an exception like the one below when trying to bring up the second worker:

Error: Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: 9111; nested exception is: 
    java.net.BindException: Address already in use

(2) Two worker processes on the same machine for the same topology can lead to inconsistent behavior. In a very resource-hungry topology predictability about how workers are scheduled across various hosts may yield better results when it comes to verifying behavior. For example, suppose a topology writes to disk a lot. Having two workers on the same host will increase the I/O on the worker, and can cause an inability to write heartbeat messages, bringing the worker offline. When these workers are rescheduled, if they are scheduled onto two new hosts, then the topology will likely not cause a crashing of the worker.

(3) Tuning the configuration of a topology for various elements like the number of required workers as well as the number of executors per topology component can be made difficult if the first time a topology is submitted two workers are scheduled on the same host, and after some tweaking of configuration options the workers are spread across more hosts (and vice versa).

To make these behaviors (and others) more predictable it would be nice if there were an option like topology.mesos.scheduler that can define a TopologyScheduler. The purpose of the TopologyScheduler would be to define worker spreading across hosts. Some options that may be useful are:

  • Default - the current "let Storm do its own thing" behavior
  • OneWorkerPerHost - worker slots for the same topology may not use offers from the same host
  • MinimumNumberOfHosts - all worker slots are scheduled on the same host (or on as few hosts as possible), attempting to maximize the number of worker slots on each host

Additional options would also then be possible should a need for them arise.

This also relates to Issue 158: enhance scheduling to act more predictably

JessicaLHartog avatar Aug 05 '16 23:08 JessicaLHartog

@JessicaLHartog +1 for topology.mesos.scheduler - Thats the whole reason behind creating Scheduling Framework described in https://github.com/mesos/storm/pull/93.

I would like to propose IsolationScheduler along with your multifarious schedulers - such a scheduler would dedicate set of hosts for a particular topology while also ensuring no other worker gets scheduled on to those dedicated hosts. This would be helpful for debugging nefarious topologies.

dsKarthick avatar Aug 07 '16 05:08 dsKarthick

@dsKarthick : nice, "multifarious", I didn't know that word! ✋ ✋ (high five). And then following it up with "nefarious". "(multi/ne)farious". I wonder what other words end with "farious"... ahh: http://wordinfo.info/unit/3613/ip:1/il:F

But back on topic: yes, totally, we should have that (IsolationScheduler) too.

erikdw avatar Aug 07 '16 07:08 erikdw

@erikdw I have a problem. 2

1 1

the last picture says "Failed to detect a master: Failed to parse data of unknown label 'json.info'" .please tell me how to solve the problem.

bigdata-user avatar Aug 16 '16 07:08 bigdata-user

Again, please stop posting these messages in random pull requests and issues. Just file a new issue.

  • Erik

On Aug 16, 2016, at 12:38 AM, zhanghangchina [email protected] wrote:

@erikdw I have a problem.

the last picture says "Failed to detect a master: Failed to parse data of unknown label 'json.info'" .please tell me how to solve the problem.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

erikdw avatar Aug 16 '16 08:08 erikdw