summingbird icon indicating copy to clipboard operation
summingbird copied to clipboard

running in multiple hosts.

Open kandu009 opened this issue 11 years ago • 13 comments

I am able to run a simple example in all 3 modes (online, offline, hybrid) on a single host.

But, I would like to extend this to multiple hosts i.e., run online mode on host1, offline mode on host2 and then run hybrid mode on host3 which uses results of host1 and host2.

Can someone please help me in this regard?

kandu009 avatar Nov 25 '14 05:11 kandu009

Sorry, I'm not sure I understand. If you're running each of those on a single node, that means that you've got Storm and Hadoop working in local mode, yeah? How are you persisting your data?

The next step is to get a distributed Storm and Hadoop cluster set up. Once you've done that you should be able to submit jobs to them, just like you do now.

Summingbird won't handle cluster management or configuration. It helps you generate jobs that you can submit to those clusters from the same Scala file. The tutorials that help you get those systems set up still apply.

Hope that helps!

sritchie avatar Nov 25 '14 05:11 sritchie

I am using an example provided here https://github.com/upio/summingbird-hybrid-example for running in online/offline/hybrid modes. I have a set of nodes with Storm and Hadoop set up on each of them. But I am a kind of unclear on how to proceed with running the above example on multiple hosts. Yeah I do agree that SummingBird doesn't handle cluster management. But, I am not quite clear on how to submit the jobs to the Storm or Hadoop clusters running on different hosts. If you could help me with this, It would be of great help!

kandu009 avatar Nov 25 '14 06:11 kandu009

Build a fat jar with all your dependencies then deploy on hadoop and storm cluster using hadoop jar .. and storm jar.. command respectively. Both Storm and Hadoop job will write to some Store (Redis, HBase etc). Use SummingBird client to talk to these stores, it will take care of merging realtime and batch key,value pairs

does it make sense?

MansurAshraf avatar Nov 25 '14 06:11 MansurAshraf

@MansurAshraf Yeah that sounds good. But I have a question, where do we provide the configuration giving the hostnames on which the storm and scalding are running ? I mean how does summingbird know that it has to contact storm or scalding running on a different host.

kandu009 avatar Nov 26 '14 15:11 kandu009

summingbird doesn't actually know anything about it really, the storm/hadoop launchers handle that. Summingbird is a pure library in that regard, when those things launch it it supplies configs/code to run

ianoc avatar Nov 26 '14 17:11 ianoc

So, your best bet is to check out some Storm and Hadoop tutorials about how to get a cluster running. Then, instead of building the native job the tutorial mentions, build the Summingbird jar instead.

— Sent from Mailbox

On Wed, Nov 26, 2014 at 10:16 AM, ianoc [email protected] wrote:

summingbird doesn't actually know anything about it really, the storm/hadoop launchers handle that. Summingbird is a pure library in that regard, when those things launch it it supplies configs/code to run

Reply to this email directly or view it on GitHub: https://github.com/twitter/summingbird/issues/561#issuecomment-64679661

sritchie avatar Nov 26 '14 17:11 sritchie

@ianoc Yeah correct. I get that part. But my question was, where or how do we pass that config information from summing bird to the Storm or Hadoop clusters running on some other host ?

@sritchie I already have a Storm and Hadoop clusters set up. But, I am looking for a way to pass in the configuration from SummingBird to one of these.

To be more precise, is there any configuration file or anything like that which summingbird expects to know details of the host on which nimbus or jobtacker runs ?

kandu009 avatar Nov 26 '14 17:11 kandu009

Basically the storm and hadoop launcher commands encode this information into the classpath, so thats how it gets passed around.

when you do hadoop jar mySummingbirdJar.jar , the hadoop jar has put some core-site.xml on the classpath which encode's job tracker information into the jobconf for summingbird/scalding to use.

Similarly the storm command operates the same way.

Both commands have means to point at other clusters instead, e.g. in storm its along the lines of storm jar -c nimbus.host=nimbus.mycompany.com

ianoc avatar Nov 26 '14 17:11 ianoc

@ianoc I was just trying to use SB with storm cluster in non-local mode,and I am facing an issue while I follow the instructions you mentioned above

#1: After I submit the jar in storm, and run SB client, it throws an error "java.lang.RuntimeException: Topology with name summingbird_SummingbirdExample already exists on cluster" => line 71(https://searchcode.com/codesearch/view/4412310/) #2: If I don't submit a jar using storm jar command, then SB client fails with error "java.lang.RuntimeException: Must submit topologies using the 'storm' client script so that StormSubmitter knows which jar to upload.". This is fair enough as it is not finding 'storm.jar' property set. => line 130(https://searchcode.com/codesearch/view/4412310/) #3: If I try changing the name of the topology that I submit using storm jar, SB still throws #2 error as it again unable to locate the corresponding jar to submit.

Am I missing something here ? Can someone please help me with this?

kandu009 avatar Jan 16 '15 20:01 kandu009

It sounds like your already running it, did you do a kill command of the topology between launches?

On Fri, Jan 16, 2015 at 12:54 PM, kandu009 [email protected] wrote:

@ianoc https://github.com/ianoc I was just trying to use SB with storm cluster in non-local mode,and I am facing an issue while I follow the instructions you mentioned above

#1 https://github.com/twitter/summingbird/issues/1: After I submit the jar in storm, and run SB client, it throws an error "java.lang.RuntimeException: Topology with name summingbird_SummingbirdExample already exists on cluster" => line 71( https://searchcode.com/codesearch/view/4412310/) #2 https://github.com/twitter/summingbird/issues/2: If I don't submit a jar using storm jar command, then SB client fails with error "java.lang.RuntimeException: Must submit topologies using the 'storm' client script so that StormSubmitter knows which jar to upload.". This is fair enough as it is not finding 'storm.jar' property set. => line 130( https://searchcode.com/codesearch/view/4412310/) #3 https://github.com/twitter/summingbird/issues/3: If I try changing the name of the topology that I submit using storm jar, SB still throws #2 https://github.com/twitter/summingbird/issues/2 error as it again unable to locate the corresponding jar to submit.

Am I missing something here ? Can someone please help me with this?

— Reply to this email directly or view it on GitHub https://github.com/twitter/summingbird/issues/561#issuecomment-70320576.

ianoc avatar Jan 16 '15 21:01 ianoc

@ianoc yes I did a kill of the topology. I think the problem here is that, if the topology is not submitted using storm jar, SB says you should submit the jar. On the other hand if I submit the jar using 'storm jar' and then run SB, it says topology already exists.

kandu009 avatar Jan 16 '15 21:01 kandu009

The storm jar command both copies and starts the topology. It operates similar to the Hadoop jar command

On Friday, January 16, 2015, kandu009 [email protected] wrote:

@ianoc https://github.com/ianoc yes I did a kill of the topology. I think the problem here is that, if the topology is not submitted using storm jar, SB says you should submit the jar. On the other hand if I submit the jar using 'storm jar' and then run SB, it says topology already exists.

— Reply to this email directly or view it on GitHub https://github.com/twitter/summingbird/issues/561#issuecomment-70324925.

ianoc avatar Jan 17 '15 08:01 ianoc

@ianoc I get that now. Thanks for letting me know.

Although, when I try this, I noticed something weird, after a topology is uploaded and storm does the assignment to its workers, I see these three things in the logs but there is no assignment actually made to the workers i.e., workers aren't starting.

  1. Launching worker with assignment ....
  2. Downloading code for storm id summingbird_SummingbirdExample-****
  3. Finished downloading code for storm id summingbird_SummingbirdExample-****

These three occur in a loop and it never ends and my workers never start. Any idea on how to solve this ? any help is highly appreciated.

kandu009 avatar Jan 20 '15 18:01 kandu009