training issue launching cluster for amplab tutorial

I'm following the recent amplab tutorial using my own AWS account. Cluster launch finishes with an error "ERROR: Cluster health check failed for spark_ec2". I'd be grateful for pointers on how to solve it or insight into what the error message means. Note that I added "-w" and "-z" flags to the launch command to avoid timeout and instance availability errors. I've cut and pasted the stdout lines that look like warnings or errors below. Please also take a look at the full stdout/stderr log here: https://gist.github.com/feffgroup/74a8c2789e582ada5150

bash-3.2$ ./spark-ec2 -i ~/aws/jorissen-account/jorissen/jorissen-us-east.pem -k jorissen-us-east --copy launch amplab-training -w 300 -z us-east-1c Setting up security groups... Searching for existing cluster amplab-training... Latest Spark AMI: ami-19474270 Launching instances... Launched 5 slaves in us-east-1c, regid = r-f775e51a Launched master in us-east-1c, regid = r-2e74e4c3 Waiting for instances to start up... Waiting 300 more seconds... Copying SSH key /Users/jorissen/aws/jorissen-account/jorissen/jorissen-us-east.pem to master... ssh: connect to host ec2-54-152-126-49.compute-1.amazonaws.com port 22: Connection refused Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i /Users/jorissen/aws/jorissen-account/jorissen/jorissen-us-east.pem [email protected] 'mkdir -p ~/.ssh'' returned non-zero exit status 255, sleeping 30

[...]

Initializing ganglia rmdir: failed to remove /var/lib/ganglia/rrds': Not a directory ln: creating symbolic link/var/lib/ganglia/rrds': File exists Connection to ec2-54-152-37-237.compute-1.amazonaws.com closed.

[...]

Setting up mesos Pseudo-terminal will not be allocated because stdin is not a terminal.

[...]

Setting up training

[...]

Connection to ec2-54-152-86-73.compute-1.amazonaws.com closed. Shutting down GANGLIA gmond: [FAILED] Starting GANGLIA gmond: [ OK ] Connection to ec2-54-152-104-184.compute-1.amazonaws.com closed. ln: creating symbolic link `/var/lib/ganglia/conf/default.json': File exists Shutting down GANGLIA gmetad: [FAILED] Starting GANGLIA gmetad: [ OK ] Stopping httpd: [FAILED] Starting httpd: [ OK ] Connection to ec2-54-152-126-49.compute-1.amazonaws.com closed. Done! Waiting for cluster to start... Exception in opening the url http://ec2-54-152-126-49.compute-1.amazonaws.com:8080/json ec2-54-152-105-0.compute-1.amazonaws.com: stopping org.apache.spark.deploy.worker.Worker

[...]

ERROR: Cluster health check failed for spark_ec2 bash-3.2$

Thanks very much.

Jan 29 '15 23:01 feffgroup

(edited by OP to improve legibility)

Jan 30 '15 00:01 feffgroup

Running into the same issue with both east and west region AMIs. Copying SSH key /home/anukool/Downloads/sparkwest1.pem to master... ssh: connect to host ec2-54-67-93-194.us-west-1.compute.amazonaws.com port 22: Connection refused Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i /home/anukool/Downloads/sparkwest1.pem [email protected] 'mkdir -p ~/.ssh'' returned non-zero exit status 255, sleeping 30 ssh: connect to host ec2-54-67-93-194.us-west-1.compute.amazonaws.com port 22: Connection refused

Mar 09 '15 22:03 anukoolrege

Getting this exact same error as well...

May 14 '15 04:05 Aerlinger

I am using git in windows 7 machine File permission for .pem file -rw-r--r--

I am getting the following error while connecting to the cluster

Copying SSH key sparkstream.pem to master... ssh: connect to host ec2-52-21-237-149.compute-1.amazonaws.com port 22: Bad file number Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i sparkstr eam.pem [email protected] 'mkdir -p ~/.ssh'' return ed non-zero exit status 255, sleeping 30 ssh: connect to host ec2-52-21-237-149.compute-1.amazonaws.com port 22: Bad file number Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i sparkstr eam.pem [email protected] 'mkdir -p ~/.ssh'' return ed non-zero exit status 255, sleeping 30 ssh: connect to host ec2-52-21-237-149.compute-1.amazonaws.com port 22: Bad file number Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i sparkstr eam.pem [email protected] 'mkdir -p ~/.ssh'' return ed non-zero exit status 255, sleeping 30 ssh: connect to host ec2-52-21-237-149.compute-1.amazonaws.com port 22: Bad file number Traceback (most recent call last): File "./spark_ec2.py", line 925, in main() File "./spark_ec2.py", line 766, in main setup_cluster(conn, master_nodes, slave_nodes, zoo_nodes, opts, True) File "./spark_ec2.py", line 406, in setup_cluster ssh(master, opts, 'mkdir -p ~/.ssh') File "./spark_ec2.py", line 712, in ssh raise e subprocess.CalledProcessError: Command 'ssh -t -o StrictHostKeyChecking=no -i sp arkstream.pem [email protected] 'mkdir -p ~/.ssh'' returned non-zero exit status 255

Aug 25 '15 22:08 sharmadp

I was getting this error as well. It turned out that I was able to manually ssh into the master server using its IP. (You can get that through the aws dashboard ->ec2 instances). instead of "ssh -i "yourkey.pem" root@hostname" you would do "ssh -i "yourkey.pem" root@ipaddress. Once I did that the host must have been automatically added to my known hosts list thus I was able to do a --resume on the setup.

Oct 19 '15 19:10 Skeftical

you may need to increase the timeout seconds in function: wait_for_spark_cluster replace time.sleep(5) with time.sleep(120)

this error occurs when spark cluster is not started, you may need to give more seconds to reboot cluster.

this works for me.

Mar 27 '16 07:03 alphago-au

Which spark version do you are using @alphago-au ?

Apr 04 '16 19:04 seufagner