issue launching cluster for amplab tutorial
I'm following the recent amplab tutorial using my own AWS account. Cluster launch finishes with an error "ERROR: Cluster health check failed for spark_ec2". I'd be grateful for pointers on how to solve it or insight into what the error message means. Note that I added "-w" and "-z" flags to the launch command to avoid timeout and instance availability errors. I've cut and pasted the stdout lines that look like warnings or errors below. Please also take a look at the full stdout/stderr log here: https://gist.github.com/feffgroup/74a8c2789e582ada5150
bash-3.2$ ./spark-ec2 -i ~/aws/jorissen-account/jorissen/jorissen-us-east.pem -k jorissen-us-east --copy launch amplab-training -w 300 -z us-east-1c Setting up security groups... Searching for existing cluster amplab-training... Latest Spark AMI: ami-19474270 Launching instances... Launched 5 slaves in us-east-1c, regid = r-f775e51a Launched master in us-east-1c, regid = r-2e74e4c3 Waiting for instances to start up... Waiting 300 more seconds... Copying SSH key /Users/jorissen/aws/jorissen-account/jorissen/jorissen-us-east.pem to master... ssh: connect to host ec2-54-152-126-49.compute-1.amazonaws.com port 22: Connection refused Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i /Users/jorissen/aws/jorissen-account/jorissen/jorissen-us-east.pem [email protected] 'mkdir -p ~/.ssh'' returned non-zero exit status 255, sleeping 30
[...]
Initializing ganglia
rmdir: failed to remove /var/lib/ganglia/rrds': Not a directory ln: creating symbolic link/var/lib/ganglia/rrds': File exists
Connection to ec2-54-152-37-237.compute-1.amazonaws.com closed.
[...]
Setting up mesos Pseudo-terminal will not be allocated because stdin is not a terminal.
[...]
Setting up training
[...]
Connection to ec2-54-152-86-73.compute-1.amazonaws.com closed. Shutting down GANGLIA gmond: [FAILED] Starting GANGLIA gmond: [ OK ] Connection to ec2-54-152-104-184.compute-1.amazonaws.com closed. ln: creating symbolic link `/var/lib/ganglia/conf/default.json': File exists Shutting down GANGLIA gmetad: [FAILED] Starting GANGLIA gmetad: [ OK ] Stopping httpd: [FAILED] Starting httpd: [ OK ] Connection to ec2-54-152-126-49.compute-1.amazonaws.com closed. Done! Waiting for cluster to start... Exception in opening the url http://ec2-54-152-126-49.compute-1.amazonaws.com:8080/json ec2-54-152-105-0.compute-1.amazonaws.com: stopping org.apache.spark.deploy.worker.Worker
[...]
ERROR: Cluster health check failed for spark_ec2 bash-3.2$
Thanks very much.
(edited by OP to improve legibility)
Running into the same issue with both east and west region AMIs. Copying SSH key /home/anukool/Downloads/sparkwest1.pem to master... ssh: connect to host ec2-54-67-93-194.us-west-1.compute.amazonaws.com port 22: Connection refused Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i /home/anukool/Downloads/sparkwest1.pem [email protected] 'mkdir -p ~/.ssh'' returned non-zero exit status 255, sleeping 30 ssh: connect to host ec2-54-67-93-194.us-west-1.compute.amazonaws.com port 22: Connection refused
Getting this exact same error as well...
I am using git in windows 7 machine File permission for .pem file -rw-r--r--
I am getting the following error while connecting to the cluster
Copying SSH key sparkstream.pem to master...
ssh: connect to host ec2-52-21-237-149.compute-1.amazonaws.com port 22: Bad file
number
Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i sparkstr
eam.pem [email protected] 'mkdir -p ~/.ssh'' return
ed non-zero exit status 255, sleeping 30
ssh: connect to host ec2-52-21-237-149.compute-1.amazonaws.com port 22: Bad file
number
Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i sparkstr
eam.pem [email protected] 'mkdir -p ~/.ssh'' return
ed non-zero exit status 255, sleeping 30
ssh: connect to host ec2-52-21-237-149.compute-1.amazonaws.com port 22: Bad file
number
Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i sparkstr
eam.pem [email protected] 'mkdir -p ~/.ssh'' return
ed non-zero exit status 255, sleeping 30
ssh: connect to host ec2-52-21-237-149.compute-1.amazonaws.com port 22: Bad file
number
Traceback (most recent call last):
File "./spark_ec2.py", line 925, in
I was getting this error as well. It turned out that I was able to manually ssh into the master server using its IP. (You can get that through the aws dashboard ->ec2 instances). instead of "ssh -i "yourkey.pem" root@hostname" you would do "ssh -i "yourkey.pem" root@ipaddress. Once I did that the host must have been automatically added to my known hosts list thus I was able to do a --resume on the setup.
you may need to increase the timeout seconds in function: wait_for_spark_cluster replace time.sleep(5) with time.sleep(120)
this error occurs when spark cluster is not started, you may need to give more seconds to reboot cluster.
this works for me.
Which spark version do you are using @alphago-au ?