ClusterScaler kills workflow on common errors like timeouts/connection close
Related to #1699. About once a day I get an error like this:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/clusterScaler.py", line 296, in check
scalerThread.join(timeout=0)
File "/usr/local/lib/python2.7/dist-packages/bd2k/util/threading.py", line 51, in run
self.tryRun( )
File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/clusterScaler.py", line 383, in tryRun
self.totalNodes = len(self.scaler.leader.provisioner.getProvisionedWorkers(self.preemptable))
File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/aws/awsProvisioner.py", line 295, in getProvisionedWorkers
entireCluster = self._getNodesInCluster(ctx=self.ctx, clusterName=self.clusterName, both=True)
File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/aws/awsProvisioner.py", line 724, in _getNodesInCluster
'instance-state-name': 'running'})
File "/usr/local/lib/python2.7/dist-packages/boto/ec2/connection.py", line 622, in get_only_instances
next_token=next_token)
File "/usr/local/lib/python2.7/dist-packages/boto/ec2/connection.py", line 681, in get_all_reservations
[('item', Reservation)], verb='POST')
File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 1171, in get_list
body = response.read()
File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 410, in read
self._cached_response = http_client.HTTPResponse.read(self)
File "/usr/lib/python2.7/httplib.py", line 578, in read
return self._read_chunked(amt)
File "/usr/lib/python2.7/httplib.py", line 636, in _read_chunked
value.append(self._safe_read(chunk_left))
File "/usr/lib/python2.7/httplib.py", line 693, in _safe_read
chunk = self.fp.read(min(amt, MAXAMOUNT))
File "/usr/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
File "/usr/lib/python2.7/ssl.py", line 341, in recv
return self.read(buflen)
File "/usr/lib/python2.7/ssl.py", line 260, in read
return self._sslobj.read(len)
SSLError: The read operation timed out
which kills the scaler thread and shuts down the workflow.
This also leaves all your instances running after failure, something I didn't notice in the earlier issue.
We could try to add more retries, but I'm not sure it's ever really necessary to bring a workflow down because the scaler isn't working. The autoscaler is basically a best-effort feature: if it works, great, but if it stops working for a few minutes, it's not a big deal. IMO a better idea is to wrap the entire body of the loop in a try/except clause, logging exceptions but otherwise ignoring them. That way, a failure in one run through the loop doesn't bring everything down.
┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-185