ClusterScaler kills workflow on common errors like timeouts/connection close

Open joelarmstrong opened this issue 8 years ago • 0 comments

Related to #1699. About once a day I get an error like this:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/clusterScaler.py", line 296, in check
    scalerThread.join(timeout=0)
  File "/usr/local/lib/python2.7/dist-packages/bd2k/util/threading.py", line 51, in run
    self.tryRun( )
  File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/clusterScaler.py", line 383, in tryRun
    self.totalNodes = len(self.scaler.leader.provisioner.getProvisionedWorkers(self.preemptable))
  File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/aws/awsProvisioner.py", line 295, in getProvisionedWorkers
    entireCluster = self._getNodesInCluster(ctx=self.ctx, clusterName=self.clusterName, both=True)
  File "/usr/local/lib/python2.7/dist-packages/toil/provisioners/aws/awsProvisioner.py", line 724, in _getNodesInCluster
    'instance-state-name': 'running'})
  File "/usr/local/lib/python2.7/dist-packages/boto/ec2/connection.py", line 622, in get_only_instances
    next_token=next_token)
  File "/usr/local/lib/python2.7/dist-packages/boto/ec2/connection.py", line 681, in get_all_reservations
    [('item', Reservation)], verb='POST')
  File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 1171, in get_list
    body = response.read()
  File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 410, in read
    self._cached_response = http_client.HTTPResponse.read(self)
  File "/usr/lib/python2.7/httplib.py", line 578, in read
    return self._read_chunked(amt)
  File "/usr/lib/python2.7/httplib.py", line 636, in _read_chunked
    value.append(self._safe_read(chunk_left))
  File "/usr/lib/python2.7/httplib.py", line 693, in _safe_read
    chunk = self.fp.read(min(amt, MAXAMOUNT))
  File "/usr/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
  File "/usr/lib/python2.7/ssl.py", line 341, in recv
    return self.read(buflen)
  File "/usr/lib/python2.7/ssl.py", line 260, in read
    return self._sslobj.read(len)
SSLError: The read operation timed out

which kills the scaler thread and shuts down the workflow.

This also leaves all your instances running after failure, something I didn't notice in the earlier issue.

We could try to add more retries, but I'm not sure it's ever really necessary to bring a workflow down because the scaler isn't working. The autoscaler is basically a best-effort feature: if it works, great, but if it stops working for a few minutes, it's not a big deal. IMO a better idea is to wrap the entire body of the loop in a try/except clause, logging exceptions but otherwise ignoring them. That way, a failure in one run through the loop doesn't bring everything down.

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-185

Jul 29 '17 05:07 joelarmstrong