[ISSUE] w.clusters.restart_and_wait(cluster_id) raises an exception when cluster is in Terminated state

Open grusin-db opened this issue 2 years ago • 1 comments

Description w.clusters.restart_and_wait(cluster_id) raises an exception when cluster is already in Terminated state

stack trace:

Reproduction

terminate a cluster via UI or SDK
w.clusters.restart_and_wait(cluster_id)

at this point, .restart_and_wait calls .restart, which in turn calls /api/2.0/clusters/restart API, and the exception is thrown from API itself. see stack trace:

  File "[cut]/.venv/lib/python3.10/site-packages/databricks/sdk/service/compute.py", line 4603, in restart_and_wait
    return self.restart(cluster_id=cluster_id, restart_user=restart_user).result(timeout=timeout)
  File "[cut]/.venv/lib/python3.10/site-packages/databricks/sdk/service/compute.py", line 4595, in restart
    self._api.do('POST', '/api/2.0/clusters/restart', body=body, headers=headers)
  File "[cut]/.venv/lib/python3.10/site-packages/databricks/sdk/core.py", line 1110, in do
    return retryable(self._perform)(method,
  File "[cut]/.venv/lib/python3.10/site-packages/databricks/sdk/retries.py", line 47, in wrapper
    raise err
  File "[cut]/.venv/lib/python3.10/site-packages/databricks/sdk/retries.py", line 29, in wrapper
    return func(*args, **kwargs)
  File "[cut]/.venv/lib/python3.10/site-packages/databricks/sdk/core.py", line 1202, in _perform

Expected behavior if cluster is terminated (or in any other "non running states"), SDK (or API?) should just start it up. only in scenario when cluster is already running it should be restarted. so far I have implemented this "workaround" logic that seems to do the job:

def ensure_cluster_has_restarted(w: WorkspaceClient, cluster_id: str):
    state = compute.State
    info = w.clusters.get(cluster_id)

    if info.state == state.RUNNING:
        return w.clusters.restart_and_wait(cluster_id)

    return w.clusters.ensure_cluster_is_running(cluster_id)

Is it a regression? no, it happens on all SDKs

Nov 03 '23 07:11 grusin-db

Thanks for raising this. Will discuss with the backend team whether we can change the Restart API to start a terminated cluster.

Nov 15 '23 16:11 mgyucht