[ISSUE] w.clusters.restart_and_wait(cluster_id) raises an exception when cluster is in Terminated state
Description
w.clusters.restart_and_wait(cluster_id) raises an exception when cluster is already in Terminated state
stack trace:
Reproduction
- terminate a cluster via UI or SDK
- w.clusters.restart_and_wait(cluster_id)
at this point, .restart_and_wait calls .restart, which in turn calls /api/2.0/clusters/restart API, and the exception is thrown from API itself. see stack trace:
File "[cut]/.venv/lib/python3.10/site-packages/databricks/sdk/service/compute.py", line 4603, in restart_and_wait
return self.restart(cluster_id=cluster_id, restart_user=restart_user).result(timeout=timeout)
File "[cut]/.venv/lib/python3.10/site-packages/databricks/sdk/service/compute.py", line 4595, in restart
self._api.do('POST', '/api/2.0/clusters/restart', body=body, headers=headers)
File "[cut]/.venv/lib/python3.10/site-packages/databricks/sdk/core.py", line 1110, in do
return retryable(self._perform)(method,
File "[cut]/.venv/lib/python3.10/site-packages/databricks/sdk/retries.py", line 47, in wrapper
raise err
File "[cut]/.venv/lib/python3.10/site-packages/databricks/sdk/retries.py", line 29, in wrapper
return func(*args, **kwargs)
File "[cut]/.venv/lib/python3.10/site-packages/databricks/sdk/core.py", line 1202, in _perform
Expected behavior if cluster is terminated (or in any other "non running states"), SDK (or API?) should just start it up. only in scenario when cluster is already running it should be restarted. so far I have implemented this "workaround" logic that seems to do the job:
def ensure_cluster_has_restarted(w: WorkspaceClient, cluster_id: str):
state = compute.State
info = w.clusters.get(cluster_id)
if info.state == state.RUNNING:
return w.clusters.restart_and_wait(cluster_id)
return w.clusters.ensure_cluster_is_running(cluster_id)
Is it a regression? no, it happens on all SDKs
Thanks for raising this. Will discuss with the backend team whether we can change the Restart API to start a terminated cluster.