ucx [BUG]: UCX Assessment tasks are failing

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

After installing the UCX to our Azure Databricks workspace the assessment job has been failing so far. At first it was only the crawl_tables task failing due to a Spark driver error. In the consequent runs more tasks started to fail. The task failures happen around 1-2 hour after the job is started.

Here are some of the error codes from the failed tasks:

crawl_tables: "The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached. at com.databricks.spark.chauffeur.Chauffeur.onDriverStateChange(Chauffeur.scala:1478)"

crawl_groups: "com.databricks.backend.common.rpc.DriverStoppedException: Driver down cause: driver state change (exit code: 137)"

Expected Behavior

In another workspace the assessment job ran successfully without issues. We applied the same configuration to both workspaces when installing UCX.

Steps To Reproduce

No response

Cloud

Azure

Operating System

Linux

Version

latest via Databricks CLI

Relevant log output

The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.
	at com.databricks.spark.chauffeur.Chauffeur.onDriverStateChange(Chauffeur.scala:1478)
	at com.databricks.spark.chauffeur.Chauffeur.$anonfun$driverStateOpt$1(Chauffeur.scala:187)
	at com.databricks.spark.chauffeur.Chauffeur.$anonfun$driverStateOpt$1$adapted(Chauffeur.scala:187)
	at com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.$anonfun$goToStopped$4(DriverDaemonMonitorImpl.scala:251)
	at com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.$anonfun$goToStopped$4$adapted(DriverDaemonMonitorImpl.scala:251)
	at scala.collection.immutable.List.foreach(List.scala:431)
	at com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.goToStopped(DriverDaemonMonitorImpl.scala:251)
	at com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.monitorDriver(DriverDaemonMonitorImpl.scala:406)
	at com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.$anonfun$job$1(DriverDaemonMonitorImpl.scala:100)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.logging.UsageLogging.$anonfun$recordOperation$1(UsageLogging.scala:532)
	at com.databricks.logging.UsageLogging.executeThunkAndCaptureResultTags$1(UsageLogging.scala:636)
	at com.databricks.logging.UsageLogging.$anonfun$recordOperationWithResultTags$4(UsageLogging.scala:654)
	at com.databricks.logging.AttributionContextTracing.$anonfun$withAttributionContext$1(AttributionContextTracing.scala:48)
	at com.databricks.logging.AttributionContext$.$anonfun$withValue$1(AttributionContext.scala:253)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
	at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:249)
	at com.databricks.logging.AttributionContextTracing.withAttributionContext(AttributionContextTracing.scala:46)
	at com.databricks.logging.AttributionContextTracing.withAttributionContext$(AttributionContextTracing.scala:43)
	at com.databricks.threading.SingletonJob$SingletonJobImpl.withAttributionContext(SingletonJob.scala:432)
	at com.databricks.logging.AttributionContextTracing.withAttributionTags(AttributionContextTracing.scala:95)
	at com.databricks.logging.AttributionContextTracing.withAttributionTags$(AttributionContextTracing.scala:76)
	at com.databricks.threading.SingletonJob$SingletonJobImpl.withAttributionTags(SingletonJob.scala:432)
	at com.databricks.logging.UsageLogging.recordOperationWithResultTags(UsageLogging.scala:631)
	at com.databricks.logging.UsageLogging.recordOperationWithResultTags$(UsageLogging.scala:541)
	at com.databricks.threading.SingletonJob$SingletonJobImpl.recordOperationWithResultTags(SingletonJob.scala:432)
	at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:533)
	at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:501)
	at com.databricks.threading.SingletonJob$SingletonJobImpl.recordOperation(SingletonJob.scala:432)
	at com.databricks.threading.SingletonJob$SingletonJobImpl$SingletonRun.$anonfun$run$4(SingletonJob.scala:491)
	at com.databricks.logging.AttributionContextTracing.$anonfun$withAttributionContext$1(AttributionContextTracing.scala:48)
	at com.databricks.logging.AttributionContext$.$anonfun$withValue$1(AttributionContext.scala:253)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
	at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:249)
	at com.databricks.logging.AttributionContextTracing.withAttributionContext(AttributionContextTracing.scala:46)
	at com.databricks.logging.AttributionContextTracing.withAttributionContext$(AttributionContextTracing.scala:43)
	at com.databricks.threading.SingletonJob$SingletonJobImpl.withAttributionContext(SingletonJob.scala:432)
	at com.databricks.threading.SingletonJob$SingletonJobImpl$SingletonRun.$anonfun$run$3(SingletonJob.scala:491)
	at scala.util.Try$.apply(Try.scala:213)
	at com.databricks.threading.SingletonJob$SingletonJobImpl$SingletonRun.$anonfun$run$1(SingletonJob.scala:490)
	at com.databricks.util.UntrustedUtils$.tryLog(UntrustedUtils.scala:109)
	at com.databricks.threading.SingletonJob$SingletonJobImpl$SingletonRun.run(SingletonJob.scala:484)
	at com.databricks.threading.InstrumentedExecutorService$$anon$1.$anonfun$run$3(InstrumentedExecutorService.scala:144)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.logging.AttributionContextTracing.$anonfun$withAttributionContext$1(AttributionContextTracing.scala:48)
	at com.databricks.logging.AttributionContext$.$anonfun$withValue$1(AttributionContext.scala:253)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
	at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:249)
	at com.databricks.logging.AttributionContextTracing.withAttributionContext(AttributionContextTracing.scala:46)
	at com.databricks.logging.AttributionContextTracing.withAttributionContext$(AttributionContextTracing.scala:43)
	at com.databricks.threading.InstrumentedExecutorService$$anon$1.withAttributionContext(InstrumentedExecutorService.scala:137)
	at com.databricks.threading.InstrumentedExecutorService$$anon$1.$anonfun$run$2(InstrumentedExecutorService.scala:142)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.instrumentation.QueuedThreadPoolInstrumenter.trackActiveThreads(QueuedThreadPoolInstrumenter.scala:110)
	at com.databricks.instrumentation.QueuedThreadPoolInstrumenter.trackActiveThreads$(QueuedThreadPoolInstrumenter.scala:107)
	at com.databricks.threading.InstrumentedExecutorService.trackActiveThreads(InstrumentedExecutorService.scala:40)
	at com.databricks.threading.InstrumentedExecutorService$$anon$1.$anonfun$run$1(InstrumentedExecutorService.scala:141)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.context.integrity.IntegrityCheckContext$ThreadLocalStorage$.withValue(IntegrityCheckContext.scala:73)
	at com.databricks.threading.InstrumentedExecutorService$$anon$1.run(InstrumentedExecutorService.scala:140)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

Aug 09 '24 08:08 tunayokumus

@tunayokumus : This is odd. Are there differences in network configuration between the workspaces? If so, what? Also, in the same workspace, could you compare the ucx job clustter configuration with the configuration of a (job) cluster that does not fail after two hours?

Finally, does the error still persist today?

Aug 12 '24 07:08 JCZuurmond

@tunayokumus Few things to check on the job that is failing.

Check the driver utilization in the metrics tab while the assessment job is running, its likely it failed due to OOM, consider increasing the cluster size (especially driver) and run
if it fails again, you could limit the databases to crawl to few (the question during installation "comma separated list of databases to migrate), this would reduce the scan area and should finish sooner
let us know if the issue still persists

Aug 21 '24 10:08 HariGS-DB