[BUG]: UCX Assessment tasks are failing
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
After installing the UCX to our Azure Databricks workspace the assessment job has been failing so far. At first it was only the crawl_tables task failing due to a Spark driver error. In the consequent runs more tasks started to fail. The task failures happen around 1-2 hour after the job is started.
Here are some of the error codes from the failed tasks:
crawl_tables: "The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached. at com.databricks.spark.chauffeur.Chauffeur.onDriverStateChange(Chauffeur.scala:1478)"
crawl_groups: "com.databricks.backend.common.rpc.DriverStoppedException: Driver down cause: driver state change (exit code: 137)"
Expected Behavior
In another workspace the assessment job ran successfully without issues. We applied the same configuration to both workspaces when installing UCX.
Steps To Reproduce
No response
Cloud
Azure
Operating System
Linux
Version
latest via Databricks CLI
Relevant log output
The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.
at com.databricks.spark.chauffeur.Chauffeur.onDriverStateChange(Chauffeur.scala:1478)
at com.databricks.spark.chauffeur.Chauffeur.$anonfun$driverStateOpt$1(Chauffeur.scala:187)
at com.databricks.spark.chauffeur.Chauffeur.$anonfun$driverStateOpt$1$adapted(Chauffeur.scala:187)
at com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.$anonfun$goToStopped$4(DriverDaemonMonitorImpl.scala:251)
at com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.$anonfun$goToStopped$4$adapted(DriverDaemonMonitorImpl.scala:251)
at scala.collection.immutable.List.foreach(List.scala:431)
at com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.goToStopped(DriverDaemonMonitorImpl.scala:251)
at com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.monitorDriver(DriverDaemonMonitorImpl.scala:406)
at com.databricks.spark.chauffeur.DriverDaemonMonitorImpl.$anonfun$job$1(DriverDaemonMonitorImpl.scala:100)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.logging.UsageLogging.$anonfun$recordOperation$1(UsageLogging.scala:532)
at com.databricks.logging.UsageLogging.executeThunkAndCaptureResultTags$1(UsageLogging.scala:636)
at com.databricks.logging.UsageLogging.$anonfun$recordOperationWithResultTags$4(UsageLogging.scala:654)
at com.databricks.logging.AttributionContextTracing.$anonfun$withAttributionContext$1(AttributionContextTracing.scala:48)
at com.databricks.logging.AttributionContext$.$anonfun$withValue$1(AttributionContext.scala:253)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:249)
at com.databricks.logging.AttributionContextTracing.withAttributionContext(AttributionContextTracing.scala:46)
at com.databricks.logging.AttributionContextTracing.withAttributionContext$(AttributionContextTracing.scala:43)
at com.databricks.threading.SingletonJob$SingletonJobImpl.withAttributionContext(SingletonJob.scala:432)
at com.databricks.logging.AttributionContextTracing.withAttributionTags(AttributionContextTracing.scala:95)
at com.databricks.logging.AttributionContextTracing.withAttributionTags$(AttributionContextTracing.scala:76)
at com.databricks.threading.SingletonJob$SingletonJobImpl.withAttributionTags(SingletonJob.scala:432)
at com.databricks.logging.UsageLogging.recordOperationWithResultTags(UsageLogging.scala:631)
at com.databricks.logging.UsageLogging.recordOperationWithResultTags$(UsageLogging.scala:541)
at com.databricks.threading.SingletonJob$SingletonJobImpl.recordOperationWithResultTags(SingletonJob.scala:432)
at com.databricks.logging.UsageLogging.recordOperation(UsageLogging.scala:533)
at com.databricks.logging.UsageLogging.recordOperation$(UsageLogging.scala:501)
at com.databricks.threading.SingletonJob$SingletonJobImpl.recordOperation(SingletonJob.scala:432)
at com.databricks.threading.SingletonJob$SingletonJobImpl$SingletonRun.$anonfun$run$4(SingletonJob.scala:491)
at com.databricks.logging.AttributionContextTracing.$anonfun$withAttributionContext$1(AttributionContextTracing.scala:48)
at com.databricks.logging.AttributionContext$.$anonfun$withValue$1(AttributionContext.scala:253)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:249)
at com.databricks.logging.AttributionContextTracing.withAttributionContext(AttributionContextTracing.scala:46)
at com.databricks.logging.AttributionContextTracing.withAttributionContext$(AttributionContextTracing.scala:43)
at com.databricks.threading.SingletonJob$SingletonJobImpl.withAttributionContext(SingletonJob.scala:432)
at com.databricks.threading.SingletonJob$SingletonJobImpl$SingletonRun.$anonfun$run$3(SingletonJob.scala:491)
at scala.util.Try$.apply(Try.scala:213)
at com.databricks.threading.SingletonJob$SingletonJobImpl$SingletonRun.$anonfun$run$1(SingletonJob.scala:490)
at com.databricks.util.UntrustedUtils$.tryLog(UntrustedUtils.scala:109)
at com.databricks.threading.SingletonJob$SingletonJobImpl$SingletonRun.run(SingletonJob.scala:484)
at com.databricks.threading.InstrumentedExecutorService$$anon$1.$anonfun$run$3(InstrumentedExecutorService.scala:144)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.logging.AttributionContextTracing.$anonfun$withAttributionContext$1(AttributionContextTracing.scala:48)
at com.databricks.logging.AttributionContext$.$anonfun$withValue$1(AttributionContext.scala:253)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.AttributionContext$.withValue(AttributionContext.scala:249)
at com.databricks.logging.AttributionContextTracing.withAttributionContext(AttributionContextTracing.scala:46)
at com.databricks.logging.AttributionContextTracing.withAttributionContext$(AttributionContextTracing.scala:43)
at com.databricks.threading.InstrumentedExecutorService$$anon$1.withAttributionContext(InstrumentedExecutorService.scala:137)
at com.databricks.threading.InstrumentedExecutorService$$anon$1.$anonfun$run$2(InstrumentedExecutorService.scala:142)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.instrumentation.QueuedThreadPoolInstrumenter.trackActiveThreads(QueuedThreadPoolInstrumenter.scala:110)
at com.databricks.instrumentation.QueuedThreadPoolInstrumenter.trackActiveThreads$(QueuedThreadPoolInstrumenter.scala:107)
at com.databricks.threading.InstrumentedExecutorService.trackActiveThreads(InstrumentedExecutorService.scala:40)
at com.databricks.threading.InstrumentedExecutorService$$anon$1.$anonfun$run$1(InstrumentedExecutorService.scala:141)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.context.integrity.IntegrityCheckContext$ThreadLocalStorage$.withValue(IntegrityCheckContext.scala:73)
at com.databricks.threading.InstrumentedExecutorService$$anon$1.run(InstrumentedExecutorService.scala:140)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
@tunayokumus : This is odd. Are there differences in network configuration between the workspaces? If so, what? Also, in the same workspace, could you compare the ucx job clustter configuration with the configuration of a (job) cluster that does not fail after two hours?
Finally, does the error still persist today?
@tunayokumus Few things to check on the job that is failing.
- Check the driver utilization in the metrics tab while the assessment job is running, its likely it failed due to OOM, consider increasing the cluster size (especially driver) and run
- if it fails again, you could limit the databases to crawl to few (the question during installation "comma separated list of databases to migrate), this would reduce the scan area and should finish sooner
- let us know if the issue still persists