SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

LightGBMRegressor.fit gets timed out running via databricks-connect.

Open aidin-zadeh opened this issue 3 years ago • 3 comments

Describe the bug LightGBMRegressor.fit gets stuck (times out) when running via databricks-connect.

To Reproduce Fit an instance of LightGBMRegressor via databricks-connect.

Expected behavior fit method is successfully completed.

Info (please complete the following information):

  • SynapseML Version: 2.12
  • SynapseML Coordinate: com.microsoft.azure:synapseml_2.12:0.9.5
  • SynapseML Repository: https://mmlspark.azureedge.net/maven
  • Spark Version: 3.2.1
  • Scala Version: 2.12
  • Databricks Version: 10.4 LTS
  • Databricks-connect Version: 10.4.6

Stacktrace

2022-07-20T17:22:38.480+0000: [GC (Allocation Failure) [PSYoungGen: 524800K->21556K(611840K)] 524800K->21580K(2010112K), 0.0129807 secs] [Times: user=0.10 sys=0.00, real=0.01 secs] 
2022-07-20T17:22:38.550+0000: [GC (Metadata GC Threshold) [PSYoungGen: 57016K->10494K(611840K)] 57040K->10526K(2010112K), 0.0049380 secs] [Times: user=0.08 sys=0.01, real=0.00 secs] 
2022-07-20T17:22:38.555+0000: [Full GC (Metadata GC Threshold) [PSYoungGen: 10494K->0K(611840K)] [ParOldGen: 32K->9830K(636928K)] 10526K->9830K(1248768K), [Metaspace: 20515K->20515K(22528K)], 0.0180416 secs] [Times: user=0.19 sys=0.01, real=0.02 secs] 
2022-07-20T17:22:39.567+0000: [GC (Metadata GC Threshold) [PSYoungGen: 263461K->12901K(611840K)] 273292K->22740K(1248768K), 0.0047544 secs] [Times: user=0.04 sys=0.00, real=0.00 secs] 
2022-07-20T17:22:39.572+0000: [Full GC (Metadata GC Threshold) [PSYoungGen: 12901K->0K(611840K)] [ParOldGen: 9838K->20825K(970240K)] 22740K->20825K(1582080K), [Metaspace: 34020K->34020K(36864K)], 0.0519531 secs] [Times: user=1.10 sys=0.04, real=0.05 secs] 
2022-07-20T17:25:53.338+0000: [GC (Metadata GC Threshold) [PSYoungGen: 523080K->29927K(773120K)] 543906K->50761K(1743360K), 0.0166177 secs] [Times: user=0.13 sys=0.01, real=0.01 secs] 
2022-07-20T17:25:53.355+0000: [Full GC (Metadata GC Threshold) [PSYoungGen: 29927K->0K(773120K)] [ParOldGen: 20833K->41112K(1268736K)] 50761K->41112K(2041856K), [Metaspace: 56284K->56213K(59392K)], 0.0775757 secs] [Times: user=1.39 sys=0.10, real=0.08 secs] 
2022-07-20T17:26:22.914+0000: [GC (Allocation Failure) [PSYoungGen: 742912K->24067K(947200K)] 784024K->65195K(2215936K), 0.0142385 secs] [Times: user=0.23 sys=0.02, real=0.02 secs] 
2022-07-20T17:26:23.235+0000: [GC (Metadata GC Threshold) [PSYoungGen: 475303K->33782K(1094144K)] 516431K->85437K(2362880K), 0.0151588 secs] [Times: user=0.34 sys=0.09, real=0.02 secs] 
2022-07-20T17:26:23.250+0000: [Full GC (Metadata GC Threshold) [PSYoungGen: 33782K->0K(1094144K)] [ParOldGen: 51655K->68913K(1595904K)] 85437K->68913K(2690048K), [Metaspace: 81386K->72073K(100352K)], 0.0750877 secs] [Times: user=0.63 sys=0.20, real=0.07 secs] 
2022-07-20T17:26:24.641+0000: [GC (Allocation Failure) [PSYoungGen: 1060352K->36634K(1105920K)] 1129265K->105556K(2701824K), 0.0215145 secs] [Times: user=0.34 sys=0.11, real=0.02 secs] 
2022-07-20T17:26:28.141+0000: [GC (Allocation Failure) [PSYoungGen: 1096986K->49144K(1699840K)] 1165908K->156002K(3295744K), 0.0545540 secs] [Times: user=0.42 sys=0.22, real=0.05 secs] 
2022-07-20T17:26:30.381+0000: [GC (Metadata GC Threshold) [PSYoungGen: 1145563K->73186K(1723904K)] 1252421K->184820K(3319808K), 0.0279235 secs] [Times: user=0.30 sys=0.22, real=0.03 secs] 
2022-07-20T17:26:30.409+0000: [Full GC (Metadata GC Threshold) [PSYoungGen: 73186K->0K(1723904K)] [ParOldGen: 111634K->131221K(1966080K)] 184820K->131221K(3689984K), [Metaspace: 135096K->112661K(165888K)], 0.1516168 secs] [Times: user=1.30 sys=0.34, real=0.15 secs] 
2022-07-20T17:26:31.571+0000: [GC (Allocation Failure) [PSYoungGen: 1650688K->60794K(2267648K)] 1781909K->192024K(4233728K), 0.0495676 secs] [Times: user=0.12 sys=0.13, real=0.05 secs] 
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)
Received exception on call, retrying: java.net.ConnectException: Connection timed out (Connection timed out)

Additional context I am using DBR version 10.4 LTS. I want to use Synapse ML package with databricks-connect. I installed and attached Synapse ML to cluster and to be able to import with databricks-connect i also added Synapse to jar.packages. I am able to install and import it successfully. However when I try to run the fit method it times out at the workers side. Please see the error above. Would you have any idea how to solve this issue? Thanks

AB#1889707

aidin-zadeh avatar Jul 20 '22 21:07 aidin-zadeh

Hi @aidin-zadeh, could you have a quick sanity check to see if the above error only happens when using databricks-connect, or it also happens when running on databricks directly? If the same error happens on databricks, we should be able to dig deeper on our side, and in this case could you provide more details about your sample code snippet and where you store the data? Thanks!

serena-ruan avatar Jul 26 '22 09:07 serena-ruan

Hey @serena-ruan, I'm able to import and successfully run LighgbmRegressor within the databricks workspace via databricks notebook. The issue only happens when running from databricks-connect.

aidin-zadeh avatar Jul 26 '22 14:07 aidin-zadeh

This looks like a network issue with databricks-connect though, ccould you have another check to see if this only happens for LightgbmRegressor or also for a normal SparkML model, say LogisticRegression?

serena-ruan avatar Jul 27 '22 09:07 serena-ruan