spark-excel icon indicating copy to clipboard operation
spark-excel copied to clipboard

[BUG] Cannot read files into dataframe in Databricks 11.3 LTS Runtime 3.3.0 Spark

Open james-miles-ccy opened this issue 3 years ago • 21 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Current Behavior

When running v2 excel pySpark code below in Databricks 11.3 LTS Runtime:

df = spark.read.format("excel")
.option("header", True)
.option("inferSchema", True)
.load(fr"{folderpath}//.xlsx") display(df)

I receive the following error upon attempting to display or use the resulting dataframe:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 101) (10.94.235.131 executor 1): java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.options()Lorg/apache/spark/sql/catalyst/FileSourceOptions;

Expected Behavior

The resulting Dataframe should display correctly.

Steps To Reproduce

set the folderpath variable to a location containing excel files, and run the below python code in latest runtime of Databricks:

df = spark.read.format("excel")
.option("header", True)
.option("inferSchema", True)
.load(fr"{folderpath}//.xlsx") display(df)

Environment

- Spark version:3.3.0
- Spark-Excel version:0.18.5
- OS:Windows 10
- Cluster environment

Anything else?

No response

james-miles-ccy avatar Nov 16 '22 13:11 james-miles-ccy

Hey @james-miles-ccy, the Spark-Excel version should consist of the Spark version and the version of Spark-Excel itself. You were only specifying the version of Spark-Excel. Can you check you were using 3.3.1_0.18.5?

nightscape avatar Nov 16 '22 13:11 nightscape

Yes I am using 3.3.1_0.18.5

james-miles-ccy avatar Nov 16 '22 13:11 james-miles-ccy

Can you check the same thing with a local or other non-Databricks Spark 3.3.0? We already had the case once where Databricks used a slightly different and not fully API-compatible version of Spark in their Runtime than the officially published one.

nightscape avatar Nov 16 '22 16:11 nightscape

I have installed Pyspark/spark-excel locally and V1 format works fine and generates dataframes in 3.3.1 spark version, but using a path for multiple files (ie V2 format) is causing issues where cells are hanging/not completing. I am using the same spark-excel version as stated above.

james-miles-ccy avatar Nov 22 '22 15:11 james-miles-ccy

Is it the same error/issue as on DataBricks?

nightscape avatar Nov 22 '22 16:11 nightscape

No, in Databricks you receive the error listed in my original comment, where as local causes endless/ non completing execution.

FYI, this is only an issue for v2, v1 works in both Databricks and local.

james-miles-ccy avatar Nov 24 '22 15:11 james-miles-ccy

I am facing same issue with V2 (Spark version:3.3.0, Spark-excel: 3.3.1_0.18.5). v1 works but not completely. input_file_name() returns empty string.

snehawankhade avatar Nov 30 '22 20:11 snehawankhade

input_file_name is only supported in v2. Unfortunately, I didn't have time to look into the original issue.

nightscape avatar Dec 01 '22 10:12 nightscape

Hey @nightscape. This got mentioned in our implementation as well

I think I've traced the issue down to Databricks using a patched spark runtime in the 11.x runtimes (and 12.0 beta runtime) which includes a change from the master branch of Spark which isn't in the 3.3 support branch.

I'm looking into this further at the moment and I'll shout if I find anything

dazfuller avatar Dec 11 '22 15:12 dazfuller

Just to add an update. I've been talking with Databricks and there's a fix coming which we'll resolve this in the 11.x and 12.x runtimes. Should hopefully be coming in January

dazfuller avatar Dec 23 '22 18:12 dazfuller

@dazfuller thanks a lot for pushing this forward and keeping us updated here!! We had a similar issue before, so I guess Databricks breaking compatibility with the Open Source Spark version is sth. we have to keep an eye on...

nightscape avatar Dec 24 '22 01:12 nightscape

Hi All, FYI looks like this has all been resolved by Databricks on 12.1 runtime!

james-miles-ccy avatar Apr 13 '23 14:04 james-miles-ccy

this happens to

spark-excel_2.12-3.3.1_0.18.7 + Spark 3.5.0 (Azure databricks 15.4LTS) and spark-excel_2.12-3.3.3_0.20.3 + spark 3.4.1 (Azure databricks 13.3LTS) spark-excel_2.12-3.3.3_0.20.3 + spark 3.5.0 (Azure databricks 15.4LTS) too

minnieshi avatar Dec 17 '24 14:12 minnieshi

@minnieshi do you get the exact same error as in the first post?

java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.options()Lorg/apache/spark/sql/catalyst/FileSourceOptions;

nightscape avatar Dec 18 '24 11:12 nightscape

Yes. The same error. @nightscape So, I instead used a combination of lower versions and wrote here in the hope that higher versions could be used. I have now copied the error below:

AbstractMethodError: org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.options()Lorg/apache/spark/sql/catalyst/FileSourceOptions;

On Wed, Dec 18, 2024 at 12.51 Martin Mauch @.***> wrote:

@minnieshi https://github.com/minnieshi do you get the exact same error as in the first post?

java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.options()Lorg/apache/spark/sql/catalyst/FileSourceOptions;

— Reply to this email directly, view it on GitHub https://github.com/nightscape/spark-excel/issues/682#issuecomment-2551123092, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWMMBVVURHZIMRN3IRFS4L2GFOTXAVCNFSM6AAAAABTYQZ6EGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJRGEZDGMBZGI . You are receiving this because you were mentioned.Message ID: @.***>

minnieshi avatar Dec 18 '24 17:12 minnieshi

@minnieshi, I think I am having a similar issue. Do you know which version I could use to make it run for Databricks 14.3 and Spark 3.5.0?

mmicu avatar Jan 27 '25 16:01 mmicu

I tried all matrix, i could not get it run on spark 3.5.0

Kind regards Min

On Mon, Jan 27, 2025 at 17.36 Marco @.***> wrote:

@minnieshi https://github.com/minnieshi, I think I am having a similar issue https://github.com/nightscape/spark-excel/issues/926. Do you know which version I could use to make it run for Databricks 14.3 and Spark 3.5.0?

— Reply to this email directly, view it on GitHub https://github.com/nightscape/spark-excel/issues/682#issuecomment-2616308644, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWMMBQEF7J4HYCKEBO2ZBT2MZN77AVCNFSM6AAAAABTYQZ6EGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJWGMYDQNRUGQ . You are receiving this because you were mentioned.Message ID: @.***>

minnieshi avatar Jan 27 '25 21:01 minnieshi

@minnieshi, thanks for the quick feedback. Do you know what exactly cause the issue?

Because I have checked the JAR which contains that file and the corresponding class. Everything looks fine and properly defined.

mmicu avatar Jan 27 '25 21:01 mmicu

@mmicu can you access the JAR files on Databricks? My best guess is that Databricks (again) made some non-binary-compatible changes in their version of Spark.

nightscape avatar Jan 28 '25 09:01 nightscape

@nightscape, yes, I should have access. I could try to get some information from the cluster and its JAR if you need.

mmicu avatar Jan 28 '25 10:01 mmicu

I reintroduced dedicated builds for Spark 3.4.1. Can someone try them and see if they work and fix the issue on Azure databricks 13.3LTS?

nightscape avatar Mar 14 '25 11:03 nightscape