[BUG] Cannot read files into dataframe in Databricks 11.3 LTS Runtime 3.3.0 Spark
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
When running v2 excel pySpark code below in Databricks 11.3 LTS Runtime:
df = spark.read.format("excel")
.option("header", True)
.option("inferSchema", True)
.load(fr"{folderpath}//.xlsx")
display(df)
I receive the following error upon attempting to display or use the resulting dataframe:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 101) (10.94.235.131 executor 1): java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.options()Lorg/apache/spark/sql/catalyst/FileSourceOptions;
Expected Behavior
The resulting Dataframe should display correctly.
Steps To Reproduce
set the folderpath variable to a location containing excel files, and run the below python code in latest runtime of Databricks:
df = spark.read.format("excel")
.option("header", True)
.option("inferSchema", True)
.load(fr"{folderpath}//.xlsx")
display(df)
Environment
- Spark version:3.3.0
- Spark-Excel version:0.18.5
- OS:Windows 10
- Cluster environment
Anything else?
No response
Hey @james-miles-ccy, the Spark-Excel version should consist of the Spark version and the version of Spark-Excel itself.
You were only specifying the version of Spark-Excel. Can you check you were using 3.3.1_0.18.5?
Yes I am using 3.3.1_0.18.5
Can you check the same thing with a local or other non-Databricks Spark 3.3.0? We already had the case once where Databricks used a slightly different and not fully API-compatible version of Spark in their Runtime than the officially published one.
I have installed Pyspark/spark-excel locally and V1 format works fine and generates dataframes in 3.3.1 spark version, but using a path for multiple files (ie V2 format) is causing issues where cells are hanging/not completing. I am using the same spark-excel version as stated above.
Is it the same error/issue as on DataBricks?
No, in Databricks you receive the error listed in my original comment, where as local causes endless/ non completing execution.
FYI, this is only an issue for v2, v1 works in both Databricks and local.
I am facing same issue with V2 (Spark version:3.3.0, Spark-excel: 3.3.1_0.18.5). v1 works but not completely. input_file_name() returns empty string.
input_file_name is only supported in v2. Unfortunately, I didn't have time to look into the original issue.
Hey @nightscape. This got mentioned in our implementation as well
I think I've traced the issue down to Databricks using a patched spark runtime in the 11.x runtimes (and 12.0 beta runtime) which includes a change from the master branch of Spark which isn't in the 3.3 support branch.
I'm looking into this further at the moment and I'll shout if I find anything
Just to add an update. I've been talking with Databricks and there's a fix coming which we'll resolve this in the 11.x and 12.x runtimes. Should hopefully be coming in January
@dazfuller thanks a lot for pushing this forward and keeping us updated here!! We had a similar issue before, so I guess Databricks breaking compatibility with the Open Source Spark version is sth. we have to keep an eye on...
Hi All, FYI looks like this has all been resolved by Databricks on 12.1 runtime!
this happens to
spark-excel_2.12-3.3.1_0.18.7 + Spark 3.5.0 (Azure databricks 15.4LTS) and spark-excel_2.12-3.3.3_0.20.3 + spark 3.4.1 (Azure databricks 13.3LTS) spark-excel_2.12-3.3.3_0.20.3 + spark 3.5.0 (Azure databricks 15.4LTS) too
@minnieshi do you get the exact same error as in the first post?
java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.options()Lorg/apache/spark/sql/catalyst/FileSourceOptions;
Yes. The same error. @nightscape So, I instead used a combination of lower versions and wrote here in the hope that higher versions could be used. I have now copied the error below:
AbstractMethodError: org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.options()Lorg/apache/spark/sql/catalyst/FileSourceOptions;
On Wed, Dec 18, 2024 at 12.51 Martin Mauch @.***> wrote:
@minnieshi https://github.com/minnieshi do you get the exact same error as in the first post?
java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.v2.FilePartitionReaderFactory.options()Lorg/apache/spark/sql/catalyst/FileSourceOptions;
— Reply to this email directly, view it on GitHub https://github.com/nightscape/spark-excel/issues/682#issuecomment-2551123092, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWMMBVVURHZIMRN3IRFS4L2GFOTXAVCNFSM6AAAAABTYQZ6EGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJRGEZDGMBZGI . You are receiving this because you were mentioned.Message ID: @.***>
@minnieshi, I think I am having a similar issue. Do you know which version I could use to make it run for Databricks 14.3 and Spark 3.5.0?
I tried all matrix, i could not get it run on spark 3.5.0
Kind regards Min
On Mon, Jan 27, 2025 at 17.36 Marco @.***> wrote:
@minnieshi https://github.com/minnieshi, I think I am having a similar issue https://github.com/nightscape/spark-excel/issues/926. Do you know which version I could use to make it run for Databricks 14.3 and Spark 3.5.0?
— Reply to this email directly, view it on GitHub https://github.com/nightscape/spark-excel/issues/682#issuecomment-2616308644, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWMMBQEF7J4HYCKEBO2ZBT2MZN77AVCNFSM6AAAAABTYQZ6EGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJWGMYDQNRUGQ . You are receiving this because you were mentioned.Message ID: @.***>
@minnieshi, thanks for the quick feedback. Do you know what exactly cause the issue?
Because I have checked the JAR which contains that file and the corresponding class. Everything looks fine and properly defined.
@mmicu can you access the JAR files on Databricks? My best guess is that Databricks (again) made some non-binary-compatible changes in their version of Spark.
@nightscape, yes, I should have access. I could try to get some information from the cluster and its JAR if you need.
I reintroduced dedicated builds for Spark 3.4.1. Can someone try them and see if they work and fix the issue on Azure databricks 13.3LTS?