Spark 2.4.0 and Scala 2.12 support
Hello,
Do you plan on supporting Spark 2.4.0 and Scala 2.12? Or is there a way to use it natively in Spark 2.4.0?
It looks like there are no more release since November 2016. https://mvnrepository.com/artifact/com.databricks/spark-redshift
Thank you.
Any answer?
As I understood from their readme they will not provide any free updates anymore.
To ensure the best experience for our customers, we have decided to inline this connector directly in Databricks Runtime. The latest version of Databricks Runtime (3.0+) includes an advanced version of the RedShift connector for Spark that features both performance improvements (full query pushdown) as well as security improvements (automatic encryption). For more information, refer to the Databricks documentation. As a result, we will no longer be making releases separately from Databricks Runtime.
We migrated to Spark 2.4.0 in my team and the package spark-redshift package is still usable with Scala 2.11.
I'm having issues migrating to spark 2.4 and using this library. Do you know what version of spark-avro you're using?
For spark-avro make sure not to use the package delivered by com.databricks anymore, but those of org.apache.spark.
If you are using build.sbt,
scalaVersion := "2.11.12"
val sparkVersion = "2.4.0"
val sparkAvroVersion = "2.4.0"
val redshiftVersion = "3.0.0-preview1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
...
"org.apache.spark" %% "spark-avro" % sparkAvroVersion,
"com.databricks" %% "spark-redshift" % redshiftVersion
)
If you are using pom.xml,
<scala.version>2.11</scala.version>
<spark.version>2.4.0</spark.version>
...
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-redshift_${scala.version}</artifactId>
<version>3.0.0-preview1</version>
<exclusions>
<exclusion>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_${scala.version}</artifactId>
</exclusion>
</exclusions>
</dependency>
...
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
...
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.7.7</version>
<scope>provided</scope>
</dependency>
Please read these instructions on Spark 2.4.0 documentation too: https://spark.apache.org/docs/latest/sql-data-sources-avro.html#compatibility-with-databricks-spark-avro
Also, don't think about using Scala 2.12 with spark-redshift_2.11, it's broken (not maintained since Nov 2016).
https://github.com/databricks/spark-redshift/releases
Hi @mycaule I am trying to do it on Amazon EMR with Spark 2.4. I am trying to invoke my spark shell using spark-shell --jars RedshiftJDBC42-1.2.10.1009.jar --packages com.databricks:spark-redshift_2.11:3.0.0-preview1,org.apache.spark:spark-avro_2.11:2.4.0
And then trying to read it via command : val url = "jdbc:redshift://cluster-link?user=username&password=password" val queryFinal = "select count(*) as cnt from table1" val df = spark.read.format("com.databricks.spark.redshift").option("url", url).option("tempdir", "s3n://temp-bucket/").option("query",queryFinal).option("forward_spark_s3_credentials", "true").load().cache
This is not working for me and giving the same exception as pointed out by you in the starting thread. Can you tell me if I am doing something wrong ?
Can you paste the exception? Maybe it's discussed in another thread. I didn't mention any above.
We've started a community edition of spark-redshift which works with spark2.4. Feel free to try it out! If you do, it'd be very helpful to receive your feedback. Any pull requests are very very welcome too.
https://github.com/spark-redshift-community/spark-redshift
We've started a community edition of spark-redshift which works with spark2.4. Feel free to try it out! If you do, it'd be very helpful to receive your feedback. Any pull requests are very very welcome too.
https://github.com/spark-redshift-community/spark-redshift
This did the trick for me . Was able to read data from Redshift but not able to write into it , was getting spark-avro issues. This edition of spark-redshift resolved the issue. Thanks
Cheers @lucagiovagnoli!
We've started a community edition of spark-redshift which works with spark2.4. Feel free to try it out! If you do, it'd be very helpful to receive your feedback. Any pull requests are very very welcome too.
https://github.com/spark-redshift-community/spark-redshift
It seems that this library does not support writing columns in JSON/Array format to Redshift with the 'Super' type, right?