spark-redshift icon indicating copy to clipboard operation
spark-redshift copied to clipboard

Spark 2.4.0 and Scala 2.12 support

Open mycaule opened this issue 7 years ago • 10 comments

Hello,

Do you plan on supporting Spark 2.4.0 and Scala 2.12? Or is there a way to use it natively in Spark 2.4.0?

It looks like there are no more release since November 2016. https://mvnrepository.com/artifact/com.databricks/spark-redshift

Thank you.

mycaule avatar Dec 28 '18 01:12 mycaule

Any answer?

JoanMartin avatar Mar 31 '19 15:03 JoanMartin

As I understood from their readme they will not provide any free updates anymore.

To ensure the best experience for our customers, we have decided to inline this connector directly in Databricks Runtime. The latest version of Databricks Runtime (3.0+) includes an advanced version of the RedShift connector for Spark that features both performance improvements (full query pushdown) as well as security improvements (automatic encryption). For more information, refer to the Databricks documentation. As a result, we will no longer be making releases separately from Databricks Runtime.

We migrated to Spark 2.4.0 in my team and the package spark-redshift package is still usable with Scala 2.11.

mycaule avatar Apr 01 '19 08:04 mycaule

I'm having issues migrating to spark 2.4 and using this library. Do you know what version of spark-avro you're using?

sgpietz-handy avatar Apr 02 '19 15:04 sgpietz-handy

For spark-avro make sure not to use the package delivered by com.databricks anymore, but those of org.apache.spark.

If you are using build.sbt,

scalaVersion := "2.11.12"

val sparkVersion = "2.4.0"
val sparkAvroVersion = "2.4.0"
val redshiftVersion = "3.0.0-preview1"

    libraryDependencies ++= Seq(
      "org.apache.spark" %% "spark-core" % sparkVersion,
      "org.apache.spark" %% "spark-sql" % sparkVersion,
      ...
      "org.apache.spark" %% "spark-avro" % sparkAvroVersion,
      "com.databricks" %% "spark-redshift" % redshiftVersion
    )

If you are using pom.xml,

		<scala.version>2.11</scala.version>
		<spark.version>2.4.0</spark.version>
...
			<dependency>
				<groupId>com.databricks</groupId>
				<artifactId>spark-redshift_${scala.version}</artifactId>
				<version>3.0.0-preview1</version>
				<exclusions>
					<exclusion>
						<groupId>com.databricks</groupId>
						<artifactId>spark-avro_${scala.version}</artifactId>
					</exclusion>
				</exclusions>
			</dependency>
...
			<dependency>
				<groupId>org.apache.spark</groupId>
				<artifactId>spark-avro_${scala.version}</artifactId>
				<version>${spark.version}</version>
			</dependency>
...
			<dependency>
				<groupId>org.apache.avro</groupId>
				<artifactId>avro</artifactId>
				<version>1.7.7</version>
				<scope>provided</scope>
			</dependency>

Please read these instructions on Spark 2.4.0 documentation too: https://spark.apache.org/docs/latest/sql-data-sources-avro.html#compatibility-with-databricks-spark-avro

Also, don't think about using Scala 2.12 with spark-redshift_2.11, it's broken (not maintained since Nov 2016). https://github.com/databricks/spark-redshift/releases

mycaule avatar Apr 02 '19 17:04 mycaule

Hi @mycaule I am trying to do it on Amazon EMR with Spark 2.4. I am trying to invoke my spark shell using spark-shell --jars RedshiftJDBC42-1.2.10.1009.jar --packages com.databricks:spark-redshift_2.11:3.0.0-preview1,org.apache.spark:spark-avro_2.11:2.4.0

And then trying to read it via command : val url = "jdbc:redshift://cluster-link?user=username&password=password" val queryFinal = "select count(*) as cnt from table1" val df = spark.read.format("com.databricks.spark.redshift").option("url", url).option("tempdir", "s3n://temp-bucket/").option("query",queryFinal).option("forward_spark_s3_credentials", "true").load().cache

This is not working for me and giving the same exception as pointed out by you in the starting thread. Can you tell me if I am doing something wrong ?

kostajaitachi avatar Apr 18 '19 05:04 kostajaitachi

Can you paste the exception? Maybe it's discussed in another thread. I didn't mention any above.

mycaule avatar Apr 19 '19 09:04 mycaule

We've started a community edition of spark-redshift which works with spark2.4. Feel free to try it out! If you do, it'd be very helpful to receive your feedback. Any pull requests are very very welcome too.

https://github.com/spark-redshift-community/spark-redshift

lucagiovagnoli avatar Jul 02 '19 19:07 lucagiovagnoli

We've started a community edition of spark-redshift which works with spark2.4. Feel free to try it out! If you do, it'd be very helpful to receive your feedback. Any pull requests are very very welcome too.

https://github.com/spark-redshift-community/spark-redshift

This did the trick for me . Was able to read data from Redshift but not able to write into it , was getting spark-avro issues. This edition of spark-redshift resolved the issue. Thanks

sheryy-abhi avatar Jul 10 '20 20:07 sheryy-abhi

Cheers @lucagiovagnoli!

StephenDenham-districtm avatar Jan 05 '21 19:01 StephenDenham-districtm

We've started a community edition of spark-redshift which works with spark2.4. Feel free to try it out! If you do, it'd be very helpful to receive your feedback. Any pull requests are very very welcome too.

https://github.com/spark-redshift-community/spark-redshift

It seems that this library does not support writing columns in JSON/Array format to Redshift with the 'Super' type, right?

vannguyende avatar Jul 03 '24 10:07 vannguyende