frameless Usage of "private" Spark APIs

Frameless is depending on portions of Spark for which there is no binary compatibility commitment. For example, Frameless uses StaticInvoke, which is part of the org.apache.spark.sql.catalyst.expressions.objects package. If you look at the (bountiful) mima exclusions in Spark, the entire org.apache.spark.sql.catalyst package is not checked for binary compatibility.

I don't really consider this a bug with frameless, but I wanted to at least raise it as a concern as it recently bit us at work.

backstory for those who care

At work we use Databricks runtime 3.5. Databricks claims that this runtime uses Spark 2.2. However, we ran into a bewildering issue with a binary incompatibility between Frameless and the runtime Spark version (related to org.apache.spark.sql.catalyst.expressions.objects.StaticInvoke). After quite a bit of investigation, we realized that the Databricks runtime doesn't actually include Spark 2.2 proper, but a private fork of it that has some incompatible changes. It has a backported change from Spark 2.3 that is incompatible with Spark 2.2 (and the version of Frameless that is built against Spark 2.2). We can work around this particular issue by moving to Spark 2.3 and the Databricks 4.0 runtime, but it's tough to know what other incompatibilities could be lurking in the private forks, and I could envision other people running into similar issues (especially if they can't move to Spark 2.3).

May 31 '18 17:05 ceedubs

Thank @ceedubs, I always worried a bit about this. Databricks and other fork's runtimes will be an issue and that's at least something we have to document.

I have not extensively analyzed if we absolutely need to do this, but I am afraid that some of the encoding work we have require some APIs that are not exposed as public by core Spark.

Jun 01 '18 19:06 imarios

I had no idea about these mima exclusions, it's really unfortunate... Anything actionable on our side?

Jun 02 '18 12:06 OlivierBlanvillain

@OlivierBlanvillain If there were ways to reduce dependencies on sections of code under these exclusions, that would be great. Short of that, the actionable item might be to just warn about this in the README. I'd be willing to contribute this documentation, but it might be a little while before I get to it.

Jun 04 '18 23:06 ceedubs

Another anecdote: Technically speaking, I believe EMR Spark is also a fork as they've backported changes before (but much less often), so it's entirely possible that some bincompat issue can happen there as well.

Jun 06 '18 00:06 longcao

@ceedubs @longcao these are great "warnings" to add. I will create a PR with the edits. If @ceedubs get to this faster even better! :)

Jun 08 '18 00:06 imarios