[Databricks] Allow registering Sedona functions with a custom prefix
Hi,
I’m trying to use Apache Sedona in Databricks, alongside the new built-in Databricks spatial SQL functions.
Right now, Sedona registers sql ST functions like st_distance, st_point ... This causes name collisions with some Databricks’ native spatial functions.
It would be very useful if Sedona supported registering all SQL functions with a custom prefix, for example:
st_distance -> sedona_st_distance st_point -> sedona_st_point
This way, Sedona functions can coexist with Databricks functions in the same notebook without conflicts.
Is this currently possible, or could it be added as a feature in a future release?
Thank you!
Is it causing Sedona to no longer work on Databricks in newer versions?
Now we register functions by the simple name
val functionName = classTag.runtimeClass.getSimpleName
We could add it, @jiayuasu. What do you think about that?
Correction: there are no name collisions. I had misinterpreted the testing scenario earlier. See more details below.
When using the Spark DataFrame API, this coexistence works well because Sedona and Databricks spatial functions belong to different namespaces.
spark.conf.set("spark.databricks.geo.st.enabled", "true")
from pyspark.databricks.sql import functions as dbf
from sedona.spark import SedonaContext
from sedona.spark.sql import st_functions as sdstf
from sedona.spark.sql import st_constructors as sdcnf
from sedona.spark.sql import st_predicates as sdpcf
sedona = SedonaContext.create(spark)
However, in Spark SQL, since both libraries (Sedona and databricks SpatialSQL) define functions with the same names and Spark SQL does not support namespaces for function resolution, I am wondering which implementation will Spark use internally? In DBR 16.4 LTS, Sedona 1.8.0, and the Private Preview of Databricks Spatial SQL, Spark SQL prioritizes Sedona’s functions over the native Databricks Spatial SQL ones. I’m awaiting the release of Databricks Runtime 17 LTS to test the same scenario.
In the meantime I also discovered that the Carto team provides an Analytics Toolbox for Databricks where all Sedona spatial functions should be used with a custom prefix, such as "sedona_": Carto Analytics Toolbox for Databricks They are using an older Sedona version (1.5.1).
In case of using Spark SQL, this approach offers much greater flexibility because most Databricks native ST_ functions are implemented on top of Photon, which gives them strong performance. For functions that do not yet exist in Databricks’ spatial SQL or not yet photonized, we can use Sedona functions (using the "sedona_" prefix).
I guess the use of Sedona's internal catalog class. They resolve function name conflicts by themselves. I think this is functionality we could add, as it doesn't require a lot of effort.
I think the problem is really at the SQL level. My understanding is that when Apache Sedona is installed on top of Databricks, the Apache Sedona ST expressions that are homonyms to the ones provided via Databricks override the Databricks ones in the function registry. Then when function resolution is happening in the SQL context the Apache Sedona get chosen over the Databricks ones which can cause failures in queries.
I’ve just realized that Apache Sedona recommends, in a Snowflake setup, always using the schema name SEDONA when calling Sedona functions to avoid conflicts with Snowflake’s built-in functions. It would be great if Sedona also provide a similar approach in a Databricks setup to avoid conflicts with Databricks’ built-in functions. Since Sedona and Databricks ST functions complement each other, this would give users even more flexibility.
Implementing a FunctionCatalog seems to be the way to go for scoping Sedona ST functions in a database instead of registering them in the global namespace. However, it is not easy to adapt the current implementation of Sedona ST functions to the UnboundFunction/BoundFunction interface required by FunctionCatalog.
Prefixing the function names with a customizable prefix seems to be a viable short-term solution. The prefix can be configured by setting a Spark configuration, such as spark.sedona.udf.prefix. I have done something similar for making Apache Sedona and GeoMesa Spark SQL co-exist before: https://www.geomesa.org/documentation/stable/user/spark/sparksql.html#using-geomesa-sparksql-with-apache-sedona.
Thanks for the great discussion here. I agree that adding a prefix could be possible. Since the next release 1.8.1 is a minor release, we will fix the other DBR bug first. This will be done in a few days.
There are a few other things going on with Spark 4.1 geometry / geography data type. We will address the prefix issue, and Spark 4.1 issue together with in the next major release of Sedona 1.9.0.