Geostats Functions in Spark Connect
I don't think the stats functions are compatible with spark connect today. I tried this in spark 3.5:
(python) ➜ python git:(graphframes-0.9.0) ✗ export SPARK_REMOTE=local
(python) ➜ python git:(graphframes-0.9.0) ✗ pytest -v tests/stats
and every test that wasn't skipped (for checkpointing) gave this kind of _jvm error:
self = <pyspark.sql.connect.session.SparkSession object at 0x16fd17df0>, name = '_jvm'
def __getattr__(self, name: str) -> Any:
if name in ["_jsc", "_jconf", "_jvm", "_jsparkSession"]:
> raise PySparkAttributeError(
error_class="JVM_ATTRIBUTE_NOT_SUPPORTED", message_parameters={"attr_name": name}
E pyspark.errors.exceptions.base.PySparkAttributeError: [JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jvm` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session.
../../../../.local/share/virtualenvs/python-GYLC1Bm8/lib/python3.10/site-packages/pyspark/sql/connect/session.py:692: PySparkAttributeError
Hi @james-willis
I had like to tackle this issue to make Geostats functions compatible with Spark Connect.
I will focus on refactoring the existing _jvm calls to use Spark Connect-compatible APIs. My local master is updated, and I am working on the feature/spark-connect-geostats-2103 branch.
Please let me know if there's any specific guidance you have.
I don’t have experience making these kind of changes. I think maybe the ST function Python methods implement something like this.
Part of me wants to deprecate these and point folks to the sql functions instead. Those already work in Spark Connect. I know that might be controversial.
Thanks for the valuable hint James! I am already diving into the ST function implementations to understand their approach. I have also noted your thoughts on deprecation and will consider that as I explore the best path for Spark Connect compatibility. Will share updates soon.