How to deserialize pyspark bundle in python without the dependency on a sparkcontext?
I have a "bundle.zip" serialized from pyspark. I want to deserialize this pipeline object in my python application without needing a spark context. according to the documentation this is possible:
Execute Spark ML Pipelines without the dependency on the spark context, distributed data frames, and costly execution plans
http://mleap-docs.combust.ml/mleap-runtime/
but in the code and the documentation i cannot find an example. can you pint me to the right python function?
thanks.
To deserialize a bundle to a Spark pipeline you can do
from pyspark.ml import PipelineModel
sparkPipelineLr = PipelineModel.deserializeFromBundle("jar:file:/tmp/airbnb_demo.lr.zip")
If you're looking for mleap scoring (using the mleap-runtime maven dependency), that is supported at the moment from JVM only, for e.g. Scala.
To deserialized a bundle to a Spark pipeline you can do
from pyspark.ml import PipelineModel sparkPipelineLr = PipelineModel.deserializeFromBundle("jar:file:/tmp/airbnb_demo.lr.zip")If you're looking for mleap scoring (using the mleap-runtime maven dependency), that is supported at the moment from JVM only, for e.g. Scala.
I need to import an object from the Mleap library to find deserializeFromBundle
from pyspark.ml import PipelineModel
from mleap.pyspark.spark_support import SimpleSparkSerializer
pipeline_object = PipelineModel.deserializeFromBundle("jar:file:{}".format(pipeline_object_path))
but if i execute this i get the message:
{AttributeError}Cannot load _jvm from SparkContext. Is SparkContext initialized?
I chose Mleap so that i could run my pipeline object without the spark context. to me, this statement from the documentation is misleading:
Execute Spark ML Pipelines without the dependency on the spark context, distributed data frames, and costly execution plans http://mleap-docs.combust.ml/mleap-runtime/
The above statement only applies to Scala. If you use Scala/Spark i see the added value of Mleap, but if you use PySpark i don't really see the added value of Mleap because you could use the build in pipeline from pyspark. Am I right?
thanks for your response.
Vincent
The one important thing I would like to mention is that you could serialize your pipeline with pyspark and you can use the mleap-runtime in scala to execute the pipeline, in order to achieve the low latencies by not requiring the Spark context.
The one important thing I would like to mention is that you could serialize your pipeline with pyspark and you can use the mleap-runtime in scala to execute the pipeline, in order to achieve the low latencies by not requiring the Spark context.
My question is this - when the pipeline is serialized and one stage uses a custom transformer - which transformer is serialized - the Spark one or the Mleap one ? In my case it's the Spark one. In which case upon deserialization, how does Mleap know that it should use the Mleap one instead ?