mleap icon indicating copy to clipboard operation
mleap copied to clipboard

How to deserialize pyspark bundle in python without the dependency on a sparkcontext?

Open vincentclaes opened this issue 6 years ago • 4 comments

I have a "bundle.zip" serialized from pyspark. I want to deserialize this pipeline object in my python application without needing a spark context. according to the documentation this is possible:

Execute Spark ML Pipelines without the dependency on the spark context, distributed data frames, and costly execution plans http://mleap-docs.combust.ml/mleap-runtime/

but in the code and the documentation i cannot find an example. can you pint me to the right python function?

thanks.

vincentclaes avatar Nov 26 '19 08:11 vincentclaes

To deserialize a bundle to a Spark pipeline you can do

from pyspark.ml import PipelineModel
sparkPipelineLr = PipelineModel.deserializeFromBundle("jar:file:/tmp/airbnb_demo.lr.zip")

If you're looking for mleap scoring (using the mleap-runtime maven dependency), that is supported at the moment from JVM only, for e.g. Scala.

ancasarb avatar Nov 26 '19 22:11 ancasarb

To deserialized a bundle to a Spark pipeline you can do

from pyspark.ml import PipelineModel
sparkPipelineLr = PipelineModel.deserializeFromBundle("jar:file:/tmp/airbnb_demo.lr.zip")

If you're looking for mleap scoring (using the mleap-runtime maven dependency), that is supported at the moment from JVM only, for e.g. Scala.

I need to import an object from the Mleap library to find deserializeFromBundle

from pyspark.ml import PipelineModel
from mleap.pyspark.spark_support import SimpleSparkSerializer
pipeline_object = PipelineModel.deserializeFromBundle("jar:file:{}".format(pipeline_object_path))

but if i execute this i get the message:

{AttributeError}Cannot load _jvm from SparkContext. Is SparkContext initialized?

I chose Mleap so that i could run my pipeline object without the spark context. to me, this statement from the documentation is misleading:

Execute Spark ML Pipelines without the dependency on the spark context, distributed data frames, and costly execution plans http://mleap-docs.combust.ml/mleap-runtime/

The above statement only applies to Scala. If you use Scala/Spark i see the added value of Mleap, but if you use PySpark i don't really see the added value of Mleap because you could use the build in pipeline from pyspark. Am I right?

thanks for your response.

Vincent

vincentclaes avatar Nov 27 '19 07:11 vincentclaes

The one important thing I would like to mention is that you could serialize your pipeline with pyspark and you can use the mleap-runtime in scala to execute the pipeline, in order to achieve the low latencies by not requiring the Spark context.

ancasarb avatar Jan 20 '20 19:01 ancasarb

The one important thing I would like to mention is that you could serialize your pipeline with pyspark and you can use the mleap-runtime in scala to execute the pipeline, in order to achieve the low latencies by not requiring the Spark context.

My question is this - when the pipeline is serialized and one stage uses a custom transformer - which transformer is serialized - the Spark one or the Mleap one ? In my case it's the Spark one. In which case upon deserialization, how does Mleap know that it should use the Mleap one instead ?

femibyte avatar Jan 24 '20 04:01 femibyte