mleap icon indicating copy to clipboard operation
mleap copied to clipboard

Schema for converting a PySpark Dataframe into a LeapFrame

Open femibyte opened this issue 7 years ago • 2 comments

I'm creating a schema from a Pyspark dataframe that can be used to build a corresponding Leapframe. My question is, how do I handle non-scalar types i.e. arrays ?

My schema looks like this.

df.printSchema()
root
 |-- scalar_1: string (nullable = true)
 |-- scalar_2: double (nullable = true)
 |-- vector_1: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- vector_2: array (nullable = true)
 |    |-- element: date (containsNull = true)

If the Pyspark dataframe consisted of purely scalar columns, I would do the following:

schema = [ { "name" : field.simpleString().split(":")[0], \
                      "type" : field.simpleString().split(":")[1] }
for field in df.schema ]

Any suggestions are welcome.

femibyte avatar Dec 28 '18 11:12 femibyte

Hey @femibyte, we have these converters for e.g. https://github.com/combust/mleap/blob/e177e5816250ff0dfae7f82e84317f288a107f31/mleap-spark-base/src/main/scala/ml/combust/mleap/spark/SparkSupport.scala#L63 that would do this for you. This is currently in Scala, but I'll update the issue once I've gotten it to work in Python, you shouldn't have to do this conversion manually.

ancasarb avatar Mar 06 '19 11:03 ancasarb

Hey @femibyte, we have these converters for e.g.

https://github.com/combust/mleap/blob/e177e5816250ff0dfae7f82e84317f288a107f31/mleap-spark-base/src/main/scala/ml/combust/mleap/spark/SparkSupport.scala#L63

that would do this for you. This is currently in Scala, but I'll update the issue once I've gotten it to work in Python, you shouldn't have to do this conversion manually.

Is this updated for python?

bhrigs avatar Aug 31 '20 22:08 bhrigs