Schema for converting a PySpark Dataframe into a LeapFrame
I'm creating a schema from a Pyspark dataframe that can be used to build a corresponding Leapframe. My question is, how do I handle non-scalar types i.e. arrays ?
My schema looks like this.
df.printSchema()
root
|-- scalar_1: string (nullable = true)
|-- scalar_2: double (nullable = true)
|-- vector_1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- vector_2: array (nullable = true)
| |-- element: date (containsNull = true)
If the Pyspark dataframe consisted of purely scalar columns, I would do the following:
schema = [ { "name" : field.simpleString().split(":")[0], \
"type" : field.simpleString().split(":")[1] }
for field in df.schema ]
Any suggestions are welcome.
Hey @femibyte, we have these converters for e.g. https://github.com/combust/mleap/blob/e177e5816250ff0dfae7f82e84317f288a107f31/mleap-spark-base/src/main/scala/ml/combust/mleap/spark/SparkSupport.scala#L63 that would do this for you. This is currently in Scala, but I'll update the issue once I've gotten it to work in Python, you shouldn't have to do this conversion manually.
Hey @femibyte, we have these converters for e.g.
https://github.com/combust/mleap/blob/e177e5816250ff0dfae7f82e84317f288a107f31/mleap-spark-base/src/main/scala/ml/combust/mleap/spark/SparkSupport.scala#L63
that would do this for you. This is currently in Scala, but I'll update the issue once I've gotten it to work in Python, you shouldn't have to do this conversion manually.
Is this updated for python?