frameless icon indicating copy to clipboard operation
frameless copied to clipboard

Easier withColumn method

Open MrPowers opened this issue 5 years ago • 3 comments

Great work on this lib! It's a great way to write Spark code!

As discussed here and in the docs, withColumn requires a full schema when a column is added.

Here's the example in the docs:

case class CityBedsOther(city: String, bedrooms: Int, other: List[String])

cityBeds.
   withColumn[CityBedsOther](lit(List("a","b","c"))).
   show(1).run()

Couldn't we just assume that the schema stays the same for the existing columns and only supply the schema for the column that's being added?

cityBeds.
   withColumn[List[String]](lit(List("a","b","c"))).
   show(1).run()

I think this'd be a lot more use friendly. I'm often dealing with schemas that have tons of columns and add lots of columns with withColumn. Let me know your thoughts!

MrPowers avatar Oct 26 '20 16:10 MrPowers

Hey @MrPowers, sorry I missed this comment. I hear what you are saying. I definitely see this being easier, but unfortunately, this is nearly impossible to do. When you add a new column to CityBeds you essentially defining a new class, which, unless you have already defined it, it does not exist. That's why you have to define a new class pass it as a type parameter inside withColumn.

imarios avatar Dec 09 '20 05:12 imarios

@MrPowers I think I have a better answer for you. Say you have TypedDataset[X] where case class X(i: Int, j: Int) and you want to add an extra column that adds i with j.

val x: TypedDataset[X] = ???
val xNew: TypedDataset[(X, Int)] = x.select(x.asCol, x('i)+x('j))

As you see, your schema became from X to Tuple2[X, Int]. In this way, you defined a new column without losing the structure of X.

imarios avatar Dec 11 '20 04:12 imarios

@imarios - Thanks for the detailed responses.

I started brainstorming the idea of typed columns, see here, and think this idea might be useful for frameless as well.

Columns are untyped and that's a big reason why Spark is so type unsafe. Typed columns can help us catch a lot more errors at compile time. When we run df("some_date") it returns an untyped column, but it should really return a DateColumn.

We could add a withTypedColumn function that'd take two arguments, a string and a typed column. We could infer the case class for the resulting Dataset (from the starting case class and the typed column) to build the new case class under the hood without making the user specify it manually. It could make this a lot more usable. Let me know your thoughts!

MrPowers avatar Feb 06 '21 03:02 MrPowers