Allow selection of nested members of StructType
Amy pointers to how this could be implemented? If I find the time, I would give it a try.
I experimented with that already quite some time ago so I might not remember the problems I had with that in detail. But in general I would expect it to look more or less like that:
Somewhere in iskra:
//> using scala "3.2.1"
import scala.language.implicitConversions
trait UntypedColumn
trait DataType
trait StringType extends DataType
trait StructType[Schema <: Tuple] extends DataType
type Label = String & Singleton
class Column[+T <: DataType](val untyped: UntypedColumn)
object Column:
// We should prefer `given Conversion` over `implicit def`
// but that would require importing `scala.language.implicitConversions` at use site rather than at definition site
// or enabling `-language:implicitConversions` option globally by the library's users
// but that's probably not what we would like
// given [T <: Tuple](using cv: ColumnView[T]): Conversion[Column[StructType[T]], cv.Out] = cv.view(_)
implicit def structColumnAsSchemaView[T <: Tuple](column: Column[StructType[T]])(using cv: ColumnView[T]): cv.Out = cv.view(column)
@annotation.showAsInfix
class :=[L <: Label, T <: DataType](untyped: UntypedColumn) extends Column[T](untyped)
trait SchemaView extends Selectable:
def selectDynamic(name: String): Any = ???
trait ColumnView[T <: Tuple]:
type Out <: SchemaView
def view(column: Column[StructType[T]]): Out
object ColumnView:
// hard coded example
given ColumnView[("first" := StringType, "last" := StringType)] with
type Out = SchemaView {
def first: "first" := StringType
def last: "last" := StringType
}
def view(column: Column[StructType[("first" := StringType, "last" := StringType)]]): Out =
new SchemaView:
def first: "first" := StringType = ???
def last: "last" := StringType = ???
// Generic macro implementation for any tuple type
// The macro should probably be easily extractable from `schemaViewExpr`
//transparent inline given [T <: Tuple]: ColumnView[T] = ${ /* ... */ }
At use site - this should still compile when :
case class Name(first: String, last: String)
@main def test =
val fullName: Column[StructType[("first" := StringType, "last" := StringType)]] = ???
val firstName = fullName.first
The example is self contained and has proper IDE support in metals (showing exact types on hover, code completions, etc.). This should still work when we replace the hard coded given instance of ColumnView with the generic one.
@prolativ does that still apply for the iskra-next branch?
And would the .* selection be similar or need a different approach?
Thank you!
Actually .* did work for some time but keeping it alive slowed me down too much when I was experimenting with other stuff so I removed support for it and later I didn't have enough time to revive it. But making it work properly would require taking a few aspects into consideration:
- Should it return a tuple of columns or some other (custom) type representing multiple columns? This question is not specific to
.*but it concerns the general syntax of.selectand similar methods. I'm still experimenting to find the best solution, taking into account things like code readability, source similarity to original Spark API, speed of compilation, IDE friendliness etc. Some of the problems that tuples have are that:
- Auto-tupling is going to be phased out in the future (so one would have to write
df.select(($.x, $.y))with double parentheses instead of justdf.select($.x, $.y) - Type constraints like
T <: Tupledon't give us any guarantees about the exact type of each element of the tuple - Switching between
TandTuple1[T]is inconvenient and introduces more corner cases to handle - Scala 2 has a limit on length of tuples (although I'm not sure yet if we're going to support scala 2 in some form)
- Should it be possible only to write
df.select(($.x, $.y) ++ $.z.*)
(easier to implement) or should we also support
df.select($.x, $.y, $.z.*)
(more like traditional spark syntax)?
The main purpose of iskra-next branch was to:
- Distinguish between different kinds of columns which can or cannot be used in a specific context when the data type of a column isn't enough to tell that, e.g. to prevent things like
df.groupBy($.string).agg(sum(sum($.ints)))
(trying to sum a sum would cause a runtime error)
- Experiment with possible support for scala 2
(2) seems to require more effort than I originally expected so I might try to extract the changes needed for (1) and get them merged into the main branch and leave further exploration of (2) for more distant future.
Beside what was mentioned above I don't think there's much that would change with regard to *. in the new model of iskra-next
Actually
.*did work for some time but keeping it alive slowed me down too much when I was experimenting with other stuff so I removed support for it and later I didn't have enough time to revive it. But making it work properly would require taking a few aspects into consideration:
- Should it return a tuple of columns or some other (custom) type representing multiple columns? This question is not specific to
.*but it concerns the general syntax of.selectand similar methods. I'm still experimenting to find the best solution, taking into account things like code readability, source similarity to original Spark API, speed of compilation, IDE friendliness etc. Some of the problems that tuples have are that:
- Auto-tupling is going to be phased out in the future (so one would have to write
df.select(($.x, $.y))with double parentheses instead of justdf.select($.x, $.y)- Type constraints like
T <: Tupledon't give us any guarantees about the exact type of each element of the tuple
I haven't done much Scala 3 development yet, but would an opaque type that wraps a tuple be an option?
- Switching between
TandTuple1[T]is inconvenient and introduces more corner cases to handle One thing that came to my mind here is the Magnet Pattern. I'm not sure if this one fell out of favour because I haven't heard about it for quite some time, but the original motivation for introducing it (making different things become something of the same kind) seems to be similar.- Scala 2 has a limit on length of tuples (although I'm not sure yet if we're going to support scala 2 in some form)
Imo Iskra should continue to explore how nice of an API one can get with Scala 3 features without being burdened by cross-compatibility. There's other options for Scala 2, e.g. Frameless and Doric.
- Should it be possible only to write
df.select(($.x, $.y) ++ $.z.*)(easier to implement) or should we also support
df.select($.x, $.y, $.z.*)(more like traditional spark syntax)?
If there's any possibility to implement it, I would say the version with just , is largely superior.
With auto-tupling removed, the only options would be to explicitly write the tuples, or use varargs which will probably lose the necessary type information, right?
I haven't done much Scala 3 development yet, but would an opaque type that wraps a tuple be an option?
I don't think opaque types would be any help here unfortunately. The main advantages of tuples come from the fact that the compiler actually sees them as tuples, not some other type.
There's other options for Scala 2, e.g. Frameless and Doric.
Both these libraries have a different approach than iskra and I think we could do it better but the question is how much effort this would require
With auto-tupling removed, the only options would be to explicitly write the tuples, or use varargs which will probably lose the necessary type information, right?
Inlined varargs seem like an interesting alternative but some more experiments would be necessary to assess the advantages and drawbacks of this approach