iskra Allow selection of nested members of StructType

Aug 24 '22 09:08 prolativ

Amy pointers to how this could be implemented? If I find the time, I would give it a try.

Dec 18 '22 12:12 nightscape

I experimented with that already quite some time ago so I might not remember the problems I had with that in detail. But in general I would expect it to look more or less like that:

Somewhere in iskra:

//> using scala "3.2.1"

import scala.language.implicitConversions

trait UntypedColumn

trait DataType
trait StringType extends DataType
trait StructType[Schema <: Tuple] extends DataType

type Label = String & Singleton

class Column[+T <: DataType](val untyped: UntypedColumn)

object Column:
  // We should prefer `given Conversion` over `implicit def`
  // but that would require importing `scala.language.implicitConversions` at use site rather than at definition site
  // or enabling `-language:implicitConversions` option globally by the library's users
  // but that's probably not what we would like

  // given [T <: Tuple](using cv: ColumnView[T]): Conversion[Column[StructType[T]], cv.Out] = cv.view(_)

  implicit def structColumnAsSchemaView[T <: Tuple](column: Column[StructType[T]])(using cv: ColumnView[T]): cv.Out = cv.view(column)

@annotation.showAsInfix
class :=[L <: Label, T <: DataType](untyped: UntypedColumn) extends Column[T](untyped)

trait SchemaView extends Selectable:
  def selectDynamic(name: String): Any = ???

trait ColumnView[T <: Tuple]:
  type Out <: SchemaView
  def view(column: Column[StructType[T]]): Out

object ColumnView:

  // hard coded example
  given ColumnView[("first" := StringType, "last" := StringType)] with
    type Out = SchemaView {
      def first: "first" := StringType
      def last: "last" := StringType
    }
    def view(column: Column[StructType[("first" := StringType, "last" := StringType)]]): Out =
      new SchemaView:
        def first: "first" := StringType = ???
        def last: "last" := StringType = ???


  // Generic macro implementation for any tuple type
  // The macro should probably be easily extractable from `schemaViewExpr`
  
  //transparent inline given [T <: Tuple]: ColumnView[T] = ${ /* ... */ }

At use site - this should still compile when :

case class Name(first: String, last: String)

@main def test =
  val fullName: Column[StructType[("first" := StringType, "last" := StringType)]] = ???
  val firstName = fullName.first

The example is self contained and has proper IDE support in metals (showing exact types on hover, code completions, etc.). This should still work when we replace the hard coded given instance of ColumnView with the generic one.

Dec 19 '22 14:12 prolativ

@prolativ does that still apply for the iskra-next branch? And would the .* selection be similar or need a different approach? Thank you!

Dec 19 '22 15:12 nightscape

Actually .* did work for some time but keeping it alive slowed me down too much when I was experimenting with other stuff so I removed support for it and later I didn't have enough time to revive it. But making it work properly would require taking a few aspects into consideration:

Should it return a tuple of columns or some other (custom) type representing multiple columns? This question is not specific to .* but it concerns the general syntax of .select and similar methods. I'm still experimenting to find the best solution, taking into account things like code readability, source similarity to original Spark API, speed of compilation, IDE friendliness etc. Some of the problems that tuples have are that:

Auto-tupling is going to be phased out in the future (so one would have to write df.select(($.x, $.y)) with double parentheses instead of just df.select($.x, $.y)
Type constraints like T <: Tuple don't give us any guarantees about the exact type of each element of the tuple
Switching between T and Tuple1[T] is inconvenient and introduces more corner cases to handle
Scala 2 has a limit on length of tuples (although I'm not sure yet if we're going to support scala 2 in some form)

Should it be possible only to write

 df.select(($.x, $.y) ++ $.z.*)

(easier to implement) or should we also support

 df.select($.x, $.y, $.z.*)

(more like traditional spark syntax)?

Dec 19 '22 16:12 prolativ

The main purpose of iskra-next branch was to:

Distinguish between different kinds of columns which can or cannot be used in a specific context when the data type of a column isn't enough to tell that, e.g. to prevent things like

df.groupBy($.string).agg(sum(sum($.ints)))

(trying to sum a sum would cause a runtime error)

Experiment with possible support for scala 2

(2) seems to require more effort than I originally expected so I might try to extract the changes needed for (1) and get them merged into the main branch and leave further exploration of (2) for more distant future.

Beside what was mentioned above I don't think there's much that would change with regard to *. in the new model of iskra-next

Dec 19 '22 16:12 prolativ

Actually .* did work for some time but keeping it alive slowed me down too much when I was experimenting with other stuff so I removed support for it and later I didn't have enough time to revive it. But making it work properly would require taking a few aspects into consideration:

Should it return a tuple of columns or some other (custom) type representing multiple columns? This question is not specific to .* but it concerns the general syntax of .select and similar methods. I'm still experimenting to find the best solution, taking into account things like code readability, source similarity to original Spark API, speed of compilation, IDE friendliness etc. Some of the problems that tuples have are that:

Auto-tupling is going to be phased out in the future (so one would have to write df.select(($.x, $.y)) with double parentheses instead of just df.select($.x, $.y)

Type constraints like T <: Tuple don't give us any guarantees about the exact type of each element of the tuple

I haven't done much Scala 3 development yet, but would an opaque type that wraps a tuple be an option?

Switching between T and Tuple1[T] is inconvenient and introduces more corner cases to handle One thing that came to my mind here is the Magnet Pattern. I'm not sure if this one fell out of favour because I haven't heard about it for quite some time, but the original motivation for introducing it (making different things become something of the same kind) seems to be similar.

Scala 2 has a limit on length of tuples (although I'm not sure yet if we're going to support scala 2 in some form)

Imo Iskra should continue to explore how nice of an API one can get with Scala 3 features without being burdened by cross-compatibility. There's other options for Scala 2, e.g. Frameless and Doric.

Should it be possible only to write
 df.select(($.x, $.y) ++ $.z.*)
(easier to implement) or should we also support
 df.select($.x, $.y, $.z.*)
(more like traditional spark syntax)?

If there's any possibility to implement it, I would say the version with just , is largely superior. With auto-tupling removed, the only options would be to explicitly write the tuples, or use varargs which will probably lose the necessary type information, right?

Dec 20 '22 09:12 nightscape

I haven't done much Scala 3 development yet, but would an opaque type that wraps a tuple be an option?

I don't think opaque types would be any help here unfortunately. The main advantages of tuples come from the fact that the compiler actually sees them as tuples, not some other type.

There's other options for Scala 2, e.g. Frameless and Doric.

Both these libraries have a different approach than iskra and I think we could do it better but the question is how much effort this would require

With auto-tupling removed, the only options would be to explicitly write the tuples, or use varargs which will probably lose the necessary type information, right?

Inlined varargs seem like an interesting alternative but some more experiments would be necessary to assess the advantages and drawbacks of this approach

Dec 20 '22 10:12 prolativ