frameless icon indicating copy to clipboard operation
frameless copied to clipboard

Optional aggregation columns shortcut the computation to an empty dataset

Open imarios opened this issue 7 years ago • 1 comments

so it seems that if during an aggregation the first aggregate returns null, then it doesn't even care about the next results, it just returns nothing. Going to look if this is a Spark bug and not a Frameless one. Nope, the issue seems to be on our side of the fence

case class X[A,B](a: A, b: B)
val t = TypedDataset.create(X[Option[Int],Long](None, 0)::Nil)
t.show().run()
+----+---+
|   a|  b|
+----+---+
|null|  0|
+----+---+

scala> t.agg(first(t('a)), sum(t('b))).collect().run()
res: Seq[(Option[Int], Long)] = WrappedArray()

scala> t.agg(sum(t('b)), first(t('a))).collect().run()
res: Seq[(Long, Option[Int])] = WrappedArray((0,None))

Test that fails randomly due to this issue (NonAggregateFunctionsTests.scala).

 test("Empty vararg tests") {
    import frameless.functions.aggregate._
    def prop[A : TypedEncoder, B: TypedEncoder](data: Vector[X2[A, B]]) = {
      val ds = TypedDataset.create(data)
      val frameless = ds.select(ds('a), concat(), ds('b), concatWs(":")).collect().run().toVector
      val framelessAggr = ds.agg(first(ds('a)), concat(), concatWs("x"), litAggr(2)).collect().run().toVector
      val scala = data.map(x => (x.a, "", x.b, ""))
      val scalaAggr = if (data.nonEmpty) Vector((data.head.a, "", "", 2)) else Vector.empty
      (frameless ?= scala).&&(framelessAggr ?= scalaAggr)
    }

    check(forAll(prop[Long, Long] _))
    check(forAll(prop[Option[Vector[Boolean]], Long] _))
  }

imarios avatar Jan 30 '18 17:01 imarios

@OlivierBlanvillain, this was the reason master was failing. The doesn't doesn't always break only when it happens to generate an empty Option.

imarios avatar Jan 30 '18 17:01 imarios