frameless icon indicating copy to clipboard operation
frameless copied to clipboard

Missing Dataset methods

Open OlivierBlanvillain opened this issue 8 years ago • 3 comments

Here is an exhaustive status of the API implemented by frameless.TypeDataset compared to Spark's Dataset. We are getting pretty close to 100% API coverage :smile:

Won't fix:

  • [ ] Dataset<T> alias(String alias) inherently unsafe
  • [ ] Dataset<Row> withColumnRenamed(String existingName, String newName) inherently unsafe
  • [ ] void createGlobalTempView(String viewName) inherently unsafe
  • [ ] void createOrReplaceTempView(String viewName) inherently unsafe
  • [ ] void createTempView(String viewName) inherently unsafe
  • [ ] void registerTempTable(String tableName) inherently unsafe
  • [ ] Dataset<T> where(String conditionExpr) use select instead

TODO:

  • [ ] <K> KeyValueGroupedDataset<K,T> groupByKey(scala.Function1<T,K> func, Encoder<K> evidence3)
  • [ ] DataFrameNaFunctions na()
  • [ ] DataFrameStatFunctions stat()
  • [ ] Dataset<T> dropDuplicates(String col1, String... cols)
  • [ ] Dataset<Row> describe(String... cols)
  • [ ] DataStreamWriter<T> writeStream() (see #232)
  • [ ] Dataset<T> withWatermark(String eventTime, String delayThreshold) (see #232)
  • [x] RelationalGroupedDataset cube(Column... cols) (WIP #246)
  • [x] RelationalGroupedDataset rollup(String col1, String... cols) (WIP #246)

Done:

  • [x] Dataset<T> sort(String sortCol, String... sortCols) (#248)
  • [x] Dataset<T> sortWithinPartitions(String sortCol, String... sortCols) (#248)
  • [x] Dataset<T> repartition(int numPartitions, Column... partitionExprs)
  • [x] Dataset<Row> drop(String... colNames) (#209)
  • [x] Dataset<Row> join(Dataset<?> right, Column joinExprs, String joinType)
  • [x] <U> Dataset<scala.Tuple2<T,U>> joinWith(Dataset<U> other, Column condition, String joinType)
  • [x] Dataset<Row> crossJoin(Dataset<?> right)
  • [x] Dataset<Row> agg(Column expr, Column... exprs)
  • [x] Column apply(String colName)
  • [x] <U> Dataset<U> as(Encoder<U> evidence2)
  • [x] Dataset<T> cache()
  • [x] Dataset<T> coalesce(int numPartitions)
  • [x] Column col(String colName)
  • [x] Object collect()
  • [x] long count()
  • [x] Dataset<T> distinct()
  • [x] Dataset<T> except(Dataset<T> other)
  • [x] void explain(boolean extended)
  • [x] <A,B> Dataset<Row> explode(String inputColumn, String outputColumn, scala.Function1<A,TraversableOnce<B f)
  • [x] Dataset<T> filter(Column condition)
  • [x] Dataset<T> filter(scala.Function1<T,Object> func)
  • [x] T first() (as firstOption)
  • [x] <U> Dataset<U> flatMap(scala.Function1<T,TraversableOnce<U>> func, Encoder<U> evidence8)
  • [x] void foreach(ForeachFunction<T> func)
  • [x] void foreachPartition(scala.Function1<Iterator<T>,scala.runtime.BoxedUnit> f)
  • [x] RelationalGroupedDataset groupBy(String col1, String... cols)
  • [x] Dataset<T> intersect(Dataset<T> other)
  • [x] Dataset<T> limit(int n)
  • [x] <U> Dataset<U> map(scala.Function1<T,U> func, Encoder<U> evidence6)
  • [x] <U> Dataset<U> mapPartitions(MapPartitionsFunction<T,U> f, Encoder<U> encoder)
  • [x] Dataset<T> persist(StorageLevel newLevel)
  • [x] void printSchema()
  • [x] RDD<T> rdd()
  • [x] T reduce(scala.Function2<T,T,T> func) (as reduceOption)
  • [x] Dataset<T> repartition(int numPartitions)
  • [x] Dataset<T> sample(boolean withReplacement, double fraction, long seed)
  • [x] Dataset<Row> select(String col, String... cols)
  • [x] void show(int numRows, boolean truncate)
  • [x] Object take(int n)
  • [x] Dataset<Row> toDF()
  • [x] String toString()
  • [x] <U> Dataset<U> transform(scala.Function1<Dataset<T>,Dataset<U>> t)
  • [x] Dataset<T> union(Dataset<T> other)
  • [x] Dataset<T> unpersist(boolean blocking)
  • [x] Dataset<Row> withColumn(String colName, Column col)
  • [x] Dataset<T> orderBy(String sortCol, String... sortCols)
  • [x] String[] columns()
  • [x] org.apache.spark.sql.execution.QueryExecution queryExecution()
  • [x] StructType schema()
  • [x] SparkSession sparkSession()
  • [x] SQLContext sqlContext()
  • [x] Dataset<T> checkpoint(boolean eager)
  • [x] String[] inputFiles()
  • [x] boolean isLocal()
  • [x] boolean isStreaming()
  • [x] Dataset<T>[] randomSplit(double[] weights, long seed)
  • [x] StorageLevel storageLevel()
  • [x] Dataset<String> toJSON()
  • [x] java.util.Iterator<T> toLocalIterator()
  • [x] DataFrameWriter<T> write()

OlivierBlanvillain avatar Aug 08 '17 09:08 OlivierBlanvillain

<A,B> Dataset explode(String inputColumn, String outputColumn, scala.Function1<A,TraversableOnce<B f) is not working for Map type columns. While vanilla Spark supports it.

snadorp avatar Jul 25 '18 12:07 snadorp

Yes, I was not able to fit Map because its type signature has two holes compared to one for all other. We can have an overloaded method just for Map I think.

imarios avatar Jul 25 '18 15:07 imarios

writeStream can be marked as done here.

etspaceman avatar Jul 14 '23 20:07 etspaceman