frameless
frameless copied to clipboard
Missing Dataset methods
Here is an exhaustive status of the API implemented by frameless.TypeDataset compared to Spark's Dataset. We are getting pretty close to 100% API coverage :smile:
Won't fix:
- [ ] Dataset<T> alias(String alias) inherently unsafe
- [ ] Dataset<Row> withColumnRenamed(String existingName, String newName) inherently unsafe
- [ ] void createGlobalTempView(String viewName) inherently unsafe
- [ ] void createOrReplaceTempView(String viewName) inherently unsafe
- [ ] void createTempView(String viewName) inherently unsafe
- [ ] void registerTempTable(String tableName) inherently unsafe
- [ ] Dataset<T> where(String conditionExpr) use select instead
TODO:
- [ ] <K> KeyValueGroupedDataset<K,T> groupByKey(scala.Function1<T,K> func, Encoder<K> evidence3)
- [ ] DataFrameNaFunctions na()
- [ ] DataFrameStatFunctions stat()
- [ ] Dataset<T> dropDuplicates(String col1, String... cols)
- [ ] Dataset<Row> describe(String... cols)
- [ ] DataStreamWriter<T> writeStream() (see #232)
- [ ] Dataset<T> withWatermark(String eventTime, String delayThreshold) (see #232)
- [x] RelationalGroupedDataset cube(Column... cols) (WIP #246)
- [x] RelationalGroupedDataset rollup(String col1, String... cols) (WIP #246)
Done:
- [x] Dataset<T> sort(String sortCol, String... sortCols) (#248)
- [x] Dataset<T> sortWithinPartitions(String sortCol, String... sortCols) (#248)
- [x] Dataset<T> repartition(int numPartitions, Column... partitionExprs)
- [x] Dataset<Row> drop(String... colNames) (#209)
- [x] Dataset<Row> join(Dataset<?> right, Column joinExprs, String joinType)
- [x] <U> Dataset<scala.Tuple2<T,U>> joinWith(Dataset<U> other, Column condition, String joinType)
- [x] Dataset<Row> crossJoin(Dataset<?> right)
- [x] Dataset<Row> agg(Column expr, Column... exprs)
- [x] Column apply(String colName)
- [x] <U> Dataset<U> as(Encoder<U> evidence2)
- [x] Dataset<T> cache()
- [x] Dataset<T> coalesce(int numPartitions)
- [x] Column col(String colName)
- [x] Object collect()
- [x] long count()
- [x] Dataset<T> distinct()
- [x] Dataset<T> except(Dataset<T> other)
- [x] void explain(boolean extended)
- [x] <A,B> Dataset<Row> explode(String inputColumn, String outputColumn, scala.Function1<A,TraversableOnce<B f)
- [x] Dataset<T> filter(Column condition)
- [x] Dataset<T> filter(scala.Function1<T,Object> func)
- [x] T first() (as firstOption)
- [x] <U> Dataset<U> flatMap(scala.Function1<T,TraversableOnce<U>> func, Encoder<U> evidence8)
- [x] void foreach(ForeachFunction<T> func)
- [x] void foreachPartition(scala.Function1<Iterator<T>,scala.runtime.BoxedUnit> f)
- [x] RelationalGroupedDataset groupBy(String col1, String... cols)
- [x] Dataset<T> intersect(Dataset<T> other)
- [x] Dataset<T> limit(int n)
- [x] <U> Dataset<U> map(scala.Function1<T,U> func, Encoder<U> evidence6)
- [x] <U> Dataset<U> mapPartitions(MapPartitionsFunction<T,U> f, Encoder<U> encoder)
- [x] Dataset<T> persist(StorageLevel newLevel)
- [x] void printSchema()
- [x] RDD<T> rdd()
- [x] T reduce(scala.Function2<T,T,T> func) (as reduceOption)
- [x] Dataset<T> repartition(int numPartitions)
- [x] Dataset<T> sample(boolean withReplacement, double fraction, long seed)
- [x] Dataset<Row> select(String col, String... cols)
- [x] void show(int numRows, boolean truncate)
- [x] Object take(int n)
- [x] Dataset<Row> toDF()
- [x] String toString()
- [x] <U> Dataset<U> transform(scala.Function1<Dataset<T>,Dataset<U>> t)
- [x] Dataset<T> union(Dataset<T> other)
- [x] Dataset<T> unpersist(boolean blocking)
- [x] Dataset<Row> withColumn(String colName, Column col)
- [x] Dataset<T> orderBy(String sortCol, String... sortCols)
- [x] String[] columns()
- [x] org.apache.spark.sql.execution.QueryExecution queryExecution()
- [x] StructType schema()
- [x] SparkSession sparkSession()
- [x] SQLContext sqlContext()
- [x] Dataset<T> checkpoint(boolean eager)
- [x] String[] inputFiles()
- [x] boolean isLocal()
- [x] boolean isStreaming()
- [x] Dataset<T>[] randomSplit(double[] weights, long seed)
- [x] StorageLevel storageLevel()
- [x] Dataset<String> toJSON()
- [x] java.util.Iterator<T> toLocalIterator()
- [x] DataFrameWriter<T> write()
<A,B> Dataset explode(String inputColumn, String outputColumn, scala.Function1<A,TraversableOnce<B f) is not working for Map type columns. While vanilla Spark supports it.
Yes, I was not able to fit Map because its type signature has two holes compared to one for all other. We can have an overloaded method just for Map I think.
writeStream can be marked as done here.