datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Epic: Complete ROW Format (Missing features)

Open yjshen opened this issue 3 years ago • 1 comments

Goal: a complete row implementation, fully used in pipeline breaker operators when possible.

Summary TLDR: The key focus of this work is to speed up fundamentally row oriented operations like hash table lookup or comparisons (e.g. #2427)

Background

DataFusion, like many Arrow systems, is a classic "vectorized computation engine" which works quite well for many common operations. The following paper, gives a good treatment on the various tradeoffs between vectorized and JIT's compilation of query plans: https://db.in.tum.de/~kersten/vectorization_vs_compilation.pdf?lang=de

As mentioned in the paper, there are some fundamentally "row oriented" operations in a database that are not typically amenable to vectorization. The "classics" are: Hash table updates in Joins and Hash Aggregates, as well as comparing tuples in sort.

When operating with a Row based format, the per-tuple type dispatch overhead becomes quite important, so such operations are typically implemented using just in time compilation (JIT) or other unsafe mechanims to minimize the overhead

@yjshen added initial support for JIT'ing in https://github.com/apache/arrow-datafusion/pull/1849 and it currently lives in https://github.com/apache/arrow-datafusion/tree/master/datafusion/jit. He also added partial support for aggregates in https://github.com/apache/arrow-datafusion/pull/2375

This ticket tracks the remaining work to fully support row formats, including JIT'ing

Getters and setters

  • [X] #1891
  • [ ] 1. Support all types that ScalarValue supports
    • [ ] 1.1 all basic types
      • [ ] 1.1.1 Decimal
      • [ ] 1.1.2 Timestamp
      • [ ] 1.1.3 Date
      • [ ] 1.1.4 Interval
      • [ ] 1.1.5 Null
    • [ ] 1.2 composite types: List / Struct
  • [ ] 2. Make varlena offset + length a type parameter for reader and writer, for space efficiency
  • [ ] 3. Assertion based on schema before getting. Think date64 as an example.

Formats

  • [ ] 1. basics: #2188
  • [ ] 2. Compact: write once, never update, Eq comparable
    • [ ] 2.1 all type supports
  • [ ] 3. WordAligned: update heavy on cells
    • [ ] 3.1 all basic type supports
    • [ ] 3.2 Varlena out-of-place store in memory, and inline/de-inline while serializing/deserializing
  • [ ] 4. RawComparable: best effort comparable based on raw bytes
    • [ ] 4.1 null-inline
    • [ ] 4.2 float bytes comparable
    • [ ] 4.3 comparator with best effort &[u8] comp, and interleave with varlena compare field-by-field

Hook into execution (mainly the pipeline-breakers)

  • [ ] Sort
    • [ ] https://github.com/apache/arrow-datafusion/issues/2427
    • [ ] RawComparable as SortByKey
    • [ ] payload #2146
    • [ ] https://github.com/apache/arrow-datafusion/issues/2427
  • [ ] Aggregate
    • [X] basics: Compact row as GroupByKey, WordAligned row as aggregation buffer #2375
    • [ ] Unify GroupByRowRow and GroupByRow: https://github.com/apache/arrow-datafusion/issues/2723
    • [ ] GroupBy: complete full row-based accumulator support.
  • [ ] Join
    • [ ] Compact row as HashJoinKey
    • [ ] RawComparable row as MergeJoinKey
    • [ ] Compact row as Join payloads

Cleanups

  • [ ] Getter / setter / accessor consolidation, DRY

JIT

  • [ ] basics: JIT the tuple field get/set with schema, avoid branching for each field in each row. (Try to fix in #1849 )
  • [ ] TBD

yjshen avatar Feb 18 '22 02:02 yjshen

@iajoiner -- here is the main ticket that is tracking the row format progress. I think there are many PRs and other docs linked from here.

alamb avatar Jul 21 '22 15:07 alamb

I think we have chosen to focus on the arrow row format instead, and we removed the datafusion row format in https://github.com/apache/arrow-datafusion/pull/6968

alamb avatar Sep 05 '23 17:09 alamb