TypedTables.jl icon indicating copy to clipboard operation
TypedTables.jl copied to clipboard

Support appending arbitrary container

Open tkf opened this issue 6 years ago • 9 comments

This PR implements append!(::Table, source) for arbitrary source table types:

julia> append!(Table(a=[1], b=[2]), (a=[3], b=[4]))
Table with 2 columns and 2 rows:
     a  b
   ┌─────
 1 │ 1  2
 2 │ 3  4

julia> append!(Table(a=[1], b=[2]), [(a=3, b=4)])
Table with 2 columns and 2 rows:
     a  b
   ┌─────
 1 │ 1  2
 2 │ 3  4

julia> append!(Table(a=[1], b=[2]), DataFrame(a=[3], b=[4]))
Table with 2 columns and 2 rows:
     a  b
   ┌─────
 1 │ 1  2
 2 │ 3  4

julia> append!(Table(a=[1], b=[2]), StructVector((a=[3], b=[4])))
Table with 2 columns and 2 rows:
     a  b
   ┌─────
 1 │ 1  2
 2 │ 3  4

tkf avatar Dec 15 '19 18:12 tkf

Codecov Report

Merging #57 into master will increase coverage by 0.12%. The diff coverage is 72.72%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #57      +/-   ##
==========================================
+ Coverage   67.06%   67.19%   +0.12%     
==========================================
  Files           6        6              
  Lines         498      509      +11     
==========================================
+ Hits          334      342       +8     
- Misses        164      167       +3
Impacted Files Coverage Δ
src/Table.jl 75.67% <72.72%> (-0.33%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update fc1247a...6c09ca2. Read the comment docs.

codecov-io avatar Dec 15 '19 18:12 codecov-io

That's an interesting point. I think I agree. But a pain point is that there is no easy way to query if iterate(table) is row-wise or not. We can use Tables.rows(table) === table although I'm not sure if the compiler can always const-prop this (also, it'd be bad if Tables.rows is not O(1)). Anyway, maybe something like this?:

isrowiterator(t) =
    Tables.istable(t) && Tables.rowaccess(t) && Tables.rows(t) === t

function Base.append!(t::Table, t2)
    if isrowiterator(t2) && Tables.columnaccess(t2)
        return append_columnaccess!(t, t2)
    end
    return append_rows!(t, t2)
end

append_row!(t::Table, rows) =  # `rows` is a generic iterator
    mapfoldl(_asnamedtuple(NamedTuple{columnnames(t)}), push!, rows; init = t)

It's nice that now it naturally supports iterators with unknown eltype.

(BTW, it would be nice to add something like isrowiterator to Tables.jl interface if we do this here.)

With this, I think append!(Table(a=[1], b=[2]), Tables.rows((a=[3], b=[4]))) should still work and it's efficient as it uses append_columnaccess! path.

tkf avatar Dec 15 '19 23:12 tkf

append!(Table(a=[1], b=[2]), Tables.rows((a=[3], b=[4]))) should still work and it's efficient as it uses append_columnaccess! path

This already works. But we need https://github.com/JuliaData/Tables.jl/pull/126 for the optimization (column-oriented memory access).

tkf avatar Dec 16 '19 00:12 tkf

I'm not sure I quite understand what append_columnaccess! means in:

 if isrowiterator(t2) && Tables.columnaccess(t2)
    return append_columnaccess!(t, t2)
end

is it hcat? or still vcat, but column-by-column?

Also note that:

Tables.istable(t) && Tables.rowaccess(t) && Tables.rows(t) === t

doesn't return true for all valid row-iterators. For example, Tables.istable(t) returns false for a Generator of NamedTuples, even though it's a perfectly valid "row table", and works fine when you call Tables.rows on it.

quinnj avatar Dec 16 '19 17:12 quinnj

still vcat, but column-by-column?

Yes, vcat done column-by-column as an optimization.

Also note that:

Tables.istable(t) && Tables.rowaccess(t) && Tables.rows(t) === t

doesn't return true for all valid row-iterators.

As append_columnaccess! is just an optimization, it is OK for isrowiterator to have false negative.

For example, Tables.istable(t) returns false for a Generator of NamedTuples, even though it's a perfectly valid "row table", and works fine when you call Tables.rows on it.

They hit append_rows! branch so it's fine.

tkf avatar Dec 16 '19 17:12 tkf

Yes, vcat done column-by-column as an optimization.

So why not just always call Tables.columns and append the result columns one by one anyway? Any time you're switching based on rowaccess/columnaccess, that's a bit of a code-smell w/ the Tables.jl API; the point is to use Tables.rows or Tables.columns as is more appropriate for your use-case. For example, that's what DataFrames does.

quinnj avatar Dec 16 '19 19:12 quinnj

I thought it's nice to avoid allocating unnecessary objects. If performance is not a concern, I think the cleanest approach here would be to add things row-by-row as there is already push!:

https://github.com/JuliaData/TypedTables.jl/blob/018a2d14d4448f3dd3f3e63321fc3bc09a77ae2b/src/Table.jl#L218

tkf avatar Dec 16 '19 20:12 tkf

Then I would just check Tables.columnaccess(t) and use the column-based path if so, otherwise fallback to the row-based approach. The default is Tables.columnaccess(x) = false, so it will only be true if a table has explicitly provided support for Tables.columns.

quinnj avatar Dec 16 '19 20:12 quinnj

I initially had something like that. But then @andyferris pointed out that it might be better to strictly require that src in append!(::TypedTable.Table, src) to be (semantically) an iterator over rows https://github.com/JuliaData/TypedTables.jl/pull/57#pullrequestreview-332289808. Essentially this boils down to that Tables.istable(x) does not imply anything about how iterate(x) works (or not work). (Just to be clear, I'm not saying this is bad. I think Tables.jl has a slick API that demonstrates how much you can do with a flexible trait-based system in Julia.) By the way, this is why I said Tabels.isrowiterator might be a good addition to Tables.jl interface.

tkf avatar Dec 16 '19 21:12 tkf