Thoughts re common Table operations
Hi there,
Not sure if this is the right place, but I've been thinking about the table operations that I typically use.
Seems the Tables interface allows for this in a straightforward way.
I was going to just implement it and release the package, but thought I'd run the idea here to coordinate efforts.
I've boiled them down to the operations in the code snippet below, aiming for:
- No ambiguity. It should be obvious from the name what the operation does. E.g.,
selectselects columns by convention, but if I've been away from my code for a while I have to relearn this. I'd preferselectcols(andselectrowsinstead offilter). - Safety. Mutating operations should be visibly clear, and unsafe operations made explicit (as per the previous point).
- Minimality. There shouldn't be 2 functions that do the same thing. E.g., some systems have both
mutateandtransform, which I think creates clutter in the API.
So here's what I have in mind for tables, views and split-apply-combine operations. Suggestions most welcome.
Cheers
Tables:
newtable = SomeTableType(table) # Convert table to SomeTableType
val = table[i, colname] # Get
table[i, colname] = val # Set
newtable = appendrows(table, rows)
newtable = appendcols(table, newcolname => somevector...)
newtable = appendcols(table, newcolname => func(row)...)
newtable = deleterows(table, rows)
newtable = deletecols(table, cols)
table = mutatecol!(table, colname::Symbol => func)
table = sortrows!(table, by)
table = sortrows!(table, colnames, rev)
Views:
view = selectcols(table, colnames)
view = selectrows(table, rowindices)
view = selectrows(table, func(row))
val = view[i, colname] # Get
view[i, colname] = val # Set. Raise an error if the view function returns false on the resulting row.
unsafe_set!(view, i, colname, val) # Changes the value and does not raise an error.
newtable = SomeTableType(view) # Convert view to SomeTableType
view = mutatecol!(view, colname::Symbol => func) # Raise an error if the view function returns false on any of the resulting rows.
split-apply-combine:
grptbl = groupby(data, colnames...)
grptbl = groupby(data, rowfunc)
for grp in grptbl # grp is a view
for r in rows(grp)
# do something here
end
end
reducedtbl = some_empty_table
for grp in grptbl
push!(reducedtbl, (col1=sum(grp[:col3]), col2=mean(grp[:col4])))
end
val = groupdefinition(grp) # (colname1=val1, colname2=val2,...) if grp was defined by colnames; or func(grp[1, :]) if grp was defined by a row function
grp = group(grptbl, groupdef) # Useful for groups accessed via definition.
grp = group(grptbl, groupidx) # Useful for accessing groups by index and for iterating over groups
For constructing reduced tables DataFrames has an interface similar to
reducedtbl = reduceby(table, colnames, :col1 => (sum, :col3), :col2 => (sum, :col4)) # Short version of the above, though less flexible (cannot operate on multiple columns at once)
But I prefer the version that explicitly iterates over the groups because it adheres to minimality and is more flexible (construction of the new columns can use arbitrary functions of the input view).
See previous discussion at https://discourse.julialang.org/t/common-api-for-tabular-data-backends/21546.
Thanks. That's a long thread that didn't conclude anything. Since the set of operations that are useful will differ for different people/use cases/style preferences, it's hard to see any agreement for a single unified API. Nor does it seem necessary.
Base and Tables.jl between them seem to provide the elements required to construct a set of operations, so perhaps it's best to let a number of query-like packages built on Base and Tables.jl to emerge organically, which the community will undoubtedly pare down by voting with their feet.
I'm inclined to just put something out there with a view to having it improved or replaced by popular vote.