tech.ml.dataset icon indicating copy to clipboard operation
tech.ml.dataset copied to clipboard

Define a query data structure for tech.ml.dataset

Open ezmiller opened this issue 4 years ago • 0 comments

We want to define a data structure specification for a query that can become canonical within tech.ml.dataset. This will help make query-related functions smarter because it will be introspectable.

A concrete example of a function that can use this query definition is tech.v3.dataset.base/filter-column, whose signature is currently (dataset colname predicate) -> dataset. predicate can be a value or an instance of IFn. If filter-column were to take a query specification instead of a function, it could decide how to execute the filter, choosing the most optimal path that is appropriate for the data -- for example, choosing to use binary search for ordered data or the new column index-structure for unordered data. This the behavior we want to unlock with this change.

@cnuernber laid out a draft of what this might look like in a PR (see here). In it there are two simple query types: :any-of and :range. The filter data structures are maps that include a special key :filter-type and then other keys as needed based on the type of filter.

#:tech.v3.dataset{:filter-type :any-of
                    :values (set item-seq)}

#:tech.v3.dataset{:filter-type :range
                  :start start
                  :start-inclusive? start-inclusive?
                  :comparator comparator
                  :datatype op-dtype
                  :stop stop
                  :stop-inclusive? stop-inclusive?}))

It also bears mentioning that this data structure resonates a bit with the signature of the tech.v3.dataset.column-index-structure.select-from-index function that can be used to query a column's index structure. That function takes a mode that at the moment is either :slice or :pick and then a hash map of key-value pairs specifying the query based on the mode (see here). Whatever data structure we end up creating, it could be that that we change select-from-index to take that query structure. This would be a case of this query data structure becoming universal among TMD functions.

Another thing to keep in mind, is that @ribelo and @genmeblog are working on "lifting" tech.ml.dataset column functions into tablecloth in this PR that may be something to consider. I'm not sure yet if the work that is being done there could influence how we define the query data structure here, or vice versa.

ezmiller avatar Aug 10 '21 17:08 ezmiller