Parsing the j argument requires 'computing on the language'
But it will not be evaluated like in data.table.
That's because the datatableinterface object doesn't really contain any data, but just references to on-disk data. So we can't actually evaluate the j expression within the list it is contained. Some example code:
parser <- function(jsub, parent_frame, table_columns) {
if (is.call(jsub)) {
print(paste0("method '", as.character(jsub[[1]]), "' is called:"))
result <- list(rep(0, length(jsub)))
result[[1]] <- as.character(jsub[[1]])
# some arguments
if (length(jsub) > 1) {
for (pos in 2:length(jsub)) {
if (is.call(jsub[[pos]])) {
# call registration here
result[[pos]] <- parser(jsub[[pos]], parent_frame, table_columns)
} else
{
is_symbol <- typeof(jsub[[pos]]) == "symbol"
name <- as.character(jsub[[pos]])
print(paste0("argument '", jsub[[pos]], "', exists in parent frame: ",
exists(name, where = parent_frame), " is symbol: ", is_symbol,
if(is_symbol) paste(" is column:", name %in% table_columns) else ""))
result[[pos]] <- as.character(jsub[[pos]])
}
}
}
print(paste0("end method '", as.character(jsub[[1]])))
}
result
}
parse <- function(j, table_columns) {
parser(substitute(j), parent.frame(), table_columns)
}
# call the parser with (simulated) known columns Z and E and a j-expression
parse(.(A = 5, B = 3 * C(7), C = f(r * E), D = g(2 * Q)), c("Z", "E"))
#> [1] "method '.' is called:"
#> [1] "argument '5', exists in parent frame: FALSE is symbol: FALSE"
#> [1] "method '*' is called:"
#> [1] "argument '3', exists in parent frame: FALSE is symbol: FALSE"
#> [1] "method 'C' is called:"
#> [1] "argument '7', exists in parent frame: FALSE is symbol: FALSE"
#> [1] "end method 'C"
#> [1] "end method '*"
#> [1] "method 'f' is called:"
#> [1] "method '*' is called:"
#> [1] "argument 'r', exists in parent frame: FALSE is symbol: TRUE is column: FALSE"
#> [1] "argument 'E', exists in parent frame: FALSE is symbol: TRUE is column: TRUE"
#> [1] "end method '*"
#> [1] "end method 'f"
#> [1] "method 'g' is called:"
#> [1] "method '*' is called:"
#> [1] "argument '2', exists in parent frame: FALSE is symbol: FALSE"
#> [1] "argument 'Q', exists in parent frame: FALSE is symbol: TRUE is column: FALSE"
#> [1] "end method '*"
#> [1] "end method 'g"
#> [1] "end method '."
So with a similar method, we can determine all the calls that are made in the j argument and determine if we need to load a column.
Note that this means we can't re-use the code from the data.table package, because the operations to be executed after parsing the j expression are completely different from that in data.table.
After parsing j, a job list needs to be constructed that lists the read jobs to the backend (remote_table implementation). For known methods that operate element-wise, only small reads are required (e.g. the basic operators <, *, +). For unknown methods, whole columns need to be imported.
Mark, I see you've added some code to parse calls, but I was thinking I'd start working towards a simple working implementation of j from the other end. This minimal implementation would: accept j as a list, determine the new column order & names, and organize the expressions to be passed to the parser. This would at least allow subsetting and renaming of the fsttable columns, and handle assignment of fixed vectors to new columns (i.e. ft[, .(X = 2)]). How does that sound?
Hi @martinblostein, thanks, that sounds great!
The largest difference with the data.table parser is that with data.table, the variables are evaluated within the list environment of the table itself. We can't do that, because there are no real variables (yet) when the interface is (first) used. Instead, we have a list of column names that have to be matched to the expression. When valid column names are found, the data can be retrieved before evaluating the expression.
For specific methods (like operators) we can also retrieve a subset of the stored data. That would be enough for printing purposes (only the head and tail of the table are required). Only when more data is needed (e.g. for calculation of a median), the whole column can be retrieved from disk.
Is that what you had in mind?
Yes, that's exactly what I envision. What I'm going to implement is just the framework for handling the j argument, before any evaluation. However, this will be a good more forward in functionality, because without worrying about any parsing or delayed evaluation using data from the fstfile, it will allow:
ft <- fst_table with columns a & b
ft[, .(a)]
ft[, .(b, a)]
ft[, .(X = b, Y = b)]
etc.
These changes could be made by simply updating the proxy table state. The case of a fixed value (ft[, .(X = a, Z = 5)] is a bit more interesting; the new data would have to be stored in some way in the proxy table. But it's still a separate issue from the delayed read/evaluation of data on disk.
Yes, great! And list elements like Z = 5 could be the next development step I think.
Basically, for the implementation of lazy evaluation of general expressions, we have to store:
- the syntax tree for each (virtual) column.
- the known functions used in the expression. Known functions (like operators) are detected and can be stored for lazy evaluation and effective sub-setting. Unknown functions (e.g.
ft[, .(Z = myfunction(X)]) can't be stored for later use. That's because running them at a later time could give different results (for example due to the use of variables from the global environment that are changed after using the expression). - the variables used in the expression that exist in the local environment (like
5in your example).
As you say, all of these objects have to be stored in the proxy table to be able to use them when needed. Nice work!