drifter
drifter copied to clipboard
calculate_distance() possible improvement...
Hi Przemek,
For data.frames with more than 200k lines, there is an important opportunity to improve the speed in the calculation of this function, which is the core of calculate_covariate_fit() function.
Instead of using rank() here:
calculate_distance <- function(variable_old, variable_new, bins = 20) {
if ("factor" %in% class(variable_old)) {
after_cuts <- c(variable_old, variable_new)
} else {
after_cuts <- cut(rank(c(variable_old, variable_new)), bins)
}
It would improve a lot if you use frank() from data.table package.
calculate_distance <- function(variable_old, variable_new, bins = 20) {
if ("factor" %in% class(variable_old)) {
after_cuts <- c(variable_old, variable_new)
} else {
after_cuts <- cut(frank(c(variable_old, variable_new)), bins)
}
Well, after that calculation there is another calculation based on table() that also can be improved significantly by using a data.table calculation. If you accept to add data.tabledependency in shifter I will open a PR with these calculations. with these changes, I was able to pass from some minutes calculations in my comparison data.frame to barely 30 secs.
Thanks, Carlos.