Speed up `dplyr` implementation
Why is it so slow currently? Hmm.
Notes for myself for later:
- remove NA
valuevalues - memory, speed - remove NA columns - memory
- don't rebuild missing lon/lats after summarizing - memory, speed
- use
summariseinstead ofdoand (probably) no dots, instead a singlew- speed
There might be errors that crop up from removing lat/lon from the data.frame. I vaguely recall either writing or reading code that references the lat/lon in the data.frame instead of cmip5data. Not saying that it's a bad idea but something to be aware of.
Thanks. I'm going to try these all out in a branch before messing with master.
Well, this is interesting. To test dplyr speedup approaches, I happened to use two very different test files: one that is sparse and thus favors dplyr over array (in terms of both memory required and speed), and the other dense (should favor array).
Test file 1: ph_Omon_IPSL-CM5A-MR_historical_r1i1p1_185001-194912.nc (1850-1859 only)
| Implementation | Object size (MB) | makeAnnualStat time (s) |
|---|---|---|
| Current dplyr | 112 | 214 |
| dplyr + remove NAs | 7.3 | 9.0 |
dplyr + remove NAs + summarise |
7.3 | 0.6 |
| Current array | 69 | 125 |
For these sparse data, there are enormous gains (in both memory used and speed) to be had by using dplyr when it's been optimized to remove NAs and use summarise. Note that array is slow and memory-inefficient with this file.
Test file 2: nbp_Lmon_HadGEM2-ES_historical_r3i1p1_195912-198411.nc
| Implementation | Object size (MB) | makeAnnualStat time (s) |
|---|---|---|
| Current dplyr | 287 | 562 |
| dplyr + remove NAs | 109 | 217 |
dplyr + remove NAs + summarise |
109 | 14.4 |
| Current array | 64 | 8.0 |
For these dense data, dplyr starts out orders of magnitude slower than array. Optimization produces very large gains, although it can't match array in either speed or memory efficiency.
Note I didn't try
- remove NA columns - memory
as this will require fairly significant changes throughout the codebase. The others, by contrast, are localized and easy.
dplyr::summarise won't pass dots (...) on to its summary functions, and so the makeXxxxStat functions use a do construct to do this, instead of the standard summarise. It turns out this is considerably slower than just using summarise--like 15x lower! (See table above.)
This is such an enormous speed difference that I'd like to propose changing the RCMIP5 stat functions: either changing their behavior slightly, or their syntax. In the first option, I could extract the first dots element (typically the weight) and pass that in via summarize, ignoring anything else in dots. So no change to syntax, but we'd tell people that if they need to pass in multiple arguments to a custom function, they should use array. Awkward, but this will be rare.
Option two would be to change makeAnnualStat, makeMonthlyStat, and makeZStat to follow the makeGlobalStat syntax with a single explicit weight parameter, but no dots. So they'd all have the form:
makeXxxStat(x, w=NULL, verbose=FALSE, FUN=mean, (etc) )
This changes the syntax, and might break existing scripts, but eliminates any ambiguity. Thoughts @cahartin @ktoddbrown ?
Hi @ktoddbrown @cahartin FYI - I'm planning on doing some more work on RCMIP5 later this month, probably the week after next. I'd like to deal with issues blocking v1.2, especially #143 #141 #129 .
Anyway, just a heads-up that you'll see some activity...hopefully leading to a 1.2 release!
@bpbond, that's great! And I will get on issue #129. I vaguely remember processing all the data.
@bpbond Sadly I don't think I'll have time to work on this this month otherwise I would be delighted to push through these issues with you. FYI I know that i have something for the regridding test unit sitting on my local folder but I don't think it's operational yet so I haven't committed it to the repo. November is looking pretty quiet for me so maybe I can devote some time then.
@ktoddbrown I've updated the testing and dplyr code for a 1.2 release to CRAN. This (above thread) is important but given time constraints I'm inclined to put it off for now, i.e. for a 1.3 release. OK?