RCMIP5 Speed up `dplyr` implementation

Why is it so slow currently? Hmm.

Jul 10 '15 14:07 bpbond

Notes for myself for later:

remove NA value values - memory, speed
remove NA columns - memory
don't rebuild missing lon/lats after summarizing - memory, speed
use summarise instead of do and (probably) no dots, instead a single w - speed

Jul 10 '15 16:07 bpbond

There might be errors that crop up from removing lat/lon from the data.frame. I vaguely recall either writing or reading code that references the lat/lon in the data.frame instead of cmip5data. Not saying that it's a bad idea but something to be aware of.

Jul 10 '15 19:07 ktoddbrown

Thanks. I'm going to try these all out in a branch before messing with master.

Jul 10 '15 19:07 bpbond

Well, this is interesting. To test dplyr speedup approaches, I happened to use two very different test files: one that is sparse and thus favors dplyr over array (in terms of both memory required and speed), and the other dense (should favor array).

Test file 1: ph_Omon_IPSL-CM5A-MR_historical_r1i1p1_185001-194912.nc (1850-1859 only)

Implementation	Object size (MB)	makeAnnualStat time (s)
Current dplyr	112	214
dplyr + remove NAs	7.3	9.0
dplyr + remove NAs + `summarise`	7.3	0.6
Current array	69	125

For these sparse data, there are enormous gains (in both memory used and speed) to be had by using dplyr when it's been optimized to remove NAs and use summarise. Note that array is slow and memory-inefficient with this file.

Test file 2: nbp_Lmon_HadGEM2-ES_historical_r3i1p1_195912-198411.nc

Implementation	Object size (MB)	makeAnnualStat time (s)
Current dplyr	287	562
dplyr + remove NAs	109	217
dplyr + remove NAs + `summarise`	109	14.4
Current array	64	8.0

For these dense data, dplyr starts out orders of magnitude slower than array. Optimization produces very large gains, although it can't match array in either speed or memory efficiency.

Jul 11 '15 11:07 bpbond

Note I didn't try

remove NA columns - memory

as this will require fairly significant changes throughout the codebase. The others, by contrast, are localized and easy.

Jul 11 '15 11:07 bpbond

dplyr::summarise won't pass dots (...) on to its summary functions, and so the makeXxxxStat functions use a do construct to do this, instead of the standard summarise. It turns out this is considerably slower than just using summarise--like 15x lower! (See table above.)

This is such an enormous speed difference that I'd like to propose changing the RCMIP5 stat functions: either changing their behavior slightly, or their syntax. In the first option, I could extract the first dots element (typically the weight) and pass that in via summarize, ignoring anything else in dots. So no change to syntax, but we'd tell people that if they need to pass in multiple arguments to a custom function, they should use array. Awkward, but this will be rare.

Option two would be to change makeAnnualStat, makeMonthlyStat, and makeZStat to follow the makeGlobalStat syntax with a single explicit weight parameter, but no dots. So they'd all have the form:

makeXxxStat(x, w=NULL, verbose=FALSE, FUN=mean, (etc) )

This changes the syntax, and might break existing scripts, but eliminates any ambiguity. Thoughts @cahartin @ktoddbrown ?

Jul 14 '15 12:07 bpbond

Hi @ktoddbrown @cahartin FYI - I'm planning on doing some more work on RCMIP5 later this month, probably the week after next. I'd like to deal with issues blocking v1.2, especially #143 #141 #129 .

Anyway, just a heads-up that you'll see some activity...hopefully leading to a 1.2 release!

Oct 09 '15 18:10 bpbond

@bpbond, that's great! And I will get on issue #129. I vaguely remember processing all the data.

Oct 09 '15 18:10 cahartin

@bpbond Sadly I don't think I'll have time to work on this this month otherwise I would be delighted to push through these issues with you. FYI I know that i have something for the regridding test unit sitting on my local folder but I don't think it's operational yet so I haven't committed it to the repo. November is looking pretty quiet for me so maybe I can devote some time then.

Oct 12 '15 16:10 ktoddbrown

@ktoddbrown I've updated the testing and dplyr code for a 1.2 release to CRAN. This (above thread) is important but given time constraints I'm inclined to put it off for now, i.e. for a 1.3 release. OK?

Jul 29 '16 19:07 bpbond