RCMIP5 icon indicating copy to clipboard operation
RCMIP5 copied to clipboard

Speed up `dplyr` implementation

Open bpbond opened this issue 10 years ago • 10 comments

Why is it so slow currently? Hmm.

bpbond avatar Jul 10 '15 14:07 bpbond

Notes for myself for later:

  • remove NA value values - memory, speed
  • remove NA columns - memory
  • don't rebuild missing lon/lats after summarizing - memory, speed
  • use summarise instead of do and (probably) no dots, instead a single w - speed

bpbond avatar Jul 10 '15 16:07 bpbond

There might be errors that crop up from removing lat/lon from the data.frame. I vaguely recall either writing or reading code that references the lat/lon in the data.frame instead of cmip5data. Not saying that it's a bad idea but something to be aware of.

ktoddbrown avatar Jul 10 '15 19:07 ktoddbrown

Thanks. I'm going to try these all out in a branch before messing with master.

bpbond avatar Jul 10 '15 19:07 bpbond

Well, this is interesting. To test dplyr speedup approaches, I happened to use two very different test files: one that is sparse and thus favors dplyr over array (in terms of both memory required and speed), and the other dense (should favor array).

Test file 1: ph_Omon_IPSL-CM5A-MR_historical_r1i1p1_185001-194912.nc (1850-1859 only)

Implementation Object size (MB) makeAnnualStat time (s)
Current dplyr 112 214
dplyr + remove NAs 7.3 9.0
dplyr + remove NAs + summarise 7.3 0.6
Current array 69 125

For these sparse data, there are enormous gains (in both memory used and speed) to be had by using dplyr when it's been optimized to remove NAs and use summarise. Note that array is slow and memory-inefficient with this file.

Test file 2: nbp_Lmon_HadGEM2-ES_historical_r3i1p1_195912-198411.nc

Implementation Object size (MB) makeAnnualStat time (s)
Current dplyr 287 562
dplyr + remove NAs 109 217
dplyr + remove NAs + summarise 109 14.4
Current array 64 8.0

For these dense data, dplyr starts out orders of magnitude slower than array. Optimization produces very large gains, although it can't match array in either speed or memory efficiency.

bpbond avatar Jul 11 '15 11:07 bpbond

Note I didn't try

  • remove NA columns - memory

as this will require fairly significant changes throughout the codebase. The others, by contrast, are localized and easy.

bpbond avatar Jul 11 '15 11:07 bpbond

dplyr::summarise won't pass dots (...) on to its summary functions, and so the makeXxxxStat functions use a do construct to do this, instead of the standard summarise. It turns out this is considerably slower than just using summarise--like 15x lower! (See table above.)

This is such an enormous speed difference that I'd like to propose changing the RCMIP5 stat functions: either changing their behavior slightly, or their syntax. In the first option, I could extract the first dots element (typically the weight) and pass that in via summarize, ignoring anything else in dots. So no change to syntax, but we'd tell people that if they need to pass in multiple arguments to a custom function, they should use array. Awkward, but this will be rare.

Option two would be to change makeAnnualStat, makeMonthlyStat, and makeZStat to follow the makeGlobalStat syntax with a single explicit weight parameter, but no dots. So they'd all have the form:

makeXxxStat(x, w=NULL, verbose=FALSE, FUN=mean, (etc) )

This changes the syntax, and might break existing scripts, but eliminates any ambiguity. Thoughts @cahartin @ktoddbrown ?

bpbond avatar Jul 14 '15 12:07 bpbond

Hi @ktoddbrown @cahartin FYI - I'm planning on doing some more work on RCMIP5 later this month, probably the week after next. I'd like to deal with issues blocking v1.2, especially #143 #141 #129 .

Anyway, just a heads-up that you'll see some activity...hopefully leading to a 1.2 release!

bpbond avatar Oct 09 '15 18:10 bpbond

@bpbond, that's great! And I will get on issue #129. I vaguely remember processing all the data.

cahartin avatar Oct 09 '15 18:10 cahartin

@bpbond Sadly I don't think I'll have time to work on this this month otherwise I would be delighted to push through these issues with you. FYI I know that i have something for the regridding test unit sitting on my local folder but I don't think it's operational yet so I haven't committed it to the repo. November is looking pretty quiet for me so maybe I can devote some time then.

ktoddbrown avatar Oct 12 '15 16:10 ktoddbrown

@ktoddbrown I've updated the testing and dplyr code for a 1.2 release to CRAN. This (above thread) is important but given time constraints I'm inclined to put it off for now, i.e. for a 1.3 release. OK?

bpbond avatar Jul 29 '16 19:07 bpbond