glmpca icon indicating copy to clipboard operation
glmpca copied to clipboard

improve negative binomial overdispersion estimation

Open willtownes opened this issue 5 years ago • 3 comments

The glmGamPoi package has some cool strategies for estimating the negative binomial overdispersion parameter(s). Try to use those ideas to improve accuracy, numerical stability, and speed of nb estimation in glmpca.

willtownes avatar Jun 19 '20 17:06 willtownes

Hi Will,

I just saw this issue. If you are interested, we could see if it is possible to call glmGamPoi::overdispersion_mle() from within glmpca. I recently worked on a similar project with DESeq2 (https://github.com/mikelove/DESeq2/pull/24). I tried to design glmGamPoi in a way, that is supposed to make it easy to integrate with other tools.

You are also at the Bioc Conference this weekm right? If you are interested, we could also have a virtual chat on that platform.

Best, Constantin

const-ae avatar Jul 28 '20 12:07 const-ae

Notes from recent skype call:

  • glmGamPoi requires data to be dense (either in-memory matrix/array or DelayedArray on disk)
  • current glmpca estimator of nb_theta is probably inaccurate
  • overall it's definitely worth using glmGamPoi as a subroutine for glmpca.

Constantin will add some features to glmGamPoi to facilitate integration:

  • flag to allow optimizer to only do a single iteration instead of running to completion
  • flag to allow a global dispersion across all genes instead of each gene having its own dispersion

Will is to do the following in glmpca package (glmGamPoi to become a "suggests" in DESCRIPTION and handled very similar to DelayedArray conditional dependency).

  • conditional import of glmGamPoi only if fam=(nb,nb2) and NO minibatches (data=dense array or DelayedArray)
  • if fam is not nb,nb2 then the user doesn't need to have glmGamPoi installed
  • If fam=nb,nb2 and we are using minibatches (ie the data are sparse) we can't rely on glmGamPoi (at least not in the present implementation). Need to either force a fixed nb_theta prespecified by the user (see #25 ) or use the current estimator which may not be very accurate and is only supported for fam=nb (global theta) not nb2 (gene-specific theta).

willtownes avatar Aug 13 '20 17:08 willtownes

Hey, thanks for the summary.

Constantin will add some features to glmGamPoi to facilitate integration:

flag to allow optimizer to only do a single iteration instead of running to completion

https://github.com/const-ae/glmGamPoi/commit/67489298ce8833ac767261230a7ed322010973a6#diff-a72abeb11c4db14c99c6483f1666e3b5L160

flag to allow a global dispersion across all genes instead of each gene having its own dispersion

https://github.com/const-ae/glmGamPoi/commit/043019c242df1156687b4e98cab39d18d63af766#diff-a72abeb11c4db14c99c6483f1666e3b5R83

Had a productive morning today and added the global overdispersion estimation routines that return a single estimate if overdispersion_mle() is called with a matrix Y. The result is basically equal to calling overdispersion_mle(c(Y)), however it can be faster and more importantly it also works for HDF5-backed matrices.

Happy to hear if those two features work for you as expected :)

const-ae avatar Aug 14 '20 13:08 const-ae