Bootstrap
PR Checklist
- [x] Referenced issue is linked
- [x] If you've fixed a bug or added code that should be tested, add tests!
- [x] Documentation in
docsis updated
Description of changes Addresses issue #497. This PR introduces convenience functionality for using (random sampling with replacement) bootstrapping towards error estimation of distances between perturbations.
For the user, no breaking changes are introduced. The changes visible are
- additional parameters
bootstrap: bool = False,n_bootstrap: int = 100,bootstrap_random_state: int = 0in the methodsDistance.one_sidedandDistance.pairwise - an additional method
Distance.bootstrap, a bootsrapping analogon toDistance.__call__
Technical details
Additional context: Examples
__call__ vs bootstrap
import pertpy as pt
import scanpy as sc
import matplotlib.pyplot as plt
from seaborn import clustermap
adata = pt.dt.distance_example()
dist = pt.tl.Distance("euclidean")
print(dist(X, Y))
print(dist.bootstrap(X, Y, n_bootstrap=100))
2.640588
MeanVar(mean=3.3500843, variance=0.06426364)
onesided
import pertpy as pt
import scanpy as sc
import matplotlib.pyplot as plt
from seaborn import clustermap
adata = pt.dt.distance_example()
dist = pt.tl.Distance("euclidean")
dists_one = dist.onesided_distances(adata, groupby="perturbation", selected_group="control")
print(dists_one.head())
dists_one, vars_one = dist.onesided_distances(
adata, groupby="perturbation", selected_group="control", bootstrap=True
)
print(dists_one.head())
print(vars_one.head())
perturbation
p-INTERGENIC393453 1.823859
p-sgELF1-2 1.992385
p-INTERGENIC216151 1.856206
p-INTERGENIC1144056 1.999334
p-sgELF1-5 2.132977
perturbation
p-INTERGENIC393453 2.194509
p-sgELF1-2 2.335306
p-INTERGENIC216151 2.192096
p-INTERGENIC1144056 2.310208
p-sgELF1-5 2.421595
perturbation
p-INTERGENIC393453 0.133328
p-sgELF1-2 0.150482
p-INTERGENIC216151 0.136785
p-INTERGENIC1144056 0.124505
p-sgELF1-5 0.158694
pairwise
import pertpy as pt
import scanpy as sc
import matplotlib.pyplot as plt
from seaborn import clustermap
adata = pt.dt.distance_example()
dist = pt.tl.Distance("euclidean")
dists = dist.pairwise(adata, groupby="perturbation")
means, vars = dist.pairwise(adata, groupby="perturbation", bootstrap=True)
clustermap(dists, robust=True, figsize=(10, 10))
clustermap(means, robust=True, figsize=(10, 10))
clustermap(vars, robust=True, figsize=(10, 10))
There are some tests failing, solving this together with maybe some more overhauling - opened draft PR with examples to raise some suggestions, especially with examples, of what our ideas could materialize to.
Slightly bugged that the bootstrapping mean is typically larger than the single-computed distance, not 100% sure about reason and validity right now
Yeah, smaller test datasets are also fine. We have issues with some distances failing on smaller test datasets so we had to sadly keep them a bit on the larger side.
I'm fixing the scgen test now. The 3.11 pre-release fails due to something numba related.
Yeah, smaller test datasets are also fine. We have issues with some distances failing on smaller test datasets so we had to sadly keep them a bit on the larger side.
I'm fixing the scgen test now. The 3.11 pre-release fails due to something numba related.
(3 Failures caused by these changes right now I believe, to be fixed.)