pertpy Bootstrap

PR Checklist

[x] Referenced issue is linked
[x] If you've fixed a bug or added code that should be tested, add tests!
[x] Documentation in docs is updated

Description of changes Addresses issue #497. This PR introduces convenience functionality for using (random sampling with replacement) bootstrapping towards error estimation of distances between perturbations.

For the user, no breaking changes are introduced. The changes visible are

additional parameters bootstrap: bool = False, n_bootstrap: int = 100, bootstrap_random_state: int = 0 in the methods Distance.one_sided and Distance.pairwise
an additional method Distance.bootstrap, a bootsrapping analogon to Distance.__call__

Technical details

Additional context: Examples

__call__ vs bootstrap

import pertpy as pt
import scanpy as sc
import matplotlib.pyplot as plt
from seaborn import clustermap

adata = pt.dt.distance_example()

dist = pt.tl.Distance("euclidean")

print(dist(X, Y))
print(dist.bootstrap(X, Y, n_bootstrap=100))

2.640588
MeanVar(mean=3.3500843, variance=0.06426364)

onesided

import pertpy as pt
import scanpy as sc
import matplotlib.pyplot as plt
from seaborn import clustermap

adata = pt.dt.distance_example()

dist = pt.tl.Distance("euclidean")

dists_one = dist.onesided_distances(adata, groupby="perturbation", selected_group="control")
print(dists_one.head())

dists_one, vars_one = dist.onesided_distances(
    adata, groupby="perturbation", selected_group="control", bootstrap=True
)
print(dists_one.head())
print(vars_one.head())

perturbation
p-INTERGENIC393453     1.823859
p-sgELF1-2             1.992385
p-INTERGENIC216151     1.856206
p-INTERGENIC1144056    1.999334
p-sgELF1-5             2.132977

perturbation
p-INTERGENIC393453     2.194509
p-sgELF1-2             2.335306
p-INTERGENIC216151     2.192096
p-INTERGENIC1144056    2.310208
p-sgELF1-5             2.421595

perturbation
p-INTERGENIC393453     0.133328
p-sgELF1-2             0.150482
p-INTERGENIC216151     0.136785
p-INTERGENIC1144056    0.124505
p-sgELF1-5             0.158694

pairwise

import pertpy as pt
import scanpy as sc
import matplotlib.pyplot as plt
from seaborn import clustermap

adata = pt.dt.distance_example()

dist = pt.tl.Distance("euclidean")

dists = dist.pairwise(adata, groupby="perturbation")

means, vars = dist.pairwise(adata, groupby="perturbation", bootstrap=True)

clustermap(dists, robust=True, figsize=(10, 10))
clustermap(means, robust=True, figsize=(10, 10))
clustermap(vars, robust=True, figsize=(10, 10))

Jan 14 '24 15:01 eroell

There are some tests failing, solving this together with maybe some more overhauling - opened draft PR with examples to raise some suggestions, especially with examples, of what our ideas could materialize to.

Slightly bugged that the bootstrapping mean is typically larger than the single-computed distance, not 100% sure about reason and validity right now

Jan 14 '24 15:01 eroell

Yeah, smaller test datasets are also fine. We have issues with some distances failing on smaller test datasets so we had to sadly keep them a bit on the larger side.

I'm fixing the scgen test now. The 3.11 pre-release fails due to something numba related.

Jan 14 '24 15:01 Zethson

Yeah, smaller test datasets are also fine. We have issues with some distances failing on smaller test datasets so we had to sadly keep them a bit on the larger side.

I'm fixing the scgen test now. The 3.11 pre-release fails due to something numba related.

(3 Failures caused by these changes right now I believe, to be fixed.)

Jan 14 '24 15:01 eroell