scDblFinder error when using aggregateFeatures and knownDoublets
Dear developers,
I'm working with multiplexed (CMO) scATAC-seq data (one 10X-run has 6 samples) which gives me information on known doublets from overlap of hashtags. When using the scDblFinder function for this data I wanted to provide these doublets as knownDoublets and aggregate features as recommended in the vignette. However, I found that this combination of parameters does not work and throws an error (see below). After some debugging I found that the source of the issue might be that the splitting of the dataset in known doublets (sce.dbl) and others (sce) is performed before aggregation which leads to a mismatch of row names between the two subsets.
MRE -- Minimal example to reproduce the bug
scDblFinder(
sce = sce,
dims = 50,
aggregateFeatures = TRUE,
knownDoublets = (sce$ident == doublet_sample),
knownUse = "discard"
)
Traceback
6: stop(sprintf(fmt, msg))
5: SummarizedExperiment:::.SummarizedExperiment.charbound(subset,
names, fmt)
4: .convert_subset_index(i, rownames(x))
3: sce.dbl[sel_features, ]
2: sce.dbl[sel_features, ]
1: scDblFinder::scDblFinder(sce = sce, dims = 50, aggregateFeatures = TRUE,
knownDoublets = sce$ident == doublet_sample, knownUse = knownUse)`
Session info
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 11 (bullseye)
Matrix products: default
attached base packages:
[1] grid stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] qs_0.25.5 furrr_0.3.1 future_1.32.0
[4] intrinsicDimension_1.2.0 yaImpute_1.0-33 glmGamPoi_1.12.1
[7] Palo_1.1 here_1.0.1 ComplexHeatmap_2.16.0
[10] pheatmap_1.0.12 ggpp_0.5.2 BSgenome.Mmusculus.UCSC.mm10_1.4.3
[13] BSgenome_1.67.4 rtracklayer_1.59.1 Biostrings_2.67.2
[16] XVector_0.39.0 tarchetypes_0.7.6 scuttle_1.10.1
[19] Signac_1.10.0 scDblFinder_1.14.0 SingleCellExperiment_1.22.0
[22] SummarizedExperiment_1.29.1 Biobase_2.59.0 GenomicRanges_1.51.4
[25] GenomeInfoDb_1.35.17 IRanges_2.33.1 S4Vectors_0.38.1
[28] BiocGenerics_0.45.3 MatrixGenerics_1.12.2 matrixStats_1.0.0
[31] targets_1.1.3 SeuratObject_4.1.3 Seurat_4.3.0
[34] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
[37] dplyr_1.1.2 purrr_1.0.1 readr_2.1.4
[40] tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.2
[43] tidyverse_2.0.0
Hi, thanks for reporting this. Until I fix it, what you can do is run the aggregation separately, e.g. this should reproduce what you're trying to do:
sce.ag <- aggregateFeatures(sce, k=50)
sce.ag <- scDblFinder(sce.ag, processing="normFeatures",
knownDoublets = (sce.ag$ident == doublet_sample))
sce$scDblFinder.score <- sce.ag$scDblFinder.score
As a note, I'm also exploring now doing it with a high k (e.g. 500) and with the normal processing, e.g.:
sce.ag <- aggregateFeatures(sce, k=500)
sce.ag <- scDblFinder(sce.ag)
Although I still haven't tested systematically that it's better...
Thanks a lot for the quick answer and the workaround!