scDblFinder scDblFinder error when using aggregateFeatures and knownDoublets

Dear developers,

I'm working with multiplexed (CMO) scATAC-seq data (one 10X-run has 6 samples) which gives me information on known doublets from overlap of hashtags. When using the scDblFinder function for this data I wanted to provide these doublets as knownDoublets and aggregate features as recommended in the vignette. However, I found that this combination of parameters does not work and throws an error (see below). After some debugging I found that the source of the issue might be that the splitting of the dataset in known doublets (sce.dbl) and others (sce) is performed before aggregation which leads to a mismatch of row names between the two subsets.

MRE -- Minimal example to reproduce the bug

scDblFinder(
  sce = sce,
  dims = 50,
  aggregateFeatures = TRUE,
  knownDoublets = (sce$ident == doublet_sample), 
  knownUse = "discard"
)

Traceback

6: stop(sprintf(fmt, msg))
5: SummarizedExperiment:::.SummarizedExperiment.charbound(subset, 
       names, fmt)
4: .convert_subset_index(i, rownames(x))
3: sce.dbl[sel_features, ]
2: sce.dbl[sel_features, ]
1: scDblFinder::scDblFinder(sce = sce, dims = 50, aggregateFeatures = TRUE, 
       knownDoublets = sce$ident == doublet_sample, knownUse = knownUse)`

Session info

R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 11 (bullseye)

Matrix products: default

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] qs_0.25.5                          furrr_0.3.1                        future_1.32.0                     
 [4] intrinsicDimension_1.2.0           yaImpute_1.0-33                    glmGamPoi_1.12.1                  
 [7] Palo_1.1                           here_1.0.1                         ComplexHeatmap_2.16.0             
[10] pheatmap_1.0.12                    ggpp_0.5.2                         BSgenome.Mmusculus.UCSC.mm10_1.4.3
[13] BSgenome_1.67.4                    rtracklayer_1.59.1                 Biostrings_2.67.2                 
[16] XVector_0.39.0                     tarchetypes_0.7.6                  scuttle_1.10.1                    
[19] Signac_1.10.0                      scDblFinder_1.14.0                 SingleCellExperiment_1.22.0       
[22] SummarizedExperiment_1.29.1        Biobase_2.59.0                     GenomicRanges_1.51.4              
[25] GenomeInfoDb_1.35.17               IRanges_2.33.1                     S4Vectors_0.38.1                  
[28] BiocGenerics_0.45.3                MatrixGenerics_1.12.2              matrixStats_1.0.0                 
[31] targets_1.1.3                      SeuratObject_4.1.3                 Seurat_4.3.0                      
[34] lubridate_1.9.2                    forcats_1.0.0                      stringr_1.5.0                     
[37] dplyr_1.1.2                        purrr_1.0.1                        readr_2.1.4                       
[40] tidyr_1.3.0                        tibble_3.2.1                       ggplot2_3.4.2                     
[43] tidyverse_2.0.0

Jul 19 '23 15:07 dottercp

Hi, thanks for reporting this. Until I fix it, what you can do is run the aggregation separately, e.g. this should reproduce what you're trying to do:

sce.ag <- aggregateFeatures(sce, k=50)
sce.ag <- scDblFinder(sce.ag, processing="normFeatures",
                      knownDoublets = (sce.ag$ident == doublet_sample))
sce$scDblFinder.score <- sce.ag$scDblFinder.score

As a note, I'm also exploring now doing it with a high k (e.g. 500) and with the normal processing, e.g.:

sce.ag <- aggregateFeatures(sce, k=500)
sce.ag <- scDblFinder(sce.ag)

Although I still haven't tested systematically that it's better...

Jul 20 '23 07:07 plger

Thanks a lot for the quick answer and the workaround!

Jul 20 '23 11:07 dottercp