Methods for splitting SingleCellExperiment objects
Is there scope to define a splitColData and splitRowData methods for the SingleCellExperiment class?
I am working with a rather large SingleCellExperiment object and I often find myself needing to split the object into a list of smaller objects for pre-processing based on either the column or row data.
This can obviously be done with the following:
# Split by column data
var <- colData(sce)$variable
sce <- lapply(var, function(x) sce[, colData(sce)$variable == x])
# Split by row data
var <- rowData(sce)$variable
sce <- lapply(var, function(x) sce[rowData(sce)$variable == x, ])
However, I've found this approach to be slower than using a for-loop with pre-allocation (e.g. similar to the code already in the splitAltExps function):
splitColData <- function(x, f) {
i <- split(seq_along(f), f)
v <- vector(mode = "list", length = length(i))
names(v) <- names(i)
for (n in names(i)) { v[[n]] <- x[, i[[n]]] }
return(v)
}
If there is a need for these methods I can submit a pull-request? If not, it would be super helpful if you could advise what is the most robust and efficient method for splitting SCE objects. Thank you.
However, I've found this approach to be slower than using a for-loop with pre-allocation (e.g. similar to the code already in the splitAltExps function):
Well, yes, that's because you're looping over every element of var rather than its unique levels.
If there is a need for these methods I can submit a pull-request?
Possibly, but this would likely go to the SummarizedExperiment repository rather than this one. Any such methods should benefit all SE subclasses, there isn't any reason that it would just be useful for SCEs.
Tagging @mtmorgan: does this functionality already exist in SE?S4Vectors::split() kind of works but it's hard to remember that it splits by row instead of column in an SE. (Also I just noticed SCE doesn't implement extractROWS properly: need to fix.)
bc220cab41b7112347dda5e094ebb2a9c987fb23 fixes the split() issue, so a hypothetical splitByRow() would be as easy as:
split(sce, rowData(sce)$variable)