read_10x_mtx with different prefix
Hi scanpy team,
I am not sure if I just missed it, but there does not seem to be a way to specify a different filename for .mtx files. For instance, assuming I have multiple .mtx files in a folder sample1.matrix.mtx, sample2.matrix.mtx ... with corresponding sample1.genes.tsv and sample1.barcodes.tsv. It would be useful to be able either specify matrix/genes/barcodes filename, etc. and/or a suffix for the files.
Thanks in advance, Alex
Hi! That function is for reading the files output by cellranger’s mex option. Your files have been renamed by someone in a way we can’t predict, and you should just adapt the little code needed to read them yourself:
https://github.com/theislab/scanpy/blob/e6e08e51d63c78581bb9c86fe6e302b80baef623/scanpy/readwrite.py#L324-L341
Took me 3 minutes:
samples = []
for sample in range(1, 10):
s = read(
path / f'{sample}.matrix.mtx',
cache=cache,
cache_compression=cache_compression,
).T
genes = pd.read_csv(path / f'{sample}.genes.tsv', header=None, sep='\t')
s.var_names = genes[0]
s.var['gene_symbols'] = genes[1].values
s.obs_names = pd.read_csv(path / f'{sample}.barcodes.tsv', header=None)[0]
samples.append(s)
adata = AnnData.concatenate(samples)
Hi @flying-sheep ,
Thanks, this is what I did myself, too. I just thought it would be cool to have the function to more general in case someone gets non standard data (especially for the people who are new to scanpy or python).
Anyways, love the package, great work!
Thanks for the praise!
If there was a way to generalize this function, we could do it. As is, I don’t see any, other than letting the user specify the all three file names. Is that what you want?
I thought this would be useful. I recently got a few datasets that were renamed and/or in a different folder structure and I thought it would be good if one could specify that. Something like
def read(folder,mtx_file=None,features_file=None,...):
if mtx_file is not None:
# Load mtx file
else:
# Fall back to load from folder
Again, thank you so much!
OK, let’s do this.
We would also be highly interested in this feature. Sometimes datasets on GEO follow exactly the pattern Alex described earlier. The small improvement on scanpy's side would allow us to read such data faster :-)
Are there plans to incorporate this feature? Would be very helpful to specify the filenames to read in -- a lot of GEO data doesn't follow the expected format.
Hey, sorry for being slow here
upon looking into this again, it is the case that read_10x_mtx has to make strong assumptions on the files being generated by Cell Ranger. This is also reflected in the filenames this software outputs.
Is there a widely used processing pipeline which does not adhere to this file naming? If yes, scanpy should indeed be able to deal with this; If no, custom workflows would actually be more reliably dealt with by using a small custom reading script as suggested by @flying-sheep above:
Hi! That function is for reading the files output by cellranger’s mex option. Your files have been renamed by someone in a way we can’t predict, and you should just adapt the little code needed to read them yourself:
https://github.com/theislab/scanpy/blob/e6e08e51d63c78581bb9c86fe6e302b80baef623/scanpy/readwrite.py#L324-L341
Took me 3 minutes:
samples = [] for sample in range(1, 10): s = read( path / f'{sample}.matrix.mtx', cache=cache, cache_compression=cache_compression, ).T genes = pd.read_csv(path / f'{sample}.genes.tsv', header=None, sep='\t') s.var_names = genes[0] s.var['gene_symbols'] = genes[1].values s.obs_names = pd.read_csv(path / f'{sample}.barcodes.tsv', header=None)[0] samples.append(s) adata = AnnData.concatenate(samples)
Is there a widely used processing pipeline which does not adhere to this file naming?
STARsolo generates cell-ranger compatible output, and when multiple multi-mapper resolution strategies are enabled, it will write multiple matrix.mtx.gz files, with different names.
e.g: STARsolo ... --soloMultiMappers Unique EM PropUnique Rescue Uniform yields:
barcodes.tsv.gz
features.tsv.gz
matrix.mtx.gz
UniqueAndMult-EM.mtx.gz
UniqueAndMult-PropUnique.mtx.gz
UniqueAndMult-Rescue.mtx.gz
UniqueAndMult-Uniform.mtx.gz
Each of these *.mtx.gz files matches the same format as matrix.mtx.gz and can be read in the same way. (They all share the *.tsv.gz files).
A 3-parameter version of the read_10x_mtx() function would be my vote as the most flexible option.
Linking partially related issue #1860 to connect