scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

read_10x_mtx with different prefix

Open alexblaessle opened this issue 6 years ago • 10 comments

Hi scanpy team,

I am not sure if I just missed it, but there does not seem to be a way to specify a different filename for .mtx files. For instance, assuming I have multiple .mtx files in a folder sample1.matrix.mtx, sample2.matrix.mtx ... with corresponding sample1.genes.tsv and sample1.barcodes.tsv. It would be useful to be able either specify matrix/genes/barcodes filename, etc. and/or a suffix for the files.

Thanks in advance, Alex

alexblaessle avatar Oct 21 '19 14:10 alexblaessle

Hi! That function is for reading the files output by cellranger’s mex option. Your files have been renamed by someone in a way we can’t predict, and you should just adapt the little code needed to read them yourself:

https://github.com/theislab/scanpy/blob/e6e08e51d63c78581bb9c86fe6e302b80baef623/scanpy/readwrite.py#L324-L341

Took me 3 minutes:

samples = []
for sample in range(1, 10):
    s = read(
        path / f'{sample}.matrix.mtx',
        cache=cache,
        cache_compression=cache_compression,
    ).T
    genes = pd.read_csv(path / f'{sample}.genes.tsv', header=None, sep='\t')
    s.var_names = genes[0]
    s.var['gene_symbols'] = genes[1].values
    s.obs_names = pd.read_csv(path / f'{sample}.barcodes.tsv', header=None)[0]
    samples.append(s)
adata = AnnData.concatenate(samples)

flying-sheep avatar Oct 23 '19 13:10 flying-sheep

Hi @flying-sheep ,

Thanks, this is what I did myself, too. I just thought it would be cool to have the function to more general in case someone gets non standard data (especially for the people who are new to scanpy or python).

Anyways, love the package, great work!

alexblaessle avatar Nov 04 '19 09:11 alexblaessle

Thanks for the praise!

If there was a way to generalize this function, we could do it. As is, I don’t see any, other than letting the user specify the all three file names. Is that what you want?

flying-sheep avatar Nov 04 '19 12:11 flying-sheep

I thought this would be useful. I recently got a few datasets that were renamed and/or in a different folder structure and I thought it would be good if one could specify that. Something like

def read(folder,mtx_file=None,features_file=None,...):
    if mtx_file is not None:
         # Load mtx file
    else: 
        # Fall back to load from folder

Again, thank you so much!

alexblaessle avatar Nov 08 '19 06:11 alexblaessle

OK, let’s do this.

flying-sheep avatar Nov 08 '19 11:11 flying-sheep

We would also be highly interested in this feature. Sometimes datasets on GEO follow exactly the pattern Alex described earlier. The small improvement on scanpy's side would allow us to read such data faster :-)

jenzopr avatar Jan 24 '20 09:01 jenzopr

Are there plans to incorporate this feature? Would be very helpful to specify the filenames to read in -- a lot of GEO data doesn't follow the expected format.

annashch-insitro avatar May 19 '22 15:05 annashch-insitro

Hey, sorry for being slow here

upon looking into this again, it is the case that read_10x_mtx has to make strong assumptions on the files being generated by Cell Ranger. This is also reflected in the filenames this software outputs.

Is there a widely used processing pipeline which does not adhere to this file naming? If yes, scanpy should indeed be able to deal with this; If no, custom workflows would actually be more reliably dealt with by using a small custom reading script as suggested by @flying-sheep above:

Hi! That function is for reading the files output by cellranger’s mex option. Your files have been renamed by someone in a way we can’t predict, and you should just adapt the little code needed to read them yourself:

https://github.com/theislab/scanpy/blob/e6e08e51d63c78581bb9c86fe6e302b80baef623/scanpy/readwrite.py#L324-L341

Took me 3 minutes:

samples = []
for sample in range(1, 10):
    s = read(
        path / f'{sample}.matrix.mtx',
        cache=cache,
        cache_compression=cache_compression,
    ).T
    genes = pd.read_csv(path / f'{sample}.genes.tsv', header=None, sep='\t')
    s.var_names = genes[0]
    s.var['gene_symbols'] = genes[1].values
    s.obs_names = pd.read_csv(path / f'{sample}.barcodes.tsv', header=None)[0]
    samples.append(s)
adata = AnnData.concatenate(samples)

eroell avatar Oct 12 '23 09:10 eroell

Is there a widely used processing pipeline which does not adhere to this file naming?

STARsolo generates cell-ranger compatible output, and when multiple multi-mapper resolution strategies are enabled, it will write multiple matrix.mtx.gz files, with different names.

e.g: STARsolo ... --soloMultiMappers Unique EM PropUnique Rescue Uniform yields:

barcodes.tsv.gz
features.tsv.gz
matrix.mtx.gz
UniqueAndMult-EM.mtx.gz
UniqueAndMult-PropUnique.mtx.gz
UniqueAndMult-Rescue.mtx.gz
UniqueAndMult-Uniform.mtx.gz

Each of these *.mtx.gz files matches the same format as matrix.mtx.gz and can be read in the same way. (They all share the *.tsv.gz files).

A 3-parameter version of the read_10x_mtx() function would be my vote as the most flexible option.

fwip avatar Mar 17 '24 16:03 fwip

Linking partially related issue #1860 to connect

eroell avatar Mar 21 '24 15:03 eroell