Winnowmap icon indicating copy to clipboard operation
Winnowmap copied to clipboard

List of files / multiple files as input

Open SergejN opened this issue 5 years ago • 5 comments

Dear maintainers,

is it possible to add a possibility to specify a list of input files instead of a single file? I work with the axolotl genome and have quite a few long reads. Therefore, I have two possibilities

    1. either I zcat the input files into a single huge fastq file, which is a bit wasteful given the amount of data OR
    1. I zcat the input files and pipe the data to winnowmap.

However, since the genome is to huge, minimap2 has to split the index. Therefore, if I pipe the data, winnowmap ends up mapping the reads only to the first 5 scaffolds, which are included in the first index chunk. Other scaffolds are processed as well afterwards, but there are no more data in the pipe. It would be nice to be able to specify multiple input files, which all can be read multiple times if necessary.

I also tried creating the index first by setting -d scaffolds.mmi, and then running winnowmap, but in this case I get a segmentation fault.

thanks!

SergejN avatar Nov 22 '20 12:11 SergejN

You can run the mapper as:

winnowmap -W repetitive_k15.txt -ax map-ont ref.fa ont1.fq.gz ont2.fq.gz ont3.fq.gz  ...

Will this resolve your issue?

cjain7 avatar Nov 22 '20 13:11 cjain7

BTW, you can also tweak the size of chunk that is processed at a time (assuming you can tolerate more memory-usage) using -I parameter.

See https://lh3.github.io/minimap2/minimap2.html

cjain7 avatar Nov 22 '20 13:11 cjain7

You can run the mapper as:

winnowmap -W repetitive_k15.txt -ax map-ont ref.fa ont1.fq.gz ont2.fq.gz ont3.fq.gz  ...

Will this resolve your issue?

In theory, yes, but it's also super inconvenient to specify the names of 137 files on the command line.

BTW, you can also tweak the size of chunk that is processed at a time (assuming you can tolerate more memory-usage) using -I parameter.

See https://lh3.github.io/minimap2/minimap2.html

Yes, I saw this parameter, but I had the impression that minimap2 cannot process sequences longer than 4G. I now saw that this was incorrect and only applies to a single sequence within the dataset and not the total length of the sequences. I will give it a try and set -I to the whole genome size (32Gb). Thanks!

SergejN avatar Nov 22 '20 17:11 SergejN

You might be able to do

winnowmap -W repetitive_k15.txt -ax map-ont ref.fa <(ls -1 *.fq.gz|tr '\n' ' ')

Not tested *assumes all FASTQ files are desired and have the extension .fq.gz

jelber2 avatar Nov 23 '20 09:11 jelber2

Yes, sure. This will also work, unless you have to specify so many files that the command line becomes too long (2MB on my system, so quite a few file names):

winnowmap -W repetitive_k15.txt -ax map-ont ref.fa $(find . -name "*.fq.gz" | grep -v 'whatever_you_want_to_exclude' | 'tr '\n' ' ')

But I wanted to propose a more elegant way. Of course, I can also put the file names into a text file and then run (assuming there are no spaces or other weird characters)

winnowmap -W repetitive_k15.txt -ax map-ont ref.fa $(cat filelist | tr '\n' ' ')

SergejN avatar Nov 23 '20 19:11 SergejN