cutadapt icon indicating copy to clipboard operation
cutadapt copied to clipboard

Demultiplexing single-end reads with combinatorial dual indexes

Open racng opened this issue 2 years ago • 3 comments

cutadapt 4.2 with Python 3.10.9

Attempted to demultiplex nanopore single-end reads with combinatorial dual indexes, but it didn't work and created a file called {name1}-{name2}.fastq.gz cutadapt -g file:barcodes_i5.fasta -a file:barcodes_i7.fasta -o {name1}-{name2}.fastq.gz partial_tcr.fastq

# barcodes_i5.fasta
>S502
AATGATACGGCGACCACCGAGATCTACACCTCTCTAT

# barcodes_i7.fasta
>N701
TCGCCTTAATCTCGTATGCCGTCTTCTGCTTG

On the other hand this seems to work: cutadapt -g file:barcodes_pairs.fasta -o {name}.fastq.gz partial_tcr.fastq It generated file A1.fastq.gz

# barcodes_pairs.fasta
>A1
AATGATACGGCGACCACCGAGATCTACACCTCTCTAT...TCGCCTTAATCTCGTATGCCGTCTTCTGCTTG

racng avatar Mar 05 '23 01:03 racng

Excuse my ignorance, but are the i5 and i7 indices also used in Nanopore sequencing? I thought these were Illumina indices.

The {name1} and {name2} template variables only work for paired-end data. name1 refers to the adapter found on R1 and name2 refers to the adapter found on R2.

For demultiplexing single-end reads based on a 5' and a 3' adapter occurrence, Cutadapt currently offers linked adapters. A problem at the moment is that you need to list all possible adapter combinations "by hand", see https://github.com/marcelm/cutadapt/issues/625#issuecomment-1145196377. Your barcodes.fasta file would start like this:

>S502_N701
AATGATACGGCGACCACCGAGATCTACACCTCTCTAT...TCGCCTTAATCTCGTATGCCGTCTTCTGCTTG

Then you can use -a file:barcodes.fasta -o {name}.fastq.gz input.fastq.gz.

Because having to list out all combinations is annoying and also inefficient, I plan to improve this, see #633, but haven’t had the time to do so, yet.

marcelm avatar Mar 06 '23 10:03 marcelm

Thanks for your reply! Yes, we performed Nanopore sequencing on a library originally prepared for Illumina sequencing.

After some thought I think it might be a good thing to specify all the adapter combinations that we should expect. I ended up creating a script to get all possible adapter combinations. But this approach of listing all possible adapter combinations does seem inefficient if it performs an alignment for each linked adapter.

It would be more efficient to look for all the unique 3' and 5' adapters separately as you plan to do, then demultiplex based on combinations. When you implement the --link-adapters feature, it would be great if it takes a table of expected barcode pairs, in case the best matching pair does not actually exist in the sample.

racng avatar Mar 06 '23 19:03 racng