fastp icon indicating copy to clipboard operation
fastp copied to clipboard

Trimming adapter sequence with Ns

Open omarwagih opened this issue 4 years ago • 8 comments

One of the fastq files I'm processing was carried out using the NEXTflex™ Small RNA-Seq Kit for library prep which uses an adapter sequence with Ns in it

NNNNTGGAATTCTCGGGTGCCAAGG

I tried passing this through fastp --adapter_sequence but I get the error

ERROR: the adapter <adapter_sequence> can only have bases in {A, T, C, G}

I also tried trimming the non-N version of the adapter then trim 4 bases off the tail but it seems fastp trims the tail first then trims the adapter so this doesn't work

--adapter_sequence=TGGAATTCTCGGGTGCCAAGG --trim_tail1 4

Is there any way of processing this fastq file using fastp?

Thanks!

omarwagih avatar Apr 14 '21 15:04 omarwagih

I'd also be very interested in having that feature e.g. --adapter_sequence_r2 NNNNNNNNNNAGATCGGAAGAGCGTCGTGTAGGGAAAGA should remove an extra 10 bases Any planes to implement this? Trimgalore seems to support Ns in the adapters. BTW

Thanks!

riederd avatar Oct 20 '21 07:10 riederd

Will consider to implement this feature.

Could you guys let me know what's the NNNNNNNNNN designed for ? UMI or barcodes in single-cell sequencing?

sfchen avatar Oct 20 '21 08:10 sfchen

Here is my use case: Notes from the Zymo-Seq RiboFree® Total RNA Library Kit:

The Zymo-Seq RiboFree® Total RNA Library Kit employs a lowcomplexity bridge to ligate the Illumina® P7 adapter sequence to the library inserts. This sequence can extend up to 10 nucleotides. QC analysis software (e.g., FastQC) may raise flags such as “Per base sequence content” at the beginning of Read 2 due to this low complexity bridge sequence.

I hope this answers your question.

Thanks

riederd avatar Oct 20 '21 08:10 riederd

In my case it’s just a 3’ sequencing adapter

omarwagih avatar Oct 20 '21 09:10 omarwagih

To add to what @riederd posted, here's additional info from the Zymo-Seq RiboFree suggestion:

If desired, these 10 nucleotides can be removed in addition to adapter
trimming. An example using Trim Galore!

for such trimming is as below:

trim_galore --paired --clip_R2 10 \
-a NNNNNNNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
-a2 AGATCGGAAGAGCGTCGTGTAGGGAAAGA \
sample.R1.fastq.gz \
sample.R2.fastq.gz

Related to this, it would be great if fastp allowed for trimming an additional n bp after adapter trimming, similar to the functionality that the trim_galore command provides above.

Currently, it looks like fastp --trim_front and --trim_tail are steps 2 and 3 (respectively) in the order of operations, which come well before the adapter trimming occurs.

In the current fastp configuration, this means if one wanted to trim the adapters and then trim an additional n bp from the trimmed reads, the user would have to initiate a second round of fastp.

kubu4 avatar Feb 24 '22 14:02 kubu4

This use case comes up for other kits as well, such as trimming an adaptase tail resulting from xGen library prep kits for which the manufacturer notes: Illumina adapter trimming must be performed before ... Adaptase tail trimming

blostein avatar Dec 21 '23 02:12 blostein

This use case came up for us too.

In the case similar to mentioned by @blostein, In our case the sequencing partner would say:

Indexed adapter sequences

The full-length adapter sequences are below. The underlined text indicates the location of the index sequences, which are 8 bp for CDI and 10 bp for UDI. These sequences represent the adapter sequences following completion of the indexing PCR step.

Index 1 (i7) adapters 5-GATCGGAAGAGCACACGTCTGAACTCCAGTCACXXXXXXXX(XX)ATCTCGTATGCCGTCTTCTGCTTG–3 Index 2 (i5) Adapters 5-AATGATACGGCGACCACCGAGATCTACACYYYYYYYY(YY)ACACTCTTTCCCTACACGACGCTCTTCCGATCT–3

Which leaves us puzzled as to what the ambiguous X and Y characters were.

Being able to declare wildcards or known length NNNNN+ subsequences would immediately solve our problem since:

We cannot:

  • assume that placing the left and right sides of the unknown chars can constitute 4 x adapters, since trimming would still pepper the trimmed reads with the unknown X+ and Y+ subsequences.
  • assume that concatenating the left and right sides of the unknown X+|Y+ ends of the known adapter subsequences can constitute 2 x joined adapters since trimming would never match these..

a1ultima avatar Jan 24 '24 18:01 a1ultima