cutadapt U in adapter sequence does not match U in input sequence

I downloaded the SILVA LSU database to use as a reference, and I'm trying to cut it at the location of the 3' primer I am using in my study. However, for some reason cutadapt only finds a small number of very short matches:

cutadapt -a 'CGAAGUUUCCCUCA' -e 0.2 silva.fasta.gz >/dev/null
This is cutadapt 2.0 with Python 3.6.7
Command line parameters: -a CGAAGUUUCCCUCA -e 0.2 silva.fasta.gz
Processing reads on 1 core in single-end mode ...
[         8=-] 00:00:26      198,843 reads   @      131 us/read;    0.46 M reads/minute  
Finished in 26.02 s (131 us/read; 0.46 M reads/minute).

=== Summary ===

Total reads processed:                 198,843
Reads with adapters:                     2,258 (1.1%)
Reads written (passing filters):       198,843 (100.0%)

Total basepairs processed:   573,641,964 bp
Total written (filtered):    573,632,517 bp (100.0%)

=== Adapter 1 ===

Sequence: CGAAGTTTCCCTCA; Type: regular 3'; Length: 14; Trimmed: 2258 times.

No. of allowed errors:
0-4 bp: 0; 5-9 bp: 1; 10-14 bp: 2

Bases preceding removed adapters:
  A: 31.5%
  C: 34.7%
  G: 33.8%
  T: 0.0%
  none/other: 0.1%

Overview of removed sequences
length	count	expect	max.err	error counts
3	812	3106.9	0	812
4	398	776.7	0	163 235
5	869	194.2	1	72 797
6	179	48.5	1	0 179

grep finds a a lot more than that, without any error tolerance or partial matching:

$ zcat silva.fasta.gz | grep -c 'CGAAGUUUCCCUCA'
16585

What's going on? Many of these sequences are 3000+ bp, and the adapter, when it occurs, should be at around 1000, so I am hoping to cut off ~2000 bp from each sequence. Is cutadapt not looking so far from the 3' end?

I have the same issue in cutadapt 1.18 with Python 2.7.12, and installing via apt and conda.

Mar 17 '19 21:03 brendanf

Hi, Cutadapt looks for the adapter in the entire sequence, so that is not the issue.

The problem appears to be with the U characters in the input sequences. Cutadapt converts all U characters in the adapter sequence to T automatically, but it does not do the same for the sequences in the input file. This problem hasn’t come up before because Cutadapt is usually used to trim sequencing reads, which of course never contain uracils.

As a workaround, you can add any IUPAC wildcard character to your adapter sequence. For example, you could use -a CGAAGUUUCCCUCAN instead. This will force the sequences to be pre-processed in a different way, and then the U to T conversion is done for both the adapter and the input sequence. (The conversion is only done internally; the trimmed sequences will have the original letters.)

It would be nice if this would work, but I need to consider whether there could be any negative consequences before I fix it. Thanks for bringing this up!

Mar 17 '19 22:03 marcelm

Thanks for the quick reply, and the workaround works great!

I guess another option would be to convert U to T in the input file manually with something like sed '/>/!y/U/T/'.

On 3/17/19 23:28, Marcel Martin wrote:

Hi, Cutadapt looks for the adapter in the entire sequence, so that is not the issue.

The problem appears to be with the |U| characters in the input sequences. Cutadapt converts all |U| characters in the adapter sequence to |T| automatically, but it does not do the same for the sequences in the input file. This problem hasn’t come up before because Cutadapt is usually used to trim sequencing reads, which of course never contain uracils.

As a workaround, you can add any IUPAC wildcard character to your adapter sequence. For example, you could use |-a CGAAGUUUCCCUCAN| instead. This will force the sequences to be pre-processed in a different way, and then the |U| to |T| conversion /is/ done for both the adapter and the input sequence. (The conversion is only done internally; the trimmed sequences will have the original letters.)

It would be nice if this would work, but I need to consider whether there could be any negative consequences before I fix it. Thanks for bringing this up!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/marcelm/cutadapt/issues/368#issuecomment-473721478, or mute the thread https://github.com/notifications/unsubscribe-auth/AQDkwQF2ppxHCTRbSg5afvzzw44iinPLks5vXsGBgaJpZM4b4qJ4.

Mar 17 '19 22:03 brendanf