U in adapter sequence does not match U in input sequence
I downloaded the SILVA LSU database to use as a reference, and I'm trying to cut it at the location of the 3' primer I am using in my study. However, for some reason cutadapt only finds a small number of very short matches:
cutadapt -a 'CGAAGUUUCCCUCA' -e 0.2 silva.fasta.gz >/dev/null
This is cutadapt 2.0 with Python 3.6.7
Command line parameters: -a CGAAGUUUCCCUCA -e 0.2 silva.fasta.gz
Processing reads on 1 core in single-end mode ...
[ 8=-] 00:00:26 198,843 reads @ 131 us/read; 0.46 M reads/minute
Finished in 26.02 s (131 us/read; 0.46 M reads/minute).
=== Summary ===
Total reads processed: 198,843
Reads with adapters: 2,258 (1.1%)
Reads written (passing filters): 198,843 (100.0%)
Total basepairs processed: 573,641,964 bp
Total written (filtered): 573,632,517 bp (100.0%)
=== Adapter 1 ===
Sequence: CGAAGTTTCCCTCA; Type: regular 3'; Length: 14; Trimmed: 2258 times.
No. of allowed errors:
0-4 bp: 0; 5-9 bp: 1; 10-14 bp: 2
Bases preceding removed adapters:
A: 31.5%
C: 34.7%
G: 33.8%
T: 0.0%
none/other: 0.1%
Overview of removed sequences
length count expect max.err error counts
3 812 3106.9 0 812
4 398 776.7 0 163 235
5 869 194.2 1 72 797
6 179 48.5 1 0 179
grep finds a a lot more than that, without any error tolerance or partial matching:
$ zcat silva.fasta.gz | grep -c 'CGAAGUUUCCCUCA'
16585
What's going on? Many of these sequences are 3000+ bp, and the adapter, when it occurs, should be at around 1000, so I am hoping to cut off ~2000 bp from each sequence. Is cutadapt not looking so far from the 3' end?
I have the same issue in cutadapt 1.18 with Python 2.7.12, and installing via apt and conda.
Hi, Cutadapt looks for the adapter in the entire sequence, so that is not the issue.
The problem appears to be with the U characters in the input sequences. Cutadapt converts all U characters in the adapter sequence to T automatically, but it does not do the same for the sequences in the input file. This problem hasn’t come up before because Cutadapt is usually used to trim sequencing reads, which of course never contain uracils.
As a workaround, you can add any IUPAC wildcard character to your adapter sequence. For example, you could use -a CGAAGUUUCCCUCAN instead. This will force the sequences to be pre-processed in a different way, and then the U to T conversion is done for both the adapter and the input sequence. (The conversion is only done internally; the trimmed sequences will have the original letters.)
It would be nice if this would work, but I need to consider whether there could be any negative consequences before I fix it. Thanks for bringing this up!
Thanks for the quick reply, and the workaround works great!
I guess another option would be to convert U to T in the input file
manually with something like sed '/>/!y/U/T/'.
On 3/17/19 23:28, Marcel Martin wrote:
Hi, Cutadapt looks for the adapter in the entire sequence, so that is not the issue.
The problem appears to be with the |U| characters in the input sequences. Cutadapt converts all |U| characters in the adapter sequence to |T| automatically, but it does not do the same for the sequences in the input file. This problem hasn’t come up before because Cutadapt is usually used to trim sequencing reads, which of course never contain uracils.
As a workaround, you can add any IUPAC wildcard character to your adapter sequence. For example, you could use |-a CGAAGUUUCCCUCAN| instead. This will force the sequences to be pre-processed in a different way, and then the |U| to |T| conversion /is/ done for both the adapter and the input sequence. (The conversion is only done internally; the trimmed sequences will have the original letters.)
It would be nice if this would work, but I need to consider whether there could be any negative consequences before I fix it. Thanks for bringing this up!
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/marcelm/cutadapt/issues/368#issuecomment-473721478, or mute the thread https://github.com/notifications/unsubscribe-auth/AQDkwQF2ppxHCTRbSg5afvzzw44iinPLks5vXsGBgaJpZM4b4qJ4.