Limiting the search to the first n characters
Hi there,
I was wondering if there is a way of limiting the search for adapters to the first n characters (or the last n characters) of each sequence. I find that particularly useful when demultiplexing: if there are a considerable number of barcodes to match, it is often the problem that one of the barcodes matches somewhere in the middle of the read. As many sequencing experiments return data with known structure, one can expect the demultiplexing information to be located in the first n characters, so it will be more precise and quicker to find that info if it was possible to limit the seach to those first characters
Barcodes in data I have worked with were usually directly at the 5' end, and Cutadapt offers anchored 5' adapters for these. If you need to be a bit more flexible, you could use a non-internal adapter with a couple of N characters at the beginning and set the minimum overlap such that only full occurrences are allowed.
For example, if you have barcode ACGTACGT (length 8) and you want to allow up to 5 bases preceding it:
cutadapt -g 'XN{5}ACGTACGT;min_overlap=8' ...
If you allow a large number of N bases like this, this is a bit slower than it could be, so please let me know if that is the case and if it is a bottleneck and I could have a look into optimizing this.
Thanks Marcel for your quick response. I´ll give it a try, thanks. I agree that with Illumina datasets it is commonly the first bases where the adaptor starts. I have often used the anchoring option and it has worked well. However, recently I have been presented with two fairly common cases in which there is some uncertainty about where the barcode is:
- In Illumina datasets, but in cases where a variable number of Ns have been added before the barcodes to increase within-cycle variability and hence increase the quality. I have found that there are cases in which the number of Ns is not exactly the same as predicted, so I thought looking for the actual adapter, but within the first 20 bp would be quicker and more precise.
- In Oxford Nanopore datasets, in which the starting point of the sequence varies quite a bit, and in which the higher error rate makes it prone to find the adapter in the wrong place.
Hopefully the -g XN{4} - I didn't know about and never used the X option- will solve both my issues. I would let you know how it goes, if you are interested.
Best Regards
On Mon, 29 May 2023 at 13:10, Marcel Martin @.***> wrote:
Barcodes in data I have worked with were usually directly at the 5' end, and Cutadapt offers anchored 5' adapters https://cutadapt.readthedocs.io/en/stable/guide.html#anchored-5adapters for these. If you need to be a bit more flexible, you could use a non-internal adapter https://cutadapt.readthedocs.io/en/stable/guide.html#non-internal with a couple of N characters at the beginning and set the minimum overlap such that only full occurrences are allowed.
For example, if you have barcode ACGTACGT (length 8) and you want to allow up to 5 bases preceding it:
cutadapt -g 'XN{5}ACGTACGT;min_overlap=8' ...
If you allow a large number of N bases like this, this is a bit slower than it could be, so please let me know if that is the case and if it is a bottleneck and I could have a look into optimizing this.
— Reply to this email directly, view it on GitHub https://github.com/marcelm/cutadapt/issues/709#issuecomment-1566984549, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXKS37PRPOX7HI3U7HG5MTXIR7Z5ANCNFSM6AAAAAAYSPQSYU . You are receiving this because you authored the thread.Message ID: @.***>