Output cut offsets in separate file
Hi, I would like to use cutadapt in our pipeline, which uses a database to store data and requires traceability/reconstruction of the various steps, storing cutoff points for the adapter clipping along with the original sequences. To do that it would be practical to have an option to also produce a tab separated file holding the read identifier and the four cutoff offsets (two for the forward and two for the reverse reads in pairs where the adapter is found). Offsets for the right end should be given from the end of the sequence (i.e. number of bases cutted from the right end as opposed to the offset from the beginning of the read). Would that be feasible? Where should I look in the source tree to implement such a feature?
Can you use --info-file for this? See http://cutadapt.readthedocs.io/en/stable/guide.html#format-of-the-info-file
There are two issues with the info file: since it contains the sequence and quality information it becomes extremely expensive in terms of i/o to use it, then it seems to hold information on the adapter match but not on the actual clipping of the sequence (is a right or left clip being performed?). What I need is practically a description of the actual output in terms of cutoff points referred to the input.
From what I understand from the code so far it seems not to be a trivial operation to obtain such information: I should write a filter to output the values and attach it just before the last filter to ensure only unfiltered sequences are considered (this won't be supported in case of demultiplexing since it has it's own filter). In order for the filter to get the cutoffs, two attributes could be added to the Sequence class (cutoffHead=0, cutoffTail=0), then all modifiers should be made aware of such attributes and update them on the returned sequence (adding in the values from the input sequence of the modifier). Am I on the right track or is there a better way of implementing it?
The info file splits each read into three parts: The part before the adapter, the match itself and the part after it, so you could actually deduce whether the part before or after was removed.
I was initially thinking that it would be sufficient to make the info file format customizable. Would that help?
I believe implementing a new option would be more or less like adding another info file output format, so you could just duplicate the code that is responsible for writing the info file and adjust it.
The offsets for where the adapter was found in a read are stored in the .match attribute of the Sequence class. The modifiers should not need to be changed.
Note two complications (in addition to demultiplexing):
- Cutadapt can search for a set of adapters, so in general the information of which adapter was found would possibly be needed (the current info file contains this)
- There is the
--timesoption, which allows to search each read multiple times for a set of adapters, so you could have multiple cutoffs (I believe in the current info file, this would result in multiple output lines). This needs to be handled, at least by throwing an error message.
Another thought, also because I’m somewhat reluctant to add support for an ad-hoc file format: Since you say you need the original sequence: Would it work to have an option to add the information to the header of each read in the output FASTQ? This could be combined with --no-trim, so you would just have a FASTQ file annotated with adapter locations.
I’ll be busy for a few days doing other things. I hope I can get back to you after the weekend.
I do not get how I can deduce which part was removed just by looking at the info file splits, besides this still means reading/writing sequences twice, which is quite an IO burden.
Customizability of the info file could help, provided the cutoff points could be added to it so that I could get the required format from the customized info file directly. The only showstopper I see is that the info file gets written for all the sequences since the filter comes before any actual filtering while I would like to have such info only for reads actually written in the "main" output file, thus excluding discarded ones. Implementing a different filter would allow putting it just before the final output is written by the last_filter
Although at the time being I am likely going to apply only the adapterCutter modifier, there are also other modifiers cutting bases from the input sequence, they should also contribute to the final cutoffs. If initially going only for adapter matches, the alternative could then be to add up cutoffs for each match, this should give the requested output since the match is always relative to the previous trimmed sequence, right? Then I can compute the actual number of removed bases from the given end (the cutoff value I need).
I don't think I will need to know which adapter(s) actually matched, just how many bases have been removed from each end of the original input read. Adding up values for each match would also cope with the multiple matches possibility.
I understand your reluctance, even though it would be an independent option keeping the tab separated style of the info file. The problem with information added to the header is that such info needs to be parsed afterwards and the header can be (almost) freely modified and is subject to change, I need a stable solution, independent of possible changes in the input data headers. Also altering them seems not an attractive choice to me. I actually need both outputs (trimmed sequences for further processing and cutoffs for storing into the database) since I already load the original input data into the database before processing with cutadapt.
P.S: there seem to be also the case of linked adapters, which complicates things a bit. I'll see what I can figure out in the next days, thanks for your support meanwhile.
An update on the status: on the cutoffs branch of my fork I have implemented the option --cutoffs-file which only works for adapter cuts (no other modifiers are affected/considered), does not support linked matches and outputs information only for unfiltered sequences (the filter is added just before the last_filter). I also included a couple of tests on paired reads (the only ones we are supporting in our pipeline at the moment)
It’s been a while, but I’m just going through old issues to see what to do about them. I still think some kind of customizable info file format is still the way to go to solve this, perhaps similar to what the recently added --rename option allows, but there has been little demand so far. I’ll leave the issue open for now.
The relevant commit in the branch is here: https://github.com/a1an77/cutadapt/commit/c11f16a2195e225edc9456f8bd53ce56a38b16e2
In short:
- The file needs to contain one record for each read in the output FASTA/FASTQ (not for all input, as
--info-filewould) - It needs to work with paired-end data (
--info-filedoesn’t) - Each record should include the length of the removed prefix and the length of the removed suffix (for both R1 and R2)