cutadapt icon indicating copy to clipboard operation
cutadapt copied to clipboard

Optimizing for Multiple cutadapt calls/ guidance on complex regex

Open jtrivedi7 opened this issue 2 years ago • 5 comments

Hi,

So our team is using combinatorial sequencing with multistep barcode appends. Thus our read analysis pipeline uses pipe and multiple cutadapt calls. Thus, somewhere in the source code a for loop is running each time cutadapt is called which is suboptimal for us. Below is a sample shell script that is being used:

cutadapt -m 24 --nextseq-trim=15 -e 0.15 \
--pair-filter=first \
--cores=$core --report=minimal --trim-n \
-g file:$ligation_barcode \
--rename="{id}_{r1.adapter_name}" \
--discard-untrimmed \
--interleaved \
$input_folder/$sample.R1.fastq.gz $input_folder/$sample.R2.fastq.gz 2>> $output_folder/report_cutadapt.txt | \
  cutadapt -e 0.15 \
--pair-filter=any \
--cores=$core --report=minimal --trim-n \
--rename='{id}_{r1.adapter_name}_{r1.match_sequence}' \
-g file:$N_RT \
--discard-untrimmed \
--interleaved \
-o $output_folder/$sample.tagged.R1.fastq.gz -p $output_folder/$sample.tagged.R2.fastq.gz - >> $output_folder/report_cutadapt.txt 2>&1

As you can see, we first call cutadapt to trim 'ligation' barcode (read adapter 1) and add it to sample id in fastq file, then take the output of the previous step and trim 'rt' barcode (read adapter 2) and redirect the final output to an output folder. We want to do away with the need to create this intermediate fastq and get the final output with a single for loop. Now, I can create a bit more complex regex for Adapter but we want to control the error handling for those two barcodes individually and not aggregately for e.g. if the max error on adapter 1 and 2 is 0.15, the found sequence need to be capped at 0.15 each and not cap at 0.3 cumulatively.

We usually run in these in multiple cores if that is important.

jtrivedi7 avatar Mar 07 '23 20:03 jtrivedi7

I’m not sure I follow. Can you please clarify where the loop is and which intermediate FASTQ you mean? In the script you have shown, there’s only the intermediate interleaved FASTQ piped from one process to the other.

marcelm avatar Mar 07 '23 20:03 marcelm

Thanks for prompt reply! Yes, I meant that intermediate fastq file. Loop maybe incorrect term here. Mainly, I wanted to understand if there is a way I can do away with piping and use cutadapt to search multiple adapters in a single call.

jtrivedi7 avatar Mar 07 '23 21:03 jtrivedi7

Thanks, I understand now. The piped back-to-back cutadapt calls are probably the most realistic alternative at the moment. The command-line interface is currently just not expressive enough to allow specifying processing steps in an arbitrary order. In particular, there’s currently no way to have some filters use --pair-filter=any and some use --pair-filter=first. Also, the --rename option can only be used once and it will only replace placeholders based on the most recent adapter match in case there are multiple.

There are partial solutions:

  • --times=2 runs two rounds of adapter trimming.
  • Instead of searching for adapter1 and then for adapter2, you could search for their combined sequence (you’d need to provide all possible combinations and this only works if they are indeed adjacent in the read).
  • The error rate can be set individually for adapters using adapter-search parameers.

This is another case for an API that would allow one to write a little bit of code to run such arbitrary pipelines. Time for me to think about this again.

That said, the piping solution isn’t that bad either: Writing and reading FASTQ is a little bit of overhead, but the dnaio library that Cutadapt uses for this is really fast, and considering that the intermediate FASTQ isn’t even compressed, the overhead should be quite small.

marcelm avatar Mar 07 '23 21:03 marcelm

props for detailed answer... I thought about your second point, the adapter sequence passes are anchored and so are adjacent. However, two things I am concerned about: 1) the error rate needs to be for individual adapters and not cumulative 2) do I run the risk of actually making time complexity worse by augmenting a single file with all possible combination of two adapters? (as in the case of going from O(n1+n2) to O(n1*n2)).

Will also reconsider my search for an alternative after your assurance that the piping isn't that bad after all.

jtrivedi7 avatar Mar 07 '23 22:03 jtrivedi7

Indeed, I cannot see how you could ensure that the error rates are separate for the two types of adapters.

Regarding runtime, this could indeed get worse, but not if the tips for speeding up trimming many adapters apply (in particular, the adapters would have to be anchored).

You can measure what the overhead of producing the intermediate FASTQ is by inserting a cutadapt --interleaved -o - between the two processes or by running a similar "empty" cutadapt invocation separately (use uncompressed input and output and no threads). You can then compare the runtime per read to what the full commands take. Given your command-line, I’d guess the extra parsing/writing takes up less than 10% of the time.

marcelm avatar Mar 07 '23 22:03 marcelm