Low number of good pairs after filtering
Hi,
I am trying to get duplex sequencing up but I find I get a very low number of 'good pairs' after filtering and consenquently, a very low number of called duplex reads. For example:
Total Reads 28801102 Read pairs (n) 10901142 Paired (%) 75 Good pairs 1151139 Good pairs (%) 4 BAM Duplex reads 1002726 Percentage of original reads (%) 3.48 Mapped 94%
So in this example, I am left with only 4% of the original reads.
I am using the basic usage as recommended:
duplex_tools pairs_from_summary $output_dir/sequencing_summary.txt $output_dir
duplex_tools filter_pairs $output_dir/pair_ids.txt $output_dir
nanopore_guppy guppy_basecaller_duplex \
--input_path $input_dir \
-r --save_path $duplex_dir \
--device auto \
--config $model \
--duplex_pairing_mode from_pair_list \
--duplex_pairing_file $output_dir/pair_ids_filtered.txt \
--align_ref $ref \
--bam_out
Questions:
Why do I get so few good pairs and subsequently good reads? 4% is a bit useless. Should I skip the filtering step and run the second guppy run with the pair_ids.text instead?
Lastly, the duplex basecalling could benefit from simplification. Dorado usage looks good but I am getting errors so its not working at the moment. Would be great if guppy could be simplified!
Hi @myxotheles,
Apologies for late reply, we're phasing out duplex-tools in favour of all batteries included in dorado.
Sorry to hear you're getting issues, would be excellent to know which errors you are having with dorado as that is the current method we recommend.
Just a couple of sanity checks for the run and dataset:
- Was the flow cell a high-duplex flow cell?
- What is the read length of the sample?
- Is the sample native human or something else?
- For the basecalling, was both the pass and fail reads used in the input dir?
Lastly, the summary metrics you're reporting, which tool do they come from?