Help!!! Different results by using extendReads or not
Dear deeptools developer, Thanks for developing such a great tool! I recently ran a Cut & Tag experiment(2 X 150bp) and want to use the plots from deeptools to show the signal distribution of different histone markers on TSS regions. However, I got different results when I used "extendReads" or not in bamCoverage function with BPM normalization. Please see the plot shown below:

The plot showed that the male sample (the "m" in the sample name) has a higher reads intensity than female sample (the "f" in the sample name) when I used extendReads, but the direction is opposite when I deactivated extendReads. I thought it was possibly caused by the quantification issue that the mate reads in properly paired reads were quantified twice. So I included "--samFlagInclude 66" in bamCoverage to quantify the mate reads only once with extendReads activated. However, the result showed similar as the one activating extendReads without samFlagInclude parameter (see below):

I am wondering why "extendReads" caused so much different results? This is important since it would results in different interpretation for biology.
deeptools version: 3.5.1 python version: 3.9.13
Here are the codes:
###with extendReads activated
bamCoverage --bam
###with extendReads deactivated
bamCoverage --bam
###with samFlagInclude 66
bamCoverage --bam
###computeMatrix codes
computeMatrix reference-point --referencePoint TSS -p 20 -b 2500 -a 2500 -R
I look forward to your reply! Many thanks in advance!!
Joe Wang
Is it necessary to use extendReads parameter for CUT& Tag paired-end sequencing data?
Hi @icanwinwyz I am also curious about this because when using --normalizeUsing RPKM we also observe a huge difference in the coverage plots:
Green is where --extendReads is switched on.
Red is where --extendReads is switched off.
Blue is ChipHub's published data for the same plant sample
And in both cases (red/green) the same --binSize 10 and all other parameters are the same, and --normalizeUsing RPKM is specified other settings are set to default and the input reads are Paired End.
So far I have traced how --extendReads impacts deeptools bamcoverage as follows:
- bamCoverage: the main() block from L143 https://github.com/deeptools/deepTools/blob/master/deeptools/bamCoverage.py#L143
Has this:
wr = writeBedGraph.WriteBedGraph([args.bam],
binLength=args.binSize,
stepSize=args.binSize,
region=args.region,
blackListFileName=args.blackListFileName,
numberOfProcessors=args.numberOfProcessors,
extendReads=args.extendReads,
minMappingQuality=args.minMappingQuality,
ignoreDuplicates=args.ignoreDuplicates,
center_read=args.centerReads,
zerosToNans=args.skipNonCoveredRegions,
samFlag_include=args.samFlagInclude,
samFlag_exclude=args.samFlagExclude,
minFragmentLength=args.minFragmentLength,
maxFragmentLength=args.maxFragmentLength,
chrsToSkip=args.ignoreForNormalization,
verbose=args.verbose,
)
Which is always triggered in all IF/ELSE conditions (i.e. no matter which normalization you use).
- writeBedGraph.py L30: https://github.com/deeptools/deepTools/blob/83f192947606c62028717f0e0824dc23a07536dc/deeptools/writeBedGraph.py#L30
has a class: class WriteBedGraph(cr.CountReadsPerBin)
which uses: import deeptools.countReadsPerBin as cr
- Looking at: countReadsPerBin.py: L30 https://github.com/deeptools/deepTools/blob/83f192947606c62028717f0e0824dc23a07536dc/deeptools/countReadsPerBin.py#L31C7-L31C23
We can see how args.extendReads is coming into play:
And we have the documentation of it here: https://deeptools.readthedocs.io/en/latest/_modules/deeptools/writeBedGraph.html
So what's happening is that by using --extendReads the number of reads that are counted per bin will be higher, since now R1 and R2 will be "filled" along the full length of their component fragment.
Quoting documentation: https://deeptools.readthedocs.io/en/latest/_modules/deeptools/writeBedGraph.html
"extendReads : bool, int
Whether coverage should be computed for the extended read length (i.e. the region covered by the two mates or the regions expected to be covered by single-reads). If the value is ‘int’, then then this is interpreted as the fragment length to extend reads that are not paired. For Illumina reads, usual values are around 300. This value can be determined using the peak caller MACS2 or can be approximated by the fragment lengths computed when preparing the library for sequencing. If the value is of the variable is true and not value is given, the fragment size is sampled from the library but only if the library is paired-end. Default: False "