deepTools icon indicating copy to clipboard operation
deepTools copied to clipboard

CountReadsPerBin using bedFile does not return counts in the expected order

Open alexbarrera opened this issue 4 years ago • 4 comments

Hi team,

When running CountReadsPerBin using a BED file to define the regions, the resulting numpy array doesn't seem to be sorted according to the entries in the BED file. I ran the same code multiple times (with and without multiple processors) and the returned array of counts is always different (however, the total number of counts using.sum() remains constant, suggesting it is just in a different order). Since the output is just a 2D numpy array, there is no way to know which BED entry correspond to which row of the returning array. I understand it might be hard to ensure the order of the returning array when using multiple threads/processors, but at a minimum it is probably a good idea to fully disclose this behavior in the CountReadsPerBin function. This seems to also be the case in multiBamSummary as reported in #848. I ended up following the workaround of using the raw counts and BED coordinates as suggested in that issue, but again it would be nice to disclose this.

Here is the code I ran where I noticed that 2 different iterations of the exact same call reached 2 different count arrays:

from deeptools.countReadsPerBin import CountReadsPerBin as cr
import numpy as np
bamList = [
    "KS134.rep1.masked.dups_marked.bam",
    "KS134.rep2.masked.dups_marked.bam",
    "KS134.rep3.masked.dups_marked.bam",
    "KS134.rep4.masked.dups_marked.bam",
    "KS134.rep5.masked.dups_marked.bam",
    "KS134.rep6.masked.dups_marked.bam"
]

bedFile = "target_regions.bed"

counts = cr(bamFilesList= bamList, numberOfProcessors=4, ignoreDuplicates=True,  bedFile=bedFile).run()
print(counts[:10])

counts = cr(bamFilesList= bamList, numberOfProcessors=4, ignoreDuplicates=True,  bedFile=bedFile).run()
print(counts[:10])

which produces the following outputs:

array([[27., 30., 29., 28.,  6.,  4.],
       [27., 30., 29., 28.,  6.,  4.],
       [28., 31., 29., 29.,  5.,  4.],
       [ 9., 14., 21., 11.,  2.,  4.],
       [ 9., 14., 20., 10.,  2.,  4.],
       [ 9., 14., 20., 10.,  2.,  5.],
       [ 6.,  4., 11.,  9.,  1.,  3.],
       [ 3.,  2., 11.,  7.,  1.,  2.],
       [ 3.,  2., 11.,  8.,  1.,  2.],
       [ 3.,  4.,  5.,  3.,  4.,  1.]])

array([[ 5.,  9.,  6.,  5.,  1.,  5.],
       [ 4., 10.,  6.,  6.,  1.,  4.],
       [ 4.,  6.,  7.,  5.,  2.,  2.],
       [ 4.,  5.,  7.,  5.,  2.,  2.],
       [ 8.,  2.,  9.,  5.,  4.,  0.],
       [10.,  5., 11.,  6.,  4.,  0.],
       [ 6.,  9., 10.,  9.,  4.,  1.],
       [ 5.,  8.,  6.,  9.,  4.,  1.],
       [ 7.,  9.,  9., 10.,  5.,  1.],
       [11., 10.,  7.,  8.,  6.,  2.]])

Python and deeptools versions:

$ python --version
Python 3.7.9

$ bamCoverage --version
bamCoverage 3.5.1

Thanks for this wonderful package by the way, truly appreciated! Alex

alexbarrera avatar Sep 22 '21 01:09 alexbarrera

Hi @alexbarrera:

I have the same problem as you. Can you tell the details about 'using the raw counts and BED coordinates'?

Thanks!

qingyun039 avatar Nov 17 '21 02:11 qingyun039

Hi @qingyun039, I ended up using the multiBamSummary BED-file command with the --outRawCounts argument, which produces a table with the coordinates of the elements in the BED file followed by the counts for each sample (BAM).

alexbarrera avatar Nov 17 '21 02:11 alexbarrera

It would indeed be super nice to have the option to name the rows of multiBamSummary BED-file --outRawCounts From the docs, the last help line states:

--transcript_id_designator TRANSCRIPT_ID_DESIGNATOR Each region has an ID (e.g., ACTB) assigned to it, which for BED files is either column 4 (if it exists) or the interval bounds. [...]

an option like --id_designator ID_DESIGNATOR (not transcript-centric), where the designator column is used to "index" the multiBamSummary BED-file --outRawCounts output file. The output file already includes the coordinates, as mentioned above, so I'm guessing this is possible.

In fact, now I don't know what --transcript_id_designator is normally used for, if not to name rows. Now I'm thinking I'm way off base and just using multiBamSummary wrong.

Anyway, this feature would be quite useful in 'sanity-checking' my work.

Echoing Alex, thank you for Deeptools, it's a joy to use and your work is greatly appreciated. Julien

julienrichardalbert avatar Jul 20 '22 15:07 julienrichardalbert