Entire chromosome outputs as high signal
Entire chr29 as High Signal Region
I have created mappability file for chr29 (smallest chromosome for quicker analysis) with kmer 100 and 150, and used final uint output file of combine_umaps
However, it spits out entire chromosome as high signal region, not sure if this is because of low number of inputs I have provided (5 inputs).
Blacklist was installed using https://anaconda.org/bioconda/encode-blacklist
$ Blacklist chr29
chr29 0 43051100 High Signal Region
Directory structure below
$ tree
.
├── input
│ ├── Ss1_final_chr29.bam
│ ├── Ss1_final_chr29.bam.bai
│ ├── Ss2_final_chr29.bam
│ ├── Ss2_final_chr29.bam.bai
│ ├── Ss3_final_chr29.bam
│ ├── Ss3_final_chr29.bam.bai
│ ├── Ss4_final_chr29.bam
│ ├── Ss4_final_chr29.bam.bai
│ ├── Ss5_final_chr29.bam
│ └── Ss5_final_chr29.bam.bai
└── mappability
└── chr29.uint8.unique
Contents of uint file # combined kmer 100 and 150
$ od -t x1 mappability/chr29.uint8.unique | head
0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0000140 00 00 00 00 00 00 00 96 96 96 96 96 96 96 96 96
0000160 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
*
0000220 96 96 96 96 96 96 96 96 96 64 64 64 64 64 64 64
0000240 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
*
0000740 64 64 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000760 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Please could you suggest why Blacklist output is the entire chromosome. Thanks.
Update: Blacklist with 23 inputs, doesn't help.
Tried with 23 inputs, and still gives me the entire chromosome as high signal region. Please could you suggest why Blacklist output is the entire chromosome. Thanks.
I think you probably need more than 100 input tracks for a quality blacklist. We used 1256, 287, 443, and 490 for human, mouse, fly, and worm.
Thanks. I might be able to get another 8 inputs from our collaborators, so total 30+ inputs, hopefully should help. Is there a way we could restrict the size of bins to overcome this reporting of entire chr as high signal? Or do you think trying with another chromosome(s) would help? thanks
It may be that there is very low signal in this chromosome? If you uncomment like 442 you can see what the thresholds are defined as and that might give some insight into why you are getting a whole chromosome like this.
Thanks. I can see on IGV tracks that many regions have high signal in input files, so assuming not low signal issue.
Do you think changing int uniqueLength from default 36 to read length (i.e. 150) and recompile the source will do the trick? I had tested with 100 150 kmers and didn't go with lower numbers for mappability, just wondering if that is causing the issue.
int uniqueLength = 36; //This is arbitraty and defines how long a read needs
// to be to be considered unique
// Should be set to something actually calculated in the uint8 files
for(int i = 0; i <= mappability.size() - binSize; i+=binOverlap) {
uniqueCntr = 0;
for(int j = 0; j < binSize; j++) {
if(mappability[i+j] > 0 && mappability[i+j] <= uniqueLength) {
uniqueCntr++;
}
thanks for your help.
This might be the issue - based on my comment it looks like the length needs to have been included in the unit8 files. Change that to 100 and give it a shot.
Thanks. Yessss, changing int uniqueLength to 100 worked!!!