Blacklist icon indicating copy to clipboard operation
Blacklist copied to clipboard

Entire chromosome outputs as high signal

Open Pooran-Dewari opened this issue 3 years ago • 7 comments

Entire chr29 as High Signal Region

I have created mappability file for chr29 (smallest chromosome for quicker analysis) with kmer 100 and 150, and used final uint output file of combine_umaps However, it spits out entire chromosome as high signal region, not sure if this is because of low number of inputs I have provided (5 inputs).

Blacklist was installed using https://anaconda.org/bioconda/encode-blacklist

$ Blacklist chr29
chr29   0       43051100        High Signal Region

Directory structure below

$ tree
.
├── input
│   ├── Ss1_final_chr29.bam
│   ├── Ss1_final_chr29.bam.bai
│   ├── Ss2_final_chr29.bam
│   ├── Ss2_final_chr29.bam.bai
│   ├── Ss3_final_chr29.bam
│   ├── Ss3_final_chr29.bam.bai
│   ├── Ss4_final_chr29.bam
│   ├── Ss4_final_chr29.bam.bai
│   ├── Ss5_final_chr29.bam
│   └── Ss5_final_chr29.bam.bai
└── mappability
    └── chr29.uint8.unique

Contents of uint file # combined kmer 100 and 150

$ od -t x1 mappability/chr29.uint8.unique | head
0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0000140 00 00 00 00 00 00 00 96 96 96 96 96 96 96 96 96
0000160 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
*
0000220 96 96 96 96 96 96 96 96 96 64 64 64 64 64 64 64
0000240 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
*
0000740 64 64 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000760 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Please could you suggest why Blacklist output is the entire chromosome. Thanks.

Pooran-Dewari avatar Nov 19 '22 09:11 Pooran-Dewari

Update: Blacklist with 23 inputs, doesn't help.

Tried with 23 inputs, and still gives me the entire chromosome as high signal region. Please could you suggest why Blacklist output is the entire chromosome. Thanks.

Pooran-Dewari avatar Nov 21 '22 14:11 Pooran-Dewari

I think you probably need more than 100 input tracks for a quality blacklist. We used 1256, 287, 443, and 490 for human, mouse, fly, and worm.

aboyle avatar Nov 29 '22 14:11 aboyle

Thanks. I might be able to get another 8 inputs from our collaborators, so total 30+ inputs, hopefully should help. Is there a way we could restrict the size of bins to overcome this reporting of entire chr as high signal? Or do you think trying with another chromosome(s) would help? thanks

Pooran-Dewari avatar Nov 30 '22 13:11 Pooran-Dewari

It may be that there is very low signal in this chromosome? If you uncomment like 442 you can see what the thresholds are defined as and that might give some insight into why you are getting a whole chromosome like this.

aboyle avatar Nov 30 '22 13:11 aboyle

Thanks. I can see on IGV tracks that many regions have high signal in input files, so assuming not low signal issue. Do you think changing int uniqueLength from default 36 to read length (i.e. 150) and recompile the source will do the trick? I had tested with 100 150 kmers and didn't go with lower numbers for mappability, just wondering if that is causing the issue.

int uniqueLength = 36; //This is arbitraty and defines how long a read needs
				// to be to be considered unique
				// Should be set to something actually calculated in the uint8 files

	for(int i = 0; i <= mappability.size() - binSize; i+=binOverlap) {
		uniqueCntr = 0;
		for(int j = 0; j < binSize; j++) {
			if(mappability[i+j] > 0 && mappability[i+j] <= uniqueLength) {
				uniqueCntr++;
			}

thanks for your help.

Pooran-Dewari avatar Nov 30 '22 14:11 Pooran-Dewari

This might be the issue - based on my comment it looks like the length needs to have been included in the unit8 files. Change that to 100 and give it a shot.

aboyle avatar Nov 30 '22 14:11 aboyle

Thanks. Yessss, changing int uniqueLength to 100 worked!!!

Pooran-Dewari avatar Nov 30 '22 14:11 Pooran-Dewari