msisensor icon indicating copy to clipboard operation
msisensor copied to clipboard

whether to use deduplicated bam

Open coolbubu opened this issue 7 years ago • 14 comments

When I ran Msisensor, I found the results are quite different between using the deduplicated bam and not deduplicated bam. I wonder which bam shoud be used , the deduplicated bam or not deduplicated bam .

not_dedeplicated.bam

Total_Number_of_Sites	Number_of_Somatic_Sites	%
9739	1501	15.41

dedeplicated.bam

Total_Number_of_Sites	Number_of_Somatic_Sites	%
8798	122	1.39

coolbubu avatar Jan 03 '19 02:01 coolbubu

what is the data coverage? WGS or WES or targeted sequencing?

When I ran Msisensor, I found the results are quite different between using the deduplicated bam and not deduplicated bam. I wonder which bam shoud be used , the deduplicated bam or not deduplicated bam .

not_dedeplicated.bam

Total_Number_of_Sites Number_of_Somatic_Sites %

9739 1501 15.41

dedeplicated.bam

Total_Number_of_Sites Number_of_Somatic_Sites %

8798 122 1.39

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/ding-lab/msisensor/issues/29, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AB9s-8n9bqtREHkiLxof0R7RtG-3Ig8jks5u_XFQgaJpZM4ZngYf.

liangkaiye avatar Jan 03 '19 02:01 liangkaiye

It is WES , the mean coverage is 180 and the dup_ratio is 54.88%

coolbubu avatar Jan 03 '19 03:01 coolbubu

how did you remove duplicates ? looks like dup ratio is so high.

Beifang avatar Jan 07 '19 08:01 Beifang

I have noticed the same behavior and now routinely msisensor on dedupped BAMs (obtained using samtools view -F 1024). The results are then much closer to MSI status obtained by orthogonal method.

micknudsen avatar Feb 04 '19 09:02 micknudsen

You should use the deduplicated BAMs. In the end, you can get the correct results only by using the data that you think is the cleanest.

ZhaoDanOnGitHub avatar Mar 01 '19 15:03 ZhaoDanOnGitHub

I am wondering whether bam with marked duplicates is sufficient or I have to export deduplicated reads to a separate bam? Thanks!

guodudou avatar Apr 11 '19 14:04 guodudou

Marking duplicates is not sufficient. There is often a notable difference between using a BAM file with duplicates marked and with duplicates removed.

micknudsen avatar Apr 12 '19 06:04 micknudsen

Thank you very much for the quick response! In addition, there is a closed issue where people suggested using coverage normalization. I find score slightly changes. But this classifies samples with score around cutoff point 3.5% differently. Do you have any suggestion? Many thanks!

guodudou avatar May 13 '19 15:05 guodudou

We suggest : MSI_H: msiscore >= 10%, MSI_L: 3.5% =< msiscore < 10%; MSS: msiscore < 3.5%

Beifang avatar May 14 '19 03:05 Beifang

Thank you very much for the great information! Do you suggest coverage normalize for normal and tumor samples? Thanks!

guodudou avatar May 14 '19 15:05 guodudou

We din't normalize the TCGA UCEC data ( msiscore: 3.5% ) in MSIsensor original version. You can test with or without normalization option. We suggest that you choose this option when normal and tumor coverage are very different.

Beifang avatar May 15 '19 03:05 Beifang

Thank you very much! Can you please specify how you implement coverage normalization and/or how normalization affects the the length distribution / msi calling? This is very important to me because with and without normalization classify my samples to MSI_H and MSI_L respectively. Thanks!

guodudou avatar Sep 13 '19 18:09 guodudou

The difference in the depth of sequencing between tumor tissue and normal tissue will affect the judgment of whether the site is stable. Therefore, we normalize the read distribution so that the area of their distribution is in the same magnitude. The specific practices are as follows: compare the sequencing depth of normal tissues and tumor tissues and correct the sequencing data with a small depth, that is,
the number of supported reads after normalization of the site = the number of supported reads * (max / min). Where max is the total number of supported reads of the tissue with a large depth of the site, and min is the total number of supported reads of the tissue with a smaller sequencing depth.

ZhaoDanOnGitHub avatar Sep 15 '19 11:09 ZhaoDanOnGitHub

Thank you very much, this is very clear! I plan to extract the coverages of tumor and normal samples at all possible MS loci that are qualified for MSI calling, then see whether I need to adopt "coverage normalization". Do you have a suggestion about what range of coverage difference between normal and tumor is good for using "coverage normalization"? Thanks!

guodudou avatar Sep 16 '19 15:09 guodudou