methphaser icon indicating copy to clipboard operation
methphaser copied to clipboard

Duplicated and missing entries in .methphased.vcf

Open liuy0421 opened this issue 2 years ago • 6 comments

When I look at the entries in the output .methphased.vcf, there seems to be complete duplicate rows (not just that one variant is methphased into different haplotypes and therefore split into multiple entires with different PS tags, but entirely duplicate entires). This is not a big issue because complete duplicates can be easily dropped - but what could be causing this? Is this intentional?

There also seems to be variants in the original .vcf file that's missing from the output .methphased.vcf, including some that was phased in the original .vcf. Is that intentional?

The original .vcf was filtered to only contain those on autosomes, and the input .bam files are filtered to only contain primary alignments but were not filtered to only keep those that map to autosomes.

Thank you so much!

liuy0421 avatar Jun 29 '23 19:06 liuy0421

Should be a bug, could you please show me the vcf file? I have never used MethPhaser on mice samples so a little example could help.

Fu-Yilei avatar Jun 29 '23 19:06 Fu-Yilei

It's not public data - I'll have to ask if it could be shared. I may get back to you on this at a later date. Sorry I can't be more helpful now.

Although one thing I encountered from the current implementation is that, since I filtered the .vcf file for only those on chromosome 1-19 (mouse autosomes), phase_block_df (defined here) defaults the dtype for the chr column to be int instead of str, and the subsequent comparison fails as int(chromosome) == str(chromosome) returns False. As a result all chromosomes were being skipped. Adding dtype={'chr' : 'str' } to the pd.read_csv() call in defining phase_block_df fixed the issue. A similar issue is also in methphasing.

liuy0421 avatar Jun 29 '23 19:06 liuy0421

I think that was because most human autosomes can be parsed as int but I can fix that, thanks for pointing out! Btw what is your reference genome headers like? Maybe that was different from human autosomes and could be the issue, not sure.

Fu-Yilei avatar Jun 29 '23 19:06 Fu-Yilei

We use mm39 as reference.

Now I think about it - this line here reads in info from the reference (and the next line gets the chromosome column). If there are non-autosome entries in the reference, there will be non-numerics in their names, and the chromosome column dtype will be str. In this case, if the .vcf file is filtered for autosomes and the chromosome column dtype gets defaulted to int, the int to str comparison fails.

liuy0421 avatar Jun 29 '23 19:06 liuy0421

I will download mm39 and take a look, thanks!

Fu-Yilei avatar Jun 29 '23 19:06 Fu-Yilei

Yeah mm39 starts with something like NC_000067.7 and HG38 starts with chr22. I will modify the dataframe reader, thanks for pointing that out

Fu-Yilei avatar Jun 30 '23 20:06 Fu-Yilei