Smudgeplot predicts tetraploid
Hi Kamil,
We are working with a weevil species (Pseudaplemonus limonii) for which we had no prior information on genome size or ploidy. Based on ancestry, we initially estimated a genome size of 600 Mb–1 Gb and assumed it to be diploid.
We generated PacBio HiFi reads from ultra-low input libraries, which involved an amplification step (due to the small amount of DNA available from a single individual). I used meryl (k=17) to generate k-mers and ran smudgeplot, which suggested a tetraploid genome.
Genomescope from HiFi reads:
smudgeplot from HiFi reads:
I also performed smudgeplot analysis using Illumina 150 PE short reads from TPase libraries (~25x coverage), which also involved a PCR amplification step. However, I read that Illumina reads from such libraries may not be ideal for smudgeplot. The results from this are as follows:
Genomescope from illumina 150PE:
Smudgeplot from illumina 150PE:
For assembly, I used HiFiasm with the HiFi reads and HiC, which produced two haplotypes, each with a genome length of approximately 330 Mb. Both haplotypes show 98.9% BUSCO completeness (C:98.9% [S:96.3%, D:2.6%], F:0.4%, M:0.7%, n:1367, E:4.1%) compared to the insecta_odb10 lineage.
Given these results, the smudgeplot from the PacBio reads appears more reliable to me—but am I overlooking something? I’d appreciate your insights!
Hi Arya,
What genome size do you expect for this weevil?
I have a suspicion this will be actually... suprise suprise... a diploid. I will walk you through my thought process and then have a suggestion how to check my suspicion.
- The HiFI data is definitely much better dataset to look at, I will ignore the Illumina dataset. The 1n coverage is definitely 114x (the k-mer spectrum peak is rather small, but you can see AB smudge and that is impossible to ignore).
- So, with 1n coverage set, we can fit a model. Either tetraploid (that one you posted) predicting 146Mbp or diploid that would give you ~290Mbp haploid genome size estimate (for a second, ignore heterozygosity). There is very little heterozygosity, so no matter what, my INTUITION is that heplotypes would get largely collapsed during assembly
- Let's look at the actual assembly; Two haplotypes separate (contrary to my INTUITION) 300Mbp each. Ok, if the tetraploid model was right and your genome is AABB (small reminder, in this model the est haploid size of A or B is just 150Mbp), so get 300 in both haplotypes, they be AB and A'B' respectively. But in that case, I would really expect nearly all BUSCOs to be duplicated (which does not seem to be the case). Alternative diploid model would predict ~300Mbp haploid genome size, which would approximately match what you have assembled.
- So, what's up with those AABB smudge? It might be, that you are dealing with a genome similar to strawberry (https://github.com/KamilSJaron/smudgeplot/wiki/tutorial-strawberry) - low heterozygosity, lots of recent duplications (but those DO NOT include BUSCOs for some reason). In the strawberry case, with increased k, we have seen a dropout of tetraploid k-mers but retaining the intensity of diploid smudge. So you can run again your smudgeplot using, say k=31 or 51 and if you will see that the proportion of the diploid smudge is greater, it's a good sign you are dealing with diploid in fact. It would be quite interesting if you would sort out what the tetraploid smudge actually is then.
- I would advise you to use FastK and a newer version of smudgeplot, it's really a lot lot more efficient/faster.
Hello Kamil,
Thank you for your detailed response, and apologies for the delayed reply—I was on vacation. I made the changes you suggested, and the GenomeScope results look promising. I used k=51 and l=118, and now the genome size is estimated to be around 294Mb with 0.082% heterozygosity!
However, I’m still a bit unclear about the smudge plot. Could you help me better understand the AAB smudge?
Thank you so much!
Is there a way to convince you to rerun it with the newer version? I am sure the plot would look a lot better.
The thing is - smudgeplot and genomescope 1n coverage need to be consistent - which of them you think is right?
I think for working with these kind of biodiverse genomes you could you more in depth k-mer training. If you would like to, you can sign for a k-mer workshop we run in June at Sanger: https://coursesandconferences.wellcomeconnectingscience.org/event/k-mer-workshop-for-biodiversity-genomics-20250601/
Hi Kamil,
Thanks for the suggestion. I am getting this error when I try to use the newer version.
Error in strsplit(.peak_sizes[, "structure"], "") : non-character #argument Calls: plot_expected_haplotype_structure -> strsplit Execution halted
Looks similar to this issue: https://github.com/KamilSJaron/smudgeplot/issues/193
Ugh, sorry about that! This is a bit confusing, given the coverage I would have not thought lack of smudges would be a problem for plotting. Are you sure you used all the data to make it?
Also, if you post here your .smu file (you will have to zip it), I would be happy to plot it for you (but I am pretty sure there will be some other problem too, there should be some smudges to plot)
Thank you so much! I did use all the data (or at least I think I did 😅). Here is the .smu files for K=31 and K=51.
ultralow_HiFi_weevil_kmersK51_L10_text.smu.gz ultralow_HiFi_weevil_kmersK31_L10_text.smu.gz I'm curious to see what you find!
weevil_k51_smudgeplot_log10_py.pdf
We still get a plotting error with the new version with k31 but k51 worked. Will try figure out how to handle the error.
Hi @aryadevias, looking at the plot Sam generated, I am pretty sure something went wrong with making the k-mer database. That can't be the same full dataset as plotting on those gneomescope plots or previous run of smudgeplot, that looks like a small fraction of data...