Issue with cluster step in HapHiC – "Some fragments are missing / redundant" and fewer clusters than expected
Dear HapHiC developers,
I am trying to use haphic cluster to perform clustering on my genome assembly. My organism has 20 chromosomes, and I used the following command:
“~/soft/HapHiC/haphic cluster --threads 80 genome.fasta HiC.filtered.bam 20 --max_inflation 50” However, the process repeatedly reports the following message for many inflation values:
Some fragments are missing / redundant, result of inflation XX.X will NOT be output At the end, I also get the message:
“The maximum number of clusters (19) is even less than the expected number of chromosomes (20). You could try higher inflation.”
Interestingly, when I run the hap1 step, it finishes successfully.
I would really appreciate your help in understanding:
What causes the “fragments are missing / redundant” message?
How can I modify the parameters to get the expected 20 clusters?
Is there any recommended strategy for choosing a good inflation value in such cases?
Thank you very much for developing HapHiC and for your support!
Could you please share a screenshot showing the first few lines of your BAM file? You could use this command to view them:
samtools view HiC.filtered.bam | less -S
I noticed all reference IDs start with "h2", so are you performing scaffolding separately for the two haplotypes? I have a few questions for clarification:
-
Is the haploid chromosome number (n) 20? (meaning 2n=40?)
-
Are the two haplotypes of similar size? Sometimes hifiasm might produce imbalanced phasing results.
-
If both above points check out, another possibility could be that your assembly's contiguity is very high, causing the default
--Nx 80parameter to potentially miss some chromosomes. You might want to try rerunninghaphic pipelinewith--Nxset to95or higher and--min_group_lenset to1.
Thank you for your answer!
Regarding the haploid chromosome number (n): Yes, my species has 2n = 40, so the haploid chromosome number (n) is 20.
Regarding the similarity in size of the two haplotypes: I used Hifiasm for assembly, generating two haplotypes (hap1 and hap2). Their sizes are quite similar, both around 500MB. I performed scaffolding separately for the two haplotypes, and hap1 can be successfully mounted without any issues.
I will try the new parameter settings you provided (--Nx 80haphic and --Nx95--min_group_len1) to see if that improves the situation.
Thank you for your suggestion. With the new parameters, I successfully located the 20 chromosomes. However, I noticed that some chromosomes show terminal signal issues, and I suspect there might be haplotype collapse. Therefore, I plan to concatenate the two haplotypes and perform the scaffolding together.
The lack of Hi-C signals at these chromosomal termini may be due to repetitive sequences near the telomeres. Because Hi-C reads can map to multiple genomic locations (i.e., multi-mapping), these alignments are typically filtered out by filter_bam's default MAPQ≥1 threshold. Therefore, scaffolding concatenated haplotypes would not resolve this issue—in fact, they might further reduce the signal.
I understand what you mean. I apologize for asking again, but is it better to combine the two haplotypes and perform the scaffolding together? This way, we can identify potential errors in the assembly of the unknown haplotype. I plan to try merging the two hap genome.fasta files and then perform scaffolding together in the next step.
If the overall heterozygosity is sufficiently high (meaning that a significant number of Hi-C links can be retained under MAPQ ≥ 1 filtering when aligning to concatenated haplotypes) and these heterozygous sites are relatively evenly distributed across the genome, then I would say "yes". However, for species with very low heterozygosity, it might often only be possible to scaffold each haplotype separately. Regardless of the specific strategy, I would always recommend concatenating the two haplotypes for final visualization to check for any phasing errors.
Hello, after merging the haplotypes and performing scaffolding, I found that two chromosomes exhibited haplotype collapse. Since my goal is to assemble a haplotype-resolved T2T genome, I encountered issues during gap filling — the collapsed homologous regions could not be filled using ONT data. Therefore, I’m considering copying the sequence from the collapsed region and pasting it onto the other haplotype to facilitate gap filling. However, I currently have no idea how to implement this approach and would appreciate any guidance.
Alternatively, can I directly use the sequence from the homologous contig to fill the gap? Would this approach be feasible?
I used HiFi data to calculate the coverage at the collapsed region. How can I precisely locate the start position of this collapsed region? The approximate location can be determined from the figure below.
For copying collapsed regions, here's the general workflow:
-
Treat copying collapsed regions as the final scaffolding task - Complete all other Juicebox adjustments first before handling the collapsed regions.
-
Identify regions to be copied:
- Extract the target sequence
- Determine copy number (typically one copy for diploids, one or more for polyploids)
- Identify insertion points (collapsing often breaks contigs in another haplotype - note between which two contigs to insert)
Note: Handling differs between primary assemblies (*.p_ctg) and unitigs (p_utg):
- For primary assemblies: Collapsed regions may be internal to a contig
- For unitigs: Collapsing typically affects the entire unitig (may only require copying the whole unitig)
-
File modifications:
- Update these files based on step 2's findings:
-
asm.fa(the contig-level assembly input into haphic; use01.cluster/corrected_asm.faif assembly correction was performed) -
*.review.assembly -
*.liftover.agp
-
- Important: Fully understand each file's format and relationships before editing
- Finally, rerun
juicer postto generate the final FASTA and AGP files
- Update these files based on step 2's findings:
I used HiFi data to calculate the coverage at the collapsed region. How can I precisely locate the start position of this collapsed region? The approximate location can be determined from the figure below.
I haven't tried copying collapsed regions on hap*.p_ctg yet (since we primarily work with polyploids, these operations are mostly performed on p_utg, where we only need to copy the entire unitig). Relying solely on Hi-C contact maps to identify collapsed regions might not be precise enough in your case. If we follow your proposed approach, several issues would need consideration: (1) Whether HiFi read alignments require MAPQ filtering, (2) What bin size to use for depth statistics, and (3) The persistent accuracy challenges.
I have some thoughts that might be worth exploring. The hap*.p_ctg are theoretically obtained by spanning the unitig paths on the assembly graph, so we could potentially first map the unitigs back to hap*.p_ctg. Then by analyzing the Hi-C contact map, p_utg alignments to hap*.p_ctg, along with p_utg lengths, depths, and even the assembly graph structure in the GFA file, we might identify which specific unitig(s) experienced collapse, and finally duplicate those unitig(s) to their corresponding positions.
In theory, this could provide more precise identification of collapse locations. However, it's important to note that for both p_ctg and p_utg, hifiasm output sequences always contain some redundant overlapping sequences at the ends (typically within 30 kb for each end, with this information recorded in the assembly graph), so copying unitigs would always include some extra sequence.
Thank you for your response. We attempted to align p_ctg.fa to p_utg.fa using minimap2 minimap2 -x asm5 . We found that two unitigs aligned to the same contig, but the unitig corresponding to the collapsed region (which we preliminarily identified based on Juicer curation and read coverage) is quite large. Therefore, it doesn't appear to be a case of a completely collapsed unitig.
Below is the approximate location we initially identified based on Juicer curation.
Based on the trimming, we determined the range:
scaffold_40 10675528 11935528
Below is the approximate location we initially identified based on reads mapping.
minimap2 -ax map-hifi -t 80 out_JBAT.FINAL.fa hifi.fastq.gz | samtools sort -@ 80 -o aln.sorted.bam
samtools view -@ 80 -b -q 60 aln.sorted.bam > aln.filtered.sorted.bam
samtools index -@ 80 aln.filtered.sorted.bam
mosdepth -t 80 -c scaffold_40 -b 500 chr40 aln.sorted.bam
Filtered for regions where the coverage on both sides of the peak is greater than 100:
scaffold_40 10995000 11851000
If the entire unitig didn't collapse, I think it might be because one end of this region doesn't have reads that can connect to the other haplotype. This might be why hifiasm didn't correctly build the dual assembly, eventually leading to phasing errors.
But from your unitig alignment results, we can at least guess that one end of the collapsed region might be around position 11963739. This is very close to the position you found using juicebox (11935528).
I also thought of another solution - directly aligning the two large contigs from scaffold_39 to scaffold_40. This should help determine the exact coordinates of the collapsed region.
What's strange is that the collapsed region you identified using HiFi depth is quite different - maybe the depth at the ends isn't very accurate. Another odd thing is that according to your contact map, scaffold_40 should be less than 20 MB, but the x-axis of your plot goes beyond 2*10^7.
I think I’ve already solved the problem. Regarding the issue shown in the figure: the collapsed region identified using HiFi depth is quite different. It's not just a matter of differing depth at the ends — data from another chromosome was mistakenly included. This was my oversight, which led to the region shown in the figure being larger than 20 MB.
I aligned the two homologous chromosomes. Based on the alignment data, I identified the collapsed region as 10,645,675–11,947,310.
Subsequently, I copied the collapsed region and modified the
asm.fa, *.review.assembly, and *.liftover.agp files to reconstruct the scaffold_39. After that, I performed gap filling, and the two gaps introduced by the duplication were correctly filled.
In addition, I realigned the adjusted FASTA file and generated the Hi-C heatmap. The duplicated regions showed clearly filtered signals on the map, which I believe is expected.
It seems this approach worked without issues. could you please check if there are any flaws in the overall procedure? Thank you for your help. You are truly very kind and helpful.
The results look great. Thanks for sharing your research, I've gained some new knowledge from this experience too.