BUSCO and LAI decrease after polish
Question or Expected behavior Hello, I use NextPolish with Illumina reads to polish my Pacbio ( by CANU ) assembly contigs, but I found the BUSCO and LAI decrease after the polish step. I have made some tests and all of them seem to decrease the BUSCO and LAI score. I am curious why this is, do you have any insights on this phenomenon.
NextPolish NextPolish version 1.3.1 Below is my run script:
#!/bin/bash
#Set input and parameters
round=1
threads=30
read1=$2
read2=$3
input=$1
for ((i=1; i<=${round};i++)); do
#step 1
# index the genome file and do alignment
bwa index ${input}
bwa mem -t ${threads} ${input} ${read1} ${read2}|samtools view --threads 6 -F 0x4 -b -|samtools fixmate -m --threads 6 - -|samtools sort -m 2g --threads 6 -|samtools markdup --threads 6 -r - sgs.sort.bam
#index bam and genome files
samtools index -@ ${threads} sgs.sort.bam
samtools faidx ${input}
#polish genome file
python /store/whzhang/tools/NextPolish_1.3.1/NextPolish/lib/nextpolish1.py -g ${input} -t 1 -p ${threads} -s sgs.sort.bam > genome.polishtemp.fa
input=genome.polishtemp.fa
#step2
#index genome file and do alignment
bwa index ${input}
bwa mem -t ${threads} ${input} ${read1} ${read2}|samtools view --threads 6 -F 0x4 -b -|samtools fixmate -m --threads 6 - -|samtools sort -m 2g --threads 6 -|samtools markdup --threads 6 -r - sgs.sort.bam
#index bam and genome files
samtools index -@ ${threads} sgs.sort.bam
samtools faidx ${input}
#polish genome file
python /store/whzhang/tools/NextPolish_1.3.1/NextPolish/lib/nextpolish1.py -g ${input} -t 2 -p ${threads} -s sgs.sort.bam > genome.nextpolish.fa
input=genome.nextpolish.fa
done
Additional context (Optional) I have tried some combination polish methods and test their BUSCO, LAI:
| Contig N50 (Mb) | BUSCO (%) | LAI | |
|---|---|---|---|
| Raw contig | 5.77 | 98.80 | 23.29 |
| Raw + Arrow (1 round) | 5.77 | 98.60 | 23.68 |
| Raw + Arrow + NextPolish (4 round) | 5.77 | 98.70 | 23.12 |
| Raw + NextPolish (1 round) | 5.77 | 98.70 | 23.25 |
| Raw + Arrow + NextPolish (1 round) | 5.77 | 98.70 | 23.15 |
My Pacbio data is ~160x, and my Illumina short reads is ~60x. As you can see, all polish step will decrease the BUSCO and LAI. It seems use Arrow and Pacbio subreads will decrease more score.
Is your genome assembled with hifi? If it is, there is no need to polish it using subreads, but you can polish it use HIFI reads. Btw, because these results are very similar, so this difference is probably due to few gene differences caused by random mapping, you can ignore it. Of course, you can call homozygous SNP to evaluate global accuracy.