ntJoin icon indicating copy to clipboard operation
ntJoin copied to clipboard

Advice. Use only chromosomes as reference or the entire assembly?

Open V-JJ opened this issue 1 year ago • 1 comments

Hi!

I'm trying to us ntJoin to scaffold an input PacBio CLR genome assembly using a chromosome level assembly of a closely related species as a refence.

  1. I would like to ask what's best? Use the entire assembly or only the "chromosomes" since they comprise more than 90% of the genome size?

  2. A different question.

  • First, I tried to run ntJoin with no_cut=True. This run yielded an assembly twice bigger than expected.
  • Then I tried no_cut=False and it greatly improved the result, so that only 94% of the target assembly was assigned to reference assembly. And the ntJoin assembly size matched quite well the known size of the input genome.

Thanks in advance! Thanks for the software!

V-JJ avatar Sep 21 '24 13:09 V-JJ

Hi @V-JJ,

  1. When deciding on what to supply for the reference, either option is totally fine (and probably depends a bit upon what you're hoping to achieve) - but assuming that the sequences other than the chromosomes are 'unassigned', it is generally safe to keep those in. Likely, they won't contribute much to the scaffolding but shouldn't be too detrimental.

  2. When running no_cut=True, an inflated genome size can be due to a larger number of N's introduced. This can be offset by supplying the G parameter, which puts a maximum size on the introduced gaps. I have a longer explanation of how a large number of N's can be introduced, and why I implemented the G feature in this previous issue: https://github.com/bcgsc/ntJoin/issues/115#issuecomment-2313102451 As you probably know, the difference is just that no_cut=True will not break any of your existing contigs, whereas no_cut=False will make breaks in your input contigs to fit to the reference. Which mode makes the most sense depends on how closely related the reference is, and your knowledge of the similarity of the genomes.

I hope that helps - thank you for your interest in ntJoin! Lauren

lcoombe avatar Sep 23 '24 15:09 lcoombe

Hi @lcoombe !

Thanks for clear and detailed explanation. Both species have a divergence of ~5 Mya and the BUSCO scores are quite similar when comparing nocut=True vs nocut=False, although a bit higher with (nocut=True).

Thanks, Vadim

V-JJ avatar Oct 07 '24 08:10 V-JJ