ntJoin icon indicating copy to clipboard operation
ntJoin copied to clipboard

Running - make error

Open desmodus1984 opened this issue 3 months ago • 10 comments

Hi, I just installed ntJoin and I want to scaffold an assembly we got from the sequencing company with one published before.

I ran this code, and I got this error: ntJoin assembly target=purged.fasta target_weight=1 reference='Lepwed.HiC-500k-ren.fasta' reference_weights='2' t=4 make: *** No rule to make target 'assembly'. Stop.

Did I miss something, I added the target and weight, and reference and weight and suggested and I was planning on using 4 threads. For some reason I got this error.

What might be the issue? I ran ntJoin before in a different workstation/college/city and it worked well.

Thanks;

desmodus1984 avatar Oct 23 '25 20:10 desmodus1984

Hi @desmodus1984,

Looks like you have a typo in your command - the command should be ntJoin assemble, not assembly.

Thank you for your interest in ntJoin! Lauren

lcoombe avatar Oct 24 '25 23:10 lcoombe

Morning, I wanted to share my results and perhaps get your suggestion on what to do next. I corrected my mistake and used "assemble" instead of "assembly", and I was absolutely astonished of the results, which are great but not free of issues. DNAZoo sequenced and scaffolded with HiC the genome of one of the species I am studying. The original file has 400k sequences. After I filtered out those smaller than 25k, the number lowered to 131. For the purpose of scaffolding, I filtered out those <500k bp to retain potential pseudo-chromosomes, and I got 17! which is the number of A chromosomes described for the species. So, I used this 17-seq file to scaffold my genome built with PacBio sequences, which has 397 sequences. After scaffolding, the all.scaffolds file has 438 sequences, but the assigned scaffolds has 17! Perfect - at least from the big picture. I have two sister species, and I am interested in comparing the genomes/genes, so next I wanted to do a quick check so I used BUSCO to see any changes in the scores. It would be unfortunate/frustrating to miss genes. Sadly, scaffolding helped in one aspect but added errors in other aspect.

The original assembly had very high BUSCO scores: C:99.1%[S:97.8%,D:1.3%],F:0.4%,M:0.5%,n:13727,E:2.8%
sadly, the full-scaffolded BUSCO scores were lower: C:98.1%[S:96.9%,D:1.3%],F:0.3%,M:1.5%,n:13727,E:2.9%
and the assigned one scores were even lower (expected since it's only a portion of the original): C:95.3%[S:94.1%,D:1.2%],F:0.3%,M:4.3%,n:13727,E:2.8%

I would like to retain as many genes as possible for comparing among the sister species, and it is sad that for example, 124 single-copy complete orthologs are gone between the original compared to the all.scaffolds. My next step is annotation and I wanted to get your opinion on ways to recover those lost complete single orthologs; it would be great to recover all those genes present in the original and lost by scaffolding, but I would focus mostly on the single-copy ones.

Thank you very much;

desmodus1984 avatar Oct 30 '25 16:10 desmodus1984

Hi @desmodus1984,

Glad that you’re getting some promising results! My suggestion would be to try a run specifying no_cut=True. This option means that ntJoin will not make any cuts to your input sequences. When this option is not specified (no_cut=False, which is the default), ntJoin will cut your input scaffolds at putative misassemblies - essentially fitting the structure of your input assembly to the reference. Sometimes this can be undesired, and I wonder if in some cases could impact the BUSCOs. If you are not cutting any sequences, just scaffolding them, there should theoretically not be a negative impact on the BUSCOs. That being said, in my experience I have seen cases where the BUSCOs decrease only after scaffolding, which I think is an issue on the BUSCO side.

Hope that helps!

lcoombe avatar Nov 01 '25 17:11 lcoombe

Hi @lcoombe I did as you suggested with no_cut=True and the scaffolding didn't improve. For example, I am trying to identify A and B chromosomes, and since B are very small, I am focused on the A based on the karyotype, 2n=17. With your suggestion, the number of scaffolded assigned sequences is good, 17, but the total length is too much: 3,000,732,934; the genome size based on k-mer profile is 2.35 GB. Maybe no_cut is not good, and I should try instead some genome polishing. Do you know any genome quality that assessed expansion/contraction besides Inspector? I did kmer quality score using merqury, and the ntjoin-scaffolded had a score higher than the original assembly.

desmodus1984 avatar Nov 03 '25 19:11 desmodus1984

Hi @lcoombe I was further comparing the original assembly with the scaffolded, and I did some check with gfastats, and the results were not pretty good. The original assembly has 5 gaps, with average, max, and min length being 23 bp; the scaffolded has 596 gaps, and 76,801.53 - 4,476,416.00 - 20.00, being the average, max, and min gap lengths respectively. As you can see, even one gap is about 4 MB!

@emilyyzhangg Do you think that ntlik can help?

Thanks

desmodus1984 avatar Nov 05 '25 16:11 desmodus1984

If you want to limit the length of the gaps, you can set the G parameter. I’d recommend trying that if you are finding the reconstruction/genome size to be larger than expected when using no_cut=True.

Since ntJoin is a scaffolder, yes, it will introduce gaps between joined sequences. It will attempt to detect and resolve overlapping sequences that are being joined, but if the algorithm detects that two pieces to be joined have a genomic distance between them, it will introduce a gap.

Currently, ntLink will fill gaps between joined scaffolds, but it does so within its scaffolding functionality, as it uses input long reads to fill the gaps. So it cannot currently be used to fill gaps independently (that could be something we introduce in the future). Depending on the sequencing reads you have available, you could consider Sealer (https://github.com/bcgsc/abyss/tree/master/Sealer) or Cobbler (https://github.com/bcgsc/RAILS).

lcoombe avatar Nov 05 '25 19:11 lcoombe

Hi @lcoombe Thanks for the information. I tried scaffolding my assembly using different combinations and I was suprised that using 4 related species had the highest BUSCO score 99.3% - higher than the original genome (99.1%) and I didn't use the no_cut option, while the scaffold version with the genome scaffolded by a different group/same species with HiC had a BUSCO score of 92%, which surprised me a lot.

I wanted to mention something weird, there is something going on with the output. So, after running ntJoin I do BUSCO check with the output assigned.fa file, and BUSCO fails with an error:

2025-11-14 11:42:28 ERROR: The following job failed with the error code 1: stats.sh format=2 in=/data/common/juanpablo.aguilar/sps-scaff/4X/LW2.0.20k.fasta.k32.w500.n1.all.scaffolds.fa n=10 2025-11-14 11:42:28 ERROR: Error message: java.lang.OutOfMemoryError: Java heap space at java.base/java.util.Arrays.copyOf(Arrays.java:3537) at java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:246) at java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:752) at java.base/java.lang.StringBuilder.append(StringBuilder.java:233) at java.base/java.io.BufferedReader.readLine(BufferedReader.java:380) at java.base/java.io.BufferedReader.readLine(BufferedReader.java:400) at fileIO.FileFormat.getFirstOctet(FileFormat.java:508) at fileIO.FileFormat.testInterleavedAndQuality(FileFormat.java:493) at fileIO.FileFormat.testFormat(FileFormat.java:402) at fileIO.FileFormat.(FileFormat.java:222) at fileIO.FileFormat.testInput(FileFormat.java:164) at fileIO.FileFormat.testInput(FileFormat.java:146) at jgi.AssemblyStats2.process(AssemblyStats2.java:259) at jgi.AssemblyStats2.main(AssemblyStats2.java:39) Exception in thread "main" java.lang.NullPointerException: Cannot invoke "fileIO.FileFormat.samOrBam()" because "ff" is null at jgi.AssemblyStats2.process(AssemblyStats2.java:266) at jgi.AssemblyStats2.main(AssemblyStats2.java:39)

2025-11-14 11:42:42 ERROR: Job failed with error code 1 2025-11-14 11:42:42 ERROR: BUSCO analysis failed!

I figured out that just running a kit header clean with seqkit with seq -i, then BUSCO runs.

Any reason why this problem happens?

Also, perhaps I missed some details about running sealer, and I tried it and has been running for days. So, I am eager to get a gap-free assembly, so I tried it on one assembly, and this is the code:

abyss-sealer -b50G -k64 -k80 -k96 -k112 -k128 -o Pseudo-gfill -S final_assembly.fasta Hlep-PacBio.fasta

The reads are PacBio HiFi, long, average size of 15kbp, and according to gfastats, there are 33 gaps, average length of 90.67, max length 100, and min of 23 bp. Is it normal for sealer to be running for days without finishing? scyld.localdomain: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time


1023889.scyld.localdom jp.a batch Pseudo-Gap-fill 10473 1 64 -- 250:00:00 R 157:43:20 I ran ntLink, and it finished in less than 24 hours, while Sealer is still running.

Hope you can help me troubleshoot this.

Thanks;

desmodus1984 avatar Nov 18 '25 17:11 desmodus1984

Hi @desmodus1984,

In terms of running BUSCO, you’d have to ask the developers why it appears that certain header namings fail. I have a recollection that in the past we did rename our input fasta files to shorter names, because it can fail with long headers, for example, but I’m not sure if that is the issue here. From the error message, it looks like the process ran out of memory. Glad you found a workaround!

Sealer is a gap-filler to be used with short reads, not long reads. Of the two suggestions I had above, Sealer is if you are using short reads for gap-filling, and Cobbler is if you use long reads. Note that if you use Cobbler I would recommend using GoldPolish to polish the filled gaps.

lcoombe avatar Nov 22 '25 18:11 lcoombe

Hi @lcoombe I wanted to ask you a "silly question". When I run ntJoin, it produces 3 output file: the *all.scaffolds, the *assigned.scaffolds, and the *unassigned.scaffolds; I was looking through the website and the paper and I couldn't find an good/short explanation. I am written a draft for the manuscript, and it would be good to have an explanation, and I am struggling trying to explain. Related to this, I am using the assigned scaffolds files for QC, I have used two reference-chromosome-scale assemblies, and with the cut_option, the scaffold number, gap number and BUSCO complete are - using the same species: 19/600, 98.2%, and using a different species: 18/1559/88.2%. The stats with no-cut off are better: using same species assembly: 17/308/98.6%, different species: 17/330/92.1%.

I got curious about using various assemblies and I tried 2 sets: 2X: the same species + sister species ~ 200 contigs; and 3X: same species, sister species, and the distant one, with no_cut. sadly the results weren't that great, both same scaffold number, while 2X: 274 gaps, and BUSCO complete 95.7%; while 3X, less gaps (233), and lower BUSCO complete 85.5%.

You mentioned in the goldpolish issue, that ntlink could be used to fill gaps, and my original assembly was just created with HiFiasm, so I wondered whether ntJoin_ing it would look better, despite short HiFi reads (mean ~ 5 kb). Happily, all the stats improved despite the short reads, higher Scaffold N50, lower contigs/scaffolds number, yet same BUSCO scores. Then, I tried to scaffold this ntlinked one, since it looked better than the original. With individual references, scaffolding is okay

Assembly Metric LW2.0 LW2.gf LW2.LW.HiC LW2.gf.LW.HiC LW2.Nsch LW2.gf.Nsch
Total length 2.4 Gb 2.40 Gb 2.38 Gb 2.37 Gb 2.22 Gb 2.17 Gb
Scaffold number 397 276 17 16 17 17
Scaffold N50 14.7 Mb 22.13 Mb 150.51 Mb 174.17 Mb 153.21 Mb 148.5 Mb
Scaffold L50 48 34 7 6 7 7
Contig N50 14.7 Mb 22.13 Mb 14.73 Mb 21.88 Mb 13.65 Mb 20.63 MB
Contig L50 48 34 48 34 48 32
Gap number 5 5 308 199 330 211
BUSCO complete 99.1% 99.1% 98.6% 98.5% 92.1% 91.7%
BUSCO missing 0.5% 0.5% 1.1% 1.2% 7.6% 8.0%

, but incredibly, using varios references led to worse results despite better big-picture structural metrics

Assembly Metric LW2.HiC-HL1 LW2.gf.HiC-HL1 LW2-3X LW2.gf.3X  
Total length 2.27 Gb 854.9 Mb 2.07 Gb 1.51 Gb  
Scaffold number 17 41 17 49  
Scaffold N50 62.06 Mb 32.47 Mb 40.93 Mb 44.26 Mb  
Scaffold L50 12 10 17 11  
Contig N50 14.73 Mb 20.17 Mb 13.89 Mb 18.43 MB  
Contig L50 46 15 44 25  
Gap number 274 74 233 128  
BUSCO complete 95.7% 38.0% 85.5% 63.5%  
BUSCO missing 4.0% 61.7% 14.2% 36.1%  

Could you speculate why ntJoining with sets using the ntLinked assembly, the BUSCO scores were so bad if I used the no_Cut option? I was really excited to get better results with the ntLinked assembly; I am worried about gene-completeness since I want to do compare genomes next.

I tried ntLinking the ntJoined genomes and I saw no improvement.

Thanks

desmodus1984 avatar Nov 30 '25 01:11 desmodus1984

Hi @desmodus1984,

The final file that I suggest you use for downstream analysis is "*all.scaffolds.fa", which contains both the scaffolds produced by ntJoin and the sequences that were not included in the scaffolds. The "assigned" fasta file only contains the scaffolds, and the "unassigned" only contains the sequences not included in a scaffold, thus "all" is the concatenation of these. We consider these "assigned" and "unassigned" files intermediates, which is why they aren't included here: https://github.com/bcgsc/ntJoin?tab=readme-ov-file#output-files

I can't quite follow the different runs in the tables that you have included, so hard for me to give specific comments, but I can comment in general about combining ntLink and ntJoin. ntJoin is dependent on the reference assembly used, while ntLink is dependent on the input reads. Because most of the time, the reference used is chromosome-scale or close to this, it would be expected for ntJoin to generate more contiguous sequences than ntLink. If your reference genome is less contiguous (a draft assembly) using these tools together makes a lot of sense.

lcoombe avatar Dec 01 '25 23:12 lcoombe

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your interest in ntJoin!

github-actions[bot] avatar Feb 04 '26 02:02 github-actions[bot]