FastOMA icon indicating copy to clipboard operation
FastOMA copied to clipboard

Process `hog_big (127)` terminated with an error exit status (130)

Open diekei opened this issue 11 months ago • 10 comments

Hello, I'm running fastOMA on LSF cluster and failed during hog_big step, is this a memory issue? Could you help me identify the problem and resolve the issue? Many thanks for your help! let me know if you need anything else.

Best, A

This is my command: nextflow run FastOMA.nf -profile lsf --input_folder analysis/input/ --output_folder analysis/output/ --omamer_db ../omadb/LUCA.h5

And the log file

Completed at    : 2025-02-16T00:52:29.718378116Z
Duration        : 3h 18m 46s
Processes       : 533 (success), 107 (failed)
Output in       : analysis/output/
Oops .. something went wrong
WARN: Killing running tasks (2)

executor >  lsf (642)
[3a/f30c6f] check_input (1)                | 1 of 1 ✔
[71/280573] omamer_run (ECoc.fa)           | 204 of 204, failed: 102, retries: 102 ✔
[d4/4c8dd8] infer_roothogs (1)             | 1 of 1 ✔
[38/12c038] batch_roothogs (1)             | 1 of 1 ✔
[30/98a315] hog_big (1)                    | 164 of 166, failed: 5, retries: 4
[2b/0bf746] hog_rest (248)                 | 269 of 269 ✔
[-        ] collect_subhogs                -
[-        ] ext…airwise_ortholog_relations -
[-        ] fastoma_report                 -
ERROR ~ Error executing process > 'hog_big (127)'

Caused by:
  Process `hog_big (127)` terminated with an error exit status (130)


Command executed:

  fastoma-infer-subhogs  --input-rhog-folder /lustre/scratch126/tol/teams/***/projects/***/oma/fastoma/work/38/12c038d88d07cdbf0eea4b772ee57b/rhogs_big/67                                 --species-tree species_tree_checked.nwk                                --output-pickles pickle_hogs                                --parallel                                 -vv                                --msa-filter-method col-row-threshold                                --gap-ratio-row 0.3                                --gap-ratio-col 0.5                                --number-of-samples-per-hog 5

Command exit status:
  130

Command output:
  (empty)

Command error:
    |   \-BRAKERAUGP00000013294.1||PAtr||1055031872
    |
  --|   /-ENSNNIP00005009898.1||PCog||1060020096
    |  |
    |  |      /-ENSVMFP00000003441.1||DPar||1100004969
    |  |     |
     \-|   /-|   /-ENSXWUP00005018438.1||GSpi||1063017559
       |  |  |  |
       |  |   \-|      /-ENSXWUP00005018942.1||GSpi||1063017855
       |  |     |   /-|
        \-|      \-|   \-ENSXWUP00005019025.1||GSpi||1063017911
          |        |
          |         \-ENSKLCP00000005280.1||MPro||1053005806
          |
          |   /-BRAKERRJPP00005037521.1||LLun||1036040893
           \-|
              \-BRAKERKZPP00000023888.1||PSpi||1013010758
  2025-02-16 00:28:03 INFO     At least one subhog is split. here is the full labeled genetree:
  
                                                        /-HOG_E0851697_sub10557
                       /S, 0.0, {'HOG_E0851697_sub10557'}
                      |                                 \-HOG_E0851697_sub10557
                      |
  -D, 0.2857142857142857                                                         /-HOG_E0851697_sub10557
                      |                                                         |
                      |                                                         |            /-HOG_E0851697_sub10551
                      |                                                         |           |
                       \S, 0.0, {'HOG_E0851697_sub10557', 'HOG_E0851697_sub10551'}     /S, 0.0     /-HOG_E0851697_sub10551
                                                                                |     |     |     |
                                                                                |     |      \D, 0.5           /-HOG_E0851697_sub10551
                                                                                |     |           |      /D, 1.0
                                                                                 \S, 0.0           \S, 0.0     \-HOG_E0851697_sub10551
                                                                                      |                 |
                                                                                      |                  \-HOG_E0851697_sub10551
                                                                                      |
                                                                                      |      /-HOG_E0851697_sub10557
                                                                                       \S, 0.0
                                                                                             \-HOG_E0851697_sub10557
  2025-02-16 00:28:03 INFO     Representaives of HOG_E0851697_sub10557 are split among 2 candidate subtrees.
  2025-02-16 00:28:03 DEBUG    Subhog paths of represenatatives for <HOG:HOG_E0851697_sub10557,size=7,tax=n79>
  2025-02-16 00:28:03 DEBUG    ---Partition 0 -----
  2025-02-16 00:28:03 DEBUG    BRAKERRJPP00005002533.1||LLun||1036002530: <HOG:HOG_E0851697_sub10541,size=1,tax=n107> --> <HOG:HOG_E0851697_sub10019,size=1,tax=LLun>
  2025-02-16 00:28:03 DEBUG    BRAKERAUGP00000013294.1||PAtr||1055031872: <HOG:HOG_E0851697_sub10023,size=1,tax=PAtr>
  2025-02-16 00:28:03 DEBUG    ---Partition 1 -----
  2025-02-16 00:28:03 DEBUG    BRAKERKZPP00000023888.1||PSpi||1013010758: <HOG:HOG_E0851697_sub10542,size=3,tax=n107> --> <HOG:HOG_E0851697_sub10519,size=2,tax=n238> --> <HOG:HOG_E0851697_sub10018,size=1,tax=PSpi>
  2025-02-16 00:28:03 DEBUG    BRAKERRJPP00005037521.1||LLun||1036040893: <HOG:HOG_E0851697_sub10542,size=3,tax=n107> --> <HOG:HOG_E0851697_sub10020,size=1,tax=LLun>
  2025-02-16 00:28:03 DEBUG    ENSNNIP00005009898.1||PCog||1060020096: <HOG:HOG_E0851697_sub10542,size=3,tax=n107> --> <HOG:HOG_E0851697_sub10519,size=2,tax=n238> --> <HOG:HOG_E0851697_sub10024,size=1,tax=PCog>
  2025-02-16 00:28:03 INFO     Splitting <HOG:HOG_E0851697_sub10557,size=7,tax=n79> into 2 subhogs: [{<HOG:HOG_E0851697_sub10023,size=1,tax=PAtr>, <HOG:HOG_E0851697_sub10541,size=1,tax=n107>}, {<HOG:HOG_E0851697_sub10542,size=3,tax=n107>}]
  2025-02-16 00:28:03 DEBUG    checking for rootHOG id E0851697 future object is done for node n37
  2025-02-16 00:28:03 DEBUG    checking for rootHOG id E0851697 future object is done for node n66

Work dir:
  /lustre/scratch126/tol/teams/***/projects/***/oma/fastoma/work/f6/28f6d2b9f03d7c8f1013dc7ea14e5d

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

diekei avatar Feb 16 '25 01:02 diekei

Hi @diekei

Thanks for using fastOMA. Could you please share these files .command.log, .command.err, and species_tree_checked.nwk in /lustre/scratch126/tol/teams/jaron/projects/coleoptera_alg/oma/fastoma/work/f6/28f6d2b9f03d7c8f1013dc7ea14e5d.

Also, if you could also share fasta file inside/lustre/scratch126/tol/teams/jaron/projects/coleoptera_alg/oma/fastoma/work/38/12c038d88d07cdbf0eea4b772ee57b/rhogs_big/67, I'll try to reproduce the error.

To check whether it was a memory limit with slurm, we have instruction here. I'm not sure about nextflow in LSF but I can check. I'm wondering about your memory limit, is it less than 64GB?

Best, Sina

sinamajidian avatar Feb 16 '25 02:02 sinamajidian

Hello @sinamajidian thank you for the swift response. This is the job requirements that I submitted: bsub -J cec_fastoma -M150000 -R"select[mem>150000] rusage[mem=150000] span[hosts=1]" -n 15 -q week -o cec_fastoma_%J.log -e cec_fastoma_%J.err \ nextflow run FastOMA.nf -profile lsf --input_folder analysis/input/ --output_folder analysis/output/ --omamer_db ../omadb/LUCA.h5

so the memlimit was 150GB. And here is the requested files:

662188.zip

Thank you!

diekei avatar Feb 16 '25 03:02 diekei

Thanks for sharing the files. In the file .command.log, it seems that the LSF computing node has 50GB memory and the job killed due to lack of memory

TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Exited with exit code 130.
Resource usage summary:
    CPU time :                                   5212.00 sec.
    Max Memory :                                 49536 MB
    Average Memory :                             26079.13 MB
    Total Requested Memory :                     49152.00 MB

But this gene family has only 500 genes and it shouldn't use huge amount of memory. I checked the proteomes it seems that several of proteins have high repeated amino acids. I'm wondering these could be real proteins or gene annotation issue.

>BRAKERTKWP00000024469.1||CAsp||1054025376 BRAKERTKWP00000024469.1
MLKTDSDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDGDG...

(FastOMA adds ||CAsp|| to the fasta record to keep track of species names.) The Mafft multiple sequence aligner has difficulty to align them. I think you are working with Arthropods, right? One way to evaluate the quality of eukaryotic proteomes is using OMArk. Most of genes in this family are from species CAsp. Is it annotated in-house?

It seems that most of the proteins in this fastoma family are mapped to the bacterial clade p__Myxococcota. You can check the output folder of your fastoma run out/hogmap and check each protein. I did it online for this gene family, resulted in this. We can imagine a few scenarios here: there is a significant HGT, fastoma reference gene family has issue, or the input proteomes are of low quality.

My run is still ongoing, now at internal node 66 (species_tree_checked.nwk, the point that your run failed). I'll update you if I found any new info.

Best, Sina

sinamajidian avatar Feb 16 '25 14:02 sinamajidian

hi @sinamajidian Thanks for looking into this. Yes, I noticed in the error file it says the max memory was 50GB, but the thing that I don't understand is, when I submit the job I clearly requested memory limit as 150GB, why there's a discrepancy here. (Sorry this is really my first time trying to run nextflow job).

I also noticed that strange looking protein. Yes, I am working with Coleoptera genomes, and the one specific example that you've just mentioned is Crioceris asparagi. I did run OMArk prior to running this analysis and here is the result:

COMPLETENESS ASSESSMENT
------------
The clade used was: Endopterygota
Number of conserved HOGs: 4879

Results on conserved HOGs:
Single: 4001 (82.00%)
Duplicated: 531 (10.88%)
Duplicated, Unexpected: 528 (10.82%)
Duplicated, Expected: 3 (0.06%)
Missing: 347 (7.11%)


CONSISTENCY ASSESSMENT
-------------------------
Number of proteins in the whole proteome: 26673

Consistent lineage placements
Total Consistent: 14053 (52.69%)
Consistent, partial hits: 4268 (16.00%)
Consistent, fragmented: 1419 (5.32%)

Inconsistent lineage placements
Total Inconsistent: 3137 (11.76%)
Inconsistent, partial hits: 1972 (7.39%)
Inconsistent, fragmented: 317 (1.19%)

Contaminants
Total Contaminants: 0 (0.00%)
Contaminants, partial hits: 0 (0.00%)
Contaminants, fragmented: 0 (0.00%)

Unknown
Total Unknown: 9483 (35.55%)


SPECIES COMPOSITION
-------------------
Detected species

Main species
Clade: Cucujiformia
Number of associated query proteins: 17190 (64.45%)

Re: input proteomes are of low quality What are the criteria that makes it low quality?

The annotation source: https://ftp.ensembl.org/pub/rapid-release/species/Crioceris_asparagi/GCA_958507055.1/braker/geneset/2023_10/

Thank you!! Best, A

diekei avatar Feb 16 '25 14:02 diekei

One update about memory requirement: FastOMA finished this gene family in around one hour with 67.1 GB of RAM (captured with /bin/time -o s.log)

user= 23969 system= 263.24 elapsed= 3340.04 CPU= 725% MemMax= 67126832
.
.
2025-02-16 09:57:57 INFO     All subHOGs for the rootHOG E0851697 as OrthoXML format is written in pickle_hogs/file_E0851697.pickle
finished  67

And, the stats of unknown omark looks quite high to me, you may want to check some cases here or in the omark paper.

Similarly, unknown proteins may be sequences without close homologs or annotation errors. Thus, not all proteins classified as inconsistent or unknown are necessarily errors, but an unusually high proportion may indicate a systematic error in the annotation.

About LSF, unfortunately I haven't used LSF. It seems that it is needed to add LSF profile to fastoma nextflow.config, see this

... it divides the requested memory by the number of requested cpus

sinamajidian avatar Feb 16 '25 15:02 sinamajidian

Hi @sinamajidian Thank you for your response and letting me know about the progress. I see - so it's definitely memory issues.

Since FastOMA automatically determines the number of threads/memory based on the number of proteomes and gene families (am I right?), I wonder if it somehow failed to estimate it. As you said 'this gene family has only 500 genes and it shouldn't use huge amount of memory'.

So I add this in the config file:

process {
  withName: 'hog_big' {
    memory = '100.GB'
    cpus = 10
  }
}

and then resuming the analysis with -resume, the job successfully completed. Only the extract_pairwise_ortholog_relations that wasn't progressing:

executor >  lsf (6)
[f5/074c60] check_input (1)                | 1 of 1, cached: 1 ✔
[a2/f10cfb] omamer_run (HObl.fa)           | 102 of 102, cached: 102 ✔
[74/7f8cfe] infer_roothogs (1)             | 1 of 1, cached: 1 ✔
[c5/ecff8e] batch_roothogs (1)             | 1 of 1, cached: 1 ✔
[ee/8a0173] hog_big (2)                    | 158 of 158, cached: 155 ✔
[76/70ab72] hog_rest (194)                 | 270 of 270, cached: 269 ✔
[4d/98843c] collect_subhogs (1)            | 1 of 1 ✔
[-        ] ext…airwise_ortholog_relations -
[52/da090a] fastoma_report (1)             | 1 of 1 ✔

I guess this is because the dataset is large (102) and by default it calculates only if it has <=25? and should be solvable by just adding additional flags when submitting it?

Thank you!

Best wishes, Arif

diekei avatar Feb 16 '25 17:02 diekei

Cool! I'm glad you solved it. That's right. FastOMA/Nextflow automatically allocates memory/cpus. True, you can use this --force_pairwise_ortholog_generation.

All the best, Sina

sinamajidian avatar Feb 16 '25 18:02 sinamajidian

Thank you for your help.

I have a follow-up question about the FastOMA_HOGs.orthoxml result. Should I expect that all genes and its splice variants exist in the FastOMA_HOGs.orthoxml file? I found for example on one species there's a gene with three splice variants and not a single one of it exist in the file and it caused a problem when I'm feeding this as an input for EdgeHOG.

Best, Arif

diekei avatar Feb 25 '25 09:02 diekei

Hi @diekei ,

I had a chat with Sina about this issue and we realized that this is because FastOMA excludes sequences which are too short. The gene we identified in GLat was only 18 nucleotide long. such short sequences we cannot place reliably in any gene family.

We will discuss if it makes sense to include those genes nevertheless in the FastOMA_HOGs.orthoxml file. For now, I think it is easier to ensure that in edgehog we don't rely that every gene is included in the orthoxml. I keep you posted in the other github issue about the progess.

alpae avatar Feb 25 '25 10:02 alpae

Hi @alpae just read your reply now here after sending the comment in the other github issue page.

Ah that make sense! Thank you so much!

diekei avatar Feb 25 '25 10:02 diekei