CheckM2 running checkm2 on a large number of genomes

Hi I have ~1 million of isolate genomes and I wanted to run checkm2 to assess their completeness and contamination. I came across #67 and I was wondering what is a good way to parse a large number of genomes into checkm2 predict?

I am currently doing

checkm2 predict
-t 30
-i $(cat <filelist.txt>)
-o
--database_path <path_to_uniref100.KO.1.dmnd>
--remove_intermediates

Is there a limit to the number of files in filelist.txt?
Is there a better way to do this other than parsing a list of file paths?

Thank you so much!

Sep 15 '23 19:09 sherlyn99

Hi,

Sorry for the late reply. I don't think in principle there should be any limits to passing as many inputs as you want to --input, though python's argparse may have system-wide limitations, as does whatever OS you're using (e.g. specific linux distro).

Parsing in a list of files in a txt seems fine. I have .tar archive input on a future features list for CheckM2, but some sections of the code need to be re-written to avoid tarbombs.

Please let me know if you encounter issues with the workflow, as I've never run CheckM2 on that many genomes, would be good to know if it can handle it.

Oct 03 '23 03:10 chklovski

Thank you such for getting back to me! I am writing to provide an update:

I have been running a job array of ~500 jobs, each containing 2750 genomes. However, I am frequently encountering the issue of out-of-memory error. I am currently supplying each job with 200g memory and 48 hours and some jobs fail with out of memory error (exit with status 125). Do you have any suggestions of how much memory I should request for each job so that the job array can be run smoothly? Thank you!

Oct 23 '23 06:10 sherlyn99