spacedust Passing large numbers of files to createsetdb

I would like to run spacedust on a plasmid database. This database has ~60k individual files that represent separate plasmid "genomes". However when I pass the following command to spacedust:

$spacedust createsetdb /individual_faa/*.faa SpacedustDB tmp --threads 18

bash: /shared/software/bin/spacedust: Argument list too long

I receive a bash error that the arguments list is too long. I have tried a number of workarounds to this such as passing an environment variable that contains all the file names...but to no avail

It would be useful if instead of passing a file glob (*), that spacedust createsetdb could instead take a single input file with paths to each of the .faa files needed for db creation. Alternatively if I could create databases in batches and combine them that could be another approach, just not sure if that is supported. Finally, if you have any other suggestions I would be forever greatful.

In terms of the total number of proteins in these plasmid "genomes" it would be quite similar to the 9000 genomes you ran in the spacedust paper since plasmids are much smaller in size. So I think computationally it should be managable just trouble getting all the files in :-)

My Environment

Linux
Using Statically compiled spacedust executable for AVX2 instruction set

Oct 30 '24 03:10 SDmetagenomics

Hi! Did you reach a solution for your issue with the large number of files?

Dec 14 '24 12:12 Fazel-AVB

Hi Fazel,

So we do not yet have a full solution, but we have identified a core problem. On our linux system (and many others unless compiled with special parameters) there is a byte limit on the size of a terminal command (2Mb of total text). So when the command for spacedust is passed using a file glob (*) the actual in memory size of the full command text becomes larger than 2Mb. We have implemented the following workarounds at the moment:

Running spacedust within the data folder so that the relative path to the data does not contain any extra directories (e.g. .faa rather than /spacedust_input/.faa). This significantly reduces the number of bites the command takes up as the relative path no longer contains a repeated directory entry for every file we want to analyze (e.g. /spacedust_input/1.faa, /spacedust_input/2.faa) .
We have started re-naming input files so that they are the shortest name possible (e.g. 1.faa, 2.faa) and linked these new names to a lookup table. This also seems to help mitigate the problem.

However, as we are using plasmid sequences that are much shorter than genomes, and have the resource, we are planning to further scale this up into the 100s of thousands of sequences. One potential solution (if possible in your code framework) would be the option to provide a single input file to the input that contains the complete paths to every file you want to include in the analysis. Do you think this would be possible to implement?

Spencer

Dec 16 '24 21:12 SDmetagenomics

Hi Spencer,

Thank you for the helpful tips. I checked the command size limit in my linux system by ulimit -a and it turned out to be stack size (kbytes, -s) 8192 (i.e., 8 Mb). Then I increased it to 16 Mb by ulimit -s 16384. Followingly I ran the $spacedust createsetdb ./*.faa SpacedustDB tmp inside the input directory. This worked for me by running it in the command line, however I haven't checked it yet via sending as a job file. I hope it helps.

Fazel

Dec 17 '24 15:12 Fazel-AVB

Hi, I have updated the workflow. Now it is also possible to pass a directory or a .tsv file with the list of paths to the desired files, for example: $spacedust createsetdb /individual_faa SpacedustDB tmp --file-include ".faa$" --threads 18 or $spacedust createsetdb path_to_faa.tsv SpacedustDB tmp --threads 18

You can download the new pre-compiled repository.

Dec 17 '24 16:12 RuoshiZhang

Fantastic thank you all. I will give it a test in the coming days.

Spencer

Spencer Diamond, Ph.D. Principal Investigator Innovative Genomics Institute University of California, Berkeley 2151 Berkeley Way Berkeley, CA 94720

Diamond Lab https://diamondlab.bio/ | IGI https://innovativegenomics.org/ | BIOME https://innovativegenomics.org/microbiome-editing/ X: @Dr__Diamond https://x.com/Dr__Diamond

On Dec 17, 2024, at 8:38 AM, RuoshiZ @.***> wrote:

Hi, I have updated the workflow. Now it is also possible to pass a directory or a .tsv file with the list of paths to the desired files, for example: $spacedust createsetdb /individual_faa SpacedustDB tmp --file-include ".faa$" --threads 18 or $spacedust createsetdb path_to_faa.tsv SpacedustDB tmp --threads 18

You can download the new pre-compiled repository.

— Reply to this email directly, view it on GitHub https://github.com/soedinglab/spacedust/issues/5#issuecomment-2548979839, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEFTX2JNWNPJC6RRZJUF5AT2GBHPRAVCNFSM6AAAAABQ3FSHJ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBYHE3TSOBTHE. You are receiving this because you authored the thread.

Dec 17 '24 16:12 SDmetagenomics

Hi Ruoshi,

I also ran into the same problem, and the provided fix still doesn't work because

cmd.execProgram(program.c_str(), par.filenames);

still uses a system call to the generated tmp bash script, which also uses a system call to the mmseqs command. So it fails with the same error

" E2BIG (Argument list too long)"

in the subsequent steps.

Thanks, Caner

Jan 08 '25 16:01 canerbagci