No datafile could be found for /Database/FoldSeekDB/PDB100_member_to_set!

Open neptuneyt opened this issue 3 months ago • 1 comments

Dear developer Thanks for your such amazing work. Due to firewall restrictions, I cannot download directly using the commands provided by the software. Therefore, I attempted to manually download and extract the files via a VPN（https://foldseek.steineggerlab.workers.dev/pdb100.tar.gz ）, and the md5sum check was ok. But when I clustersearch my faa_queryDB to it, met the error No datafile could be found for /Database/FoldSeekDB/PDB100_member_to_set! as bleow. Additionally, it would be a great pity if such an excellent piece of work becomes unavailable due to network issues. Could you please provide an alternative data link for users in China, such as one hosted on https://zenodo.org/ (this platform offers 50GB of free storage space; for data exceeding 50GB, it may need to be split into smaller parts)? In any case, thank you again for developing such outstanding software, and I look forward to your reply.

PDB database

(foldseek) [yut@io02 FoldSeekDB]$ ls -F PDB100*
PDB100             PDB100_h         PDB100_seq_ca.0@      PDB100_seq_h.index    PDB100_seq_ss.index
PDB100_ca          PDB100_h.dbtype  PDB100_seq_ca.1       PDB100_seq.index      PDB100_seq_taxonomy@
PDB100_ca.dbtype   PDB100_h.index   PDB100_seq_ca.dbtype  PDB100_seq.lookup@    PDB100.source
PDB100_ca.index    PDB100.index     PDB100_seq_ca.index   PDB100_seq_mapping@   PDB100_ss
PDB100_clu         PDB100.lookup    PDB100_seq.dbtype     PDB100_seq.source@    PDB100_ss.dbtype
PDB100_clu.dbtype  PDB100_mapping   PDB100_seq_h.0@       PDB100_seq_ss.0@      PDB100_ss.index
PDB100_clu.index   PDB100_seq.0@    PDB100_seq_h.1        PDB100_seq_ss.1       PDB100_taxonomy
PDB100.dbtype      PDB100_seq.1     PDB100_seq_h.dbtype   PDB100_seq_ss.dbtype  PDB100.version

error log

[spacedust]$ spacedust clustersearch queryDB /Database/FoldSeekDB/PDB100 spacedust_result.tsv tmpFolder

MMseqs Version:                         2.e56c505
Substitution matrix                     aa:blosum62.out,nucl:nucleotide.out
Add backtrace                           true
Alignment mode                          2
Alignment mode                          0
Allow wrapped scoring                   false
E-value threshold                       10
Seq. id. threshold                      0
Min alignment length                    30
Seq. id. mode                           0
Alternative alignments                  0
Coverage threshold                      0.8
Coverage mode                           2
Max sequence length                     65535
Compositional bias                      1
Compositional bias                      1
Max reject                              2147483647
Max accept                              2147483647
Include identical seq. id.              false
Preload mode                            0
Pseudo count a                          substitution:1.100,context:1.400
Pseudo count b                          substitution:4.100,context:5.800
Score bias                              0
Realign hits                            false
Realign score bias                      -0.2
Realign max seqs                        2147483647
Correlation score weight                0
Gap open cost                           aa:11,nucl:5
Gap extension cost                      aa:1,nucl:2
Zdrop                                   40
Threads                                 128
Compressed                              0
Verbosity                               3
Seed substitution matrix                aa:VTML80.out,nucl:nucleotide.out
Sensitivity                             5.7
k-mer length                            0
k-score                                 seq:2147483647,prof:2147483647
Alphabet size                           aa:21,nucl:5
Max results per query                   300
Split database                          0
Split mode                              2
Split memory limit                      0
Diagonal scoring                        true
Exact k-mer matching                    0
Mask residues                           1
Mask residues probability               0.9
Mask lower case residues                0
Minimum diagonal score                  15
Selected taxa
Spaced k-mers                           1
Spaced k-mer pattern
Local temporary path
Rescore mode                            0
Remove hits by seq. id. and coverage    false
Sort results                            0
Mask profile                            1
Profile E-value threshold               0.001
Global sequence weighting               false
Allow deletions                         false
Filter MSA                              1
Use filter only at N seqs               0
Maximum seq. id. threshold              0.9
Minimum seq. id.                        0.0
Minimum score per column                -20
Minimum coverage                        0
Select N most diverse seqs              1000
Pseudo count mode                       0
Gap pseudo count                        10
Min codons in orf                       30
Max codons in length                    32734
Max orf gaps                            2147483647
Contig start mode                       2
Contig end mode                         2
Orf start mode                          1
Forward frames                          1,2,3
Reverse frames                          1,2,3
Translation table                       1
Translate orf                           0
Use all table starts                    false
Offset of numeric ids                   0
Create lookup                           0
Add orf stop                            false
Overlap between sequences               0
Sequence split mode                     1
Header split mode                       0
Chain overlapping alignments            0
Merge query                             1
Search type                             0
Search iterations                       1
Start sensitivity                       4
Search steps                            1
Exhaustive search mode                  false
Filter results during exhaustive search 0
Strand selection                        1
LCA search mode                         false
Disk space limit                        0
MPI runner
Force restart with latest tmp           false
Remove temporary files                  false
Use simple best hit                     true
Include sub-optimal hits with factor    0
Alpha                                   1
Aggregation mode                        0
Filter self match                       false
Multihit P-value cutoff                 0.01
Clustering and Ordering P-value cutoff  0.01
Maximum gene gaps                       3
Minimal cluster size                    2
Cluster weighting factor                false
Database output                         true
Cluster search against profiles         false
Cluster Search Mode                     0
Path to Foldseek                        /Software/Miniconda3/envs/spacedust/bin/foldseek

besthitbyset queryDB /Database/FoldSeekDB/PDB100 tmpFolder/966050721878520555/result_prefixed tmpFolder/966050721878520555/aggregate --simple-best-hit 1 --suboptimal-hits 0 --threads 128 --compressed 0 -v 3

No datafile could be found for /Database/FoldSeekDB/PDB100_member_to_set!
Error: aggregate best hit failed

Nov 01 '25 09:11 neptuneyt

Hi! PDB100 is a foldseek DB, not a spacedust setDB, which should contain further metadata files like _member_to_set needed in the workflow. Unfortunately, PDB100 is not suitable for converting to a spacedust setDB because it does not contain any genomic position information.

If you create the DB via foldseek using the prodigal-like input (contains genomic position) and prostt5, you can convert the foldseek DB to spacedust setDB via the following:

Create the Foldseek DB and Spacedust setDB

path/to/foldseek createdb genome1.faa [...genomeN.faa] DB --prostt5-model weights spacedust createsetdb DB setDB tmpFolder

On Sat, 1 Nov 2025 at 10:53, neptuneyt @.***> wrote:

neptuneyt created an issue (soedinglab/spacedust#13) https://github.com/soedinglab/spacedust/issues/13

Dear developer Thanks for your such amazing work. Due to firewall restrictions, I cannot download directly using the commands provided by the software. Therefore, I attempted to manually download and extract the files via a VPN（ https://foldseek.steineggerlab.workers.dev/pdb100.tar.gz ）, and the md5sum check was ok. But when I clustersearch my faa_queryDB to it, met the error No datafile could be found for /Database/FoldSeekDB/PDB100_member_to_set! as bleow. Additionally, it would be a great pity if such an excellent piece of work becomes unavailable due to network issues. Could you please provide an alternative data link for users in China, such as one hosted on https://zenodo.org/ (this platform offers 50GB of free storage space; for data exceeding 50GB, it may need to be split into smaller parts)? In any case, thank you again for developing such outstanding software, and I look forward to your reply.

PDB database

(foldseek) @.*** FoldSeekDB]$ ls -F PDB100* PDB100 PDB100_h PDB100_seq_ca.0@ PDB100_seq_h.index PDB100_seq_ss.index PDB100_ca PDB100_h.dbtype PDB100_seq_ca.1 PDB100_seq.index PDB100_seq_taxonomy@ PDB100_ca.dbtype PDB100_h.index PDB100_seq_ca.dbtype PDB100_seq.lookup@ PDB100.source PDB100_ca.index PDB100.index PDB100_seq_ca.index PDB100_seq_mapping@ PDB100_ss PDB100_clu PDB100.lookup PDB100_seq.dbtype PDB100_seq.source@ PDB100_ss.dbtype PDB100_clu.dbtype PDB100_mapping PDB100_seq_h.0@ PDB100_seq_ss.0@ PDB100_ss.index PDB100_clu.index PDB100_seq.0@ PDB100_seq_h.1 PDB100_seq_ss.1 PDB100_taxonomy PDB100.dbtype PDB100_seq.1 PDB100_seq_h.dbtype PDB100_seq_ss.dbtype PDB100.version

error log

[spacedust]$ spacedust clustersearch queryDB /Database/FoldSeekDB/PDB100 spacedust_result.tsv tmpFolder

MMseqs Version: 2.e56c505 Substitution matrix aa:blosum62.out,nucl:nucleotide.out Add backtrace true Alignment mode 2 Alignment mode 0 Allow wrapped scoring false E-value threshold 10 Seq. id. threshold 0 Min alignment length 30 Seq. id. mode 0 Alternative alignments 0 Coverage threshold 0.8 Coverage mode 2 Max sequence length 65535 Compositional bias 1 Compositional bias 1 Max reject 2147483647 Max accept 2147483647 Include identical seq. id. false Preload mode 0 Pseudo count a substitution:1.100,context:1.400 Pseudo count b substitution:4.100,context:5.800 Score bias 0 Realign hits false Realign score bias -0.2 Realign max seqs 2147483647 Correlation score weight 0 Gap open cost aa:11,nucl:5 Gap extension cost aa:1,nucl:2 Zdrop 40 Threads 128 Compressed 0 Verbosity 3 Seed substitution matrix aa:VTML80.out,nucl:nucleotide.out Sensitivity 5.7 k-mer length 0 k-score seq:2147483647,prof:2147483647 Alphabet size aa:21,nucl:5 Max results per query 300 Split database 0 Split mode 2 Split memory limit 0 Diagonal scoring true Exact k-mer matching 0 Mask residues 1 Mask residues probability 0.9 Mask lower case residues 0 Minimum diagonal score 15 Selected taxa Spaced k-mers 1 Spaced k-mer pattern Local temporary path Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Mask profile 1 Profile E-value threshold 0.001 Global sequence weighting false Allow deletions false Filter MSA 1 Use filter only at N seqs 0 Maximum seq. id. threshold 0.9 Minimum seq. id. 0.0 Minimum score per column -20 Minimum coverage 0 Select N most diverse seqs 1000 Pseudo count mode 0 Gap pseudo count 10 Min codons in orf 30 Max codons in length 32734 Max orf gaps 2147483647 Contig start mode 2 Contig end mode 2 Orf start mode 1 Forward frames 1,2,3 Reverse frames 1,2,3 Translation table 1 Translate orf 0 Use all table starts false Offset of numeric ids 0 Create lookup 0 Add orf stop false Overlap between sequences 0 Sequence split mode 1 Header split mode 0 Chain overlapping alignments 0 Merge query 1 Search type 0 Search iterations 1 Start sensitivity 4 Search steps 1 Exhaustive search mode false Filter results during exhaustive search 0 Strand selection 1 LCA search mode false Disk space limit 0 MPI runner Force restart with latest tmp false Remove temporary files false Use simple best hit true Include sub-optimal hits with factor 0 Alpha 1 Aggregation mode 0 Filter self match false Multihit P-value cutoff 0.01 Clustering and Ordering P-value cutoff 0.01 Maximum gene gaps 3 Minimal cluster size 2 Cluster weighting factor false Database output true Cluster search against profiles false Cluster Search Mode 0 Path to Foldseek /Software/Miniconda3/envs/spacedust/bin/foldseek

besthitbyset queryDB /Database/FoldSeekDB/PDB100 tmpFolder/966050721878520555/result_prefixed tmpFolder/966050721878520555/aggregate --simple-best-hit 1 --suboptimal-hits 0 --threads 128 --compressed 0 -v 3

No datafile could be found for /Database/FoldSeekDB/PDB100_member_to_set! Error: aggregate best hit failed

— Reply to this email directly, view it on GitHub https://github.com/soedinglab/spacedust/issues/13, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHFHFQF3DXBMJKASII6R2OD32R7JJAVCNFSM6AAAAACK3BL6KGVHI2DSMVQWIX3LMV43ASLTON2WKOZTGU3TONZRHE3DKMQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Nov 04 '25 08:11 RuoshiZhang