RabbitTClust icon indicating copy to clipboard operation
RabbitTClust copied to clipboard

Program hangs when clustering 1.9 million genomes

Open sherlyn99 opened this issue 9 months ago • 2 comments

Hi, I am running clust-greedy on 1.9 million genomes and program hangs after 1 day, with no output.

Curious if this is a known issue/if there is any troubleshooting that can be done. Thank you!

Best, Sherlyn


Version: v.2.3.0

Command: clust-greedy -t 64 -k 16 -s 18000 --threshold 0.01 -F KSSD --no-save --list -i filepath.tsv -o clusters_greedy.txt

Log (last bit of it):

+ clust-greedy -t 64 -k 16 -s 18000 --threshold 0.01 -F KSSD --no-save --list -i filepaths/filepath.tsv -o clusters_greedy.txt
-----set the sketch function: KSSD
-----set the thread number 64
-----set kmerSize: 16
-----set sketchSize:  18000
-----set threshold:  0
	===the genome number for clustering is: 1965687
	===the genome number below the minimum genome length threshold is: 7
	===the total genome number is: 1965694
	===the totalSize is: 8077946919959
	===the maxSize is: 18446744073295954413
	===the minSize is: 112044
	===the averageSize is: 4109477
the kmerSize 16 is too small for the maximum genome size of 18446744073295954413
replace the kmerSize to the: 39 for reducing the random collision of kmers
-----the max recommand distance threshold is: 0.233462
-----sketch by file!
-----the kmerSize is: 39
-----the thread number is: 64
-----the threshold is: 0.01
-----use the Mash distance (fixed-sketch-size), the sketchSize is: 18000
-----input fileList, sketch by file
---finished sketching: 0 genomes
---finished sketching: 10000 genomes
---finished sketching: 20000 genomes
---finished sketching: 30000 genomes
---finished sketching: 40000 genomes
---finished sketching: 50000 genomes
---finished sketching: 60000 genomes
---finished sketching: 70000 genomes
---finished sketching: 80000 genomes
---finished sketching: 90000 genomes
---finished sketching: 100000 genomes
---finished sketching: 110000 genomes
---finished sketching: 120000 genomes
---finished sketching: 130000 genomes
---finished sketching: 140000 genomes
---finished sketching: 150000 genomes
---finished sketching: 160000 genomes
---finished sketching: 170000 genomes
---finished sketching: 180000 genomes
---finished sketching: 190000 genomes
---finished sketching: 200000 genomes
---finished sketching: 210000 genomes
---finished sketching: 220000 genomes
---finished sketching: 230000 genomes
---finished sketching: 240000 genomes
---finished sketching: 250000 genomes
---finished sketching: 260000 genomes
---finished sketching: 270000 genomes
---finished sketching: 280000 genomes
---finished sketching: 290000 genomes
---finished sketching: 300000 genomes
---finished sketching: 310000 genomes
---finished sketching: 320000 genomes
---finished sketching: 330000 genomes
---finished sketching: 340000 genomes
---finished sketching: 350000 genomes
---finished sketching: 360000 genomes
---finished sketching: 370000 genomes
---finished sketching: 380000 genomes
---finished sketching: 390000 genomes
---finished sketching: 400000 genomes
---finished sketching: 410000 genomes
---finished sketching: 420000 genomes
---finished sketching: 430000 genomes
---finished sketching: 440000 genomes
---finished sketching: 450000 genomes
---finished sketching: 460000 genomes
---finished sketching: 470000 genomes
---finished sketching: 480000 genomes
---finished sketching: 490000 genomes
---finished sketching: 500000 genomes
---finished sketching: 510000 genomes
---finished sketching: 520000 genomes
---finished sketching: 530000 genomes
---finished sketching: 540000 genomes
---finished sketching: 550000 genomes
---finished sketching: 560000 genomes
---finished sketching: 570000 genomes
---finished sketching: 580000 genomes
---finished sketching: 590000 genomes
---finished sketching: 600000 genomes
---finished sketching: 610000 genomes
---finished sketching: 620000 genomes
---finished sketching: 630000 genomes
---finished sketching: 640000 genomes
---finished sketching: 650000 genomes
---finished sketching: 660000 genomes
---finished sketching: 670000 genomes
---finished sketching: 680000 genomes
---finished sketching: 690000 genomes
---finished sketching: 700000 genomes
---finished sketching: 710000 genomes
---finished sketching: 720000 genomes
---finished sketching: 730000 genomes
---finished sketching: 740000 genomes
---finished sketching: 750000 genomes
---finished sketching: 760000 genomes
---finished sketching: 770000 genomes
---finished sketching: 780000 genomes
---finished sketching: 790000 genomes
---finished sketching: 800000 genomes
---finished sketching: 810000 genomes
---finished sketching: 820000 genomes
---finished sketching: 830000 genomes
---finished sketching: 840000 genomes
---finished sketching: 850000 genomes
---finished sketching: 860000 genomes
---finished sketching: 870000 genomes
---finished sketching: 880000 genomes
---finished sketching: 890000 genomes
---finished sketching: 900000 genomes
---finished sketching: 910000 genomes
---finished sketching: 920000 genomes
---finished sketching: 930000 genomes
---finished sketching: 940000 genomes
---finished sketching: 950000 genomes
---finished sketching: 960000 genomes
---finished sketching: 970000 genomes
---finished sketching: 980000 genomes
---finished sketching: 990000 genomes
---finished sketching: 1000000 genomes
---finished sketching: 1010000 genomes
---finished sketching: 1020000 genomes
---finished sketching: 1030000 genomes
---finished sketching: 1040000 genomes
---finished sketching: 1050000 genomes
---finished sketching: 1060000 genomes
---finished sketching: 1070000 genomes
---finished sketching: 1080000 genomes
---finished sketching: 1090000 genomes
---finished sketching: 1100000 genomes
---finished sketching: 1110000 genomes
---finished sketching: 1120000 genomes
---finished sketching: 1130000 genomes
---finished sketching: 1140000 genomes
---finished sketching: 1150000 genomes
---finished sketching: 1160000 genomes
---finished sketching: 1170000 genomes
---finished sketching: 1180000 genomes
---finished sketching: 1190000 genomes
---finished sketching: 1200000 genomes
---finished sketching: 1210000 genomes
---finished sketching: 1220000 genomes
---finished sketching: 1230000 genomes
---finished sketching: 1240000 genomes
---finished sketching: 1250000 genomes
---finished sketching: 1260000 genomes
---finished sketching: 1270000 genomes
---finished sketching: 1280000 genomes
---finished sketching: 1290000 genomes
---finished sketching: 1300000 genomes
---finished sketching: 1310000 genomes
---finished sketching: 1320000 genomes
---finished sketching: 1330000 genomes
---finished sketching: 1340000 genomes
---finished sketching: 1350000 genomes
---finished sketching: 1360000 genomes
---finished sketching: 1370000 genomes
---finished sketching: 1380000 genomes
---finished sketching: 1390000 genomes
---finished sketching: 1400000 genomes
---finished sketching: 1410000 genomes
---finished sketching: 1420000 genomes
---finished sketching: 1430000 genomes
---finished sketching: 1440000 genomes
---finished sketching: 1450000 genomes
---finished sketching: 1460000 genomes
---finished sketching: 1470000 genomes
---finished sketching: 1480000 genomes
---finished sketching: 1490000 genomes
---finished sketching: 1500000 genomes
---finished sketching: 1510000 genomes
---finished sketching: 1520000 genomes
---finished sketching: 1530000 genomes
---finished sketching: 1540000 genomes
---finished sketching: 1550000 genomes
---finished sketching: 1560000 genomes
---finished sketching: 1570000 genomes
---finished sketching: 1580000 genomes
---finished sketching: 1590000 genomes
---finished sketching: 1600000 genomes
---finished sketching: 1610000 genomes
---finished sketching: 1620000 genomes
---finished sketching: 1630000 genomes
---finished sketching: 1640000 genomes
---finished sketching: 1650000 genomes
---finished sketching: 1660000 genomes
---finished sketching: 1670000 genomes
---finished sketching: 1680000 genomes
---finished sketching: 1690000 genomes
---finished sketching: 1700000 genomes
---finished sketching: 1710000 genomes
---finished sketching: 1720000 genomes
---finished sketching: 1730000 genomes
---finished sketching: 1740000 genomes
---finished sketching: 1750000 genomes
---finished sketching: 1760000 genomes
---finished sketching: 1770000 genomes
---finished sketching: 1780000 genomes
---finished sketching: 1790000 genomes
---finished sketching: 1800000 genomes
---finished sketching: 1810000 genomes
---finished sketching: 1820000 genomes
---finished sketching: 1830000 genomes
---finished sketching: 1840000 genomes
---finished sketching: 1850000 genomes
---finished sketching: 1860000 genomes
---finished sketching: 1870000 genomes
---finished sketching: 1880000 genomes
---finished sketching: 1890000 genomes
---finished sketching: 1900000 genomes
---finished sketching: 1910000 genomes
---finished sketching: 1920000 genomes
---finished sketching: 1930000 genomes
---finished sketching: 1940000 genomes
---finished sketching: 1950000 genomes
---finished sketching: 1960000 genomes

sherlyn99 avatar Apr 17 '25 21:04 sherlyn99

Apologies for the delayed response. To my knowledge, we have tested several datasets at the million-scale level. Have you had a chance to try the MinHash algorithm? I believe the MinHash-based implementation has undergone more comprehensive testing. Please feel free to let me know if you continue to encounter any issues. By the way, we will also take a closer look at what happened with the KSSD-based implementation.

Best, Zekun

ZekunYin avatar Apr 22 '25 15:04 ZekunYin

Hi Sherlyn, I noticed that your program is slow and has not produced any output after one day.

Based on your program log, I think there are two primary factors contributing to this issue:

  1. Sketch Size is Too Large Your current sketch size is set to 18,000, which significantly increases the computational cost. For reference, this is 18× larger than the default sketch size (1,000), meaning each pairwise distance computation takes roughly 18× more time. While increasing the sketch size can improve the accuracy of distance estimations and clustering results, the improvement in accuracy scales sub-linearly, whereas the runtime cost increases linearly. Therefore, I recommend starting with a smaller sketch size — the default value of 1,000 is usually a good balance between performance and accuracy.

  2. The Distance Threshold Significantly Affects Clustering Efficiency The distance threshold plays a critical role in the behavior of the clust-greedy algorithm. A higher threshold allows more genomes to be grouped into a single cluster, resulting in fewer total clusters and representative genomes. Consequently, the remaining genomes are only compared against this reduced set of representatives, which lowers the overall number of pairwise distance computations. In general, using a larger threshold can significantly reduce computation time. However, this parameter also has a substantial impact on the clustering outcome — it may lead to coarser clusters and reduced resolution. Therefore, it’s important to choose this threshold carefully, balancing computational efficiency with the desired clustering granularity.

Recommendation: To resolve the performance bottleneck, I recommend initially reducing the sketch size to the default (1,000), and then fine-tuning the distance threshold based on your specific accuracy and granularity requirements.

Best, Xiaoming

XiaomingXu1995 avatar Apr 23 '25 07:04 XiaomingXu1995