Autometa icon indicating copy to clipboard operation
Autometa copied to clipboard

🐌 ➡️ 🐎 Optimize DBSCAN clustering method for genome binning

Open evanroyrees opened this issue 3 years ago • 0 comments

From scikit-learn clustering docs, DBSCAN's memory consumption may be optimized:

Memory consumption for large sample sizes

This implementation is by default not memory efficient because it constructs a full pairwise similarity matrix in the case where kd-trees or ball-trees cannot be used (e.g., with sparse matrices). This matrix will consume floats. A couple of mechanisms for getting around this are:

Use OPTICS clustering in conjunction with the extract_dbscan method. OPTICS clustering also calculates the full pairwise matrix, but only keeps one row in memory at a time (memory complexity n).

A sparse radius neighborhood graph (where missing entries are presumed to be out of eps) can be precomputed in a memory-efficient way and dbscan can be run over this with metric='precomputed'. See sklearn.neighbors.NearestNeighbors.radius_neighbors_graph.

evanroyrees avatar Aug 07 '22 19:08 evanroyrees