🐌 ➡️ 🐎 Optimize DBSCAN clustering method for genome binning
From scikit-learn clustering docs, DBSCAN's memory consumption may be optimized:
Memory consumption for large sample sizes
This implementation is by default not memory efficient because it constructs a full pairwise similarity matrix in the case where kd-trees or ball-trees cannot be used (e.g., with sparse matrices). This matrix will consume floats. A couple of mechanisms for getting around this are:
Use OPTICS clustering in conjunction with the
extract_dbscanmethod. OPTICS clustering also calculates the full pairwise matrix, but only keeps one row in memory at a time (memory complexity n).A sparse radius neighborhood graph (where missing entries are presumed to be out of eps) can be precomputed in a memory-efficient way and dbscan can be run over this with
metric='precomputed'. See sklearn.neighbors.NearestNeighbors.radius_neighbors_graph.