Memory usage - coords waste
Dear developers,
Description
In my code, I'm using sparse for handling large data ( > 10 GB). I noticed a larger memory usage by the sparse library than I expected. Comparing 2D matrix with scipy.sparse I realized that sparse is using a significantly larger amount of memory than scipy.sparse. Below you can find the memory consumption of the small example code included at the bottom (obtained with the memory_profiler library)
Line # Mem usage Increment Line Contents
================================================
7 99.5 MiB 99.5 MiB @profile
8 def check_conv(N1, N2, N3):
9 226.4 MiB 126.9 MiB A = sp.random(N1, N2, density=0.12, format="coo")
10
11 447.0 MiB 220.6 MiB B = sparse.COO.from_scipy_sparse(A)
12 636.3 MiB 189.3 MiB return B.reshape((N3, N2, N2))
We see a usage of 220 MB by sparse.COO while scipy.sparse uses only 127 MB.
Investigating the memory usage in sparse.COO, I found a large amount of memory used by the lines
246 415.6 MiB 126.1 MiB self.coords = self.coords.astype(np.intp, copy=False)
and
276 510.1 MiB 94.3 MiB self._sort_indices()
If I comment line 246 in the file sparse/_coo/core.py then the memory usage is significantly smaller.
Line # Mem usage Increment Line Contents
================================================
7 99.2 MiB 99.2 MiB @profile
8 def check_conv(N1, N2, N3):
9 226.3 MiB 127.1 MiB A = sp.random(N1, N2, density=0.12, format="coo")
10
11 383.8 MiB 157.5 MiB B = sparse.COO.from_scipy_sparse(A)
12 573.1 MiB 189.2 MiB return B.reshape((N3, N2, N2))
A gain of around 60 MB. My question is, why line 246 in sparse/_coo/core.py seems to copy the memory, while copy=False and how can I avoid it?
Also, do there is a way to avoid the sorting of index in line 276 when converting the matrix from scipy.sparse?
Example Code
from __future__ import division
import numpy as np
import scipy.sparse as sp
import sparse
from memory_profiler import profile
@profile
def check_conv(N1, N2, N3):
A = sp.random(N1, N2, density=0.12, format="coo")
B = sparse.COO.from_scipy_sparse(A)
return B.reshape((N3, N2, N2))
check_conv(453264, 152, 2982)
Hello, you can pass the sorted=True flag to avoid the sorting of contents, and the has_duplicates=False to avoid deduplication. Beware that there will be issues if you do this with coordinates which aren’t sorted or have duplicates.
Also, the higher memory usage is due to the format. We use COO which usually has lower compression efficiency than CSR, except hypersparse arrays.
I think the main factor here is that np.intp typically upcasts 32 bit ints to 64 bit ints. @mbarbry, if you run your code example and check the dtypes of the coordinate arrays I think you'll see A.row.dtype is going to be dtype('int32') whereas B.coords.dtype is dtype('int64'). This is related to #249.
Thank you for your answers. What @daletovar describes seems to be the issue. So from what I read in #249 , there is no actual fix for such situations?
I don't know what kind of bugs occurred exactly when using other dtypes to store coordinates (@hameerabbasi might be able to answer this), but you could perhaps try commenting out the conversion. Depending on what you're trying to do the GCXS format could be useful. You would have to clone from github to use it.
Yes, it was complex. We had overflows, and lots of them in different places. 🤷♂️ I gave up at some point and moved to np.int64.