dask
dask copied to clipboard
Error using db.from_delayed with sparse arrays
File "python3.11/site-packages/dask/bag/core.py", line 1881, in reify
if len(seq) and isinstance(seq[0], Iterator):
^^^^^^^^^^^^^^^^^
File "python3.11/site-packages/scipy/sparse/_base.py", line 425, in __len__
raise TypeError("sparse array length is ambiguous; use getnnz()"
Minimal Complete Verifiable Example:
import dask.bag as db
import numpy as np
from dask import delayed
from scipy.sparse import csr_array
def add(x, y):
return x + y
@delayed
def create_sparse_array_delayed():
return csr_array(np.random.random((10, 10)))
@delayed
def create_array_delayed():
return np.random.random((10, 10))
db.from_sequence(
[csr_array(np.random.random((10, 10))), csr_array(np.random.random((10, 10)))]).fold(
add).compute() # works with sparse arrays when created from sequence
db.from_delayed([create_array_delayed(), create_array_delayed()]).fold(add).compute() # works with numpy arrays
db.from_delayed([create_sparse_array_delayed(), create_sparse_array_delayed()]).fold(add).compute() # fails
Environment:
- Dask version: 2024.12.0
- Python version: 3.11
- Operating System: Mac
- Install: pip
Thanks for your report. Any advice on how we can make this work without adding scipy as a dependency for bags?
FYI I've created a fix for this bug here: https://github.com/dask/dask/pull/12103
cc: @phofl @joshua-gould