Allow grouping noisy arrays using tolerances
I find grouping data can be a little more intuitive when using tolerances. This can be done for example like this:
def groupby_isclose_argsort(arr, atol=0, rtol=0.1):
"""
Return a group idx of a noisy array.
TODO: argsort not available in dask.
Examples
--------
reps = 2
y = np.array([72, 72, 100, 100, 300, 300, 500, 500])
y = np.stack(reps*[y], 0)
noise = lambda y : 1 + 0.1 * (np.random.rand(*y.shape) - 0.5)
y = y * noise(y)
groupby_isclose_argsort(y)
array([[0, 0, 1, 1, 2, 2, 3, 3],
[0, 0, 1, 1, 2, 2, 3, 3]], dtype=int32)
"""
# Sort values to make sure values are monotonically increasing:
a = arr.ravel()
i = np.argsort(a)
i_rev = np.empty(i.shape, dtype=np.intp, like=a)
i_rev[i] = np.arange(len(a), like=a)
a = a[i]
# Calculate a monotonically increasing index that increase when the
# difference between current and previous value changes:
b = np.roll(a, 1)
b[0] = a[0]
by = np.cumsum(~np.isclose(a, b, atol=atol, rtol=rtol))
# tolerance = atol + rtol * b
# by = np.cumsum(np.abs(a - b) > tolerance)
return by[i_rev].reshape(arr.shape)
So on a dataset the api would look something like this ds.groupby("arr", atol=0, rtol=0.1).
Looks interesting. Usually I do something similar by specifying bins which this package now supports. Would that work for you? You do have to know the bins though...
You can always "factorize" arr yourself and then pass it to groupby directly...
PS: it's nice to see you here. please file issues for any bugs you find.
Yeah, it's the knowing the bins beforehand that I don't like. :) It gets difficult figuring out each group visually when you start getting values that are 100x-1000x larger than the smallest value and if the values (noise staying the same) changes between each test it gets tiresome redoing the bins each time. This is also a dask-focused project and that implies a little bit that visually inspecting data might be slow and cumbersome. But maybe you have a way that I haven't thought of?
I will likely factorize myself if I don't manage to convince you, but I still have hope!
There's https://github.com/deepcharles/ruptures that deals with this issue. A lot of python for loops along the arrays though, I think that can be improved though. There are C-implementations for some of the algorithms however.
It'd be good to add this in the documentation under "Labeling Groups"
Now I think https://flox.readthedocs.io/en/latest/user-stories.html would be a good place. "Overlapping Groups" also illustrates a trick.
Here's something interesting that seems relevant: https://www.cs.ucr.edu/~eamonn/MatrixProfile.html, https://pypi.org/project/matrixprofile-ts/
Here's a version using sklearns meanshift:
import numpy as np
import matplotlib.pyplot as plt
import flox
def _add_noise(
arrs: tuple[np.ndarray, ...], mult: bool = True, shuffle: bool = False
) -> tuple[np.ndarray, ...]:
noise = lambda y: 1 + 0.025 * (np.random.rand(*y.shape) - 0.5)
out = ()
for a in arrs:
b = a * noise(a) if mult else a + noise(a)
if shuffle:
np.random.shuffle(a)
out += (b,)
return out
def _cluster_meanshift(
arrs: tuple[np.ndarray, ...], bandwith: tuple[float, ...]
) -> tuple[np.ndarray, ...]:
from sklearn.cluster import MeanShift
labels = ()
for a, b in zip(arrs, bandwith):
ms = MeanShift(bandwidth=b, bin_seeding=True)
ms.fit(a.reshape(-1, 1))
labels += (ms.labels_,)
return labels
def _combine_cluster(labels: tuple[np.ndarray, ...]) -> np.ndarray:
_, out = np.unique(labels, return_inverse=1, axis=1)
return out
def plot_clusters(x, ys: tuple[np.ndarray, ...], bys: tuple[np.ndarray, ...]):
fig, axs = plt.subplots(len(ys), 1, sharex=True, layout="constrained")
if hasattr(axs, "__iter__"):
axs = axs
else:
axs = [axs]
axs[-1].set_xlabel("x")
for i, (ax, y, by) in enumerate(zip(axs, ys, bys)):
for g in np.unique(by):
idx = by == g
ax.scatter(x[idx], y[idx])
ax.set_title(f"ys[{i}]")
ax.set_ylabel(f"ys[{i}]")
return fig, axs
reps = 8
x = np.tile(np.array([3000, 3000, 3000, 3000, 5000, 5000, 5000, 5000]), reps)
y = np.tile(np.array([75, 75, 100, 100, 300, 300, 500, 500]), reps)
z = np.tile(np.array([1, 2, 1, 2, 1, 2, 1, 2]), reps)
w = np.tile(np.array([10, 10, 10, 10, 10, 10, 10, 10]), reps)
time = np.arange(*y.shape) * 0.1
x, y, z, w = _add_noise((x, y, z, w))
xl, yl, zl = _cluster_meanshift((x, y, z), bandwith=(500, 15, 0.1))
plot_clusters(time, (x, y, z), (xl, yl, zl))
g = _combine_cluster((xl, zl))
# print(g)
plot_clusters(time, (w,), (g,))
results, groups = flox.groupby_reduce(w, g, func="max")
print(results)
Some other meanshift alternatives; https://github.com/jenniferjang/meanshiftpp - article https://github.com/abhisheka456/GridShift - article