flox Allow grouping noisy arrays using tolerances

I find grouping data can be a little more intuitive when using tolerances. This can be done for example like this:

def groupby_isclose_argsort(arr, atol=0, rtol=0.1):
    """
    Return a group idx of a noisy array.

    TODO: argsort not available in dask.

    Examples
    --------
    reps = 2
    y = np.array([72, 72, 100, 100, 300, 300, 500, 500])
    y = np.stack(reps*[y], 0)
    noise = lambda y : 1 + 0.1 * (np.random.rand(*y.shape) - 0.5)
    y = y * noise(y)
    groupby_isclose_argsort(y)
    array([[0, 0, 1, 1, 2, 2, 3, 3],
           [0, 0, 1, 1, 2, 2, 3, 3]], dtype=int32)
    """
    # Sort values to make sure values are monotonically increasing:
    a = arr.ravel()
    i = np.argsort(a)
    i_rev = np.empty(i.shape, dtype=np.intp, like=a)
    i_rev[i] = np.arange(len(a), like=a)
    a = a[i]

    # Calculate a monotonically increasing index that increase when the
    # difference between current and previous value changes:
    b = np.roll(a, 1)
    b[0] = a[0]
    by = np.cumsum(~np.isclose(a, b, atol=atol, rtol=rtol))
    # tolerance = atol + rtol * b
    # by = np.cumsum(np.abs(a - b) > tolerance)

    return by[i_rev].reshape(arr.shape)

So on a dataset the api would look something like this ds.groupby("arr", atol=0, rtol=0.1).

Aug 28 '21 11:08 Illviljan

Looks interesting. Usually I do something similar by specifying bins which this package now supports. Would that work for you? You do have to know the bins though...

You can always "factorize" arr yourself and then pass it to groupby directly...

PS: it's nice to see you here. please file issues for any bugs you find.

Aug 28 '21 19:08 dcherian

Yeah, it's the knowing the bins beforehand that I don't like. :) It gets difficult figuring out each group visually when you start getting values that are 100x-1000x larger than the smallest value and if the values (noise staying the same) changes between each test it gets tiresome redoing the bins each time. This is also a dask-focused project and that implies a little bit that visually inspecting data might be slow and cumbersome. But maybe you have a way that I haven't thought of?

I will likely factorize myself if I don't manage to convince you, but I still have hope!

Aug 28 '21 21:08 Illviljan

There's https://github.com/deepcharles/ruptures that deals with this issue. A lot of python for loops along the arrays though, I think that can be improved though. There are C-implementations for some of the algorithms however.

Feb 17 '22 06:02 Illviljan

It'd be good to add this in the documentation under "Labeling Groups"

Oct 11 '22 22:10 dcherian

Now I think https://flox.readthedocs.io/en/latest/user-stories.html would be a good place. "Overlapping Groups" also illustrates a trick.

Nov 29 '22 03:11 dcherian

Here's something interesting that seems relevant: https://www.cs.ucr.edu/~eamonn/MatrixProfile.html, https://pypi.org/project/matrixprofile-ts/

Nov 28 '23 17:11 dcherian

Here's a version using sklearns meanshift:

import numpy as np
import matplotlib.pyplot as plt
import flox

def _add_noise(
    arrs: tuple[np.ndarray, ...], mult: bool = True, shuffle: bool = False
) -> tuple[np.ndarray, ...]:
    noise = lambda y: 1 + 0.025 * (np.random.rand(*y.shape) - 0.5)

    out = ()
    for a in arrs:
        b = a * noise(a) if mult else a + noise(a)
        if shuffle:
            np.random.shuffle(a)
        out += (b,)

    return out


def _cluster_meanshift(
    arrs: tuple[np.ndarray, ...], bandwith: tuple[float, ...]
) -> tuple[np.ndarray, ...]:
    from sklearn.cluster import MeanShift

    labels = ()
    for a, b in zip(arrs, bandwith):
        ms = MeanShift(bandwidth=b, bin_seeding=True)
        ms.fit(a.reshape(-1, 1))
        labels += (ms.labels_,)

    return labels


def _combine_cluster(labels: tuple[np.ndarray, ...]) -> np.ndarray:
    _, out = np.unique(labels, return_inverse=1, axis=1)
    return out


def plot_clusters(x, ys: tuple[np.ndarray, ...], bys: tuple[np.ndarray, ...]):
    fig, axs = plt.subplots(len(ys), 1, sharex=True, layout="constrained")
    if hasattr(axs, "__iter__"):
        axs = axs
    else:
        axs = [axs]

    axs[-1].set_xlabel("x")
    for i, (ax, y, by) in enumerate(zip(axs, ys, bys)):
        for g in np.unique(by):
            idx = by == g

            ax.scatter(x[idx], y[idx])
            ax.set_title(f"ys[{i}]")
            ax.set_ylabel(f"ys[{i}]")

    return fig, axs


reps = 8
x = np.tile(np.array([3000, 3000, 3000, 3000, 5000, 5000, 5000, 5000]), reps)
y = np.tile(np.array([75, 75, 100, 100, 300, 300, 500, 500]), reps)
z = np.tile(np.array([1, 2, 1, 2, 1, 2, 1, 2]), reps)
w = np.tile(np.array([10, 10, 10, 10, 10, 10, 10, 10]), reps)

time = np.arange(*y.shape) * 0.1

x, y, z, w = _add_noise((x, y, z, w))

xl, yl, zl = _cluster_meanshift((x, y, z), bandwith=(500, 15, 0.1))
plot_clusters(time, (x, y, z), (xl, yl, zl))
g = _combine_cluster((xl, zl))
# print(g)

plot_clusters(time, (w,), (g,))
results, groups = flox.groupby_reduce(w, g, func="max")
print(results)

Some other meanshift alternatives; https://github.com/jenniferjang/meanshiftpp - article https://github.com/abhisheka456/GridShift - article

Dec 03 '23 23:12 Illviljan