expan icon indicating copy to clipboard operation
expan copied to clipboard

Refactor binning

Open gbordyugov opened this issue 8 years ago • 5 comments

  1. Do we really need classes for binning? Do we ever use inheritance from the base Binning class? I've got a feeling that a bunch of functions would be enough for all our purposes.

  2. Labelling: Instead of passing format_str to Binning.labels, user should pass a function that would return the appropriate label for given category, smth like labels = bins.labels(x, lambda i: 'label " + str(i)) thus a) getting rid of not very flexible labelling code and b) giving the user the flexibility of choosing any labelling scheme.

  3. Better greedy sorting (for instance, min-heap based).

gbordyugov avatar Jul 20 '17 10:07 gbordyugov

Example of heap-based binning:

from heapq import heapify, heappop, heappush

def doTheMagic(items, nbins):
    categories = []
    for item in set(items):
        pair = (sum([i == item for i in items]), [item])
        categories.append(pair)

    heapify(categories)

    while len(categories) > nbins:
        smallest       = heappop(categories)
        secondSmallest = heappop(categories)
        count0, items0 =       smallest[0],       smallest[1]
        count1, items1 = secondSmallest[0], secondSmallest[1]
        heappush(categories, (count0 + count1, items0 + items1))

    return categories

doTheMagic([1, 1, 2, 2, 3, 3, 5], 4)

gbordyugov avatar Jul 20 '17 13:07 gbordyugov

Agree completely, we need to prioritize these changes so that we can fix sga() functionality ASAP.

mkolarek avatar Jul 21 '17 09:07 mkolarek

I've constructed an example where your greedy algorithm won't work. ;)

Given items=[10, 10, 20, 20], nbins=2, the algorithm will compute the following steps: -> [10, 10, 20, 20] -> [20, 20, 20] -> [40, 20]

where the optimal binning is [30, 30].

shansfolder avatar Jul 24 '17 11:07 shansfolder

Nice catch! However, the existing algorithm would fail here too.

Another reason to rethink it.

gbordyugov avatar Jul 24 '17 15:07 gbordyugov

A nice paper on categorical binning: http://www.aaai.org/ocs/index.php/IJCAI/IJCAI-09/paper/viewFile/625/705

gbordyugov avatar Aug 01 '17 09:08 gbordyugov