Refactor binning
-
Do we really need classes for binning? Do we ever use inheritance from the base
Binningclass? I've got a feeling that a bunch of functions would be enough for all our purposes. -
Labelling: Instead of passing
format_strtoBinning.labels, user should pass a function that would return the appropriate label for given category, smth likelabels = bins.labels(x, lambda i: 'label " + str(i))thus a) getting rid of not very flexible labelling code and b) giving the user the flexibility of choosing any labelling scheme. -
Better greedy sorting (for instance, min-heap based).
Example of heap-based binning:
from heapq import heapify, heappop, heappush
def doTheMagic(items, nbins):
categories = []
for item in set(items):
pair = (sum([i == item for i in items]), [item])
categories.append(pair)
heapify(categories)
while len(categories) > nbins:
smallest = heappop(categories)
secondSmallest = heappop(categories)
count0, items0 = smallest[0], smallest[1]
count1, items1 = secondSmallest[0], secondSmallest[1]
heappush(categories, (count0 + count1, items0 + items1))
return categories
doTheMagic([1, 1, 2, 2, 3, 3, 5], 4)
Agree completely, we need to prioritize these changes so that we can fix sga() functionality ASAP.
I've constructed an example where your greedy algorithm won't work. ;)
Given items=[10, 10, 20, 20], nbins=2, the algorithm will compute the following steps:
-> [10, 10, 20, 20]
-> [20, 20, 20]
-> [40, 20]
where the optimal binning is [30, 30].
Nice catch! However, the existing algorithm would fail here too.
Another reason to rethink it.
A nice paper on categorical binning: http://www.aaai.org/ocs/index.php/IJCAI/IJCAI-09/paper/viewFile/625/705