python-fp-growth
python-fp-growth copied to clipboard
incorrect results when the number of distinct items is large
When the number of distinct items is larger than the max itemset size this library gives incorrect results.
I found this in two different ways: 1.) comparing against the brute-force method of computing all the subsets and 2.) checking that the apriori principle is satisfied by the results (i.e. the support of a superset should never be larger than the support of any of its subsets).
Here's a test:
#all combinations of length-4 from 20 different items
database = [frozenset(t) for t in itertools.combinations(range(20), r=4)]
fitemlists_w_sup = fp_growth.find_frequent_itemsets(database, minimum_support=2,
include_support=True)
fitemlists_w_sup = list(fitemlists_w_sup) # I will reuse the generator
#brute force way calculate support
counts = []
for transaction in database:
psets = powerset(transaction)
psets = [pset for pset in psets if len(pset) > 0] # remove empty set
psets = map(frozenset, psets)
counts.extend(psets)
counts = collections.Counter(counts)
#check that the brute-force method agrees with the fp-growth library
fpd = dict([(frozenset(x[0]), x[1]) for x in fitemlists_w_sup])
for fitemlist_w_sup in fitemlists_w_sup:
if fitemlist_w_sup[1] != counts[frozenset(fitemlist_w_sup[0])]:
print 'fp support is wrong. fp support: {0}, true support: {1}'\
.format(fitemlist_w_sup, counts[frozenset(fitemlist_w_sup[0])])
#check that the apriori principle is satisfied
for subpset in powerset(fitemlist_w_sup[0]):
if len(subpset) > 0:
if fpd[frozenset(subpset)] < fpd[frozenset(fitemlist_w_sup[0])]:
print 'apriori?, subset and support: {0} {1}, superset and support: {2}'\
.format(subpset, fpd[frozenset(subpset)],fitemlist_w_sup)
For reference, here's how I calculated powerset:
def powerset(iterable):
xs = list(iterable)
return itertools.chain.from_iterable(
itertools.combinations(xs, n) for n in range(max_size + 1))