string_grouper added ability to cache matrix in queries across which `master` is constant

Hi @Bergvca,

This PR addresses @asnorthrup's issue: see here.

Within this PR, the end-user is now able to:

execute multiple separate queries on the same master dataset while the duplicates dataset changes;
take advantage of caching to boost the performance of such queries (if master is small enough).

... # set pandas.Series master, df1, and df2
sg = StringGrouper(master, enable_cache=True)  # build corpus
matches1 = sg.match_strings(None, duplicates=df1)  # first query: compares ``df1`` against ``master`` while caching ``master``
matches2 = sg.match_strings(None, duplicates=df2)  # subsequent queries: compares ``df2`` against ``master`` using the cache

Caution: Caching is likely to provide a performance benefit only when the input master dataset is "small enough". The challenge would be in determining exactly how small master should be for caching to become beneficial.

May 23 '22 11:05 ParticularMiner

Awesome, thanks for taking some time to implement @ParticularMiner. Can't wait to test it out and report back what I see. I am especially curious about the size issue. I guess I hadn't considered the fact that this could has a performance hit since the 'other way' creates the entire embedding anyway. I need to wrap my head around that after re-looking at your code changes. I'm sure there are performance enhancements that happen in the way the code was originally structured when size>RAM, but I hadn't thought to investigate that. Thanks for pointing it out.

May 26 '22 00:05 asnorthrup

You're very welcome, @asnorthrup!

Indeed, we'd be very grateful if you would report your benchmarks related to this code. That would help us decide if these changes are worth merging or not.

May 26 '22 07:05 ParticularMiner

@asnorthrup

The 'other way' does not, in general, embed the entire master-matrix in RAM. Only the entire corpus is stored. To boost performance, the master series is partitioned in such a way that each partition can be transformed into a matrix which is small enough to fit into the CPU cache where the processing is performed.

The partitions are processed one after the other, implying that each partition's matrix replaces the former matrix in RAM (and CPU cache), keeping overall memory usage small. As a result, any time an end user submits a new query (even with the same master series) each partition's matrix would always need to be re-built from the corpus.

However, with the 'new way' (provided in this PR), we'd be storing each partition's matrix in a Python dict, which I called cache (not to be confused with the CPU cache) as soon as it is computed. So with each new query involving the same master series, the partitions' matrices do not need to be recomputed. Instead they are looked-up from the Python dict (cache). We are, of course assuming that a Python dict lookup is faster than rebuilding a matrix from the corpus.

Since cache takes up significant space in RAM, it is possible for it to degrade performance. But the extent to which performance is enhanced/degraded by cache needs to be investigated experimentally.

May 26 '22 10:05 ParticularMiner

To the best of my understanding I tried to see what kind of difference 'rebuilding a matrix' vs 'dict lookup' had. See if you agree with how I used the Grouper in the new scenario.

But the outcome seems fairly consistent that in my small duplicates size (5 records in my test) vs large master size(varying size), I couldn't get an appreciable difference in the two versions. The cache version generally ran faster, but only on the order of 0.1 seconds even on master with sizes near ~1M records.

Unless there is something about the tests that I'm not accounting for, I would say that the update really doesn't have enough performance increase to warrant making more base functions. It doesn't hurt, but for simplicity, I'm not sure I'd worry about it for my use cases test_cache code.txt Test_Cache_results.txt .

May 28 '22 00:05 asnorthrup

Thank you very much @asnorthrup for your code!

It's very good to test against different master series. But I'm beginning to wonder if the changes I introduced in this PR improve functionality at the cost of confusing the end-user.

I'll try to explain further below:

In the inner loop of your code, you
1. build the corpus (using sg = StringGrouper(master, ...)) then
2. submit a single query on that corpus (using sg.match_strings(None, duplicates, ...)).
3. repeat the above two steps with enable_cache=True
This procedure is unlikely to demonstrate the impact of caching because you are submitting only one query in each case (where Case 1 means "no caching", and Case 2 means "caching").
You need to submit at least two queries in each case. The impact on performance can only be seen from the 2nd, 3rd, ... queries onwards.
This is because building a corpus alone does not create any tf-idf matrices, but submitting a query on the corpus does. So in both cases above, the first query computes fresh tf-idf matrices from the corpus. But in Case 2, the first query also saves the tf-idf matrix of master in the "cache".
So for any subsequent queries on the same corpus,
- in Case 1: the same tf-idf matrices are computed again from the corpus even though master does not change.
- in Case 2: the tf-idf matrix for master is not computed again, but is instead retrieved from cache in order to save time.

I hope this explanation helps.

May 28 '22 08:05 ParticularMiner

Good explanation. I had in my mind that building master was the 'first query', but that was just a quick assumption, not due to 'confused end user', more just something I could have read into better.

Anyways, updated tests. Shows exactly what you intended. Dramatic speed up. Even when the original tfidf took 3+seconds to build with 700k+ names, the speed up is dramatic.

With my new understanding and test results, I think this is worth considering to include. For example, the performance increase would be significant in some cases.

Lets say I wanted to do: find duplicates, then maybe i wanted to 'stem' duplicates, then maybe some other transformation, the speed increase against 'master' would be significant in the 'stem' and 'transform' queries. 5-6 seconds vs 0.1 seconds for the larger customer lists.

Test_Cache_results.txt test_cache code.txt .

May 30 '22 19:05 asnorthrup

@asnorthrup

That's exciting! Thank you very much for taking the trouble to benchmark this. Hopefully these changes will soon appear in a new version.

May 31 '22 11:05 ParticularMiner

Hi all! Thanks again for your work @ParticularMiner! I will try to add it too a new version soon once I have some time.

Jun 06 '22 18:06 Bergvca