added ability to cache matrix in queries across which `master` is constant
Hi @Bergvca,
This PR addresses @asnorthrup's issue: see here.
Within this PR, the end-user is now able to:
- execute multiple separate queries on the same
masterdataset while theduplicatesdataset changes; - take advantage of caching to boost the performance of such queries (if
masteris small enough).
... # set pandas.Series master, df1, and df2
sg = StringGrouper(master, enable_cache=True) # build corpus
matches1 = sg.match_strings(None, duplicates=df1) # first query: compares ``df1`` against ``master`` while caching ``master``
matches2 = sg.match_strings(None, duplicates=df2) # subsequent queries: compares ``df2`` against ``master`` using the cache
Caution: Caching is likely to provide a performance benefit only when the input master dataset is "small enough". The challenge would be in determining exactly how small master should be for caching to become beneficial.
Awesome, thanks for taking some time to implement @ParticularMiner. Can't wait to test it out and report back what I see. I am especially curious about the size issue. I guess I hadn't considered the fact that this could has a performance hit since the 'other way' creates the entire embedding anyway. I need to wrap my head around that after re-looking at your code changes. I'm sure there are performance enhancements that happen in the way the code was originally structured when size>RAM, but I hadn't thought to investigate that. Thanks for pointing it out.
You're very welcome, @asnorthrup!
Indeed, we'd be very grateful if you would report your benchmarks related to this code. That would help us decide if these changes are worth merging or not.
@asnorthrup
The 'other way' does not, in general, embed the entire master-matrix in RAM. Only the entire corpus is stored. To boost performance, the master series is partitioned in such a way that each partition can be transformed into a matrix which is small enough to fit into the CPU cache where the processing is performed.
The partitions are processed one after the other, implying that each partition's matrix replaces the former matrix in RAM (and CPU cache), keeping overall memory usage small. As a result, any time an end user submits a new query (even with the same master series) each partition's matrix would always need to be re-built from the corpus.
However, with the 'new way' (provided in this PR), we'd be storing each partition's matrix in a Python dict, which I called cache (not to be confused with the CPU cache) as soon as it is computed. So with each new query involving the same master series, the partitions' matrices do not need to be recomputed. Instead they are looked-up from the Python dict (cache). We are, of course assuming that a Python dict lookup is faster than rebuilding a matrix from the corpus.
Since cache takes up significant space in RAM, it is possible for it to degrade performance. But the extent to which performance is enhanced/degraded by cache needs to be investigated experimentally.
To the best of my understanding I tried to see what kind of difference 'rebuilding a matrix' vs 'dict lookup' had. See if you agree with how I used the Grouper in the new scenario.
But the outcome seems fairly consistent that in my small duplicates size (5 records in my test) vs large master size(varying size), I couldn't get an appreciable difference in the two versions. The cache version generally ran faster, but only on the order of 0.1 seconds even on master with sizes near ~1M records.
Unless there is something about the tests that I'm not accounting for, I would say that the update really doesn't have enough performance increase to warrant making more base functions. It doesn't hurt, but for simplicity, I'm not sure I'd worry about it for my use cases test_cache code.txt Test_Cache_results.txt .
Thank you very much @asnorthrup for your code!
It's very good to test against different master series. But I'm beginning to wonder if the changes I introduced in this PR improve functionality at the cost of confusing the end-user.
I'll try to explain further below:
-
In the inner loop of your code, you
- build the corpus (using
sg = StringGrouper(master, ...)) then - submit a single query on that corpus (using
sg.match_strings(None, duplicates, ...)). - repeat the above two steps with
enable_cache=True
This procedure is unlikely to demonstrate the impact of caching because you are submitting only one query in each case (where Case 1 means "no caching", and Case 2 means "caching").
- build the corpus (using
-
You need to submit at least two queries in each case. The impact on performance can only be seen from the 2nd, 3rd, ... queries onwards.
-
This is because building a corpus alone does not create any tf-idf matrices, but submitting a query on the corpus does. So in both cases above, the first query computes fresh tf-idf matrices from the corpus. But in Case 2, the first query also saves the tf-idf matrix of
masterin the "cache". -
So for any subsequent queries on the same corpus,
- in Case 1: the same tf-idf matrices are computed again from the corpus even though
masterdoes not change. - in Case 2: the tf-idf matrix for
masteris not computed again, but is instead retrieved fromcachein order to save time.
- in Case 1: the same tf-idf matrices are computed again from the corpus even though
I hope this explanation helps.
Good explanation. I had in my mind that building master was the 'first query', but that was just a quick assumption, not due to 'confused end user', more just something I could have read into better.
Anyways, updated tests. Shows exactly what you intended. Dramatic speed up. Even when the original tfidf took 3+seconds to build with 700k+ names, the speed up is dramatic.
With my new understanding and test results, I think this is worth considering to include. For example, the performance increase would be significant in some cases.
Lets say I wanted to do: find duplicates, then maybe i wanted to 'stem' duplicates, then maybe some other transformation, the speed increase against 'master' would be significant in the 'stem' and 'transform' queries. 5-6 seconds vs 0.1 seconds for the larger customer lists.
@asnorthrup
That's exciting! Thank you very much for taking the trouble to benchmark this. Hopefully these changes will soon appear in a new version.
Hi all! Thanks again for your work @ParticularMiner! I will try to add it too a new version soon once I have some time.