Very simple fast test of hashing performance for moderate sized files
I noticed the conversation about hashing in one of the other issues. Perhaps this would be useful?
It creates a directory tree of 1000 files each of 256K in size and then simply hashes each. As files are fresh each run the misleading figures due to warm/cold disc caches should be consistent. The whole test only takes a 1-2 sec for me.
Current coverage is 88.86% (diff: 100%)
@@ master #241 diff @@
==========================================
Files 1 1
Lines 1040 997 -43
Methods 0 0
Messages 0 0
Branches 166 158 -8
==========================================
- Hits 930 886 -44
- Misses 82 83 +1
Partials 28 28
Powered by Codecov. Last update 1e7b28a...7c8f2d1
Thanks, I think it's a good idea to have some sort of standard benchmark. I suppose instead of creating a new file, this could be part of the existing performancetests.py test case?
Alas (?), the discussion in #239 suggests that the issue with slow hashing appears to be related to concurrent clcache instances, at least for @akleber 's setup -- so I'm not sure the test code as it is reproduces that issue
In any case, I very much agree that some sort of performance test for this functionality would be good -- it's not clear to me though which scenarios to benchmark.
I started this with a theory that we could compute hashes in parallel using concurrent.futures if we had multiple cpus (some doing IO some doing hashing). My tests here showed that it actually just made things worse by quite some considerable margin.
Indeed, it matches @akleber 's observation that concurrent hashing of files is substantially slower than sequential hashing.
Maybe this is another argument in favor of some sort of server process which acts as the sole instance to sequentially hash (and potentially cache) hashes.
I think a performance test to check how fast cache hits and cache misses are (both concurrently as well as sequentially) would be a nice thing to have, but that should probably go into performancetests.py.