clcache Very simple fast test of hashing performance for moderate sized files

I noticed the conversation about hashing in one of the other issues. Perhaps this would be useful?

It creates a directory tree of 1000 files each of 256K in size and then simply hashes each. As files are fresh each run the misleading figures due to warm/cold disc caches should be consistent. The whole test only takes a 1-2 sec for me.

Nov 02 '16 17:11 inorton

Current coverage is 88.86% (diff: 100%)

Merging #241 into master will decrease coverage by 0.55%

@@             master       #241   diff @@
==========================================
  Files             1          1          
  Lines          1040        997    -43   
  Methods           0          0          
  Messages          0          0          
  Branches        166        158     -8   
==========================================
- Hits            930        886    -44   
- Misses           82         83     +1   
  Partials         28         28

Powered by Codecov. Last update 1e7b28a...7c8f2d1

Nov 02 '16 17:11 codecov-io

Thanks, I think it's a good idea to have some sort of standard benchmark. I suppose instead of creating a new file, this could be part of the existing performancetests.py test case?

Alas (?), the discussion in #239 suggests that the issue with slow hashing appears to be related to concurrent clcache instances, at least for @akleber 's setup -- so I'm not sure the test code as it is reproduces that issue

In any case, I very much agree that some sort of performance test for this functionality would be good -- it's not clear to me though which scenarios to benchmark.

Nov 03 '16 21:11 frerich

I started this with a theory that we could compute hashes in parallel using concurrent.futures if we had multiple cpus (some doing IO some doing hashing). My tests here showed that it actually just made things worse by quite some considerable margin.

Nov 05 '16 08:11 inorton

Indeed, it matches @akleber 's observation that concurrent hashing of files is substantially slower than sequential hashing.

Maybe this is another argument in favor of some sort of server process which acts as the sole instance to sequentially hash (and potentially cache) hashes.

Nov 05 '16 10:11 frerich

I think a performance test to check how fast cache hits and cache misses are (both concurrently as well as sequentially) would be a nice thing to have, but that should probably go into performancetests.py.

Nov 14 '16 07:11 frerich