Nick Crews
Nick Crews
### What happens? Thanks for this great library. It was amazingly easy to get it set up and start using it on some parquet files. Please excuse me if this...
Looking at how arg and kwarg hashing is done now, it is a little different from how the stdlib does it with functools.cache (see https://github.com/python/cpython/blob/3.10/Lib/functools.py#L448) I think this builtin method...
I compute the pairwise scores for some data, and pass these scores to clustering. If my scores contain any 0s and if connected_components requires filtering, then we go into an...
See each commit individually, nothing functional changes EDIT: still shouldn't be any funcitonal changes, though this has gotten much more substantial
From https://github.com/dedupeio/dedupe/issues/1045#issuecomment-1149052541 Why do we have the distinction between Static and non-Static classes? Is it to prevent re-training an already trained model? I don't think this needs to be enforced...
Currently, you choose whether or not to use index predicates by passing the `index_predicates` flag in `prepare_training()`. This has some drawbacks - Indexing happens regardless, in a previous step. Slow....
See https://github.com/dedupeio/dedupe/pull/1053/commits/cc609bb427d3b4db86b3d00d20f39adfc7a1eb0a There I already removed the internals that make it look like we actually use these params. Now we just need to decide what to do. Add a warning...
Decision coming from #1032
What about if instead of having the benchmark/integration test datasets in the banchmarks/ directory, what about if we included them in the dedupe/ directory, so they are actually part of...