RapidFuzz icon indicating copy to clipboard operation
RapidFuzz copied to clipboard

Add support for Smith Waterman algorithm

Open maxbachmann opened this issue 4 years ago • 4 comments

The Smith Waterman algorithm is a commonly used metric to compare strings. It would be useful to add it to RapidFuzz.

maxbachmann avatar Jan 02 '22 15:01 maxbachmann

Hi, will this algorithm be added recently?

hongduosun avatar Aug 22 '22 13:08 hongduosun

I am not sure yet. I think I could add a simple implementation in the close future. When matching long sequences you would probably want to use a more optimized implementation like: https://github.com/jeffdaily/parasail (has python bindings) I do not think I will have the time to write an implementation that is even close to this level of optimization.

maxbachmann avatar Aug 22 '22 17:08 maxbachmann

Got it, thanks for the great work!

hongduosun avatar Aug 23 '22 02:08 hongduosun

After looking at the paraseil python bindings I found them way to hard to use. E.g.

Be careful using the attributes of the Result object - especially on Result instances constructed on the fly. For example, calling parasail.sw_trace("asdf", "asdf", 11, 1, parasail.blosum62).cigar.seq returns a numpy.ndarray that wraps a pointer to memory that is invalid because the Cigar is deallocated before the seq statement. You can avoid this problem by assigning Result instances to variables as in the example above.

Is not really acceptable in my opinion. So I decided to add at least a simple implementation for people who just want to have a properly working implementation.

maxbachmann avatar Aug 23 '22 13:08 maxbachmann