simbase icon indicating copy to clipboard operation
simbase copied to clipboard

Is Euclidean distance not supported?

Open bwlim opened this issue 11 years ago • 4 comments

I'm very happy to see open source Vector Database! Simbase is great for me, thanks :D

I have a question (or maybe new feature request..) Supported similarity(score) functions are "cosinesq" and "jensenshannon" cosine similarity function does not count vector magnitude.. But in my application, vector magnitude is meaningful for similar vector search. I want similarity function using "Euclidean distance" to be supported also :D Give some guides, thanks for your great vector DB :D

bwlim avatar Jun 26 '14 06:06 bwlim

Thanks for your interesting of our project.

It is possible to support euclidean distance. Please take a look of the "score" package:

  • https://github.com/guokr/simbase/blob/master/src/main/java/com/guokr/simbase/score/

There are two suite of APIs in the implementation of a score function, one is for dense vector set, the other is for sparse vector set. And the rest of the API are all event hooks.

If you could implement this feature, it is highly plausible. Or we can take this but will be due in late next week.

Thanks again.

mountain avatar Jun 26 '14 09:06 mountain

A quick implementation without verification and tests, please check with changeset 099ecf1 and help us to review it. if no problem, I will close the issue tomorrow.

And @bwlim please give us feedback on this issue. Thanks!

mountain avatar Jun 26 '14 15:06 mountain

Supporting Manhattan distance also seems very good, thanks!

but, I couldn't fully understand integer vector score function because I didn't fully read and understand simbase code ==>

  • @Override
  • public float score(String srcVKey, int srcId, int[] source, int srclen, String tgtVKey, int tgtId, int[] target,
  •        int tgtlen) {
    

I'm just in the phase of planning new service, I cannot test simbase code right now... I don't have working system and test data now, (This is my hobby project with my wife :D) Later I will test simbase~ I'm Sorry.

bwlim avatar Jun 30 '14 02:06 bwlim

Hi, @bwlim ,

The integer vector API is for the sparse vectors. Sparsity is very common in high dimensional data, in this scenario, dense storage format is very ineffective, so we introduce sparse storage format.

For example, we have a 1024 dimensional base, the two format are as below

  • dense storage format: cmp1, cmp2, ..., cmp1024
  • sparse storage format: idx1, cmp1, idx2, cmp2, ... (where cmpi is a non-zero component, and idxi is the index of the compoent)

mountain avatar Jun 30 '14 03:06 mountain