Is Euclidean distance not supported?
I'm very happy to see open source Vector Database! Simbase is great for me, thanks :D
I have a question (or maybe new feature request..) Supported similarity(score) functions are "cosinesq" and "jensenshannon" cosine similarity function does not count vector magnitude.. But in my application, vector magnitude is meaningful for similar vector search. I want similarity function using "Euclidean distance" to be supported also :D Give some guides, thanks for your great vector DB :D
Thanks for your interesting of our project.
It is possible to support euclidean distance. Please take a look of the "score" package:
- https://github.com/guokr/simbase/blob/master/src/main/java/com/guokr/simbase/score/
There are two suite of APIs in the implementation of a score function, one is for dense vector set, the other is for sparse vector set. And the rest of the API are all event hooks.
If you could implement this feature, it is highly plausible. Or we can take this but will be due in late next week.
Thanks again.
A quick implementation without verification and tests, please check with changeset 099ecf1 and help us to review it. if no problem, I will close the issue tomorrow.
And @bwlim please give us feedback on this issue. Thanks!
Supporting Manhattan distance also seems very good, thanks!
but, I couldn't fully understand integer vector score function because I didn't fully read and understand simbase code ==>
- @Override
- public float score(String srcVKey, int srcId, int[] source, int srclen, String tgtVKey, int tgtId, int[] target,
-
int tgtlen) {
I'm just in the phase of planning new service, I cannot test simbase code right now... I don't have working system and test data now, (This is my hobby project with my wife :D) Later I will test simbase~ I'm Sorry.
Hi, @bwlim ,
The integer vector API is for the sparse vectors. Sparsity is very common in high dimensional data, in this scenario, dense storage format is very ineffective, so we introduce sparse storage format.
For example, we have a 1024 dimensional base, the two format are as below
- dense storage format: cmp1, cmp2, ..., cmp1024
- sparse storage format: idx1, cmp1, idx2, cmp2, ... (where cmpi is a non-zero component, and idxi is the index of the compoent)