feature_engine icon indicating copy to clipboard operation
feature_engine copied to clipboard

Features/distances

Open VascoSch92 opened this issue 2 years ago • 15 comments

Just a first sketch.

Let me know what do you think :-)

VascoSch92 avatar Sep 12 '23 15:09 VascoSch92

Is it possible to calculate Haversine distance using sklearn? It is quite fast and well optimized, reimplementing it seems like a not so good idea.

P.S. it could be quite interesting to add more measures of distance, for example Ruler distance

glevv avatar Sep 15 '23 21:09 glevv

Is it possible to calculate Haversine distance using sklearn? It is quite fast and well optimized, reimplementing it seems like a not so good idea

Yes it is possible to compute the Haversine distance with sklearn. I was also thinking to use an apply and the Haversine distance method of Sklearn.

The question is: is it faster than vectorisation?

But I'm happy to change if it faster or If there is a faster method than mine ;-)

VascoSch92 avatar Sep 15 '23 21:09 VascoSch92

Is it possible to calculate Haversine distance using sklearn? It is quite fast and well optimized, reimplementing it seems like a not so good idea

Yes of course I know that. The question is: Can you vectorise it? it is faster than vectorisation?

I'm not sure I understand the question. Scikit-learn implementation is vectorized by default

glevv avatar Sep 15 '23 21:09 glevv

Is it possible to calculate Haversine distance using sklearn? It is quite fast and well optimized, reimplementing it seems like a not so good idea

Yes of course I know that. The question is: Can you vectorise it? it is faster than vectorisation?

I'm not sure I understand the question. Scikit-learn implementation is vectorized by default

I think the issue with the sklearn implementation is that it does a cartesian product between X and Y and yields a matrix.

We only need a pairwise calculation between X and Y that yields a vector.

kylegilde avatar Sep 16 '23 15:09 kylegilde

np.diag(haversine_distances(X, Y) * R) would give you the vector you want

glevv avatar Sep 16 '23 21:09 glevv

haversine_distances

I know that it is a simple way to code this, but from a time complexity perspective, it's not a great idea to use quadratic complexity when only linear complexity is needed.

kylegilde avatar Sep 16 '23 22:09 kylegilde

haversine_distances

I know that it is a simple way to code this, but from a time complexity perspective, it's not a great idea to use quadratic complexity when only linear complexity is needed.

Yea, you are right, this way it will be better

glevv avatar Sep 17 '23 13:09 glevv

Is it possible to calculate Haversine distance using sklearn? It is quite fast and well optimized, reimplementing it seems like a not so good idea.

P.S. it could be quite interesting to add more measures of distance, for example Ruler distance

Hey @glevv thanks for the suggestion.

If I understood this blog correctly, it has 3 computations: euclidean, harvesine (the one we are trying to implement here) and a more complicated one that has a smaller error (vincenty's formula). Is this correct?

I'd suggest we stick to harvesine in this PR, and see if we create an issue to expand the class later with the Vincenty's. Is this formula commonly used? do we really need an error as small as 0.5mm for geo coordinates?

solegalli avatar Sep 18 '23 09:09 solegalli

Is it possible to calculate Haversine distance using sklearn? It is quite fast and well optimized, reimplementing it seems like a not so good idea. P.S. it could be quite interesting to add more measures of distance, for example Ruler distance

Hey @glevv thanks for the suggestion.

If I understood this blog correctly, it has 3 computations: euclidean, harvesine (the one we are trying to implement here) and a more complicated one that has a smaller error (vincenty's formula). Is this correct?

They are all measures of distance between two points on ellipsoid. There were no Vincenty formula, but it's quite heavy to compute. In this particular blog post they talked about two simpler and faster formulas (Cheap Ruler and FCC equation) but with higher error.

I'd suggest we stick to harvesine in this PR, and see if we create an issue to expand the class later with the Vincenty's. Is this formula commonly used? do we really need an error as small as 0.5mm for geo coordinates?

Ye, let's go with haversine only, not sure about Vincenty tho

glevv avatar Sep 18 '23 12:09 glevv

Hey @solegalli Sorry if I disappeared. I had a lot to do with work and life. I will try to give a look at this pull request next week ,-)

VascoSch92 avatar Feb 03 '24 17:02 VascoSch92

No Problem at all @VascoSch92 . Same here.

I am doing some big changes to the correlation transformers, I think we could release a new version when i got those finished, hopefully during February.

It would be great if we can squeeze this transformer in that release 2. If you find the time, we look forward to your contribution :)

solegalli avatar Feb 05 '24 08:02 solegalli

No Problem at all @VascoSch92 . Same here.

I am doing some big changes to the correlation transformers, I think we could release a new version when i got those finished, hopefully during February.

It would be great if we can squeeze this transformer in that release 2. If you find the time, we look forward to your contribution :)

Hey @solegalli :-) is it time to give another try to this transformer? What do you think?

VascoSch92 avatar Apr 22 '24 10:04 VascoSch92

Sure! Contributions are welcome any time :)

solegalli avatar Apr 22 '24 10:04 solegalli

ok perfect. I will work on it.

VascoSch92 avatar Apr 22 '24 10:04 VascoSch92

Hey @solegalli finally I have something.

I still need some guidance for some point:

  • From which classes should I extend? I'm extending from BaseNumericalTransformer, FitFromDictMixin and GetFeatureNamesOutMixin but I don't know if is a good idea
  • I have a fit method also if I'm not using it. Should I have it anyway or can I delete it?

VascoSch92 avatar May 09 '24 21:05 VascoSch92