janusgraph Cost based index selection

Fixes #3234

This PR introduces a cost based index selection algorithm which replaces the previous index selection strategies.

I did intentionally not squash my commits in order to make the review process easier. A significant part of the diff consists of methods being moved to other classes and classes being moved to other packages. I tried to isolate these into separate commits wherever possible so you can focus on the functional differences.

At the heart of the PR, there are two classes IndexSelectivityEstimator and CostBasedIndexSelector. While the latter is an adapted version of the approximate index selection algorithm we featured earlier, the first one is in charge of providing the actual selectivity estimations for given Conditions. Although its estimation capabilities are very limited as of now, it essentially defines an interface that a sophisticated estimation process (using histograms, see #2096) would have to implement.

The estimated selectivity of a condition is always represented by value 0% ≤ x ≤ 100% where 100% means that the condition matches all elements in the index. Starting here, we can provide some baseline estimations, like for example:

an EQ condition on a unique index has a selectivity of 0%, because either 1 or 0 results are returned, no matter how large the index is.
therefore, a NEQ condition on the same index has a selectivity of 100%.
an EQ condition on a non-unique index is hard to reason about. In literature, a default selectivity of 10% is often used.
for conditions like LEQ, the best guess we can take is 50%. For composite conditions, we can leverage the baseline estimations of all components and perform union and intersection operations, each assuming statistical independence. Since we do not have any insights on data distribution, this is currently the best we can do.

In the end, the CostBasedIndexSelector compiles a group of indexes based on the total cost of execution, which includes the estimated execution times for:

the index lookup,
the subsequent in memory filtering of all conditions which were not answered by an index query
the final ordering, if not supported by the first queried index. As for the selectivity estimation, these estimations have to be taken with a grain of salt as well, especially because ordering takes a non-linear amount of time and our only indicator of the result size is measured in percent, not total elements.

As can be seen in the tests, the choices taken by the CostBasedIndexSelector are always reasonable and beat our original index selection in cases which resemble the example from #3234.

Thank you for contributing to JanusGraph!

In order to streamline the review of the contribution we ask you to ensure the following steps have been taken:

For all changes:

[x] Is there an issue associated with this PR? Is it referenced in the commit message?
[x] Does your PR body contain #xyz where xyz is the issue number you are trying to resolve?
[x] Has your PR been rebased against the latest commit within the target branch (typically master)?
[ ] Is your initial contribution a single, squashed commit?

For code changes:

[x] Have you written and/or updated unit tests to verify your changes?
[ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
[ ] If applicable, have you updated the LICENSE.txt file, including the main LICENSE.txt file in the root of this repository?
[ ] If applicable, have you updated the NOTICE.txt file, including the main NOTICE.txt file found in the root of this repository?

For documentation related changes:

[x] Have you ensured that format looks appropriate for the output in which it is rendered?

Oct 11 '22 12:10 rngcntr

@rngcntr Thank you for the great work! I didn't have a chance to review your code yet but wanted to ask several question. I see that methods in IndexSelectivityEstimator.java are static. Meaning the user can't overwrite logic on the estimator. Do you know what would be the best approach to overwrite index selection process or modify estimator logic? It's not a blocker for use but I noticed that we at Mapped use very similar logic to the one provided in IndexSelectivityEstimator but instead of relying only on operations (EQ, IN and others) we also relay on user's specified "precision score" for properties used. I.e. for example, we expect maximum prevision score for any unique property, we expect smaller but still very high precision score for properties like "name" / "serialNumber", we have low precision score for properties like "type" and we have the lowest prevision score for any other property. Based on properties precision score + operations we calculate selectivity for different possible filtration combinations (using simple DFS) and select a single winner. I just wander if there is a chance to make something similar in JanusGraph so that we don't keep that selection logic separately. Maybe something like assigning "precision" or "selectivity" to properties when they are created and then reuse that information. Probably it's another topic but decided to leave it here in case anyone have some thoughts about it.

Oct 19 '22 13:10 porunov

Hi @porunov! What you describe as "precision score" is an estimation for the selectivity of a property. In my opening comment, I essentially refer to the same thing when talking about histograms. To me, it sounds like two approaches to solve the same problem: We would like to know how selective a given property is. One way to get that information is to keep track of the index using histograms, but as a neat shortcut, we could also rely on the user to provide this information. Hence, I consider your suggestion a step towards a definitive solution and I will consider implementing user provided insights.

Your other comment is also concerned about user configuration. I implemented the methods to be static in this first draft because I considered them so fundamentally applicable that I didn't see the need to override them. However, it shouldn't be too much of an effort to extract the logic into an interface and a hot-swappable implementation.

Oct 19 '22 13:10 rngcntr

One way to get that information is to keep track of the index using histograms, but as a neat shortcut, we could also rely on the user to provide this information. Hence, I consider your suggestion a step towards a definitive solution and I will consider implementing user provided insights.

Thank you! I think user provided selectivity should be a good feature because users will be able to estimate their selectivity using any technique they want. Histograms is a great approach but I assume it requires more effort to be implemented.

Oct 20 '22 15:10 porunov

It's fascinating that you started the first step toward a real non-trivial query optimizer! I haven't looked into the code yet, but I think I got a rough idea. One quick question though:

Which index(es) will this new optimizer pick up, when we have has(key1, value1).has(key2) query and there is a composite index for key1 and a mixed index for key2 which resembles the example in your original issue (https://github.com/JanusGraph/janusgraph/issues/3234)? Does it always select the composite index only?

Nov 21 '22 00:11 li-boxuan

@li-boxuan

Which index(es) will this new optimizer pick up, when we have has(key1, value1).has(key2) query and there is a composite index for key1 and a mixed index for key2 which resembles the example in your original issue (#3234)? Does it always select the composite index only?

Yes. The reasoning is: Without any prior knowledge, there are fewer results expected to match key1 == value1 than there are for key2 != null. The assumption that I made here (and that is open to debate) that it will be faster to load the few matching records from the first condition and check the second condition on each one manually than it is to load the possibly very long list of matches from the second condition and build an intersection.

I am very aware that there exist cases where the latter choice would be more efficient. However, as long as there is no way for the optimizer to distinguish such cases, the rule of thumb is to go with the execution that is expected to perform better on average.

@porunov I did not yet have the time to apply your suggestions just yet, especially as I am currently more concerned about the index lifecycle issues. But I did not forget about it for sure ;)

Nov 21 '22 09:11 rngcntr

Without any prior knowledge, there are fewer results expected to match key1 == value1 than there are for key2 != null. The assumption that I made here (and that is open to debate) that it will be faster to load the few matching records from the first condition and check the second condition on each one manually than it is to load the possibly very long list of matches from the second condition and build an intersection.

That makes sense. The only concern I have is, right now, the default behavior is to join the results from both indices, but if user knows that they only need the first index, they can rewrite their query as g.V().has("key1", "value1").identity().has("key2") so that JanusGraph will only pick up the first index. With this PR, however, even if users would like to explicitly use two indices, there is no way to achieve that, no?

A follow-up question of mine is, what's the current default behavior of g.V().has("key1", "val1").has("key2", "val2") where each key hits an index respectively?

Nov 21 '22 14:11 li-boxuan

In your example, the new algorithm would take the index that is unique (if any) or just choose one of them otherwise. Regarding your other concern, I have to admit that I haven't thought about the case that users would want to query two indexes explicitly. Have you tested your example with identity()? I tried to run it the way you proposed, but both indexes are still queried according to profile().

EDIT: The easiest way I found to avoid using an index is to use limit(Integer.MAX_VALUE) instead of identity().

Nov 22 '22 08:11 rngcntr

I tried to run it the way you proposed, but both indexes are still queried according to profile().

You need to run it with g.withoutStrategies(IdentityRemovalStrategy).

Nov 22 '22 15:11 li-boxuan

@li-boxuan Do you have any suggestions on how to force the usage of more than one index from a user perspective? I'm thinking of something like a ForceUseAllIndicesStrategy which is disabled by default and can be enabled for certain queries. However, a downside would be that the behavior applies to the entire query.

Nov 28 '22 10:11 rngcntr

@rngcntr ForceUseAllIndicesStrategy sounds reasonable. Alternatively, is it possible for users to plug in their own index selector (like the old naive one)? I don't know which is easier to implement and would make more sense to users.

Nov 28 '22 15:11 li-boxuan

Forcing users to write their own index selector just to fix the performance of a single query sounds unreasonable to me. Therefore, I'd opt for the strategy and I also think that this one would be quite easy to implement.

Dec 01 '22 06:12 rngcntr

I removed the milestone because at the moment, I don't have enough time to work on this. I'm looking forward to get back to this PR, but for now I don't want it to block the v1.0 release.

Mar 06 '23 06:03 rngcntr

@rngcntr I happened to think of this PR today again, and now I have a very different view. To recap, we wanted to introduce the cost-based index selection, and at the same time, we didn't like to introduce performance regression.

I think the best way of doing so is to let users provide a query hint. Amazon Neptune, for example, uses g.withSideEffect(hint, value) for query hint. So if the user wants to force the optimizer to use a certain index, they could do something like this:

g.withSideEffect("forceIndex", ["indexForKey1"]).V().has("key1", "val1").has("key2", "val2")
g.withSideEffect("forceIndex", ["indexForKey1", "indexForKey2"]).V().has("key1", "val1").has("key2", "val2")

This would be more user-friendly and probably also easier to implement than introducing a ForceUseAllIndicesStrategy. Thoughts?

Mar 30 '23 02:03 li-boxuan

Hi @li-boxuan! That sounds like a nice and simple approach to me because side effects support the intention of using this feature on a per-query base instead of potentially configuring a strategy globally and influencing the performance of all other queries as well. So far, I have not implemented any side-effect features in JanusGraph, but I can imagine it being rather easy. Instead of supplying a configuration via withSideEffect, I think we can use with to supply the configuration to the traversal source instead of the traversal. I can't tell if this is as easy to implement just yet, but it looks cleaner not to carry the side effect through every single traverser.

Mar 30 '23 06:03 rngcntr

Yeah there's some nuance between with() and withSideEffect() which I am not familiar with. Not sure why Neptune uses withSideEffect() rather than with() but I agree with() sounds more intuitive.

Mar 30 '23 06:03 li-boxuan

@li-boxuan I finally found some time to work on this feature again. In addition to forcing the usage of a specified index, I also added the option to suppress the usage of certain indexes. As you already imagined, the usage is straight forward:

g.with("hint.index-selection.suppress-index-usage", "myPoorlyPerformingIndex").V()...
g.with("hint.index-selection.force-index-usage", "myAbsolutelyAwesomeIndex").V()...

I think the classes introduced for TraversalHints will come in handy once hints become used more widely. However, I hesitated to dive into auto-generating docs before having heard some opinions.

Apr 28 '23 13:04 rngcntr

@li-boxuan I finally found some time to work on this feature again. In addition to forcing the usage of a specified index, I also added the option to suppress the usage of certain indexes. As you already imagined, the usage is straight forward:
g.with("hint.index-selection.suppress-index-usage", "myPoorlyPerformingIndex").V()...
g.with("hint.index-selection.force-index-usage", "myAbsolutelyAwesomeIndex").V()...
I think the classes introduced for TraversalHints will come in handy once hints become used more widely. However, I hesitated to dive into auto-generating docs before having heard some opinions.

Thank you @rngcntr for this great feature. I wonder if you had a chance to think about user provided selectivity approach and if so do you envision it to be provided via hints mechanism ? As an example, I could think about our use-case at Mapped. We have multiple different indices touching the same properties, but we want to use the most selective one only. For example, we have properties: String id, String name, String type. Now we have 4 composite indices:

on id only.
on name only.
on name & type.
on type only.

id is 100% selective (returns 0 or 1 elements). name is 80% selective (usually returns a small amount of elements). type is 10% selective (usually returns many elements). For a query g.V().has("name", "foo").has("id", "123").has("type", "bar") we prefer using only the index 1. without using other two indices. For a query g.V().has("type", "foo").has("id", "123") we would like to use index 1. again. For a query g.V().has("type", "foo").has("name", "foo") we would like to use index 3..

Currently we have custom logic which knows about property selectivity and basically re-arranges the Gremlin steps if order from the highest selectivity property to the lowest selectivity property and attaches .limit(Integer.MAX_VALUE) to the place where it guarantees that only a single index will be executed for this query. The approach works for us, but it's definitely an ugly hack and the native JanusGraph approach would be much more preferable. I.e. out logic basically changes the above queries into: g.V().has("id", "123").limit(Integer.MAX_VALUE).has("name", "foo").has("type", "bar") g.V().has("id", "123").limit(Integer.MAX_VALUE).has("type", "bar") g.V().has("name", "foo").limit(Integer.MAX_VALUE).has("type", "bar")

I didn't review this PR yet, but was wondering if it would be possible to somehow reuse either this hints mechanism or some other mechanism to achieve similar behavior without using that logic (to rearrange steps and adding limit into the query).

Apr 28 '23 14:04 porunov

TraversalHints will come in handy once hints become used more widely

was wondering if it would be possible to somehow reuse either this hints mechanism

That's a very neat use-case for hints and the effort to implement it looks reasonable. The only thing I'll have to figure out is what would be the cleanest way to provide the tuple (indexName, providedSelectivity) as a hint compared to just one value.

Another idea that crossed my mind: How about making every hint also a config option such that hints can also be provided on an instance level instead of the traversal level? Ultimately, we could move towards an implementation that automatically accepts MASKABLE configuration options as hints to apply e.g. USE_MULTIQUERY on a traversal level as well. That, however, is non-trivial for now, since there are lots of MASKABLE configuration options that go way beyond what can be altered for single traversals, such as DB_CACHE_SIZE.

@porunov One more remark: I'm always used to refer to selectivity the other way around, such that 10% selective means an index query will return 10% of the entire data set on average. In your wording, that would be 90% selective. I tend towards calling that 10% selectivity in the hints because this notation is used for internal calculations anyway. For two independent index queries with selectivities 10% and 20%, it is easy to estimate that the overall result size will be 10% * 20% = 2% of the data set. If you know any arguments why the other notation would be preferable, that's open for discussion.

May 02 '23 05:05 rngcntr

I'll have to figure out is what would be the cleanest way to provide the tuple (indexName, providedSelectivity) as a hint compared to just one value.

Yeah, but notice I was referring to properties and not to indices because if we refer to indices it will be applicable to composite indices only. We can’t define selectivity of a mixed index because different combinations of properties may be used for a mixed index.

Another idea that crossed my mind: How about making every hint also a config option such that hints can also be provided on an instance level instead of the traversal level? Ultimately, we could move towards an implementation that automatically accepts MASKABLE configuration options as hints to apply e.g. USE_MULTIQUERY on a traversal level as well. That, however, is non-trivial for now, since there are lots of MASKABLE configuration options that go way beyond what can be altered for single traversals, such as DB_CACHE_SIZE.

This is a good approach I think. We could potentially use LOCAL configuration options for hints. I think that providing selectivity for each traversal could be cumbersome for cases when selectivity of a property doesn’t change (in case we have too many properties), but on the other hand it’s handy solution in case selectivity of a property may change from traversal to traversal. I think providing both options is a good solution because in case selectivity of a property doesn’t change from traversal to traversal then the user will prefer config option. Otherwise user will prefer providing selectivity for each traversal.

@porunov One more remark: I'm always used to refer to selectivity the other way around, such that 10% selective means an index query will return 10% of the entire data set on average. In your wording, that would be 90% selective. I tend towards calling that 10% selectivity in the hints because this notation is used for internal calculations anyway. For two independent index queries with selectivities 10% and 20%, it is easy to estimate that the overall result size will be 10% * 20% = 2% of the data set. If you know any arguments why the other notation would be preferable, that's open for discussion.

I totally agree with your selectivity definition. I used to use “precision” for properties (where 100% means 0% selectivity) and accidentally referred to selectivity as to precision, but it should be the other way around as you mentioned. Moreover, I know that “selectivity” is a more popular solution to use, thus I also think using selectivity approach is better than precision.

May 02 '23 08:05 porunov

How about making every hint also a config option such that hints can also be provided on an instance level instead of the traversal level?

I am not sure what that's gonna look like. On the instance level, I suppose users could say "distinct value of my index is few/many", or "the data distribution is very skewed", but regarding selectivity, it could only be meaningful on the query level, e.g. "selectivity of predicate (x == 1) is 0.1".

10% selective means an index query will return 10% of the entire data set on average

That's correct. I personally don't like that jargon because people could easily think it the other way around. That being said, it's widely used in textbooks, academics, and industry, so I guess we should stick to it.

May 03 '23 07:05 li-boxuan

Yeah, but notice I was referring to properties and not to indices because if we refer to indices it will be applicable to composite indices only. We can’t define selectivity of a mixed index because different combinations of properties may be used for a mixed index.

Agreed. But even properties alone would not be enough, as the selectivity also depends on the predicate.

I am not sure what that's gonna look like. On the instance level, I suppose users could say "distinct value of my index is few/many", or "the data distribution is very skewed", but regarding selectivity, it could only be meaningful on the query level, e.g. "selectivity of predicate (x == 1) is 0.1".

A use case for instance level configuration is our scenario at G DATA, which I pictured in #3234:

We use an index which is specifically designed for one use case but has disadvantages in most other use cases.
We deploy JanusGraph instances for specific purposes, e.g. one for automated traversals and one for manual queries by the user.
Instance level hints would now enable us to allow the usage of an index on one instance but suppress its usage on another instance.
By doing so, we gain the power to decide which indexes a traversal written by our customer should use without having to access the customer's source code and somehow modify the traversal to include certain hints.

May 05 '23 08:05 rngcntr

Agreed. But even properties alone would not be enough, as the selectivity also depends on the predicate.

It's true but I suspect predicates selectivity can be calculated (i.e. they don't necessary needed to be provided by users). For example, .in(1,2,3) has selectivity 3 times of .eq(1). In case user defines name property selectivity as 10% then selectivity of .in(1,2,3) is 30%. In case name selectivity is 50% then that in selectivity will be 100% because 150% doesn't make sense. I could be wrong, not sure. Of course, there are some predicates which are harder to grasp like P.gt or some full-text search predicates, but I think some kind of base calculation could be applied to those predicates and we can improve them overtime.

May 08 '23 13:05 porunov

@li-boxuan I should point out that I'm currently looking into user defined selectivities, as @porunov requested earlier. When reviewing, please keep in mind that things are subject to change.

Jul 07 '23 08:07 rngcntr