janusgraph icon indicating copy to clipboard operation
janusgraph copied to clipboard

Drop optimization

Open ntisseyre opened this issue 1 year ago • 3 comments

Created a new MultiVertexQuery drop API that efficiently drops many vertices.

ntisseyre avatar Apr 29 '24 17:04 ntisseyre

Local benchmark execution:

Benchmark                        (isMultiDrop)  (verticesAmount)  Mode  Cnt    Score    Error  Units
MultiDropBenchmark.dropVertices           true              5000  avgt    5   71.137 ±  2.007  ms/op
MultiDropBenchmark.dropVertices          false              5000  avgt    5  900.047 ± 53.045  ms/op

Conclusion: locally deleting vertices via multiQuery performed >12 times faster than via Gremlin API.

porunov avatar Apr 29 '24 20:04 porunov

I would prefer adding a simple unit test also which tests the new API tx.multiQuery(vertices).drop() for better test coverage. Otherwise this PR looks good to me. After this PR is merged I think we could also overwrite TinkerPop's DropStep to leverage this new optimization in Gremlin queries without the need to using multuQuery directly (i.e. similar as we did with JanusGraphHasStep as it is also a filter step).

I'm not sure it will work the same way on individual DropStep, since the benefit of multi-query that it constructs a query once and applies to all vertices. For individual DropStep, it constructs a query every time for each individual vertex to fetch relations (system + standard), and that takes so much time.

ntisseyre avatar Apr 29 '24 21:04 ntisseyre

I would prefer adding a simple unit test also which tests the new API tx.multiQuery(vertices).drop() for better test coverage. Otherwise this PR looks good to me. After this PR is merged I think we could also overwrite TinkerPop's DropStep to leverage this new optimization in Gremlin queries without the need to using multuQuery directly (i.e. similar as we did with JanusGraphHasStep as it is also a filter step).

I'm not sure it will work the same way on individual DropStep, since the benefit of multi-query that it constructs a query once and applies to all vertices. For individual DropStep, it constructs a query every time for each individual vertex to fetch relations (system + standard), and that takes so much time.

Yes, the default DropStep processes vertices one by one, but our multiquariable custom JanusGraph steps which overwrite TinkerPop steps extend the interface with additional methods to "pre-register vertices which will be processed later". Thus, the overwritten version of the DropStep will simply execute multiQuery for a batch of vertices and the vertices which are already processed will be skipped until a new batch is registered. We use this technique for many custom steps (JanusGraphHasStep, JanusGraphPropertiesStep, JanusGraphVertexStep, etc.). After this PR is merged I can implement JanusGraphDropStep as it is quite easy to do if we already have the necessary multiQuery API and I already done that for multiple steps. So, I think it shouldn't be a problem unless I am missing something.

porunov avatar Apr 29 '24 21:04 porunov