cqengine Improve memory footprint of on-heap indexes

I would like to use off-heap persistence together with on-heap indexes in order to minimize memory usage while providing quick queries. The problem I am encountering is that the current architecture of on-heap indexes prevents any memory savings for this setup. This is because the on-heap indexes actually store (a reference to) the underlying object, so no memory savings are realized by using on-disk storage.

It would be nice if only the index was stored in memory, and it would then go and lookup the underlying data upon retrieval. This would tie in nicely with a primary key indexed collection (#112) where the index would only need to store a primary key lookup on-heap. It would then go to the disk storage to lookup the unerlying object on demand.

Nov 01 '17 20:11 shollander

There are 3 types of indexes in this respect:

disk
off-heap (these are stored in the process' memory (i.e. in RAM), but external to the Java heap
on-heap (these are stored on the Java heap)

The disk indexes and the off-heap indexes both actually do the kind of lookup you described. They do not store object references. They work by storing the primary key of the indexed objects inside the index instead. So accessing objects via these indexes involves a 2-step lookup: access the indexes to find the PK of the object, then fetch the actual serialized object from a different (clustered) index on the PK.

On the other hand, the on-heap indexes, store object references directly (as you described). So accessing objects via these indexes is a 1-step lookup. If you combine on-heap indexes with off-heap or disk indexes, then you can find that they cause the issue you observed: objects are retained on the heap by these indexes.

However one of the main benefits of on-heap indexes is precisely that they support the 1-step lookup operations. This is kinda the killer feature of on-heap indexes: objects can be located in fewer hops, AND objects are already stored in deserialized form on the heap. So they provide fast query performance.

So maybe you could consider the following options instead?

Try enabling the "index" MergeStrategy. The overhead of deserialization can often be a bottleneck when working with disk or off-heap indexes or persistence. If you enable the "index" merge strategy, it actually can avoid the need to deserialize "candidate" objects during query processing - if you have indexes on all of the fields mentioned in your queries.
If you want to locate indexes in RAM (for performance) AND you want to avoid storing objects on-heap, you actually should be able to combine the disk and the off-heap indexes in that case. Using off-heap indexes will not cause your objects to be retained in RAM (only the PKs would be retained in RAM). Effectively you could store your objects on disk, but your indexes in RAM if you wish.

Let me know how you get on with this. You might need to experiment with these options to find the ones that give best performance.

Nov 14 '17 22:11 npgall

I need to keep indexes in RAM for performance, but want to avoid storing the objects themselves in RAM to reduce the memory footprint. We deal with millions of objects and keeping them all in memory is not practical.

The second option that you gave (of using off-heap indexes together with disk storage) would be the best fit. (Although I would prefer on-heap to off-heap to allow java to manage the memory.) Unfortunately, this does not work for us since we need to use string starts with (SW) and string ends with (EW) indexes. While the off-heap index does provide a SW index, it does not support EW queries.

I understand that you gain efficiency with the one-step lookup of on-heap indexes. However, this benefit is not always worth the cost. Perhaps there can be an option to have on-heap indexes also only store the primary key.

Other than this issue I have been very happy with CQEngine. It has solved our problem of providing fast queries for our data and greatly improved the performance of our application. As a work-around for this issue we are currently using a custom approach of creating our own key-value object where the key is the indexed field value and the value is the primary key. So our queries are a manual two step process. We first look up the primary key using the indexed collection (which resides in memory). We then take that primary key and look up the actual object in our on-disk storage. For disk storage we are using H2 MVStore with a Kryo serializer which I have found to be much faster than CQEngine's SQLite disk storage solution. Our custom solution does limit the ability of some of our queries since we have to look up each indexed field separately. That's why I was hoping the CQEngine could natively support something like this.

Nov 15 '17 14:11 shollander

That's a nice workaround actually.

However I'd like to get to the bottom of why you see better performance when retrieving the serialized objects from H2 MVStore, than with CQEngine/SQLite. Both solutions use Kryo for serialization.

I'll keep this issue open as a reminder to run some additional benchmarks on CQEngine's disk persistence with SQLite, to see if there's a bottleneck.

Nov 16 '17 12:11 npgall

Yes, using H2MVStore was definitely significantly faster. Note that MVStore does not offer a built-in Kryo Serializer. They have their own custom serializer implementation. I found their default serializer to be too slow and implemented my own Kryo Serializer using their interface.

Nov 17 '17 03:11 shollander

@shollander - So combination of your Kryo Serializer and MVStore was best performing solution for you? Can you provide code snippets?

Dec 27 '18 17:12 archenroot