lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Add IO prefetch to HNSW graph crawl?

Open mikemccand opened this issue 6 months ago • 8 comments

In the cold case (KNN vectors don't all fit in RAM), it should help to add prefetching to our HNSW graph crawl?

When we pop the best node out of the work queue, we could ask for all of its neighbor node's vectors to be prefetched before we loop through them evaluating their distance and inserting them into the work queue. We could maybe go further and do so for the top N nodes in the queue. Modern SSDs are typically highly concurrent (maybe 1000s of concurrent requests before they saturate?) so we should use that.

Lucene's prefetch API is very similar to async IO, but maybe not quite as efficient (CPU core will sometimes be more idle with prefetch since Lucene isn't executing immediately as each WILL_NEED IO request finishes), but it should still be a big win with HNSW versus not prefetching?

Maybe with virtual threads we could eventually build a "real" async IO solution, but that's a bigger thing.

Or maybe we are already prefetching today in our latest KnnVectorsFormat Codec default and I missed it?

mikemccand avatar Oct 03 '25 13:10 mikemccand

So, before we had bulk scoring, I tried this and it actually harmed performance :(. But, I was doing something silly I think (https://github.com/apache/lucene/pull/13586)

It seems to me we should be able to smartly pre-fetch particular sets of ordinals and bulk score them for much faster cold start for HNSW. That combined with the bi-partite ordering might make hnsw work much better in restricted memory environments.

One downside to how Lucene does prefetching is that the call to adjust the OS advise is synchronous. During a hot path query, this just isn't free.

Related to async (direct) IO: https://github.com/apache/lucene/pull/15224

benwtrent avatar Oct 03 '25 13:10 benwtrent

So, before we had bulk scoring, I tried this and it actually harmed performance :(. But, I was doing something silly I think (#13586)

Oooh thanks for the pointer @benwtrent. Do you know what silly thing was happening in that benchy? From the PR it looks like there was confusion about the results (a common occurrence when benchmarking!! sigh ¯_(ツ)_/¯), but no smoking gun found?

This comment from @jpountz is also interesting:

Right now prefetch() has logic to disable itself if data is consistently already in the page cache. Thinking out loud, if your benchmark shows that prefetch() hurts more than it helps when the data is often but not always in the page cache, we could look into disabling madvise in that case too and only enable it when data is rarely already in the page cache.

At Amazon we are confused by profiling of one of our PKLookup heavy indexing runs, which seem to show non-trivial time spent in mincore (the OS level function that reports which pages are hot or not for a virtual address range), invoked from MemorySegment.isLoaded which is the method (I think?) we are using to turn off prefetch when things seem already hot enough, I think? Unfortunately, mincore is O(N) (N = number of pages), and requires allocating a temporary byte[] array where it sets the status of each page. But I think N should generally be super tiny in how Lucene prefetches? I think typically 1 or maybe 2 pages? We prefetch a terms block (ish) to look up a term, what else?

We (Amazon) then re-ran with a hack to turn off prefetch and saw some gains (50 minutes faster on 11.5 hour indexing), not as much as we had expected given the scary profiler output showing lots of time in mincore (so maybe there be ghosts a-lurking, though, it is async profiler). This was just one internal datapoint, we haven't separately tried to make a repeatable benchy, etc.

Still, maybe we should somehow reduce this prefetch overhead for hot indices.

One downside to how Lucene does prefetching is that the call to adjust the OS advise is synchronous. During a hot path query, this just isn't free.

I can see the cost from isLoaded() (above) ... but, madvise(MADV_WILLNEED) isn't supposed to do anything costly? OK so I asked Claude to dig into Linux kernel sources to explain what this actually does. Here's its explanation and the code snippet it sucked out from Linux kernel sources (hopefully no hallucinations!). It looks like it triggers async readahead (hmm does that mean it's a no-op if we had MADV_RANDOM previously?). And is dropping/acquiring locks and files (sounds like that could take non-trivial time maybe). Also, TIL you can MADV_WILLNEED swap (virtual address space not backed by memory-mapped file) too.

Anyway, a better long-term solution for Lucene is "real" async-io (like you are doing in #15224)? N virtual threads reading concurrently, and efficiently acting on the bytes as they arrive... even in the hot case, this can be a concurrency win, since Lucene would be naturally expressing its code as data-driven concurrency, so JVM could in theory use multiple real threads to process each byte[] concurrently when all bytes are hot. HNSW graph crawl seems so embarrassingly concurrent in this sense.

mikemccand avatar Oct 21 '25 14:10 mikemccand

So, before we had bulk scoring, I tried this and it actually harmed performance :(. But, I was doing something silly I think (#13586)

Oooh thanks for the pointer @benwtrent. Do you know what silly thing was happening in that benchy? From the PR it looks like there was confusion about the results (a common occurrence when benchmarking!! sigh ¯_(ツ)_/¯), but no smoking gun found?

This comment from @jpountz is also interesting:

Right now prefetch() has logic to disable itself if data is consistently already in the page cache. Thinking out loud, if your benchmark shows that prefetch() hurts more than it helps when the data is often but not always in the page cache, we could look into disabling madvise in that case too and only enable it when data is rarely already in the page cache.

At Amazon we are confused by profiling of one of our PKLookup heavy indexing runs, which seem to show non-trivial time spent in mincore (the OS level function that reports which pages are hot or not for a virtual address range), invoked from MemorySegment.isLoaded which is the method (I think?) we are using to turn off prefetch when things seem already hot enough, I think? Unfortunately, mincore is O(N) (N = number of pages), and requires allocating a temporary byte[] array where it sets the status of each page. But I think N should generally be super tiny in how Lucene prefetches? I think typically 1 or maybe 2 pages? We prefetch a terms block (ish) to look up a term, what else?

We (Amazon) then re-ran with a hack to turn off prefetch and saw some gains (50 minutes faster on 11.5 hour indexing), not as much as we had expected given the scary profiler output showing lots of time in mincore (so maybe there be ghosts a-lurking, though, it is async profiler). This was just one internal datapoint, we haven't separately tried to make a repeatable benchy, etc.

Still, maybe we should somehow reduce this prefetch overhead for hot indices.

One downside to how Lucene does prefetching is that the call to adjust the OS advise is synchronous. During a hot path query, this just isn't free.

I can see the cost from isLoaded() (above) ... but, madvise(MADV_WILLNEED) isn't supposed to do anything costly? OK so I asked Claude to dig into Linux kernel sources to explain what this actually does. Here's its explanation and the code snippet it sucked out from Linux kernel sources (hopefully no hallucinations!). It looks like it triggers async readahead (hmm does that mean it's a no-op if we had MADV_RANDOM previously?). And is dropping/acquiring locks and files (sounds like that could take non-trivial time maybe). Also, TIL you can MADV_WILLNEED swap (virtual address space not backed by memory-mapped file) too.

Anyway, a better long-term solution for Lucene is "real" async-io (like you are doing in #15224)? N virtual threads reading concurrently, and efficiently acting on the bytes as they arrive... even in the hot case, this can be a concurrency win, since Lucene would be naturally expressing its code as data-driven concurrency, so JVM could in theory use multiple real threads to process each byte[] concurrently when all bytes are hot. HNSW graph crawl seems so embarrassingly concurrent in this sense.

mikemccand avatar Oct 21 '25 14:10 mikemccand

I wonder if it would be better to push this into RandomVectorScorer.bulkScore? The necessary prefetch calls would happen at roughly the same point the bulkScore call happens. The scorer could then use virtual threads to perform the reads concurrently.

Deciding whether or not you should use virtual threads for the reads is still quite tricky. The API layering also makes it difficult for callers to indicate if they think that the vector data for graph traversal should be resident.

mccullocht avatar Oct 30 '25 19:10 mccullocht

@mccullocht I agree, the abstraction needs to be lower down. The thing doing the scoring might be low enough to know HOW its accessing the vectors.

Virtual threads for OS madvise "will need" calls seems like too much.

But is absolutely critical for other formats that don't use memory segments.

benwtrent avatar Oct 30 '25 19:10 benwtrent

Yeah, virtual threads for madvise is not great -- I think at that point you'd prefer to copy the vector back onto the heap and let the virtual threads parallelize any page faults that might be happening. For non-mmap cases it may make sense to use virtual threads to parallelize in any case since each read is a syscall that may result in IO. It's a little awkward to use the underlying Directory implementation as a hint.

I may try this. Not worth worrying about better APIs for this case unless there is some proven value.

mccullocht avatar Oct 30 '25 19:10 mccullocht

I haven't worked with virtual threads for async IO myself, so this is more for my understanding... I read that virtual threads get pinned to the carrier platform thread when used for Foreign Function and Memory API.

Since MMapDirectory on Java 21+ uses MemorySegment (part of FFM API), does this mean we cannot (should not) use virtual threads for MMAP file access use cases?

Is the thinking here that we continue to use mmap access for the ANN graph search (fetching neighbor vector values for similarity checks), and independently leverage DirectIO with virtual threads for tasks like rescoring on vector similarity?

vigyasharma avatar Oct 31 '25 20:10 vigyasharma

I read that virtual threads get pinned to the carrier platform thread when used for Foreign Function and Memory API.

Yeah, I don't think we should use virtual threads for MMAP things. Though I do think this pinning issue has been fixed (I recollect seeing a jep about fixing this recently....)

Is the thinking here that we continue to use mmap access for the ANN graph search (fetching neighbor vector values for similarity checks), and independently leverage DirectIO with virtual threads for tasks like rescoring on vector similarity?

My thinking is that whomever creates the scorer (e.g. the format) has the appropriate context available to know what to do.

benwtrent avatar Oct 31 '25 20:10 benwtrent