ruby YJIT: Prepare for freeing unused code pages

This PR implements rb_yjit_mark_unused to free code pages and outlines what I think needs to be implemented as a first simple version of code GC, which only collects code for GCed ISEQs.

How code pages can be freed

Today, I played with mprotect and madvise, looking at the RSS. My observation has been:

Linux:
- madvise
  - MADV_DONTNEED frees physical memory immediately.
  - MADV_FREE is available in newer platforms and useful for lazily doing it. It's probably better in terms of production performance, but it seems not friendly to monitoring/benchmarking memory usages; the RSS is likely to remain high when there're no competing memory usages. One idea might be to use MADV_DONTNEED on dev builds and MADV_FREE for release builds.
  - The argument address needs to be page-aligned. Otherwise it gives an "Invalid argument" error.
- mprotect PROT_NONE doesn't free physical memory.
macOS:
- madvise
  - MADV_DONTNEED and/or MADV_FREE doesn't free physical memory immediately.
  - This platform also has MADV_FREE_REUSABLE unlike Linux, but it doesn't free memory immediately either.
  - off-topic: Unlike Linux, a non-page-aligned address can be used as the argument.
- mprotect PROT_NONE ~frees physical memory~ . The argument needs to be page-aligned.
  - edit: While it does seem to reduce RSS, re-enabling PROT_READ allows reading the same content. So it might not be actually freed. This needs further investigation.

It's implemented as rb_yjit_mark_unused in this PR.

Tracking unused code

Given the above information, you always need to give a page-aligned address to those system calls. One simple way to achieve it is to only free code pages when an entire page is marked unused.

Because we always start a basic block with a 64-byte address (cb.align_pos(64)), I think we can assume that a single cacheline may not have code for different ISEQs. If that's true, it should be enough to just track the in-use status per cacheline when we're going to free all code of dead ISEQs at once.

On Linux, a page is 4KiB and a cacheline is typically 64 bytes, thus 64 cachelines for a single page, which can be represented by a 64-bit bitmap. So 128KiB would be needed for the bitmap of a 64MiB code block. A page is 16KiB on M1 MacBook Pro.

I think our first step could be to set 1 on compile_with_regs and set 0 on rb_yjit_iseq_free in the bitmap. Then, when all bits for a single page become 0 (all the 64 bits of a cacheline on Linux), we can call rb_yjit_mark_unused to free it.

This part is extremely WIP as of writing this. Feel free to take over this if this idea makes sense and you want to move this forward when I'm offline.

Out of scope in this PR: Inline and outlined code regions

We also discussed the idea to mix inline and outlined code regions. This PR doesn't work on that project, but I believe those projects are kind of independent and could be worked on concurrently. I assume it can be done before or after this change with relative ease.

cc: @maximecb @XrXr

Sep 20 '22 08:09 k0kubun

One idea might be to use MADV_DONTNEED on dev builds and MADV_FREE for release builds.

I think in general I would avoid having two different implementation paths when we don't have data to back up the idea that having these two separate behaviors yields perf improvements. It's best to minimize complexity and avoid unforeseen strange bugs.

Given the above information, you always need to give a page-aligned address to those system calls. One simple way to achieve it is to only free code pages when an entire page is marked unused.

I agree with that 👍 It might also mean that there's an advantage in having smaller code pages wrt memory usage. We will need to benchmark it.

Because we always start a basic block with a 64-byte address (cb.align_pos(64)), I think we can assume that a single cacheline may not have code for different ISEQs.

This is unfortunately not quite true. We set the alignment to 64 bytes in gen_entry_prologue... Which is an old "optimization" that I put in place but that may not actually improve perf 😅 However, when we do JIT-to-JIT ISEQ-to-ISEQ calls we pay no attention to alignment.

I think your idea of "painting" pages incrementally when generating code and when freeing ISEQs is fairly clever. What I was originally thinking is that we would have a bitmap that is essentially one bit per page. We would zero out all the bits and then iterate through all the live ISEQs to mark the live ISEQs. We have access to all of the YJIT blocks for all the ISEQs so we can map from ISEQ->blocks->address ranges->pages. Maybe we could start with that? Unless you can think of an incremental marking scheme that can work with arbitrary address ranges for ISEQs.

We also discussed the idea to mix inline and outlined code regions. This PR doesn't work on that project, but I believe those projects are kind of independent and could be worked on concurrently. I assume it can be done before or after this change with relative ease.

We may want to think about that fairly early? At least convince ourselves that it will work with our design.

We'll need to split each code page into an inline and outlined half so that the outlined code is always close to the inlined code.

I think that in terms of marking, we basically don't care about outlined code. We can assume that the code in the outlined half belongs to the same ISEQs that are in the inlined half. Therefore, if it's safe to discard the inlined code, it should be safe to discard the outlined code.

Sep 20 '22 15:09 maximecb

I think in general I would avoid having two different implementation paths when we don't have data to back up the idea that having these two separate behaviors yields perf improvements. It's best to minimize complexity and avoid unforeseen strange bugs.

Makes sense. I prefer just using MADV_DONTNEED for Linux then. MADV_FREE doesn't exist in older environments, so it would force us to maintain two paths.

However, when we do JIT-to-JIT ISEQ-to-ISEQ calls we pay no attention to alignment.

Got it. It was a too optimistic assumption 🙂 If we currently pack different ISEQs in such calls, adding alignment there may have an unwanted impact. Let's avoid designs that need this assumption.

What I was originally thinking is that we would have a bitmap that is essentially one bit per page. We would zero out all the bits and then iterate through all the live ISEQs to mark the live ISEQs. We have access to all of the YJIT blocks for all the ISEQs so we can map from ISEQ->blocks->address ranges->pages. Maybe we could start with that?

That sounds like the simplest solution we could have now. I think we need to newly introduce a list of ISEQs that have been JITed in the past, but given that we're targeting only a few thousands of ISEQs, it's probably fine in terms of both time and space complexity.

Unless you can think of an incremental marking scheme that can work with arbitrary address ranges for ISEQs.

I do think of an incremental marking scheme that works without the alignment assumption though. If we have [u8; N] or [u16; N] where each element maintains the number of ISEQs that overlap a single page, we could do it incrementally. Because each page requires only 1 or 2 bytes, it requires an even smaller size for maintaining it than my first idea.

For this to work, whenever we write new code for an ISEQ, we should be able to know the ISEQ's current set of address ranges and its new one in order to increment only elements for newly modified pages. Being able to know all the code addresses of an ISEQ is needed for a non-incremental design as well, so virtually we have no blocker for taking this design.

Anyway, I'm planning on implementing your non-incremental design first because it's simpler and I think it's better to have a working first version as early as possible and incrementally improve it.

At least convince ourselves that it will work with our design.

It's kind of related to the "only free code pages when an entire page is marked unused" idea. No matter what layout we choose for each page or over multiple pages, we'll need to be able to identify whether a particular page(s) is in use or not. We can make the page usage tracking either fully isolated or tightly integrated with something to manage a code page, but even for the latter case, the core logic of the memory page management would stay the same when the layout is changed.

I'm not saying that we should do that early though. I'm rather suggesting that we could work on them at the same time especially when we have multiple people working on this project. I just prioritized solving the primary goal anyway because I have the above belief.

Sep 21 '22 08:09 k0kubun

That sounds like the simplest solution we could have now. I think we need to newly introduce a list of ISEQs that have been JITed in the past, but given that we're targeting only a few thousands of ISEQs, it's probably fine in terms of both time and space complexity.

IIRC SFR is on the order of 10K to 20K ISEQs.

Does CRuby have a straightforward way of iterating through all the ISEQs? If so, we could use that. If not, then I guess we have to add a list/set of JITted ISEQs as you suggest.

For a given ISEQ, we can check if there exist block versions in YJIT: https://github.com/ruby/ruby/blob/master/yjit/src/core.rs#L686

And this function gets called when marking an ISEQ: https://github.com/ruby/ruby/blob/master/yjit/src/core.rs#L559

I do think of an incremental marking scheme that works without the alignment assumption though. If we have [u8; N] or [u16; N] where each element maintains the number of ISEQs that overlap a single page, we could do it incrementally. Because each page requires only 1 or 2 bytes, it requires an even smaller size for maintaining it than my first idea.

IIUC what you are suggesting here is using one integer per page representing the number of ISEQs that overlap on that page? If so then I guess this would be reference counting. Presumably that would work. We might still need to reconstruct the refcount from time to time if we ever decide to throw away all the code because we've reached the limit though.

Anyway, I'm planning on implementing your non-incremental design first because it's simpler and I think it's better to have a working first version as early as possible and incrementally improve it.

Alrighty 👍 :)

No matter what layout we choose for each page or over multiple pages, we'll need to be able to identify whether a particular page(s) is in use or not.

One challenging point is that we don't really track address ranges for outlined code blocks right now. If we split each page between inline and outlined halves, we can guarantee that all of the outlined code for an ISEQ will live in the same page as its inline code. Without that design, then we need to track the outlined address ranges. Maybe not the hardest thing, but something to think about. We do also want to solve the problem of making inline and outlined code closer together (within the range of conditional branches on arm).

I'm rather suggesting that we could work on them at the same time especially when we have multiple people working on this project. I just prioritized solving the primary goal anyway because I have the above belief.

I apologize because I do realize that it's hard to coordinate on this project async and there are a lot of different elements to keep track of.

Sep 22 '22 20:09 maximecb

IIRC SFR is on the order of 10K to 20K ISEQs.

:memo:

Does CRuby have a straightforward way of iterating through all the ISEQs?

I currently don't know of such a thing. Iterating through all objects might work if every ISeq has a Ruby object, but I'm not sure if it's efficient enough. At least MJIT has been doing that by itself for years and nobody has pointed out an alternative.

If so then I guess this would be reference counting. We might still need to reconstruct the refcount from time to time if we ever decide to throw away all the code because we've reached the limit though.

Exactly.

One challenging point is that we don't really track address ranges for outlined code blocks right now. If we split each page between inline and outlined halves, we can guarantee that all of the outlined code for an ISEQ will live in the same page as its inline code. Without that design, then we need to track the outlined address ranges.

For that to be useful, we'd need to assume only memory pages used by inline code are used by outlined code, i.e. outlined code shouldn't grow faster than inline code, or otherwise we'll need extra tracking. It would probably hold if you make each page a half-and-half layout, but once you start thinking about optimizing it like 80% inline and 20% outlined, again, you'll need something.

However, it makes sense to keep thinking about the relationship between those projects to potentially simplify the whole design.

I apologize because I do realize that it's hard to coordinate on this project async and there are a lot of different elements to keep track of.

Thank you for bearing with me being in a different timezone. Let me know if there's anything I can help with to improve our async communication.

Sep 26 '22 01:09 k0kubun

Does CRuby have a straightforward way of iterating through all the ISEQs?

rb_objspace_each_objects combined with rb_obj_is_iseq. See rb_iseq_trace_set_all/trace_set_i in iseq.c

Sep 26 '22 03:09 jeremyevans

Ah right, I've completely forgotten that we need to do that thing for making every insn a trace_ insn. It's probably efficient enough to be used for it too then. Thank you!

Sep 26 '22 04:09 k0kubun

I pushed my latest code at https://github.com/ruby/ruby/pull/6406/commits/08c83fb89941d4338e42090c6464fdbbbd83262b. Before marking this ready, it worked for a simple synthetic workload locally. I plan to work on including YJIT entry code in inline code addresses and using rb_objspace_each_objects as Jeremy suggested.

While working on this, I realized it's probably useful to change the code layout first because:

If we assert inline code size >= outlined code size, you could skip tracking outlined code addresses.
This implementation triggers code GC for each CodeBlock separately, but it makes more sense to GC both at once. It could be more naturally implemented if you work on that first.

So, yeah, I'll work on that first 😅

Sep 26 '22 07:09 k0kubun

For that to be useful, we'd need to assume only memory pages used by inline code are used by outlined code, i.e. outlined code shouldn't grow faster than inline code, or otherwise we'll need extra tracking. It would probably hold if you make each page a half-and-half layout, but once you start thinking about optimizing it like 80% inline and 20% outlined, again, you'll need something.

Well, I was thinking, we split the pages half and half. There is a current page that we're generating code into, and when we fill either the inline or outlined half, we switch to the next page, to maintain the invariant that we're always putting the outlined code corresponding with some inline code in the same page. It has the potential to be somewhat wasteful but it should work?

Before marking this ready, it worked for a simple synthetic workload locally. I plan to work on including YJIT entry code in inline code addresses and using rb_objspace_each_objects as Jeremy suggested.

Nice. I will read through the code that you wrote so far. I'm going to be conservative in merging code GC because I feel it's very important that we get this piece right. We should test this on railsbench and other larger benchmarks in yjit-bench as well.

While working on this, I realized it's probably useful to change the code layout first because: If we assert inline code size >= outlined code size, you could skip tracking outlined code addresses. This implementation triggers code GC for each CodeBlock separately, but it makes more sense to GC both at once. It could be more naturally implemented if you work on that first.

In my mind it's also, if we're going to modify the design of our code block system, we might as well tackle this problem. Ever since we started talking about arm support months ago, I've been wanting us to improve the way we do code layout.

Sep 26 '22 20:09 maximecb

I wrote a few comments but I want to say I think you did a good job and your current implementation makes a lot of sense.

I think the main missing piece is the inline/outlined page split, and it would be interesting to have some numbers in terms of how much memory can be freed on railsbench, erubi_rails after initialization but before the first benchmark iteration (and what percentage of the total allocated that is).

Sep 26 '22 21:09 maximecb

Well, I was thinking, we split the pages half and half. There is a current page that we're generating code into, and when we fill either the inline or outlined half, we switch to the next page, to maintain the invariant that we're always putting the outlined code corresponding with some inline code in the same page. It has the potential to be somewhat wasteful but it should work?

What I meant is that, if we currently don't know any addresses of outlined code, when we fill the outlined half and only outlined code of an ISEQ goes to the next page, there will be no way to detect the ISEQ is using the next page. So, for that case, I thought tracking the outlined address ranges (or outlined code pages) is needed while you seem to assume it's not needed with your half-and-half idea. Managing the relationship between an ISEQ and code pages hopefully doesn't require a lot of space, so it's probably not a big deal though.

I'm going to be conservative in merging code GC because I feel it's very important that we get this piece right. We should test this on railsbench and other larger benchmarks in yjit-bench as well.

Agreed :+1:

Sep 27 '22 01:09 k0kubun

So, for that case, I thought tracking the outlined address ranges (or outlined code pages) is needed while you seem to assume it's not needed with your half-and-half idea.

Yeah I think we either need to go with half and half, or we need to track where the outlined code for an iseq lives. Either way though, we want the outlined code to be close to the inline code it's associated with to avoid jumping too far.

Assuming we don't go with the half and half strategy:

If we assume that we have a current inline page that we generate inline code into and a current outlined page that we generate outlined code into... When we fill the current inline or outlined page, we need to move to the next page. We need some kind of mechanism to allocate the next page. In order to keep the in/out code close to each other, we would ideally want the next pages to be allocated to be close to each other.

The problem here is IMO that we're going to run into fragmentation issues once we start to GC code. I guess this may or may not be a problem in practice. If we keep a list of free pages that is sorted by increasing address, it might work out OK because ARM CPUs can jump +/-1 1MiB. That's actually pretty far, so as long as the next inline/outlined pages are within 1MiB (256 x 4KiB pages), we can still do a one-instruction jump.

Managing the relationship between an ISEQ and code pages hopefully doesn't require a lot of space, so it's probably not a big deal though.

It's a bit annoying to do but it is doable. Would have to keep track of all the CodePtr targets of instructions that get passed to the assembler I think. That or record every call to get_side_exit and also scan the dst_addrs field of the Branch struct for stubs and the entry_exit field of the Block struct 🤔

Sep 27 '22 20:09 maximecb

I've been experimenting with a few designs for the inline/outlined page split. While still WIP, I filed https://github.com/ruby/ruby/pull/6460 for sharing the current design.

Sep 28 '22 08:09 k0kubun

so long and thanks for all the fish! :)

Oct 26 '22 08:10 yasarcelep

Makes sense. I prefer just using MADV_DONTNEED for Linux then. MADV_FREE doesn't exist in older environments, so it would force us to maintain two paths.

It's worth noting that golang also moved away from MADV_FREE because it makes memory usage harder to reason about. See: https://github.com/golang/go/issues/42330.

Nov 08 '22 20:11 igorwwwwwwwwwwwwwwwwwwww

I have created https://bugs.ruby-lang.org/issues/19122 to mirror that Go issue and propose that Ruby also stops using MADV_FREE.

Nov 11 '22 16:11 smcgivern

Yeah, I mean, we decided to use only MADV_DONTNEED in this PR, so we're good.

Nov 11 '22 20:11 k0kubun

Sorry, I should have clarified: that issue is for the remaining usage of MADV_FREE in fiber_pool_stack_free.

Nov 12 '22 06:11 smcgivern