wireguard-go icon indicating copy to clipboard operation
wireguard-go copied to clipboard

device: Allow buffer memory growth to be limited at run time

Open gitlankford opened this issue 2 years ago • 13 comments

The infinite memory growth allowed by the default PreallocatedBuffersPerPool setting causes processes to be oom-killed on low memory devices. This occurs even when a soft limit is set with GOMEMLIMIT. Specifically running tailscale on a linux device (openwrt, mips, 128MB RAM) will exhaust all memory and be oom-killed when put under heavy load. Allowing this value to be overwritten as is done in the iOS build will allow tuning to cap memory expansion and prevent oom-kill.

see tailscale issue thread for further info: https://github.com/tailscale/tailscale/issues/7272

gitlankford avatar Feb 26 '23 15:02 gitlankford

Thank you for your patience.

I'm sure there are other ways to solve this issue (specific low-mem builds, etc). This method follows the prior art of the iOS build.

gitlankford avatar Feb 26 '23 15:02 gitlankford

Not sure it's a good idea to add random knobs like this for third parties to twiddle and trip over. I wonder if there's some heuristic that could be used instead, which would always work and dynamically scale accordingly? ratelimiter.c in the kernel does this, for example, with something pretty kludgy but it does work:

int wg_ratelimiter_init(void)
{
        mutex_lock(&init_lock);
        if (++init_refcnt != 1)
                goto out;

        entry_cache = KMEM_CACHE(ratelimiter_entry, 0);
        if (!entry_cache)
                goto err;

        /* xt_hashlimit.c uses a slightly different algorithm for ratelimiting,
         * but what it shares in common is that it uses a massive hashtable. So,
         * we borrow their wisdom about good table sizes on different systems
         * dependent on RAM. This calculation here comes from there.
         */
        table_size = (totalram_pages() > (1U << 30) / PAGE_SIZE) ? 8192 :
                max_t(unsigned long, 16, roundup_pow_of_two(
                        (totalram_pages() << PAGE_SHIFT) /
                        (1U << 14) / sizeof(struct hlist_head)));
        max_entries = table_size * 8;

        table_v4 = kvcalloc(table_size, sizeof(*table_v4), GFP_KERNEL);
        if (unlikely(!table_v4))
                goto err_kmemcache;

#if IS_ENABLED(CONFIG_IPV6)
        table_v6 = kvcalloc(table_size, sizeof(*table_v6), GFP_KERNEL);
        if (unlikely(!table_v6)) {
                kvfree(table_v4);
                goto err_kmemcache;
        }
#endif 

zx2c4 avatar Mar 03 '23 13:03 zx2c4

Thanks @zx2c4 - I am happy to experiment with that sort of dynamic scaling, but will be less confident about being able to do it up to your standards given my current limited GO experience. Perhaps that could be a future workstream?

In my experiments with different values, I found that when the PreallocatedBuffersPerPool was too high, the GC would be working overtime until oom-kill, and when it was "just right" the GC would only run every 2 minutes in the forced GC run. (an indirect measurement of the memory situation)

The other thing to note is that the router I'm running this on is dedicated to running this VPN, so the memory usage is otherwise extremely stable. That is probably the the most common setup for devices of this size (also: noswap).

gitlankford avatar Mar 03 '23 14:03 gitlankford

Any news on this? It looks like someone had the issue as well over here: https://github.com/qdm12/gluetun/issues/2036

qdm12 avatar May 10 '24 14:05 qdm12

Any ETA when will this be merged or the other solution mentioned by @zx2c4 implemented? This increased our memory consumption for 10.5k peers upto 18GB, whereas after setting the buffer size to 4096 (as it is also being set in the android version) it went down to 2GB.

shaheerem avatar Dec 05 '24 12:12 shaheerem

hitting this as well. is there a mitigation? anyone use compile flags to set that value during compilation time?

james-lawrence avatar Jul 04 '25 04:07 james-lawrence

@zx2c4 this is actually a performance issue that is quite significant. not just in memory but in CPU usage due to the GC cycles high concurrent code triggers. essentially without the limit the code will generate giant spikes in allocations causing significant gc behavior. I suggest the default be changed to 4096 for everything currently at 0 at a minimum.

the below files are pprof, just named them .txt to get past github silly file filtter. current = 0:

profile.pprof.txt

image

updated to = 4096:

image

mem.pprof.txt

cpu.pprof.txt

james-lawrence avatar Jul 04 '25 12:07 james-lawrence

for what its worth I agree knobs are annoying and a heuristic would be better, but leaving this as is is painful. this is a very simple change that would allow developers to tune usage until a heuristic is implemented.

james-lawrence avatar Jul 04 '25 15:07 james-lawrence

I think @bradfitz was looking at some major refactoring that would make this more natural?

zx2c4 avatar Jul 04 '25 16:07 zx2c4

@zx2c4 which is fine but this is over 2 year(s) old. shrug this minor of a fix is better than the fact the library torches any highly concurrent application with no generally applicable way to resolve it by developers. we're not talking peanuts we're talking orders of magnitude in performance differences.

james-lawrence avatar Jul 04 '25 16:07 james-lawrence

to be clear, from my perspective this is an easy resolution now that makes the problem runtime/compile time adjustable by developers with no real impact on any future refactoring vs an unknown future date for a 'major refactor' as much as I respect @bradfitz as a developer a bird in hand is worth two in the bush.

james-lawrence avatar Jul 04 '25 16:07 james-lawrence

for reference numbers on a 32 core / 124GB machine:

PreallocatedBuffersPerPool = 0: RAM: >20GB Cores: maxed out.

PreallocatedBuffersPerPool = 4096: RAM: <2GB Cores: <30% util

with no observable change in the applications network throughput (though I assume thats mostly due to a bottle neck in the application itself)

james-lawrence avatar Jul 04 '25 16:07 james-lawrence

Not sure it's a good idea to add random knobs like this for third parties to twiddle and trip over.

It should be noted that these are already variables on the iOS platform. Making it variable across all platforms may result in better consistency.

https://github.com/WireGuard/wireguard-go/blob/f333402bd9cbe0f3eeb02507bd14e23d7d639280/device/queueconstants_ios.go#L13-L19

ArcticLampyrid avatar Nov 11 '25 01:11 ArcticLampyrid