binaryen Questions Related to the "Abnormal" Memory Usage of Binaryen

Hi developers, I am incorporating Binaryen as third-party library into my fuzzing framework. In short, Binaryen is used to parse input Wasm module, do some mutation and generation work and emit a new testcase back in this case.

Everything went fine until I deployed it on a cloud server. The memory usage of my fuzzing framework kept going on and eventually kernel's OOM Killer killed my fuzzing process.

It really took me a lot of effort to figure out the reason: during the mutation or generation stage, our work will generate block/loop names for BinaryenBlock()/BinaryenLoop() C API, which will further call IString::interned to update some static variables: https://github.com/WebAssembly/binaryen/blob/6453fd55a312779c2f0d9451d325646522a85470/src/support/istring.h#L36

The problem is that fuzzing may generate different block/loop names for different input Wasm module, and these names become useless when we have finished the mutation/generation stage for this specific module, but the corresponding block/loop names are not erased from these static variables accordingly! Eventually, it results in excessive memory occupation for maintaining these meaningless names.

So, I wonder if there is any way to erase elements inside these static/global variables on a per-module basis, so that the overall memory usage is affordable.

Jan 25 '24 13:01 mobsceneZ

Atm Binaryen is optimized for short-running processes, like calling wasm-opt from the commandline to optimize a module. After the optimization the process ends, which is how toolchains (Emscripten, J2CL, Kotlin, etc.) use Binaryen. The specific issue is that we intern strings for speed, and we never remove them (until the process ends). That is simple and efficient, but if you have a very long running process then it will accumulate memory.

Perhaps we could add an option to not intern strings. That would just need a replacement file for istring.h. It would be slower, but avoid this overhead. However, if you can restart your fuzzer process periodically that would be simpler.

Jan 25 '24 16:01 kripken

We similarly intern types and never delete them, but we do have a function called destroyAllTypesForTestingPurposesOnly that deletes all the interned types. It's wildly unsafe to use, but I could see it being useful to call between independent fuzzer iterations. Perhaps we could have a similar function for the interned strings?

Jan 25 '24 17:01 tlively

Atm Binaryen is optimized for short-running processes, like calling wasm-opt from the commandline to optimize a module. After the optimization the process ends, which is how toolchains (Emscripten, J2CL, Kotlin, etc.) use Binaryen. The specific issue is that we intern strings for speed, and we never remove them (until the process ends). That is simple and efficient, but if you have a very long running process then it will accumulate memory.

Perhaps we could add an option to not intern strings. That would just need a replacement file for istring.h. It would be slower, but avoid this overhead. However, if you can restart your fuzzer process periodically that would be simpler.

Hey kripken, I think it may not be necessary to implement a new option that support not interning strings. Instead, like @tlively said in the comment, I wonder if it's acceptable to export a C API to clear up these interned types or strings which could make things easier. Besides, I'm afraid that restarting fuzzer process may not be a good idea for the effectiveness of fuzzing. Another option for me is to maintain a block/loop name pool where each name can be reused across modules/functions, which as I presume could effectively reduce the memory usage. But it will depend on whether "exporting a C API" option is acceptable for Binaryen community :).

Jan 26 '24 03:01 mobsceneZ

We similarly intern types and never delete them, but we do have a function called destroyAllTypesForTestingPurposesOnly that deletes all the interned types. It's wildly unsafe to use, but I could see it being useful to call between independent fuzzer iterations. Perhaps we could have a similar function for the interned strings?

Yes, functions like destroyAllTypesForTestingPurposesOnly are exactly what I need. I'm curious about why this function is "wildly unsafe to use", is is related to the thread safety?

Jan 26 '24 03:01 mobsceneZ

I'm not opposed to such an API.

Jan 26 '24 03:01 kripken

We similarly intern types and never delete them, but we do have a function called destroyAllTypesForTestingPurposesOnly that deletes all the interned types. It's wildly unsafe to use, but I could see it being useful to call between independent fuzzer iterations. Perhaps we could have a similar function for the interned strings?

Yes, functions like destroyAllTypesForTestingPurposesOnly are exactly what I need. I'm curious about why this function is "wildly unsafe to use", is is related to the thread safety?

Yeah, it's not safe for any thread to be doing anything with types while another thread is calling this function. But more generally, any types from before the call cannot be used after the call, which is a lifetime issue that code using types normally wouldn't need to think about.

Jan 26 '24 16:01 tlively

How about to use arena or bump allocator for interner? I guess generated names could be quite long and continuously grow in "converge" mode, which may caused to OOM, especially for wasm32 binaryen builds. It would be nice to be able to do a reset for interner state.

Jan 26 '24 20:01 MaxGraey

We similarly intern types and never delete them, but we do have a function called destroyAllTypesForTestingPurposesOnly that deletes all the interned types. It's wildly unsafe to use, but I could see it being useful to call between independent fuzzer iterations. Perhaps we could have a similar function for the interned strings?

Yes, functions like destroyAllTypesForTestingPurposesOnly are exactly what I need. I'm curious about why this function is "wildly unsafe to use", is is related to the thread safety?

Yeah, it's not safe for any thread to be doing anything with types while another thread is calling this function. But more generally, any types from before the call cannot be used after the call, which is a lifetime issue that code using types normally wouldn't need to think about.

You're right, if there are multiple threads then things could get messy. I do not use multiple threads in my fuzzing framework, so a clean up function is perfectly suitable for me. If that's ok for you to have this kind of C API, I will try adding one and issue a pull request ;).

Jan 27 '24 05:01 mobsceneZ

How about to use arena or bump allocator for interner? I guess generated names could be quite long and continuously grow in "converge" mode, which may caused to OOM especially on for wasm32 binaryen builds. It would be nice to be able to do a reset for interner state.

Hey MaxGraey, I think for this case arena allocator is a pretty nice choice, but I'm not an expert in designing such memory allocators so it may depends on developers.

Jan 27 '24 06:01 mobsceneZ

@mobsceneZ, sounds good! I look forward to the PR.

Jan 29 '24 05:01 tlively

A bump allocator is separately a good idea here. We can use the existing MixedArena which gives exactly what we want. A separate PR with that would also be welcome here.

Jan 29 '24 18:01 kripken