faasm
faasm copied to clipboard
Scaling issues
This is an offshoot of @kubasz original issue about scaling issues: https://github.com/faasm/faabric/issues/156 to capture the WAVM/ Faasm specific ones.
Some solutions are implemented in these forks:
- https://github.com/auto-ndp/faasm
- https://github.com/auto-ndp/WAVM
Faasm-specific:
- File descriptors seem to be used a lot, and it's easy to run out of them at high throughput. It would be good to mention raising the system limit in the docs.
- To support full migration of functions, we need to include wasm globals along with the snapshots. These can be serialised to a vector of primitives (ints), and there are usually not very many per wasm module.
- There are a lot of (potentially unnecessary)
std::string->std::vector<uint8_t>conversions when interfacing with Protobuf objects. These can sometimes be avoided by pre-allocating the protofbuf fields, and are most important on hot paths of the function execution. #707
WAVM-specific:
- Because WAVM needs to allocate virtual memory to cover the whole wasm address space (to avoid bounds checks), we are effecitvely limited to ~2048 module instances before hitting the x86_64 128TiB userspace limit. This would be helped by properly cleaning up dead Executors in Faabric as mentioned in https://github.com/faasm/faabric/issues/156.
- The virtual memory consumption of each module can be reduced by shrinking the table size from 64GB to 64MB (or something suitable). Note that if this size is set too small though, we will start overwriting random regions of memory (as it will no longer enforce the bounds expected by wasm).
- WAVM serialisation seems to do a lot of string copying, which is particularly noticeable when loading wasm files.
- Allowing transparent huge pages with
madvisecan really help with some of thememcpyperformance within WAVM. - Locks on cloning modules are exclusive and can cause a bottleneck. However, the process is read-only on the source module, so can be changed to a shared lock to reduce contention.
- It's possible to add new
cloneFromcalls that reuse existing memory and table objects' mappings instead ofmmaping fresh copies - ^ process vmm lock contention in the kernel - reduced, but still significant for functions <30ms in my experience