Document and improve wasmer performance for contracts
Motivation
Currently, we use an embedded (compiled to wasm into the runtime) wasmi in order to execute contracts. We want to switch to the wasmer singlepass backend. However, in order to make an informed decision we need to document the status quo.
We have two benchmark approaches to assess the performance of contract execution:
Synthetic Benchmarks
This is a suite of runtime benchmarks that are used to assign a weight value to every host function available to contracts and each supported wasm instruction. These benchmark are defined in this file. You can have a look at #10266 to find out how to run these benchmarks with the benchbot and what the current diff is is to wasmi.
Insights from those benchmarks:
- Host functions are roughly double the weight on wasmer (see for example
seal_callerbenchmark) - There is a 50us static overhead (see
callbenchmark). - Instruction weights are much lower (1/8)
End to end benchmarks
The end to end benchmarks are also runtime benchmarks but flagged as extra so they don't get part of the emitted weight file. They are just for manual execution and inspection.
Instead of using procedural generated fixtures as the other benchmarks they use real world contracts written in ink! or solang. For now there are only two: Benchmarking a transfer on a ink! ERC20 contract and on an solang ERC20 contract.
You can look into #10268 on how to run these benchmarks. You will also find results for running those with wasmi and wasmer.
By looking at those benchmarks results we can learn that wasmer is 50% slower on the ink! example and 78% slower on the solang example.
Conclusion
Instruction performance of wasmer is better than wasmi (including compilation overhead of wasmer). However, in an end-to-end benchmark wasmi is still faster. We need to find out why this is and resolve the issue if possible.
Goal
The goal of this issue is to find out why those end-to-end benchmarks report no results in favor of wasmer. If the reason are some resolvable inefficiencies in either wasmer or the wasmer integration those should be fixed and the new results should be reported here. Those fixes should then be upstreamed to substrate and wasmer.
If the conclusion is that we cannot outperform interpretation for typical smart contract workloads this is also a valid result. This needs to be backed up with additional end-to-end benchmarks, though. Maybe there are some workloads where is outperforms interpretation (loop heavy contracts) but not all. A solution where we offer two execution engines might be something to think about then.
Steps
- [ ] Change the bench bot so that cargo features can be supplied (allows running benchmarks for wasmer without changing any code)
- [ ] Compare time it takes to setup a instance vs. invoking it on wasmer vs wasmi. We want to find out the static overhead (not depending on contract size) of creating an instance on wasmer vs wasmi.
Intermediate results so far.
tldr: wasmer is ~twice as slow at instantiating modules and ~twice as fast at executing calls when compared to wasmi.
The Setup
I used the tracing library to instrument and measure timings of various parts of Substrate during transaction dispatch, such as instance creation, execution, parameter marshalling, etc. See https://github.com/paritytech/substrate/commit/ca2b249f171a2a5b9d7d2f536c0f3de9ef993095
I used the flipper example as a testbed and isolated a single call to Flipper::flip to get an idea of how wasmi and wasmer compare. Of course, proper comparison should aggregate data over several calls, but gathered data was distinct enough to analyze as is.
I disabled embedded executor, so both wasmi and wasmer were executed by sandbox API from Substrate client. That way control flow would be as close as possible between the two backends so we can directly compare the results.
I executed the node using the cli below but for wasmi execution I hacked client/executor/wasmtime/src/host.rs like this:
pub fn new(allocator: FreeingBumpHeapAllocator) -> Self {
HostState {
sandbox_store: SandboxStore(Some(Box::new(sandbox::Store::new(
- sandbox::SandboxBackend::TryWasmer,
+ sandbox::SandboxBackend::Wasmi,
)))),
cargo run --release --features=wasmer-sandbox -- --dev --execution=wasm --no-hardware-benchmarks -lsandbox=trace -lwasmer=debug -lexec=debug --tracing-targets=wasmer=trace,sandbox=trace,inst=trace
I also tried to measure the whole lifetime of sp_sandbox::SandboxInstance but that required using tracing inside of a runtime. I tried using sp-sandbox there but failed to get some output even with with-tracing feature enabled.
The Results
These two files are part of the node log during dispatch of the Flipper::flip transaction. Important lines were surrounded by newlines. Best to be viewed side-by-side.
https://gist.github.com/0x7CFE/b7db97b07620aca2fa860c2296604cd0#file-wasmi2-txt https://gist.github.com/0x7CFE/b7db97b07620aca2fa860c2296604cd0#file-wasmer2-log
Note that the ERROR log level was chosen just for visibility. It does not affect the test in any other way.
Also note that tracing spans are nested, so if span A contains span B inside, B's log would come prior to A.
Result Interpretation
Here I show notable lines from the files above with my commentary. See the whole log for the full picture. To simplify reading I removed irrelevant info and formatted time data with spaces (1234 → 1 234). Wasmi goes first, wasmer next.
During module initialization phase we clearly see that wasmi is way ahead of wasmer and is almost twice as fast. This is not surprising at all, since wasmer needs to perform a computationally expensive task of compiling wasm to a native machine code, whereas wasmi just does instruction transpiling into a format that is better suited for immediate execution:
16:41:54.106 sc_tracing: sc_executor_common::sandbox::wasmi_backend: instantiate, time: 635 291, id: 13511348637925378, values: len=26132
16:32:39.108 sc_tracing: sc_executor_common::sandbox::wasmer_backend: instantiate, time: 1 889 049, id: 13511348637925378, values: len=26132
16:41:54.106 sc_tracing: sc_executor_common::sandbox: instantiate, time: 661 333, id: 20266748078981121, values: len=26132
16:32:39.108 sc_tracing: sc_executor_common::sandbox: instantiate, time: 1 924 451, id: 20266748078981121, values: len=26132
However, when modules were initialized the situation is very much the reverse. Here's an example of handling a host function:
16:41:54.106 sc_tracing: sc_executor_common::sandbox::wasmi_backend: SandboxContext::invoke, time: 12 338, id: 524669906344476676, values: index=226
16:32:39.108 sc_tracing: sc_executor_common::sandbox::wasmer_backend: SandboxContext::invoke, time: 7 079, id: 524669906344476676, values: index=226
16:41:54.106 sc_tracing: sc_executor_common::sandbox::wasmi_backend: invoke_index, time: 40 192, id: 524669906344476675, values: index=0, args="RuntimeArgs([I32(142220)])"
16:32:39.108 sc_tracing: sc_executor_common::sandbox::wasmer_backend: wasmer::Function closure, time: 18 164, id: 524669906344476675, values: index=226
The topmost invoke (near the end of the log file) is also very much in favor of wasmer:
16:41:54.109 sc_tracing: sc_executor_common::sandbox::wasmi_backend: invoke, time: 2 803 069, id: 15763148451610626, values: state=1020704, export_name="call", args="[]"
16:32:39.109 sc_tracing: sc_executor_common::sandbox::wasmer_backend: invoke, time: 1 455 830, id: 15763148451610626, values: _state=1020704, export_name="call", args="[]"
16:41:54.109 sc_tracing: sc_executor_common::sandbox: invoke, time: 2 815 407, id: 24770347706351617, values: state=1020704, export_name="call", args="[]"
16:32:39.109 sc_tracing: sc_executor_common::sandbox: invoke, time: 1 464 897, id: 24770347706351617, values: state=1020704, export_name="call", args="[]"
Again, I should note that these measurements were based on an execution of a single transaction. So please take it with a grain of salt. Still it contains a few dozen of invocations and, I think, shows pretty convincingly that wasmer is good at executing stuff, but really falls behind during the instantiation phase.
So, what gives? Unfortunately, our current setup is pretty straightforward: we instantiate the module right before making the call:
// Instantiate the instance from the instrumented module code and invoke the contract entrypoint.
let mut runtime = Runtime::new(ext, input_data, memory);
let result = sp_sandbox::default_executor::Instance::new(&code, &imports, &mut runtime)
.and_then(|mut instance| instance.invoke(function.identifier(), &[], &mut runtime));
So, even that wasmer is able to execute code fast it still makes no practical difference in the current setup. This is especially visible in smart contract scenario where the code is small and often contains tiny amount of instructions.
Thoughts and Ideas
- Research why wasmer init takes so long. Is it really because of compilation or maybe something else?
- Think about instance caching. If we'd able to cache instantiated wasmer modules we would be able to execute them lightning fast. We already have instrumentation cache, maybe we can cache wasmer instance as well?
- Concurrent execution? Theoretically we can do both backends concurrently and use whichever comes first.
- Chunked execution/compiling akin to https://github.com/paritytech/wasmi/pull/262? If wasmer would compile only what it actually needs the module would be instantiated faster.
- Maybe wasmer is slow at its initialization and not module instantiation? Could we preload or warm-up its internals somehow?
- On demand JIT compiling per function/module? Or even something weirder, like a heterogeneous approach in Hotspot JVM (interpreter + compiler)
Further Work
- Proper setup with aggregation of several calls.
- End-to-end measuring
CC @dvdplm, @athei
I tried using
sp-sandboxthere but failed to get some output even withwith-tracingfeature enabled.
This makes me a bit sad. Is there an issue in substrate for this? I believe others have struggled with this as well in the past (cc @Imod7 perhaps?).
isolated a single call to Flipper::flip to get an idea of how wasmi and wasmer compare.
I don't know how representative this call is, but imo it'd be good to repeat the benchmark with a few other function calls just to ensure that the overall conclusion is not overly skewed by the specific code in flip.
Research why wasmer init takes so long. Is it really because of compilation or maybe something else?
This seems like the most pressing question. Is it easy to measure the actual compilation time vs compiler startup?
I don't know how representative this call is, but imo it'd be good to repeat the benchmark with a few other function calls just to ensure that the overall conclusion is not overly skewed by the specific code in
flip.
Well, it's the simplest contract call I can imagine, so I essentially wanted to draw a baseline here. It could easily happen that for the simplest of calls wasmi is faster because it exits early. However, what we see is that even on such a trivial method, wasmer still executes faster than wasmi. Also, don't get me wrong, even such a tiny method makes cat wasmer2.txt | grep SandboxContext::invoke | wc -l... 77 calls to SandboxContext::invoke (mostly index 226)! So it's definitely not a noop.
On the other hand, we surely need to measure something like ERC20 for the real world scenario, but I'm expecting it to be mostly the same: slower init, faster execution.
This seems like the most pressing question. Is it easy to measure the actual compilation time vs compiler startup?
That's a question on its own.
P.S.: I've added some more entries to the thoughts and ideas section.
This suggests that we can rule out our theory that it is due to slow execution of host functions by wasmer. It is all about the startup which was the most likely scenario anyways. Transformation into machine code takes time. I still would like to see this repeated with an ink! ERC20 transfer, though. Also, having this somehow integrated into substrate or some other repo so it is easily repeatable would be very nice. Don't want all your work to go to waste and then the next time we need to do it all over again. I think this is nice to have this as some kind of regression test (doesn't need to be completely automated).
Research why wasmer init takes so long. Is it really because of compilation or maybe something else?
I think that this is the next question that you should investigate. It guess this would mean profiling wasmer itself. All other thoughts about cutting down compile times are useless if we don't know if that is actually the bottle neck.
So, what gives? Unfortunately, our current setup is pretty straightforward: we instantiate the module right before making the call:
There is no other point in time we could possibly instantiate the module. We don't know which code to instantiate until someone makes a call (very different from the runtime itself). We can't do some basic LRU cache of modules if you are thinking about that: The cache can be gamed (warm/cold can be different for different validators). Also, good luck putting weights on that.
Think about instance caching. If we'd able to cache instantiated wasmer modules we would be able to execute them lightning fast. We already have instrumentation cache, maybe we can cache wasmer instance as well?
Caching machine code is problematic because it is platform dependent and the cache lives within the runtime. If we could introduce an IR to wasmer singlepass that is trivially translatable to all supported machines this could be a target for caching. Probably some basic register machine byte code. SSA not possible because there is no O(1) transformation from wasm. On the other hand not having an IR is kind of the point of a single pass compiler in order to reduce compilation times by leaving out this step. But having two O(1) transformations could be worth it for the sake of supporting caching.
Chunked execution/compiling akin to https://github.com/paritytech/wasmi/pull/262? If wasmer would compile only what it actually needs the module would be instantiated faster.
Having a dynamic recompiler that only compiles functions on entry (or even smaller chunks) could cut down on compile times if that is really what makes it expensive. The context switch between interpretation and compilation could pessimise performance, though. You would still pass in the whole module. I think this is essentially the same as your Hotspot idea.
Ok, so I've made some measurements that I'm now confident of.
Please note that I redid the tests using manual measurement since I discovered that for some reason tracing introduced significant overhead especially in the hot section of the code (invoke_index) where I tried to separately measure time it takes to prepare, invoke and finalize the call. Overall tracing output was quite unpredictable and not very reliable (with the exception of the #[instrument] macro), so I decided to ditch it.
Long story short: the performance is comparable, with the exception of module instantiation which is considerably slower on Wasmer. Everything else is more or less on par.
The tests were performed on the following branch: https://github.com/paritytech/substrate/commits/dk-tracing-executor.
Here are links to recorded Wasmi and Wasmer sessions where I did
-
deployERC20 contract with 10 initial units owned by ALICE -
transfer1 unit from ALICE to BOB - (everything else that https://polkadot.js.org/apps/#/contracts does during the session)
Here are the data grepped from the logs:
Calls to deploy
| wasmi | wasmer |
|---|---|
| 16.937696ms | 16.041995ms |
Calls to instantiate
| wasmi | wasmer |
|---|---|
| 892.447µs | 3.376242ms |
| 376.753µs | 1.215131ms |
| 858.383µs | 3.442349ms |
| 835.121µs | 3.695362ms |
| 813.823µs | 1.323876ms |
| 258.299µs | 1.348852ms |
| 276.585µs | 3.898192ms |
Calls to call
| wasmi | wasmer |
|---|---|
| 1.540576ms | 1.436548ms |
| 27.965549ms | 21.100177ms |
| 21.856403ms | 12.622645ms |
| 20.340723ms | 22.064872ms |
| 1.158249ms | 1.497089ms |
| 1.22862ms | 1.136338ms |
All things combined make me think that the bottleneck is not in the execution engine, but in a way how we dispatch wasm calls. JIT shines when it's able to fully utilize modern super-scalar hardware, especially speculative execution and cache prefetch. In our case it's impossible to do so because execution is jumping back and forth between substrate core and wasm. Of course it's just my speculation, but I can't imagine other reasons why interpreter and compiler show similar results.
Interestingly, Wasmer is generally faster when doing invoke_index, but it's still not enough to compensate huge time loss during the module instantiation.
If only we could cache instantiated modules then we'd probably be able to squeeze some performance out of Wasmer. Still, I don't think it's practical given that performance win is marginal.
P.S.: I did another test session where I removed all logs from invoke_index which is pretty hot point in the code, but the results remained mostly the same.
CC @dvdplm @athei
All things combined make me think that the bottleneck is not in the execution engine, but in a way how we dispatch wasm calls.
I don't know how you come to this conclusion. Assuming that you mean imported function calls by "wasm calls". The values you measured for invoke_index even contradict this (wasmer isn't slower there). When look at instantiate I see that wasmi pulls ahead which is easily explainable by the fact that instantiate (just does some basic initialization) executed much less of the contract code than the call (which does an actual transfer). The quota between executed code vs. compiled code is very much in favor of wasmi for instantiate.
JIT shines when it's able to fully utilize modern super-scalar hardware, especially speculative execution and cache prefetch. In our case it's impossible to do so because execution is jumping back and forth between substrate core and wasm.
Again, this doesn't necessarily follow from the numbers observed. Compilation performance can be the culprit.
Of course it's just my speculation, but I can't imagine other reasons why interpreter and compiler show similar results.
From that I am seeing here I think the most likely scenario is the obvious one: Faster execution of the compiled code cannot recuperate the big up front costs of code compilation. This is especially true of just a small part is executed (compare instantiate vs. call) because wasmer always compiles the whole module.
If only we could cache instantiated modules then we'd probably be able to squeeze some performance out of Wasmer. Still, I don't think it's practical given that performance win is marginal.
If we assume that compilation is the culprit than caching would make all the difference.
Thoughts and Ideas
You asked me to comment on them. I will do this here. But in general I think this issue should only be about accessing the status quo. All of those ideas are massive undertakings and shouldn't be tackled on a hunch. First, we need to know what is going on. Otherwise we could be optimizing at the completely wrong end. We need to verify our assumption.
- Yes absolutely please find that out. This is a very useful information and awesome if we could get our hands on it.
- Discussed this in my previous comment.
- Having concurrency and the runtime level is currently something that is not supported by substrate. Also, very difficult to assign weights to that if we only "let things run". We can't just do something without knowing what the cost of that is. But in general having some way to use interpreter and compiler for different use cases could be something that makes sense. Maybe just offload that to the contract authors to assess performance and select the one they want. That would make it deterministic. But I consider the other ideas as much more promising.
- Discussed this in my previous comment.
- Would be awesome if you could find that out. Then we could use that API to warm that up within the client and keep it in memory. Could be related to memory allocation. I know this is a big topic for the wastime executor that we don't re-zero all the memory when we recycle instances but use memory mapping operations to zero it in constant time. However, we can't use the same approach here as we cannot cache instances because those are connected to a fixed module. In contracts we don't know the modules a-priori as we do for the runtime. So in case we go down this route it needs to be done differently. Like caching only the linear memory and not the instances. However, linear memory is very small for contracts so it is unlikely that this is the culprit. This all assumes that it is not compilation which is causing this.
- I consider this kind of the same thing as in 4. We can't make it as clever as HotSpot because it all needs to be deterministic. We need a mapping from: Compiled code in bytes -> Weight and a worst case estimation for this mapping.
From that I am seeing here I think the most likely scenario is the obvious one: Faster execution of the compiled code cannot recuperate the big up front costs of code compilation. This is especially true of just a small part is executed (compare instantiate vs. call) because wasmer always compiles the whole module.
Yes, you are absolutely right. Maybe I failed to express my point, but my conclusions are mostly the same.
I don't know how you come to this conclusion. Assuming that you mean imported function calls by "wasm calls". The values you measured for
invoke_indexeven contradict this (wasmer isn't slower there). When look atinstantiateI see that wasmi pulls ahead which is easily explainable by the fact thatinstantiate(just does some basic initialization) executed much less of the contract code than thecall(which does an actual transfer). The quota between executed code vs. compiled code is very much in favor of wasmi forinstantiate.
My understanding is that compiled code should be an order of magnitude faster which is definitely not the case as we see in the measurements. Yes, invoke_index is generally faster when executed by wasmer, but it's not that faster. So I thought maybe something is preventing wasmer to execute at full speed. Something like the need to perform host function calls. We even read and write sandbox memory using host functions.
My understanding is that compiled code should be an order of magnitude faster which is definitely not the case as we see in the measurements.
Correct me if I am wrong: Your measurement always measure compilation + execution. This means even if the execution is much faster it could still be offset by some very slow compilation. So you can't follow from that that the execution isn't a magnitude faster. For all we know it could be. As a matter of fact it is suggested by the synthetic instruction benchmarks I posted in the top post.
Yes, invoke_index is generally faster when executed by wasmer, but it's not that faster.
Same speed is completely fine there assuming that the rest of the code executes much faster. Assuming that imported function calls are roughly the same that should still yield much better performance. But we can't observe any speedup and I still think that compilation costs are the most likely explanation: It is the most obvious and doesn't contradict any measurement.
So I thought maybe something is preventing wasmer to execute at full speed. Something like the need to perform host function calls. We even read and write sandbox memory using host functions.
I don't get why you think it should be that. I mean it could be but compilation seems to be more likely. You could test this assumption by writing two contract calls where one does much more host function calls and the other much more arithmetic instructions, stack access etc..
Correct me if I am wrong: Your measurement always measure compilation + execution. This means even if the execution is much faster it could still be offset by some very slow compilation. So you can't follow from that that the execution isn't a magnitude faster.
I can be wrong too, but my current understanding is that wasmer does the full compilation during module creation that accepts a wasm bytecode and a Store object. Then it does sanity checks and the actual compilation in wasmer::sys::module::Module::from_binary_unchecked. Theoretically it could be that something is done during first invocation in a JIT manner but I doubt it's the case in Singlepass.
So, effectively, everything related to native code generation is done during module creation. Everything else deals with already compiled native module.
I have done some quick checks and indeed, the data backs up my theory:
| module creation | module imports | module instantiation | total |
|---|---|---|---|
| 3.025098ms | 34.248µs | 85.161µs | 3.144507ms |
| 1.724811ms | 19.01µs | 40.881µs | 1.784702ms |
| 3.437267ms | 43.125µs | 95.834µs | 3.576226ms |
| 3.639528ms | 38.441µs | 87.024µs | 3.764993ms |
| 3.662501ms | 51.664µs | 92.641µs | 3.806806ms |
| 2.62689ms | 31.589µs | 80.124µs | 2.738603ms |
| 1.768464ms | 19.011µs | 46.197µs | 1.833672ms |
Please see the instantiate function. First, it creates wasmer module by passing wasm bytecode to it, then it populates import tables to tie modules together and expose host functions. Finally, it instantiates the module and then creates our BackendInstance.
Note that the majority of the time is spent during module creation where the compilation takes place. Also note that this procedure is pure in a sense that it does not depend on anything but the wasm blob. If only we could cache it then all subsequent module instantiations would happen in a microsecond range. Even if we'd cache the compiled module for one block we would already be able to shave off some time. Even in my test scenario the module gets compiled 7 times.
Okay I get confused because I didn't understand what is shown in the tables. We cleared that over element:
deploy: Calling the deploy exported function of a contract
instantiate: Calling the instantiate function of the wasmer backend (compiles the code and creates the instance). No contract code execution.
call: Calls the call exported function of a contract: Does an ERC20 transfer.
Given these information it is indeed curious why even with all the code compiled wasmer doesn't outperform wasmi by a huge margin. Compilation is factored out already.
Wasmer looks really bad: It takes a long time to create the module and then doesn't perform faster at execution.
But why would it be the host function calls? You measured them as roughly equal in time. I guess the A/B test I mentioned would be the next step: Comparing a call with many vs. few host function calls.
But why would it be the host function calls? You measured them as roughly equal in time. I guess the A/B test I mentioned would be the next step: Comparing a call with many vs. few host function calls.
I made some more tests.
The deploy call takes 16ms. I added together time of all invoke_index calls that happen during the deploy and they sum to 13420.98µs ≈ 13.4ms. So, the absolute majority of time is spent in those 1600 host function calls.
In such a scenario, even if wasmer will perform 1000 times faster than wasmi it still can affect only those remaining ≈2.5 ms. Even in ideal case the deployment would take no less than 13.4ms + ε.
On the other hand, for other calls it's not that bad.
Overall I think that host function overhead explains why wasmer does not not show x10 speedup in all tests. Yet, that thing alone cannot explain why wasmer is slower sometimes even if we factor out the compilation.
1600 host function calls for a single deploy of an ERC20 contract? That seems too much. Or is it a batch of calls?
Overall I think that host function overhead explains why wasmer does not not show x10 speedup in all tests. Yet, that thing alone cannot explain why wasmer is slower sometimes even if we factor out the compilation.
Agreed. It does explain why it isn't able to outperform it by a huge margin but not why it is slower. Except if the host functions would be much slower on wasmer. Maybe the slowness isn't correctly captured in your measurements because the context switch makes the following non host function instructions slower (cold caches due to a lot of code being shuffled around). I think you already mentioned this theory somewhere.
1600 host function calls for a single deploy of an ERC20 contract? That seems too much. Or is it a batch of calls?
Unfortunately, not. It's for a single deployment. I also think that's a bit too much but I don't know the contract internals. Wasmi log shows the same story so it's definitely not Wasmer's fault.
Also, call that takes 22.064872ms in the table above makes even more, 2244 host function calls that take 8.4ms in total. So its not specific to deploy alone.
Maybe the slowness isn't correctly captured in your measurements
Yes, I think that is the case, not to mention that massive log output introduces delays on its own. I think now we need to aggregate data over many invocations with and without host function calls to say for sure.
Unfortunately, not. It's for a single deployment. I also think that's a bit too much but I don't know the contract internals. Wasmi log shows the same story so it's definitely not Wasmer's fault.
I assume it is the gas metering then. It generates one host function call for every basic block. I think what we can learn from that is that adding disk caching to wasmer (with an IR language) + implementing https://github.com/paritytech/wasm-instrument/issues/11 could make wasmer competitive.
I assume it is the gas metering then. It generates one host function call for every basic block. I think what we can learn from that is that adding disk caching to wasmer (with an IR language) + implementing https://github.com/paritytech/wasm-instrument/issues/11 could make wasmer competitive.
To confirm that assumption @0x7CFE can disable gas metering and do one more measurement.
Also, to avoid the influence of debug logs, you can measure the total time of the execution and the time to init the executor. The difference - is the time of call execution.
To confirm that assumption @0x7CFE can disable gas metering and do one more measurement.
I no longer work on this.
Yes I also thought of this. Redoing this with disabled gas metering would be interesting to confirm the theory.
No one working on this anymore. De-prioritized because we go with in-runtime wasmi for now which will get some performance updates soon.
With this pr here: https://github.com/paritytech/substrate/pull/12501 we can probably close this?
Yea. I don't think wasmer is happening any time soon.