fix(workflow): Fix multiple context leaks in `reuseV8Context` executor
What was changed
- Fix multiple context leaks and bugs in the
reuseV8Contextexecutor:- …by reassigning a new object to an existing shared global variable:
globalThis.console = { ...globalThis.console, wfId: workflowInfo().workflowId }; - …by modifying one of Node's built in global objects:
globalThis.Number.a = workflowInfo().workflowId; - …by deleting a previously set global variable:
globalThis.a = 1; await sleep(1) ; delete globalThis.a ; await sleep(1) ; globalThis.a = (globalThis.a || 0) + 1; /* globalThis.a is 2 rather than 1 */
- …by reassigning a new object to an existing shared global variable:
I'm a bit confused about what is the ultimate goal and why this is needed. If we want to eliminate all side channels this is not going to work, e.g., you could have timing side-channels, and the v8 VM is not designed to eliminate side-channels. If the goal is that other untrusted code is running in the same reused v8 context and attacking other users by modifying the context, I'm not sure we are protecting all the primordial classes (a String, Number,...) either, and in any case, that's a very hard problem...
Is this just to help prevent some mistakes, nothing to do with security?
Is this just to help prevent some mistakes, nothing to do with security?
No, that's really not a security thing (that's simply impossible AFAIK with Node's vm). And we don't have a goal of preventing all possible escapes out of the VM.
The reuseV8Context is a performance optimization (roughly 66% reduction on RAM usage for cached workflows, and up to 50% on CPU usage). The trick is that instead of creating a new vm context for each workflow (which is slow and allocates a large number of objects), we use a single vm context, swapping global variables in and out before and after executing each activation.
However, there's been some reports of users doing uncommon but still legitimate and reasonable things in their workflow code, that would result in context from one workflow execution leaking into another workflow execution. Depending on use cases, that may result in non-deterministic behaviors, memory leaks, or some other oddities.
For example, one user recently reported that they were mocking functions in the console object inside their workflow code to redirect messages to the workflow log. They were doing that because their workflow was calling some outside library, not meant specifically for Temporal Workflows, which would print things its own logging through console.log. Agreed, that's not the best way of doing this, but still something I would have expected to be supported by our sandbox. Their code looked something like this:
async function myWorkflow() {
const workflowId = workflowInfo().workflowId;
globalThis.console = {
...console,
log: (message) => {
log(`${workflowId}: message`)
}
}
// ...
The problem here is that previously, when leaving a workflow's execution context, the Reusable VM's would not touch any global variable that existed right after start. In this case, that means that it would leave the console object in place, even though it was no longer the original, pristine console object! That would result in them seeing log messages with an incorrect workflowId.
Once you know it, it's quite easy to fix their code to avoid this issue (they did), but it's also an opportunity for us to make our engine more reliable.
There's also one test, "Shared global state is frozen", that existed before this PR that was actually not testing what we thought it was testing. That is, the test was asserting attempting to add a custom property to some global object from Node's default vm context would be rejected.
However, it was doing so by adding something to the setTimetout function, but it turns out that's insufficient as the globalThis.setTimeout is overridden by the SDK itself, resulting in it being present in the globals object exposed outside of the vm, where global variables that would not been modified inside the vm would not appear in the globals object (seems to be some kind of internal optimization done by Node/V8).
This PR fixes that bug; the test is now asserting the proper behavior against the Array global function instead.
The code changed significantly since the first round of reviews.
I'm closing this PR and opening a new one.