workers-sdk 🐛 BUG: signal #11: Segmentation Fault after a few minutes of usage

Which Cloudflare product(s) does this pertain to?

Workers Runtime

What version(s) of the tool(s) are you using?

3.28.1

What version of Node are you using?

20.9.0

What operating system and version are you using?

Mac Darwin Kernel Version 23.0.0

Describe the Bug

Observed behavior

✘ [ERROR] *** Received signal #11: Segmentation fault: 11

  stack:


╭───────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ [b] open a browser, [d] open Devtools, [l] turn off local mode, [c] clear console, [x] to exit        │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────╯

I get ^^ after a few minutes of running. After this wrangler dev is effectively dead but oddly not 'crashed' I have to kill via hitting 'x' and restart.

Command I use to start:

wrangler dev --port 8787 --var GIT_HASH:$(git rev-parse HEAD) ./src/index.ts

Expected behavior

No segfault and if there is a segfault it should probably have a stack trace.

Also it would be easier for me if a segfault actually crashed the application so I could maybe just have a watchdog that restarted it. It crashing but being in a 'zombie' state is much worse than the alternative of it exiting hopefully with an error-code.

Steps to reproduce

start with wrangler dev
several minutes of happiness as everything appears to work fine
After a few minutes (not sure on exact timing but somewhere after say 5 minutes of use) I get the segfault.

Not sure but possibly related to bug #4562

Likely this is just a bug I am now able to see now that #4562 has been resolved.

I am using a DurableObject and WebSockets for most of my application communication.

If I constantly ping the worker over WebSocket so that it never hibernates then I do not see this issue (similar to #4562)

Since this appears to be some deep flaw in hibernation that is difficult to resolve I respectfully request that there be a flag to disable hibernation until that feature is more stable as a quick fix, or possibly disable the hibernation feature all-together for 'dev' until the issue is resolved and hibernation is more stable.

It feels like the team working on workers is a bit over-stretched with the resources it has available, and the turn-around time on what I would consider major bugs is too slow. Perhaps it is better to reduce the capabilities of 'wrangler dev' until the dev team working on it can catch up with the bugs.

Please don't take the above as a ding on the hard work of the team. I think wrangler and the CloudFlare products are awesome but at some point if the quality of the dev experience is too bad, it forces developers using this product to seek alternatives out of sheer necessity to get work done. I have enough bugs in my own code, I don't have the time to devote to bugs on the tools I use to get work done. :)

Please provide a link to a minimal reproduction

No response

Please provide any relevant error logs

No response

Feb 11 '24 19:02 matthewjosephtaylor

Hi @matthewjosephtaylor thanks of creating this separate issue! Problems like these are hard to debug, reproduce and fix and we thank you for your patience and willingness to work with us to resolve the issue.

The segfault is obviously not providing much info and no stack trace. Could you please provide the log file that this instance of wrangler says it's writing to – it will contain debug log messages which aren't shown in your terminal. There might be a stack trace in there. And it will have more info about internal state of wranglers servers.

It's not clear yet whether hibernation is actually the issue. In the meantime, you can patch wrangler so that it provides this flag to miniflare and validate whether hibernation really is causing your issue:

Create a file patches/wrangler+3.28.1.patch at the root of your repo with this contents:

diff --git a/node_modules/wrangler/wrangler-dist/cli.js b/node_modules/wrangler/wrangler-dist/cli.js
index 310df39..104ef04 100644
--- a/node_modules/wrangler/wrangler-dist/cli.js
+++ b/node_modules/wrangler/wrangler-dist/cli.js
@@ -128096,7 +128096,7 @@ function buildMiniflareBindingOptions(config) {
     durableObjects: Object.fromEntries(
       internalObjects.map(({ class_name }) => [
         class_name,
-        { className: class_name, scriptName: getName(config) }
+        { className: class_name, scriptName: getName(config), unsafePreventEviction: true }
       ])
     ),
     // Use this worker instead of the user worker if the pathname is
@@ -128304,7 +128304,8 @@ async function buildMiniflareOptions(log2, config, proxyToUserWorkerAuthenticati
         ...sitesOptions
       },
       externalDurableObjectWorker
-    ]
+    ],
+    
   };
   return { options: options25, internalObjects };
 }

run: npx patch-package
Use wrangler dev and your repro instructions as usual

Feb 13 '24 16:02 RamIdeas

@RamIdeas here is what I find in the wrangler log inside of ~/.wrangler/logs First two log messages are due to my application logging. From the logs it appears that my perception is incorrect and the segfault happens much sooner than 5 minutes after last interaction. More like a little over a minute.

--- 2024-02-13T23:49:43.493Z info
app-message [ ^[[32m'dataLink:find'^[[39m ]
---

--- 2024-02-13T23:49:43.493Z info
dataLinkFindListener: chat-message-1707868183441-fe513a67-16aa-49ad-91a0-24476c97c590 function-call-result
---
  
--- 2024-02-13T23:50:59.962Z error
^[[31m✘ ^[[41;31m[^[[41;97mERROR^[[41;31m]^[[0m ^[[1m*** Received signal #11: Segmentation fault: 11^[[0m

  stack:
  

---

I'll give the patch a go and see if this helps and will report after I've tried it out.

Feb 13 '24 23:02 matthewjosephtaylor

Ran patch:

$ npx patch-package
patch-package 8.0.0
Applying patches...
[email protected] ✔

I now see this upon running wrangler dev:

⎔ Starting local server...
✘ [ERROR] Error reloading local server: MiniflareCoreError [ERR_DIFFERENT_PREVENT_EVICTION]: Multiple unsafe prevent eviction values defined for Durable Object "AppStateDurableObject" in "core:user:ai-worker": true and undefined

      at getDurableObjectClassNames
  (/Users/mtaylor/checkouts/MJT/app/ai-worker/node_modules/miniflare/dist/src/index.js:7785:17)
      at #assembleConfig
  (/Users/mtaylor/checkouts/MJT/app/ai-worker/node_modules/miniflare/dist/src/index.js:8265:37)
      at async #assembleAndUpdateConfig
  (/Users/mtaylor/checkouts/MJT/app/ai-worker/node_modules/miniflare/dist/src/index.js:8434:20)
      at async Mutex.runWith
  (/Users/mtaylor/checkouts/MJT/app/ai-worker/node_modules/miniflare/dist/src/index.js:3405:16)
      at async #waitForReady
  (/Users/mtaylor/checkouts/MJT/app/ai-worker/node_modules/miniflare/dist/src/index.js:8530:5)
      at async #onBundleUpdate
  (/Users/mtaylor/checkouts/MJT/app/ai-worker/node_modules/wrangler/wrangler-dist/cli.js:128360:20)
      at async Mutex.runWith
  (/Users/mtaylor/checkouts/MJT/app/ai-worker/node_modules/miniflare/dist/src/index.js:3405:16) {
    code: 'ERR_DIFFERENT_PREVENT_EVICTION',
    cause: undefined
  }

and my application is effectively dead as I the WS connect effectively hangs forever waiting on connect

Screenshot 2024-02-13 at 6 04 57 PM

Here is the snippet from the wrangler log file I'm comfortable sharing publicly:

  - GIT_HASH: "(hidden)"
  - ./src/index.ts: "(hidden)"
---

--- 2024-02-14T00:03:54.388Z log
^[[2m⎔ Starting local server...^[[22m
---

--- 2024-02-14T00:03:54.399Z debug
[InspectorProxyWorker] handleProxyControllerIncomingMessage {"type":"reloadStart"}
---

--- 2024-02-14T00:03:54.403Z error
^[[31m✘ ^[[41;31m[^[[41;97mERROR^[[41;31m]^[[0m ^[[1mError reloading local server: MiniflareCoreError [ERR_DIFFERENT_PREVENT_EVICTION]: Multiple unsafe prevent eviction values defined for Durable Object "AppStateDurableObject" in "core:user:ai-worker": true and undefined^[[0m

      at getDurableObjectClassNames
  (/Users/mtaylor/checkouts/MJT/app/ai-worker/node_modules/miniflare/dist/src/index.js:7785:17)
      at #assembleConfig
  (/Users/mtaylor/checkouts/MJT/app/ai-worker/node_modules/miniflare/dist/src/index.js:8265:37)
      at async #assembleAndUpdateConfig
  (/Users/mtaylor/checkouts/MJT/app/ai-worker/node_modules/miniflare/dist/src/index.js:8434:20)
      at async Mutex.runWith
  (/Users/mtaylor/checkouts/MJT/app/ai-worker/node_modules/miniflare/dist/src/index.js:3405:16)
      at async #waitForReady
  (/Users/mtaylor/checkouts/MJT/app/ai-worker/node_modules/miniflare/dist/src/index.js:8530:5)
      at async #onBundleUpdate
  (/Users/mtaylor/checkouts/MJT/app/ai-worker/node_modules/wrangler/wrangler-dist/cli.js:128360:20)
      at async Mutex.runWith
  (/Users/mtaylor/checkouts/MJT/app/ai-worker/node_modules/miniflare/dist/src/index.js:3405:16) {
    code: 'ERR_DIFFERENT_PREVENT_EVICTION',
    cause: undefined
  }


---

Hopefully that adds some light into the issue.

I will say that I've written a fairly complex application making using of D1 and DurableObjects to save state. Perhaps since it failed to load my app at all this is a sign I'm outside the norm of what most people are doing. I'll note that I've been running/developing this application for several months now and as far as I can tell I am not facing any issues in production.

Feb 14 '24 00:02 matthewjosephtaylor

Hi @matthewjosephtaylor can you provide the full log file please? The snippet you provided didn't have enough context to helpfully guide debugging.

Also, I think the patch I gave for wrangler wasn't enough. Can you try this patch of miniflare instead? I don't think you'll need the patch for wrangler if you use this patch instead:

filepath: patches/miniflare+3.20240129.1.patch

diff --git a/node_modules/miniflare/dist/src/index.js b/node_modules/miniflare/dist/src/index.js
index 34b7521..2474c43 100644
--- a/node_modules/miniflare/dist/src/index.js
+++ b/node_modules/miniflare/dist/src/index.js
@@ -6506,7 +6506,7 @@ Ensure ${stringName} doesn't include unbundled \`import\`s.`
                 // path when persisting to the file-system. `-` is invalid in
                 // JavaScript class names, but safe on filesystems (incl. Windows).
                 uniqueKey: unsafeUniqueKey ?? `${options.name ?? ""}-${className}`,
-                preventEviction: unsafePreventEviction
+                preventEviction: true
               };
             }
           ),

Feb 16 '24 15:02 RamIdeas

@RamIdeas the patch worked.

Ran after removing the previous wrangler patch I ran:

$ npx patch-package
patch-package 8.0.0
Applying patches...
[email protected] ✔

Think this 'fixes' the problem. Not experiencing any segfaults or other issues after applying the patch.

I realize this isn't a 'real fix' but I can confidently report back that I'll be rocking this patch until the real fix is implemented. THANK YOU for getting that patch to me. #lifechanging

Unfortunately I'm not comfortable sharing the full log, especially in a public forum. If you want to let me know what in particular you are looking for (perhaps provide a series of greps?) I can root around and see what I can pull out.

I will run with this patch today and report back if there are any additional issues/gotchas I discover after longer-term use.

I would of course like to help you finalize the fix for this. So if there are further patches you want me to run to test potential fixes, I'd be onboard with trying them out.

Feb 16 '24 18:02 matthewjosephtaylor

Follow up after running the patch for a few days:

No more issues with segfaults, or related issues.
I discovered that I rely on DO hibernation to cleanup some ephemeral storage since I did not expect the DO to remain alive for > 30s. So feel like a 'turn off hibernation' flag might prove useful as a sort of stress test of my application, as I discovered a potential application bug while in this 'no hibernation mode'.

Feb 20 '24 18:02 matthewjosephtaylor

Some of us have also been watching this issue over on the workerd side: https://github.com/cloudflare/workerd/issues/1422

The thread over there is that it has something to do with hibernation (same conclusion y'all came to over here).

Will be giving this patch a go. I've heard some complaints about websockets disconnecting in production, but I've yet to confirm if it's something about their internet vs. this issue.

EDIT: Yep, disabling eviction works:

CleanShot 2024-03-12 at 23 25 53@2x

Mar 13 '24 00:03 thecatanon

I believe this has been fixed as of https://github.com/cloudflare/workerd/issues/1422#issuecomment-2075680936, so feel free to close the issue.

Apr 24 '24 19:04 MellowYarker

Confirm that latest wrangler has fixed this error for me @RamIdeas thanks for all of your hard work!

May 17 '24 19:05 matthewjosephtaylor

Glad to hear it @matthewjosephtaylor! And thanks again for your patience 😄

May 19 '24 17:05 RamIdeas