edge-runtime icon indicating copy to clipboard operation
edge-runtime copied to clipboard

InvalidWorkerCreation: Edge functions cannot handle concurrent requests

Open nathanaeng opened this issue 1 year ago • 19 comments

Bug report

  • [X] I confirm this is a bug with Supabase, not with my own application.
  • [X] I confirm I have searched the Docs, GitHub Discussions, and Discord.

Describe the bug

Making concurrent requests to a Supabase edge function will result in InvalidWorkerCreation errors or 502 errors.

To Reproduce

Steps to reproduce the behavior, please provide code snippets or a repository:

  1. Using the Supabase CLI, create a new function with supabase functions new test_concurrency. Here is an example of a function I have (I realize the createClient is not used):
import "jsr:@supabase/functions-js/edge-runtime.d.ts"
import { createClient } from 'jsr:@supabase/supabase-js@2';

console.log("Hello from Functions!")

Deno.serve(async (req) => {
  const supabaseClient = createClient(
    Deno.env.get('SUPABASE_URL') ?? '',
    Deno.env.get('SUPABASE_SERVICE_ROLE_KEY') ?? '',
  );
  const { name } = await req.json()
  const data = {
    message: `Hello ${name}!`,
  }

  return new Response(
    JSON.stringify(data),
    { headers: { "Content-Type": "application/json" } },
  )
})
  1. Run supabase functions serve

  2. In a new terminal tab, execute this bash script which sends 200 concurrent requests, replacing SERVICE_ROLE_KEY with your service role key:

#!/bin/bash

seq 1 200 | xargs -n1 -P0 -I{} curl -L -X POST 'http://localhost:54321/functions/v1/test_concurrency' -H 'Authorization: Bearer SERVICE_ROLE_KEY' --data '{"name":"Example"}'
  1. Notice how it will successfully execute the function for the first 100 or so requests, before erroring on the supabase functions serve tab:
InvalidWorkerCreation: worker did not respond in time
    at async UserWorker.create (ext:sb_user_workers/user_workers.js:145:15)
    at async Object.handler (file:///root/index.ts:154:22)
    at async respond (ext:sb_core_main_js/js/http.js:163:14) {
  name: "InvalidWorkerCreation"
}

with the following error message on the tab that executes the test script:

{"code":"BOOT_ERROR","message":"Worker failed to boot (please check logs)"}

Expected behavior

I would expect the edge function to be able to handle concurrent requests to this degree.

Screenshots

image

System information

  • OS: macOS, M3 Max
  • Browser (if applies) [e.g. chrome, safari]
  • Version of supabase-js: 1.192.5, using supabase-edge-runtime-1.58.2 (compatible with Deno v1.45.2)
  • Version of Node.js: 18

Additional context

From my understanding, edge functions can be used to serve API routes, and in a production application it is perfectly reasonable that you would have 200 users hit the same endpoint at the same time. This example uses an edge function with minimal computations. If you add database reads, a text embedding call using Supabase.ai gte-small, and a database write, it can handle even fewer concurrent requests (around 40 from my testing). I noticed this issue at first because I wanted to generate text embeddings on seed data consisting of only 40 users (which gets triggered on inserts to a table) but it failed to work for every user.

I'm not entirely sure how edge functions work, maybe a worker is being re-used to handle multiple requests and then a CPU limit or similar is hit, resulting in failures - but I thought the idea of edge functions is to scale up with requests and a mere 200 requests is nothing.

At first I thought that this could be a problem with local Supabase running in Docker, but I also confirmed this occurs on a remote Supabase project (ran using Supabase to host) - where I get 502 errors after the first 50-100 requests or so.

nathanaeng avatar Sep 15 '24 01:09 nathanaeng

I have encountered a similar issue when trying to call an edge function multiple times concurrently. In my case, making a lot of calls resulted in InvalidWorkerCreation errors or 502 errors. It seems that the scaling ability of edge functions might be limited and this significantly impacts performance when concurrent requests spike.

I feel like other serverless functions can handle concurrent requests with ease, yet edge functions can't even handle 50? Is Supabase not equipped to handle more than 50 concurrent requests? It seems as if the edge function is attempting to create a worker for every single request rather than queuing or using some implementation to resolve concurrency on a large scale.

ethan-dinh avatar Sep 15 '24 03:09 ethan-dinh

Hello @nathanaeng and @ethan-dinh

With the user script code and bash script you posted in the description and assuming you're using default edge runtime policy settings in supabase/cli then, I can explain why the edge runtime is showing such low request throughput.

The edge runtime has three main scheduling policies(per_worker, per_request, oneshot) for workers, and for developers convenience, supabase/cli defaults to whichever of these scheduling policies is not used by Supabase Edge Functions. (aka. oneshot policy)

Unlike the other policies, the oneshot policy does not reuse workers but rather creates a new worker and forwards a request to it, even if they have the same service path. The reason supabase/cli chose this policy as the default is that the source code can be changed by developers at any time, so that the next request will reflect the changed source code. So it is not used in production(and Supabase Edge Functions) because it is highly inefficient for the reasons described above.

If you change the policy, I think you'll probably get a different result.

I was able to reproduce your issue exactly locally on the oneshot policy using your code, but I was also able to confirm that the per_worker policy is not affected by this issue.

Of course, my experience doesn't guarantee that you won't have the same issue with Supabase Edge Functions.

Today, I came across an author on Reddit discussing this same topic, and it seemed that the author was also experiencing these issues with Supabase Edge Functions.

My expectation is that these issues should be handled well by the per_worker policy, but it looks like sometimes it's not able to properly forward the many request traffic to the workers and just gives up. (Forgive me, I have very limited visibility for Edge Functions because I am not a member of the Supabase team).

I have opened PR-382 to better handle this situation, and once this is merged, they will be able to implement more specific request scheduling on top of the per_worker policy, which I believe will mitigate these issues.

I will put this on my watchlist and will let you guys know if there are any updates on this issue in the future.

Have a great day!

nyannyacha avatar Sep 15 '24 13:09 nyannyacha

Thanks for the detailed response! Yep, I have looked into the per_worker policy and while it might work fine for the simple edge function I provided above, it was failing for a more complex edge function that performs a read, text embedding, and write. I can't recall how many concurrent requests it was able to handle, it might have been a bit more than oneshot but it was still underwhelming unfortunately. Additionally, I was able to replicate this error on my remote DB (Supabase hosted) which makes me think it's not just a local hosting issue. Thanks for helping though!

nathanaeng avatar Sep 15 '24 15:09 nathanaeng

Hello @nyannyacha , thanks for your detailed response. As someone who self-hosts edge functions separately (not together with supabase docker compose), where should I go about changing the policies you mentioned? I suspect it is in the main function index.ts with forceCreate = true or false but I am not sure and I am still getting those 502 errors after 30-50 concurrent requests even with the forceCreate = false option. Can you help me figure out some other configurations in the main function where I can optimize for better scaling performance? I am running it in multiple replicas in my K8s deployment but the replicas still cannot pass the load test because the edge runtime container stop responding to requests and return 502 with the above error after a few concurrent requests.

thurahtetaung avatar Sep 17 '24 08:09 thurahtetaung

Can confirm I'm experiencing the same thing: working on running a data-intensive edge function and running into this along with some request hanging locally. I also have a low-bandwidth bottom of the line MacBook for what it's worth, but just confirming it's replicating on my end.

codingiswhyicry avatar Oct 15 '24 13:10 codingiswhyicry

i can also that i'm experiencing that, i'm trying it even with a small project after a couple requests with my colleagues saying : Can't reach database server at aws-0-eu-central-1.pooler.supabase.com:5432

TSM540 avatar Oct 17 '24 20:10 TSM540

Is it possible for a single V8 isolate to handle multiple requests concurrently?

I created a small test where I have a function sleep for 25 seconds, with a wall clock time limit of 30 seconds. Additionally, I have forceCreate set to false, --max-parallelism 1 and --policy per_worker.

I then have a unit test to create two concurrent requests to the function.

The first request boots up a worker, while the second fails with the error message: failed to start service sleep-test: InvalidWorkerCreation: worker did not respond in time

if this is not expected behavior I can provide more details to the test.

Another question: assuming cpu time limits are set to 0. If a worker has a request that comes in with <1 second left on the wall clock timer, and the request takes longer than 1 second, will the request be retried with a different isolate or will it always fail with some 500?

kyle-okami avatar Jan 28 '25 15:01 kyle-okami

Hello @kyle-okami

The first request boots up a worker, while the second fails with the error message: failed to start service sleep-test: InvalidWorkerCreation: worker did not respond in time

As you expected, If --max-parallelism is set to 1 as you set it, the number of workers handling the service will be fixed at 1. Nevertheless, this error message is rendered because a special limiter is set to prevent the edge runtime from waiting indefinitely for a worker's response.

It is exposed as a --request-wait-timeout and the default value is 10000 (in milliseconds, or 10 seconds).

If you want to allow edge runtime to wait for a worker that takes more than 10 seconds to respond, this flag must also be set to more than 10 seconds.

Another question: assuming cpu time limits are set to 0. If a worker has a request that comes in with <1 second left on the wall clock timer, and the request takes longer than 1 second, will the request be retried with a different isolate or will it always fail with some 500?

Firstly, If the CPU time limit is set to 0, it means that there is no limit. And even if the remaining wall clock is less than 1 second, there is no problem in processing the request.

This is because when the worker exceeds half of the wall clock limit you specified, it is changed to a kind of retired state. In this state, the worker cannot receive new requests, and a newly created worker starts accepting requests. (Therefore, even if max-parallelism is 1, there is a point in time when more than one worker can exist.)

I hope my answer has clarified things for you. Finally, I believe it is better to open an issue if you have another question.

Thank you!

nyannyacha avatar Jan 28 '25 21:01 nyannyacha

thank you @nyannyacha

kyle-okami avatar Jan 29 '25 00:01 kyle-okami

Is there any updates on this? I'm evaluating using Supabase right now for my project and this seems like a huge red flag if I'm going to need to migrate off Supabase to handle a larger volume of requests.

ninjz avatar Mar 11 '25 05:03 ninjz

Is there any updates on this? I'm evaluating using Supabase right now for my project and this seems like a huge red flag if I'm going to need to migrate off Supabase to handle a larger volume of requests.

For multiple reasons I recommend using Cloudflare Workers instead of Supabase Edge Functions. They are fast/cheap, big free tear and scale automatically. You can use them as reverse API proxy to hide your Supabase project url for the Supabase JS client, or to do your business logic by calling the DB. Much better for production use cases.

LaszloDev avatar Mar 11 '25 09:03 LaszloDev

@nyannyacha do we have a way to change that when the instance is up? Currently, my test suit needs it, but I still want live reload for my dev time. Have to shut don't Supabase and restart just to change this is a pain.

riderx avatar May 25 '25 20:05 riderx

@riderx What environment are you using it in? There is a scheduling policy called oneshot that makes it possible to get new source changes every time you call the function.

nyannyacha avatar May 26 '25 00:05 nyannyacha

🔕 This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Aug 20 '25 02:08 github-actions[bot]

It would be good to get a maintainer's view on this. We keep getting "I'm not in the Supabase team" from maintainers... Which is pretty frustrating. It would be good if Supabase could put on their public-facing user documentation that they aren't supporting edge-runtime with any Supabase employees, so if there is a bug then "sux4u".

AntonOfTheWoods avatar Aug 20 '25 03:08 AntonOfTheWoods

@AntonOfTheWoods @nyannyacha is a maintainer of Edge Runtime and part of the official Supabase team (he was probably not at the time he wrote the comment :)).

Have you tried the suggestion proposed above? I believe this is an issue you'd only experience in self-hosted version and can be resolved by setting the configuration options suggested above

laktek avatar Sep 08 '25 21:09 laktek

@laktek not really just for self hosted. All my env run locally for dev and CI. In CI it’s easy to set the policy to per_worker and all test including the one who run background task will work.

The real problem is when we work on a feature who use background task locally. We cannot use “oneshot” as it will fail and differ from a prod usecase, and “per_worker” is very inefficient to work with as we need to reload the worker ourself each time.

I would love to have another policy called “watch” who watch for filechange and force the worker to reload at minimum at file change, or better wait the worker is stale and reload, this would increase a lot the efficiency of a local setup. I’m up as well to have a CLI command who do that instead of a policy.

Right now it’s hard to explain to my team that supabase start work, mostly but not for certain case as we are in oneshot to have a “live reload” env.

I would prefer to tell them to do “supabase livereload” or “watch” and then it’s clear

riderx avatar Sep 09 '25 07:09 riderx

“per_worker” is very inefficient to work with as we need to reload the worker ourself each time.

@riderx I think that's no longer the case. With this PR, auto-reload should work on CLI for per_worker policy as well. This was released in CLI version 2.41.0. If you're using the latest stable version of CLI, it should work. Does this not work?

laktek avatar Sep 09 '25 08:09 laktek

🔕 This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Nov 09 '25 02:11 github-actions[bot]

The issue happens when running:

deno cache --reload index.ts

This command causes the error.

To fix it, please reinstall Deno completely and install it again without running the cache command. In other words, after reinstalling Deno, do not run:

deno cache --reload index.ts

Ge6ben avatar Dec 09 '25 13:12 Ge6ben

@laktek yes i confirm it mostly work now, only new file or new edge function required stop and start now

riderx avatar Dec 11 '25 00:12 riderx