Wait for a function to become healthy in scale-up event
Signed-off-by: Alex Ellis (OpenFaaS Ltd) [email protected]
Description
Wait for a function to become healthy in scale-up event
Motivation and Context
Prior to this change, after scaling a function up and returning the API call, a function may still not be ready to serve traffic. This resulted in HTTP errors, for a percentage of the time, especially if the task was deleted instead of being just paused.
Pausing was instant, but during re-creation the function needs some time to start up.
This change puts a health check into the hot path for the scale event. It is blocking, so scaling up will have some additional latency, but will return with a ready endpoint much more of the time than previously.
This approach means that faasd doesn't have to run a set of exec or HTTP healthchecks continually, and use CPU for each of them, even when a function is idle.
How Has This Been Tested?
Tested with the nodeinfo function, by killing the task and then invoking the function. Prior to this, the function may give an error code some of the time.
If you want to test this patch, follow these instructions:
https://github.com/openfaas/faasd/blob/master/docs/PATCHES.md
Then deploy your function, kill its task, and then invoke it. Let's say you deployed bot as a function
faas-cli deploy -f bot.yml
# View the task
sudo ctr -n openfaas-fn task ls
# Delete the task by killing the process
sudo ctr -n openfaas-fn task rm bot -f
# Invoke the function to cause a scale up (and the new code that waits until the function is healthy)
time curl -i http://127.0.0.1:8080/function/bot
If you're using an openfaas template, it already implements a health endpoint, there's no need to add anything to your code.
Types of changes
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
Checklist:
Commits:
- [x] I've read the CONTRIBUTION guide
- [x] My commit message has a body and describe how this was tested and why it is required.
- [x] I have signed-off my commits with
git commit -sfor the Developer Certificate of Origin (DCO)
Code:
- [x] My code follows the code style of this project.
- [x] I have added tests to cover my changes.
Docs:
- [ ] My change requires a change to the documentation.
- [ ] I have updated the documentation accordingly.
Happy Path
Check currently running tasks:
$ sudo ctr -n openfaas-fn task ls
TASK PID STATUS
nodeinfo 3914 RUNNING
env 4056 RUNNING
derek-sha256 4452 RUNNING
Lets rm env:
sudo ctr -n openfaas-fn task rm env -f
WARN[0000] task env exit with non-zero exit code 137
Check its gone:
$ sudo ctr -n openfaas-fn task ls
TASK PID STATUS
derek-sha256 4452 RUNNING
nodeinfo 3914 RUNNING
Call the env function:
time curl -i http://127.0.0.1:8080/function/env
HTTP/1.1 200 OK
Content-Length: 425
Content-Type: text/plain; charset=utf-8
Date: Wed, 03 Nov 2021 19:15:18 GMT
X-Call-Id: a5bb5d32-8dc7-4e1d-87b7-f7f0d3121954
X-Duration-Seconds: 0.001052
X-Start-Time: 1635966918789614976
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
fprocess=env
HOME=/home/app
Http_X_Forwarded_For=10.62.0.1:50736
Http_X_Forwarded_Host=127.0.0.1:8080
Http_X_Start_Time=1635966918789614976
Http_User_Agent=curl/7.68.0
Http_Accept=*/*
Http_Accept_Encoding=gzip
Http_X_Call_Id=a5bb5d32-8dc7-4e1d-87b7-f7f0d3121954
Http_Method=GET
Http_ContentLength=0
Http_Content_Length=0
Http_Path=/
Http_Host=10.62.0.15:8080
real 0m0.259s
user 0m0.012s
sys 0m0.001s
Has the task been reinstated?
$ sudo ctr -n openfaas-fn task ls
TASK PID STATUS
nodeinfo 3914 RUNNING
env 39201 RUNNING
derek-sha256 4452 RUNNING
Load test
Create a script:
#!/bin/bash
N=10
echo Attempting scale up event ${N} times...
for ((i=1; i<=$N; i++ ));
do
ctr -n openfaas-fn task rm env -f > /dev/null 2>&1
sleep 1
STATUS=$(curl -is http://127.0.0.1:8080/function/env | head -n 1|cut -d$' ' -f2)
PID=$(ctr -n openfaas-fn task ls | grep 'env' | awk '{print $2}')
echo "PID:${PID} - STATUS:${STATUS}"
done
Run the script over a small set echoing each time to demonstrate efficacy:
$ ./loop.sh
Attempting scale up event 10 times...
PID:52761 - STATUS:200
PID:52947 - STATUS:200
PID:53128 - STATUS:200
PID:53297 - STATUS:200
PID:53469 - STATUS:200
PID:53640 - STATUS:200
PID:53812 - STATUS:200
PID:53984 - STATUS:200
PID:54150 - STATUS:200
PID:54323 - STATUS:200
Edit script to only echo when a non-200 return code is encountered:
#!/bin/bash
N=1000
echo Attempting scale up event ${N} times...
for ((i=1; i<=$N; i++ ));
do
ctr -n openfaas-fn task rm env -f > /dev/null 2>&1
sleep 1
STATUS=$(curl -is http://127.0.0.1:8080/function/env | head -n 1|cut -d$' ' -f2)
PID=$(ctr -n openfaas-fn task ls | grep 'env' | awk '{print $2}')
if [ "${STATUS}" != "200" ]
then
echo "PID:${PID} - STATUS:${STATUS}"
fi
done
echo "Test Completed"
Run the script (add time as a sanity check):
$ time ./loop.sh
Attempting scale up event 1000 times...
PID:59226 - STATUS:500
PID: - STATUS:404
PID: - STATUS:404
PID: - STATUS:
PID: - STATUS:404
PID: - STATUS:
PID:191771 - STATUS:500
PID:213436 - STATUS:500
Test Completed
real 25m35.304s
user 0m30.736s
sys 0m28.744s
Rerun the N=1000 script with the iteration number in the output:
time ./loop.sh
Attempting scale up event 1000 times...
PID:251329 - STATUS:500 - Iteration:138
PID: - STATUS:404 - Iteration:293
PID: - STATUS:404 - Iteration:695
PID:364081 - STATUS:500 - Iteration:786
Test Completed
real 28m52.654s
user 0m29.693s
sys 0m27.052s
What a great testing method 💪
Perhaps there's something going wrong here that I need to dig into.
How does 1000x test run compare when you use the original binary?
Edit: Further investigation meant a rerun was required. Rerun results provided here.
Putting the 0.14.3 binary back in place and rerunning the N=1000 test:
$ time ./loop.sh
Attempting scale up event 1000 times...
PID:458207 - STATUS:500 - Iteration:32
PID: - STATUS: - Iteration:252
PID: - STATUS: - Iteration:344
PID: - STATUS:404 - Iteration:361
PID: - STATUS: - Iteration:540
Test Completed
real 32m15.716s
user 0m30.298s
sys 0m28.397s
I was trying to avoid failing but... failed :-P
Here's observation on faasd version: 0.14.3 commit: ea62c1b12dd1ae1de794e6dd260351cfbd1a6759 https://asciinema.org/a/J6KJL3KZgFmFO1TL0pBTQg0SU
Here's repeat on build from wait-for-healthy branch
https://asciinema.org/a/FyxgzsEiOUVh82PCPDlk39I1w
Unfortunately regardless of how long I wait after faasd starts up, the first hit on env function produces the same outcome (fail) and any later one works just fine
root@debian:~/faasd# curl http://localhost:8080/function/env
Can't reach service for: env.
root@debian:~/faasd# curl http://localhost:8080/function/env
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
fprocess=env
HOME=/home/app
[...]
This is why I hacked it by adding script that after faasd starts will tickle every registered function with GET request. Ugly but works...
Following the steps of @tmiklas I've found that that error corresponds to a HTTP/500.
So far I've only found Can't reach service for: here:
https://github.com/openfaas/faas-provider/blob/bc0f26cae70c1b3ba2a595cea8bfb8c04925d5c6/proxy/proxy.go#L168-L168
Which is introduced in this change, see:
https://github.com/openfaas/faasd/pull/213/files#diff-d4774f198ccba12d22912607ba38ceac90b936cc5f0198415df45e22629a3770R18
However, I cannot currently see a path to that particular error message.
I ought to mention that the same happens for me using 0.14.3:
root@derek:~# faasd version
__ _
/ _| __ _ __ _ ___ __| |
| |_ / _` |/ _` / __|/ _` |
| _| (_| | (_| \__ \ (_| |
|_| \__,_|\__,_|___/\__,_|
faasd version: 0.14.3 commit: ea62c1b12dd1ae1de794e6dd260351cfbd1a6759
root@derek:~# killall -9 faasd; systemctl restart faasd
root@derek:~# ctr -n openfaas-fn task ls
TASK PID STATUS
nodeinfo 3914 RUNNING
env 635600 RUNNING
derek-sha256 4452 RUNNING
root@derek:~# systemctl status faasd
● faasd.service - faasd
Loaded: loaded (/lib/systemd/system/faasd.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2021-11-05 07:32:01 UTC; 18s ago
Main PID: 635793 (faasd)
Tasks: 8 (limit: 9513)
Memory: 22.0M (limit: 500.0M)
CGroup: /system.slice/faasd.service
└─635793 /usr/local/bin/faasd up
Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 Resolver: "localhost"="127.0.0.1"
Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 Resolver: "faasd-provider"="10.62.0.1"
Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 Resolver: "nats"="10.62.12.21"
Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 Resolver: "queue-worker"="10.62.12.22"
Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 Resolver: "basic-auth-plugin"="10.62.12.23"
Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 Resolver: "prometheus"="10.62.12.24"
Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 Resolver: "gateway"="10.62.12.25"
Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 Proxy from: 127.0.0.1:9090, to: prometheus:9090 (10.62.12.24)
Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 faasd: waiting for SIGTERM or SIGINT
Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 Proxy from: 0.0.0.0:8080, to: gateway:8080 (10.62.12.25)
root@derek:~# curl -i http://localhost:8080/function/env
HTTP/1.1 500 Internal Server Error
Content-Length: 30
Content-Type: text/plain; charset=utf-8
Date: Fri, 05 Nov 2021 07:32:34 GMT
X-Call-Id: 0fed9735-c9e5-497d-9202-b4182535b055
X-Content-Type-Options: nosniff
X-Start-Time: 1636097554054763230
Can't reach service for: env.
and the logs:
Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 faasd: waiting for SIGTERM or SIGINT
Nov 05 07:32:07 derek faasd[635793]: 2021/11/05 07:32:07 Proxy from: 0.0.0.0:8080, to: gateway:8080 (10.62.12.25)
Nov 05 07:32:08 derek openfaas:basic-auth-plugin[636244]: 2021/11/05 07:32:08 Validated request 200.
Nov 05 07:32:08 derek openfaas:gateway[636474]: 2021/11/05 07:32:08 error with upstream request to: /system/function/env, Get "http://faasd-provider:8081/system/function/env?namespace=openfaas-fn": dial tcp 10.62.0.1:8081: connect: connection refused
Nov 05 07:32:08 derek openfaas:gateway[636474]: 2021/11/05 07:32:08 Forwarded [GET] to /system/function/env?namespace=openfaas-fn - [502] - 0.002867s seconds
Nov 05 07:32:10 derek openfaas:basic-auth-plugin[636244]: 2021/11/05 07:32:10 Validated request 200.
Nov 05 07:32:10 derek openfaas:gateway[636474]: 2021/11/05 07:32:10 error with upstream request to: /system/functions, Get "http://faasd-provider:8081/system/functions?namespace=openfaas-fn": dial tcp 10.62.0.1:8081: connect: connection refused
Nov 05 07:32:10 derek openfaas:gateway[636474]: 2021/11/05 07:32:10 Forwarded [GET] to /system/functions?namespace=openfaas-fn - [502] - 0.002327s seconds
Nov 05 07:32:10 derek openfaas:gateway[636474]: 2021/11/05 07:32:10 List functions responded with code 502, body:
Nov 05 07:32:11 derek openfaas:basic-auth-plugin[636244]: 2021/11/05 07:32:11 Validated request 200.
Nov 05 07:32:11 derek openfaas:gateway[636474]: 2021/11/05 07:32:11 error with upstream request to: /system/function/env, Get "http://faasd-provider:8081/system/function/env?namespace=openfaas-fn": dial tcp 10.62.0.1:8081: connect: connection refused
Nov 05 07:32:11 derek openfaas:gateway[636474]: 2021/11/05 07:32:11 Forwarded [GET] to /system/function/env?namespace=openfaas-fn - [502] - 0.002428s seconds
Nov 05 07:32:11 derek systemd[1]: faasd-provider.service: Scheduled restart job, restart counter is at 3.
Nov 05 07:32:11 derek systemd[1]: Stopped faasd-provider.
Nov 05 07:32:11 derek systemd[1]: Started faasd-provider.
Nov 05 07:32:11 derek faasd[636580]: 2021/11/05 07:32:11 faasd-provider starting.. Service Timeout: 1m0s
Nov 05 07:32:11 derek faasd[636580]: faasd version: 0.14.3 commit: ea62c1b12dd1ae1de794e6dd260351cfbd1a6759
Nov 05 07:32:11 derek faasd[636580]: 2021/11/05 07:32:11 Writing network config...
Nov 05 07:32:11 derek faasd[636580]: 2021/11/05 07:32:11 Listening on TCP port: 8081
I can also trigger the HTTP/404 in this way, again on 0.14.3:
killall -9 faasd; systemctl restart faasd
root@derek:~# ctr -n openfaas-fn task ls
TASK PID STATUS
nodeinfo 3914 RUNNING
env 637515 STOPPED
derek-sha256 4452 RUNNING
root@derek:~# systemctl status faasd
● faasd.service - faasd
Loaded: loaded (/lib/systemd/system/faasd.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2021-11-05 07:40:36 UTC; 4s ago
Main PID: 638443 (faasd)
Tasks: 7 (limit: 9513)
Memory: 18.9M (limit: 500.0M)
CGroup: /system.slice/faasd.service
└─638443 /usr/local/bin/faasd up
Nov 05 07:40:38 derek faasd[638443]: Starting: nats
Nov 05 07:40:38 derek faasd[638443]: 2021/11/05 07:40:38 Created container: nats
Nov 05 07:40:39 derek faasd[638443]: 2021/11/05 07:40:39 nats has IP: 10.62.12.38
Nov 05 07:40:39 derek faasd[638443]: 2021/11/05 07:40:39 Task: nats Container: nats
Nov 05 07:40:39 derek faasd[638443]: Starting: prometheus
Nov 05 07:40:39 derek faasd[638443]: 2021/11/05 07:40:39 Created container: prometheus
Nov 05 07:40:40 derek faasd[638443]: 2021/11/05 07:40:40 prometheus has IP: 10.62.12.39
Nov 05 07:40:40 derek faasd[638443]: 2021/11/05 07:40:40 Task: prometheus Container: prometheus
Nov 05 07:40:40 derek faasd[638443]: Starting: gateway
Nov 05 07:40:40 derek faasd[638443]: 2021/11/05 07:40:40 Created container: gateway
root@derek:~# curl -i http://localhost:8080/function/env
HTTP/1.1 404 Not Found
Date: Fri, 05 Nov 2021 07:40:42 GMT
Content-Length: 168
Content-Type: text/plain; charset=utf-8
error finding function env.openfaas-fn: Get "http://faasd-provider:8081/system/function/env?namespace=openfaas-fn": dial tcp 10.62.0.1:8081: connect: connection
And logs:
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 queue-worker has IP: 10.62.12.41
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Task: queue-worker Container: queue-worker
Nov 05 07:40:41 derek openfaas:queue-worker[639127]: Loading basic authentication credentials
Nov 05 07:40:41 derek openfaas:queue-worker[639127]: Starting queue-worker. Version: 0.12.2 Git Commit: ada9a31504c99b57c3ae6fb7559d7684988e5683
Nov 05 07:40:41 derek openfaas:queue-worker[639127]: Connect: nats://nats:4222
Nov 05 07:40:41 derek openfaas:queue-worker[639127]: Subscribing to: faas-request at nats://nats:4222
Nov 05 07:40:41 derek openfaas:queue-worker[639127]: Wait for 5m5s
Nov 05 07:40:41 derek openfaas:queue-worker[639127]: Listening on [faas-request], clientID=[faas-worker-queue-worker], qgroup=[faas] maxInFlight=[1]
Nov 05 07:40:41 derek openfaas:nats[638761]: [1] 2021/11/05 07:40:41.779531 [INF] STREAM: Channel "faas-request" has been created
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Supervisor init done in: 5.108281179s
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Looking up IP for: "gateway"
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Resolver rebuilding map
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Resolver: "localhost"="127.0.0.1"
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Resolver: "faasd-provider"="10.62.0.1"
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Resolver: "basic-auth-plugin"="10.62.12.37"
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Resolver: "nats"="10.62.12.38"
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Resolver: "prometheus"="10.62.12.39"
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Resolver: "gateway"="10.62.12.40"
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Resolver: "queue-worker"="10.62.12.41"
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Looking up IP for: "prometheus"
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 Proxy from: 127.0.0.1:9090, to: prometheus:9090 (10.62.12.39)
Nov 05 07:40:41 derek faasd[638443]: 2021/11/05 07:40:41 faasd: waiting for SIGTERM or SIGINT
Nov 05 07:40:42 derek faasd[638443]: 2021/11/05 07:40:42 Proxy from: 0.0.0.0:8080, to: gateway:8080 (10.62.12.40)
Nov 05 07:40:42 derek systemd-networkd[608]: veth76b7beab: Gained IPv6LL
Nov 05 07:40:42 derek openfaas:gateway[639002]: 2021/11/05 07:40:42 http://faasd-provider:8081/system/function/env?namespace=openfaas-fn Get "http://faasd-provider:8081/system/function/env?namespace=openfaas-fn": dial tcp 10.62.0.1:8081: connect: connection refused
Nov 05 07:40:42 derek openfaas:gateway[639002]: 2021/11/05 07:40:42 GetReplicas [env.openfaas-fn] took: 0.002563s
Nov 05 07:40:42 derek openfaas:gateway[639002]: 2021/11/05 07:40:42 Scaling: error finding function env.openfaas-fn: Get "http://faasd-provider:8081/system/function/env?namespace=openfaas-fn": dial tcp 10.62.0.1:8081: connect: connection refused
Nov 05 07:40:45 derek openfaas:gateway[639002]: 2021/11/05 07:40:45 Get "http://faasd-provider:8081/system/namespaces": dial tcp 10.62.0.1:8081: connect: connection refused
Nov 05 07:40:45 derek openfaas:gateway[639002]: 2021/11/05 07:40:45 Get "http://faasd-provider:8081/system/functions?namespace=openfaas-fn": dial tcp 10.62.0.1:8081: connect: connection refused
cc @welteki
The solution that we're exploring with probing in the gateway, may also help with this issue in faasd.
https://github.com/openfaas/faas/issues/1721