Trigger Alert `CoreDNSErrorsHigh` with OpenShift 4.11.9
Issue Description
I have deployed 3scale apimanager with 3scale operator 2.12. When these are huge number of connections, OpenShift would trigger Alert CoreDNSErrorsHigh .
There was also huge number of error logs, known as SERVFAIL, with CoreDNS, like below:
[INFO] 10.20.10.100:42782 - 766 "A IN backend-listener. udp 34 false 512" SERVFAIL qr,aa,rd,ra 34 0.000114929s
[INFO] 10.20.10.100:32785 - 54091 "A IN backend-listener. udp 34 false 512" SERVFAIL qr,aa,rd,ra 34 0.000045799s
[INFO] 10.20.10.100:38736 - 38959 "A IN backend-listener. udp 34 false 512" SERVFAIL qr,aa,rd,ra 34 0.000495619s
[INFO] 10.20.10.100:53221 - 55209 "A IN backend-listener. udp 34 false 512" SERVFAIL qr,aa,rd,ra 34 0.000111221s
[INFO] 10.20.10.100:43456 - 55627 "A IN backend-listener. udp 34 false 512" SERVFAIL qr,aa,rd,ra 34 0.000087746s
[INFO] 10.20.10.100:53734 - 57461 "A IN backend-listener. udp 34 false 512" SERVFAIL qr,aa,rd,ra 34 0.000093702s
[INFO] 10.20.10.100:56550 - 34310 "A IN backend-listener. udp 34 false 512" SERVFAIL qr,aa,rd,ra 34 0.000131092s
[INFO] 10.20.10.100:40476 - 52329 "A IN backend-listener. udp 34 false 512" SERVFAIL qr,aa,rd,ra 34 0.000077204s
[INFO] 10.20.10.100:48084 - 49744 "A IN backend-listener. udp 34 false 512" SERVFAIL qr,aa,rd,ra 34 0.000089607s
[INFO] 10.20.10.100:43484 - 50657 "A IN backend-listener. udp 34 false 512" SERVFAIL qr,aa,rd,ra 34 0.000053069s
Version
OpenResty Version
sh-4.4$ resty -V
resty 0.27
nginx version: openresty/1.19.3.1
built with OpenSSL 1.1.1k FIPS 25 Mar 2021
TLS SNI support enabled
configure arguments: --prefix=/usr/local/openresty/nginx --with-cc-opt='-O2 -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -DNGX_LUA_ABORT_AT_PANIC -I/usr/local/openresty/zlib/include -I/usr/local/openresty/pcre/include -I/builddir/build/BUILD/openresty-72c2c44c0a02d07851998457c3367a72cdbad0ed/openresty-1.19.3.1/../apicast-nginx-module-v0.4/' --add-module=../ngx_devel_kit-0.3.1 --add-module=../echo-nginx-module-0.62 --add-module=../xss-nginx-module-0.06 --add-module=../ngx_coolkit-0.2 --add-module=../set-misc-nginx-module-0.32 --add-module=../form-input-nginx-module-0.12 --add-module=../encrypted-session-nginx-module-0.08 --add-module=../srcache-nginx-module-0.32 --add-module=../ngx_lua-0.10.19 --add-module=../ngx_lua_upstream-0.07 --add-module=../headers-more-nginx-module-0.33 --add-module=../array-var-nginx-module-0.05 --add-module=../memc-nginx-module-0.19 --add-module=../redis2-nginx-module-0.15 --add-module=../redis-nginx-module-0.3.7 --add-module=../ngx_stream_lua-0.0.9 --with-ld-opt='-Wl,-rpath,/usr/local/openresty/luajit/lib -Wl,-z,relro -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -L/usr/local/openresty/zlib/lib -L/usr/local/openresty/pcre/lib -Wl,-rpath,/usr/local/openresty/zlib/lib:/usr/local/openresty/pcre/lib -L/builddir/build/BUILD/openresty-72c2c44c0a02d07851998457c3367a72cdbad0ed/openresty-1.19.3.1/../apicast-nginx-module-v0.4/' --with-pcre-jit --with-stream --with-stream_ssl_module --with-stream_ssl_preread_module --with-http_v2_module --without-mail_pop3_module --without-mail_imap_module --without-mail_smtp_module --with-http_stub_status_module --with-http_realip_module --with-http_addition_module --with-http_auth_request_module --with-http_secure_link_module --with-http_random_index_module --with-http_gzip_static_module --with-http_sub_module --with-http_dav_module --with-http_flv_module --with-http_mp4_module --with-http_gunzip_module --with-threads --add-module=/builddir/build/BUILD/openresty-72c2c44c0a02d07851998457c3367a72cdbad0ed/openresty-1.19.3.1/../apicast-nginx-module-v0.4 --add-dynamic-module=/builddir/build/BUILD/openresty-72c2c44c0a02d07851998457c3367a72cdbad0ed/openresty-1.19.3.1/../nginx-opentracing-v0.3.0/opentracing --with-stream --with-http_ssl_module
API Cast Gateway Image Version
[user@localhost ~]$ oc image info registry.redhat.io/3scale-amp2/apicast-gateway-rhel8@sha256:e6970547e76e9671a11e45ebe55c1ccf8d3833e22ce232adb2fe963b0f0263cb
Name: registry.redhat.io/3scale-amp2/apicast-gateway-rhel8@sha256:e6970547e76e9671a11e45ebe55c1ccf8d3833e22ce232adb2fe963b0f0263cb
Media Type: application/vnd.docker.distribution.manifest.v2+json
Created: 147d ago
Image Size: 114.5MB in 3 layers
Layers: 79.43MB sha256:1b3417e31a5e0e64f861e121d4efed3152e75aaa85026cd784cd0070e063daa3
1.796kB sha256:809fe483e88523e7021d76b001a552856f216430023bdc0aeff8fce8df385535
35.03MB sha256:949460b6bc6cc93018087e3896c839f232ec361c13b612f5e5dab922f060d3bf
OS: linux
Arch: amd64
Entrypoint: container-entrypoint
Command: scripts/run
Working Dir: /opt/app-root
User: 1001
Exposes Ports: 8080/tcp
Environment: PATH=/opt/app-root/src/bin:/opt/app-root/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
container=oci
AUTO_UPDATE_INTERVAL=0
HOME=/opt/app-root/src
BUILD_PATH=/remote-source/app
LD_LIBRARY_PATH=:/opt/app-root/lib
LUA_CPATH=./?.so;/usr/lib64/lua/5.1/?.so;/usr/lib64/lua/5.1/loadall.so;/usr/local/lib64/lua/5.1/?.so
LUA_PATH=/usr/lib64/lua/5.1/?.lua;/usr/local/share/lua/5.1/?.lua;/usr/local/share/lua/5.1/*/?.lua;
Labels: architecture=x86_64
build-date=2022-09-13T21:46:03.758871
com.redhat.build-host=cpt-1005.osbs.prod.upshift.rdu2.redhat.com
com.redhat.component=3scale-amp-apicast-gateway-container
com.redhat.license_terms=https://www.redhat.com/agreements
description=APIcast is not a standalone API gateway therefore it needs connection to the 3scale API management platform. The container includes OpenResty and uses LuaRocks to install dependencies (rocks are installed in the application folder).
distribution-scope=public
io.k8s.description=APIcast is not a standalone API gateway therefore it needs connection to the 3scale API management platform. The container includes OpenResty and uses LuaRocks to install dependencies (rocks are installed in the application folder).
io.k8s.display-name=3scale API gateway (APIcast)
io.openshift.expose-services=8080:apicast
io.openshift.tags=integration, nginx, lua, openresty, api, gateway, 3scale, rhamp
[email protected]
name=3scale-amp2/apicast-gateway-rhel8
release=74
summary=3scale's API gateway (APIcast) is an OpenResty application which consists of two parts: Nginx configuration and Lua files.
url=https://access.redhat.com/containers/#/registry.access.redhat.com/3scale-amp2/apicast-gateway-rhel8/images/1.21.0-74
vcs-ref=3e532f5dc893e3ed1d3d374d3f266e9704b358b0
vcs-type=git
vendor=Red Hat, Inc.
version=1.21.0
Steps to Reproduce
- Change Environment
APICAST_LOG_LEVELtodebug - Watch out the logs of API Cast, there would show DNS resolve procedure, like below:
First time to request:
2023/02/08 08:43:02 [debug] 22#22: *5 resolver.lua:350: lookup(): resolver query: backend-listener, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 resolver.lua:321: search_dns(): resolver query: backend-listener search: query: backend-listener., requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 cache.lua:122: fetch_answers(): resolver cache miss backend-listener., requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 cache.lua:188: get(): resolver cache miss: backend-listener., requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 dns_client.lua:75: query(): resolver query: backend-listener. nameserver: 10.10.0.10:53, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 dns_client.lua:75: query(): resolver query: backend-listener. nameserver: 10.10.0.10:53, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 resolver.lua:321: search_dns(): resolver query: backend-listener search: 3scale.svc.cluster.local query: backend-listener.3scale.svc.cluster.local, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 cache.lua:122: fetch_answers(): resolver cache miss backend-listener.3scale.svc.cluster.local, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 cache.lua:188: get(): resolver cache miss: backend-listener.3scale.svc.cluster.local, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 dns_client.lua:75: query(): resolver query: backend-listener.3scale.svc.cluster.local nameserver: 10.10.0.10:53, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 cache.lua:74: store(): resolver cache write backend-listener.3scale.svc.cluster.local with TLL 5, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 resolver.lua:370: lookup(): resolver query: backend-listener finished with 1 answers, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 resolver.lua:410: get_servers(): query for backend-listener finished with 1 answers, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 resolver.lua:350: lookup(): resolver query: 10.10.10.100, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 resolver.lua:355: lookup(): host is ip address: 10.10.10.100, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 resolver.lua:370: lookup(): resolver query: 10.10.10.100 finished with 1 answers, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 resolver.lua:410: get_servers(): query for 10.10.10.100 finished with 1 answers, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 http.lua:50: connect(): connected to ip:10.10.10.100 host: 10.10.10.100 port: 3000 ok: 1 err: nil, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 proxy.lua:26: new(): connection to backend-listener:3000 established, reused times: 0, requestID=A
Second time to request:
2023/02/08 08:43:15 [debug] 22#22: *15 resolver.lua:350: lookup(): resolver query: backend-listener, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 resolver.lua:321: search_dns(): resolver query: backend-listener search: query: backend-listener., requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 cache.lua:122: fetch_answers(): resolver cache miss backend-listener., requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 cache.lua:188: get(): resolver cache miss: backend-listener., requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 dns_client.lua:75: query(): resolver query: backend-listener. nameserver: 10.10.0.10:53, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 dns_client.lua:75: query(): resolver query: backend-listener. nameserver: 10.10.0.10:53, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 resolver.lua:321: search_dns(): resolver query: backend-listener search: 3scale.svc.cluster.local query: backend-listener.3scale.svc.cluster.local, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 cache.lua:115: fetch_answers(): resolver cache read backend-listener.3scale.svc.cluster.local 1 entries, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 cache.lua:190: get(): resolver cache hit: backend-listener.3scale.svc.cluster.local 10.10.10.100, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 resolver.lua:370: lookup(): resolver query: backend-listener finished with 1 answers, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 resolver.lua:410: get_servers(): query for backend-listener finished with 1 answers, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 resolver.lua:350: lookup(): resolver query: 10.10.10.100, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 resolver.lua:355: lookup(): host is ip address: 10.10.10.100, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 resolver.lua:370: lookup(): resolver query: 10.10.10.100 finished with 1 answers, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 resolver.lua:410: get_servers(): query for 10.10.10.100 finished with 1 answers, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 http.lua:50: connect(): connected to ip:10.10.10.100 host: 10.10.10.100 port: 3000 ok: 1 err: nil, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 proxy.lua:26: new(): connection to backend-listener:3000 established, reused times: 2, requestID=B
Current Result
Current DNS resolve produce for API Cast, like below:
First Time to query backend-listener
- Query
backend-listener
resolver.lua:350: lookup(): resolver query: backend-listener, requestID=A
- Query
backend-listener., and gotSERVFAILforbackend-listener.
resolver.lua:321: search_dns(): resolver query: backend-listener search: query: backend-listener., requestID=A
cache.lua:122: fetch_answers(): resolver cache miss backend-listener., requestID=A
dns_client.lua:75: query(): resolver query: backend-listener. nameserver: 10.10.0.10:53, requestID=A
- Query
backend-listener.3scale.svc.cluster.local.Found 10.10.10.100 and write cache tobackend-listener.3scale.svc.cluster.local., notbackend-listener
resolver.lua:321: search_dns(): resolver query: backend-listener search: 3scale.svc.cluster.local query: backend-listener.3scale.svc.cluster.local, requestID=A
cache.lua:122: fetch_answers(): resolver cache miss backend-listener.3scale.svc.cluster.local, requestID=A
dns_client.lua:75: query(): resolver query: backend-listener.3scale.svc.cluster.local nameserver: 10.10.0.10:53, requestID=A
cache.lua:74: store(): resolver cache write backend-listener.3scale.svc.cluster.local with TLL 5, requestID=A
- Found and connction established
resolver.lua:370: lookup(): resolver query: backend-listener finished with 1 answers, requestID=A
http.lua:50: connect(): connected to ip:10.10.10.100 host: 10.10.10.100 port: 3000 ok: 1 err: nil, requestID=A
proxy.lua:26: new(): connection to backend-listener:3000 established, reused times: 0, requestID=A
Second Time to query backend-listener
- Query
backend-listener. I think cache shold be hit at this step.
resolver.lua:350: lookup(): resolver query: backend-listener, requestID=B
- Query
backend-listener., still GotSERVFAILforbackend-listener.
resolver.lua:321: search_dns(): resolver query: backend-listener search: query: backend-listener., requestID=B
cache.lua:122: fetch_answers(): resolver cache miss backend-listener., requestID=B
dns_client.lua:75: query(): resolver query: backend-listener. nameserver: 10.10.0.10:53, requestID=B
- Query
backend-listener.3scale.svc.cluster.local., cache hits forbackend-listener.3scale.svc.cluster.local.
resolver.lua:321: search_dns(): resolver query: backend-listener search: 3scale.svc.cluster.local query: backend-listener.3scale.svc.cluster.local, requestID=B
cache.lua:115: fetch_answers(): resolver cache read backend-listener.3scale.svc.cluster.local 1 entries, requestID=B
- Found and connction established
resolver.lua:370: lookup(): resolver query: backend-listener finished with 1 answers, requestID=B
http.lua:50: connect(): connected to ip:10.10.10.100 host: 10.10.10.100 port: 3000 ok: 1 err: nil, requestID=B
proxy.lua:26: new(): connection to backend-listener:3000 established, reused times: 2, requestID=B
Expected Result
Second Time to query backend-listener, the cache should be hit at first step to avoid querying CoreDNS.
Maybe DNS cache save and check, should be move out.
https://github.com/3scale/APIcast/blame/master/gateway/src/resty/resolver.lua#L306
Thanks for looking at this and for the great description and detailed doc.
I looked at it and indeed it seems an issue.
We are currently working on other tasks with higher priorities. But I have created a Jira issue for this issue. https://issues.redhat.com/browse/THREESCALE-9301 so it can be added to the backlog for future sprints
Fixed in #1500