KeyDB Performance issue with keydb

Hello. I have launched dockerized instance of keydb with image eqalpha/keydb:x86_64_v6.0.16 and configured active-replica yes Then I copied aof file and started instance with this docker-compose:

version: '3.4'
services:
  keydb:
   image: eqalpha/keydb:x86_64_v6.0.16
   container_name: keydb
   restart: unless-stopped
   security_opt:
     - seccomp:unconfined
   network_mode: host
   volumes:
     - /db/keydb/:/data/
     - type: bind
       source: ./keydb.conf
       target: /etc/keydb/keydb.conf
   logging:
    driver: "json-file"
    options:
      max-file: "5"
      max-size: 10m

keydb.conf is:

bind 0.0.0.0
protected-mode no
port 6379
tcp-backlog 511
timeout 0
tcp-keepalive 300
supervised no
pidfile /var/run/keydb_6379.pid
loglevel notice
databases 16
always-show-logo yes
save ""
save ""
save ""
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir /data
replica-serve-stale-data yes
replica-read-only yes
repl-diskless-sync no
repl-diskless-sync-delay 5
repl-disable-tcp-nodelay no
replica-priority 100
lazyfree-lazy-eviction no
lazyfree-lazy-expire no
lazyfree-lazy-server-del no
replica-lazy-flush no
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 25
auto-aof-rewrite-min-size 7gb
aof-load-truncated yes
aof-use-rdb-preamble yes
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 128
latency-monitor-threshold 0
notify-keyspace-events ""
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-size -2
list-compress-depth 0
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
stream-node-max-bytes 4096
stream-node-max-entries 100
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit replica 1024mb 1024mb 0
client-output-buffer-limit pubsub 128mb 64mb 60
hz 10
dynamic-hz yes
aof-rewrite-incremental-fsync yes
rdb-save-incremental-fsync yes
server-threads 4
rename-command FLUSHDB ""
rename-command FLUSHALL ""
requirepass "**"
masterauth "**"

active-replica yes

replicaof host-b 6379

replica-announce-ip host-a-ip
replica-announce-port 6379

It start without any issue and now there is about 2.5 millions of keys. Replication is also fine. All ~400 clients are connected to host-a, host-b is for manual standby. But the mean time of get operations is very poor:

keydb-benchmark -h `hostname -f` -a ** -t get -n 1000
...
95.30% <= 2141 milliseconds
...
100.00% <= 3040 milliseconds
50.08 requests per second

I decided to enable watchdog config set watchdog-period 500 and got these records in log:

1:signal-handler (1621551890)
--- WATCHDOG TIMER EXPIRED ---
EIP:
/lib/x86_64-linux-gnu/libc.so.6(syscall+0x19) [0x7f8f642e2959]

Backtrace:
keydb-server 0.0.0.0:6379(logStackTrace(ucontext_t*)+0x6b) [0x556bd5ca592b]
keydb-server 0.0.0.0:6379(watchdogSignalHandler(int, siginfo_t*, void*)+0x1d) [0x556bd5ca59cd]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0) [0x7f8f645ca8a0]
/lib/x86_64-linux-gnu/libc.so.6(syscall+0x19) [0x7f8f642e2959]
keydb-server 0.0.0.0:6379(fastlock_sleep+0xa4) [0x556bd5d00624]
keydb-server 0.0.0.0:6379(+0x110399) [0x556bd5d06399]
keydb-server 0.0.0.0:6379(aeProcessEvents+0x2a7) [0x556bd5c43e97]
keydb-server 0.0.0.0:6379(aeMain+0x45) [0x556bd5c442a5]
keydb-server 0.0.0.0:6379(workerThreadMain(void*)+0x74) [0x556bd5c4ac34]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f8f645bf6db]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f8f642e8a3f]
1:signal-handler (1621551890) --------

1:signal-handler (1621551893)
--- WATCHDOG TIMER EXPIRED ---
EIP:
/lib/x86_64-linux-gnu/libc.so.6(syscall+0x19) [0x7f8f642e2959]

Backtrace:
keydb-server 0.0.0.0:6379(logStackTrace(ucontext_t*)+0x6b) [0x556bd5ca592b]
keydb-server 0.0.0.0:6379(watchdogSignalHandler(int, siginfo_t*, void*)+0x1d) [0x556bd5ca59cd]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0) [0x7f8f645ca8a0]
/lib/x86_64-linux-gnu/libc.so.6(syscall+0x19) [0x7f8f642e2959]
keydb-server 0.0.0.0:6379(fastlock_sleep+0xa4) [0x556bd5d00624]
keydb-server 0.0.0.0:6379(+0x110399) [0x556bd5d06399]
keydb-server 0.0.0.0:6379(aeProcessEvents+0x2a7) [0x556bd5c43e97]
keydb-server 0.0.0.0:6379(aeMain+0x45) [0x556bd5c442a5]
keydb-server 0.0.0.0:6379(workerThreadMain(void*)+0x74) [0x556bd5c4ac34]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f8f645bf6db]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f8f642e8a3f]
1:signal-handler (1621551893) --------

Both servers are baremetal, have 20cpus and 128G ram, network 10Gbps. OS CentOS Linux release 7.9.2009 (Core). I set somaxconn with sysctl -w net.core.somaxconn=1024 and disabled transparent_hugepage with "echo never | tee /sys/kernel/mm/transparent_hugepage/enabled".

top shows keydb is consuming 90-140% of CPU perf top first lines:

Samples: 57K of event 'cycles:ppp', 4000 Hz, Event count (approx.): 26274076884 lost: 0/0 drop: 0/0
Overhead  Shared Object                           Symbol
  37.75%  keydb-server                            [.] 0x0000000000077297
  15.24%  keydb-server                            [.] 0x0000000000050cf9
   7.92%  keydb-server                            [.] 0x0000000000110388
   1.36%  keydb-server                            [.] 0x0000000000077294
   1.14%  keydb-server                            [.] 0x000000000011038d

I have no idea how to reproduce this, can you help me to find out what I am doing wrong?

May 21 '21 00:05 akosyrev

I am having the same issue recently, the CPU load for one of the cluster is insanely high, and it is almost impossible to run any thing on it.

At the meantime, the dump.rdb file stops updating due to the limitation.

Aug 12 '21 07:08 Lubard

Hi Akosyrev & Lubard

Thank you for contacting EQAlpha. We appreciate you reaching out to us.

For starters, you can upgrade

server-threads 7

See : https://docs.keydb.dev/blog/2019/10/28/blog-post

However since you have 20 CPUs[i assume you hence also have 20 cores], you can also increase it to as many as

server-threads 20

Uncomment

server-thread-affinity true

to optimize CPU usage

I see that you have some configuration parameter that does take up some work, you could try disabling them for some potential performance improvements.

Aug 12 '21 19:08 BobEQAlpha