QAT_Engine icon indicating copy to clipboard operation
QAT_Engine copied to clipboard

sleep(1) in function qat_engine_init() probably affect Nginx connection accept when reloading Nginx.

Open fuchencong opened this issue 3 years ago • 4 comments

1. Version

root@chaitin-safeline:/QAT/build# nginx -V
Tengine version: Tengine/2.3.2
nginx version: nginx/1.17.3
built by gcc 7.3.0 (Ubuntu 7.3.0-27ubuntu1~18.04)
built with OpenSSL 1.1.1c  28 May 2019
TLS SNI support enabled`
OpenSSL 1.1.1c
QAT_ENGINE_VERSION: 0.5.42
QAT_DRIVER_VERSION 1.7.l.4.6.0-00025

2. Issue

In our production environment, I observed a problem: some http requests take more than 1s to process when reloading Nginx(nginx -s reload).

After some trouble-shooting, we think the issue is probably due to sleep(1) in function qat_engine_init(). The bug process is as follows:

  1. When reload Nginx, Nginx will spawn new worker process to accept new connection
  2. Stop old worker process gracefully: the old worker process will stop accept new connection, and just handle the remaining connections
  3. After fork new worker, new worker will execute qat_engine_init() by engine_init_child_at_fork_handler
  4. qat_engine_init() may execute sleep(1), which will block new worker to accept TCP connection, even though the TCP connection has been established in kernel

By strace new worker, I found the nanosleep(1s) syscall, we belived the syscall is producted by qat-engine library(I can't reproduce the issue under gdb, so not able to get userspace backtrace).

16:00:50.865468 openat(AT_FDCWD, "/dev/qat_adf_ctl", O_RDWR) = 19 <0.000008>
16:00:50.865500 ioctl(19, _IOC(_IOC_WRITE, 0x61, 0x5, 0xc), 0x7ffdd907f4b4) = 0 <0.000008>
16:00:50.865530 close(19)               = 0 <0.000007>
16:00:50.865560 rt_sigprocmask(SIG_BLOCK, [USR1], NULL, 8) = 0 <0.000007>
16:00:50.865657 clone(child_stack=0x7f0dbd0acf70, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f0dbd0ad9d0, tls=0x7f0dbd0ad700, child_tidptr=0x7f0dbd0ad9d0) = 21338 <0.000110>

16:00:50.865821 nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffdd907f9d0) = 0 <1.000084>

16:00:51.865975 getpid()                = 21337 <0.000021>
16:00:51.866076 write(3, "2022/12/28 16:00:50 1472172449  "..., 79) = 79 <0.000024>
16:00:51.866132 write(3, "2022/12/28 16:00:50 1472172449  "..., 76) = 76 <0.000009>
16:00:51.866171 prlimit64(0, RLIMIT_NOFILE, {rlim_cur=329270, rlim_max=329270}, NULL) = 0 <0.000009>

By disalbe qat-engine in openssl, the issue is not seen.

3. So...

I think sleep(1) is not an appropriate solution to wait for initialization completion. Would someone please to double check the issue I mentioned?

fuchencong avatar Dec 28 '22 10:12 fuchencong

@fuchencong We will check and revert back on the issue

Yogaraj-Alamenda avatar Jan 05 '23 07:01 Yogaraj-Alamenda

@Yogaraj-Alamenda Any update on the issue?

fuchencong avatar Jan 28 '23 07:01 fuchencong

@fuchencong Apologies for the delay. The sleep was added for the polling thread creation to complete. I will try changing to cond wait to avoid time delay.

Yogaraj-Alamenda avatar Feb 08 '23 12:02 Yogaraj-Alamenda

@Yogaraj-Alamenda Thank you for you reply. Totally agree with you. Looking forward to the new version.

fuchencong avatar Feb 09 '23 02:02 fuchencong