sleep(1) in function qat_engine_init() probably affect Nginx connection accept when reloading Nginx.
1. Version
root@chaitin-safeline:/QAT/build# nginx -V
Tengine version: Tengine/2.3.2
nginx version: nginx/1.17.3
built by gcc 7.3.0 (Ubuntu 7.3.0-27ubuntu1~18.04)
built with OpenSSL 1.1.1c 28 May 2019
TLS SNI support enabled`
OpenSSL 1.1.1c
QAT_ENGINE_VERSION: 0.5.42
QAT_DRIVER_VERSION 1.7.l.4.6.0-00025
2. Issue
In our production environment, I observed a problem: some http requests take more than 1s to process when reloading Nginx(nginx -s reload).
After some trouble-shooting, we think the issue is probably due to sleep(1) in function qat_engine_init(). The bug process is as follows:
- When reload Nginx, Nginx will spawn new worker process to accept new connection
- Stop old worker process gracefully: the old worker process will stop accept new connection, and just handle the remaining connections
- After fork new worker, new worker will execute
qat_engine_init()byengine_init_child_at_fork_handler -
qat_engine_init()may executesleep(1), which will block new worker to accept TCP connection, even though the TCP connection has been established in kernel
By strace new worker, I found the nanosleep(1s) syscall, we belived the syscall is producted by qat-engine library(I can't reproduce the issue under gdb, so not able to get userspace backtrace).
16:00:50.865468 openat(AT_FDCWD, "/dev/qat_adf_ctl", O_RDWR) = 19 <0.000008>
16:00:50.865500 ioctl(19, _IOC(_IOC_WRITE, 0x61, 0x5, 0xc), 0x7ffdd907f4b4) = 0 <0.000008>
16:00:50.865530 close(19) = 0 <0.000007>
16:00:50.865560 rt_sigprocmask(SIG_BLOCK, [USR1], NULL, 8) = 0 <0.000007>
16:00:50.865657 clone(child_stack=0x7f0dbd0acf70, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f0dbd0ad9d0, tls=0x7f0dbd0ad700, child_tidptr=0x7f0dbd0ad9d0) = 21338 <0.000110>
16:00:50.865821 nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffdd907f9d0) = 0 <1.000084>
16:00:51.865975 getpid() = 21337 <0.000021>
16:00:51.866076 write(3, "2022/12/28 16:00:50 1472172449 "..., 79) = 79 <0.000024>
16:00:51.866132 write(3, "2022/12/28 16:00:50 1472172449 "..., 76) = 76 <0.000009>
16:00:51.866171 prlimit64(0, RLIMIT_NOFILE, {rlim_cur=329270, rlim_max=329270}, NULL) = 0 <0.000009>
By disalbe qat-engine in openssl, the issue is not seen.
3. So...
I think sleep(1) is not an appropriate solution to wait for initialization completion. Would someone please to double check the issue I mentioned?
@fuchencong We will check and revert back on the issue
@Yogaraj-Alamenda Any update on the issue?
@fuchencong Apologies for the delay. The sleep was added for the polling thread creation to complete. I will try changing to cond wait to avoid time delay.
@Yogaraj-Alamenda Thank you for you reply. Totally agree with you. Looking forward to the new version.