Unhide does not terminate
Hello,
On RHEL8, I have the problem that unhide-linux sys does not terminate and keeps running for days. The observed behavior does not affect every RHEL8 instance, and currently, I have difficulties identifying which instances are affected.
Unhide is stuck in a loop and keeps checking the PIDs. It repeatedly runs checkps(tmppid, checks), where it checks if the currently_to_be_checked_PID is also listed in the output of ps --no-header -p %i o pid.
Sometimes this succeeds, but if the PID does not exist, it will check for threads using: ps --no-header -eL o lwp. This check takes a lot of time since on my test system there are 4K PIDs returned, which it checks one by one to see if the thread is found.
This command is being executed over and over again without really finishing, thus blocking unhide from completing successfully.
Do you have any suggestions on how to fix this problem?
Hello Hu6li,
Sorry for the delay, I don't know why but Github hasn't sent me a notification email for your issue.
There are 3 main factors that influence the duration of an unhide run:
- maximum number of PID on the system,
- current number of threads,
- processor load,
Can you tell me which version of unhide you are using ? It seems to be v20130526 on RHEL8. Last version is v20220611. Sysinfo and readdir tests are the only ones which use "ps --no-header -p %i o pid" and none of them are called by "sys" test in the last version. "sysinfo" was removed from "sys" tests list as it generates too much false positives on modern hardware/DE. Moreover, in the last version there is a "-u" option that tries to prevent buffering of ps piped output.
What are you really mean by "does not terminate" ? Does it really never terminate, or does it last for hours/days ? I've never seen "unhide" not end and there's never been a report to that effect until now.
On my PC, with an almost idle processor (1% CPU load), about 1000 PIDs, and a maximum of 4 million PIDs, unhide sys already take more than 3 minutes to end. The "sys" test executes nine elementary tests, each of them loops over the 4M possibles PID. Each time a PID is found, a call to checkps() is done.
What are the your maxpid value and the CPU load ?
Usually I run the command "./unhide-linux -vou reverse quick" which is far shorter. It is only slightly less effective, and there may be marginally more false negatives than with in-depth analysis commands. In case of suspicion, I run the command "./unhide-linux -vmou reverse procall quick brute procfs sys" which lasts a very long time.
If you need more info or detail, don't hesitate to write back.
Regards, Patrick.
Hi Patrick, Many thanks for your detailed answer. No worries about the delay, as you see I wasn't that fast either so please also excuse my late response.
We use version: 20210124 -> We now check if dependent programs allow us to update unhide
Sorry about the ambiguity in "does not terminate". As you expected it keeps running for days/weeks, as mentioned the problem only occurs on some systems and on those affected, I never witnessed it ter-minating. Unhide just keeps spawning "ps --no-header -eL o lwp" processes.
On one instance the process is running since 6th of December. Listing all child processes of unhide I see new "ps --no-header -eL o lwp" popping up (and finishing) on a regular basis.
Maxpid: 4M CPU load: 37%
We will now try updating and adjusting the params according to your suggestions. Thanks! Regards, Jens
Hello Jens,
And what is the number of threads on your systems ?
I found a small bug where the command "ps --no-header -p %i o pid" is run unnecessarily for each process (not for threads), in addition to the command "ps --no-header -eL o lwp". This doesn't matter, as "ps --no-header -p %i o pid" is very fast. Removing it does not significantly decrease the overall execution time.
In the previous message, I mentioned the -u option for removing output buffering from piped commands. I've seen that I haven't applied this modification in all possible places in the source code.
I'll try to fix these two points quickly.
In the meantime, I've just pushed and released the version correcting issue #11, which has been waiting on my system since May :S
Keep me posted.
Hi Patrick Currently the system has 6111 threads running.
Awesome thanks for your feedback. I tried running unhide using the "-vou reverse quick" option which helped a bit. Unhide finished after 15 hours.
I'll then wait for your next release. Thanks for your efforts and quick responses!
If I find any more indicators as to why unhide takes so long to run, I'll let you know.
Cheers, Jens
Hi Patrick,
Over the past few days, I explored some options and considered generating a single "ps --no-header" output before iterating over all possible PIDs. This approach aimed to reduce the number of "ps" executions. However, it seems impractical on systems with rapid thread execution changes.
Currently, there’s a loop for each PID that runs "ps --no-header -p PID -o pid", and if the PID doesn’t exist, a second loop executes "ps --no-header -eL -o lwp".
I haven’t tested this yet, but it might be possible to avoid chaining these loops by using "ps --no-header -L -o lwp,pid -p PID". This command would return all LWPs and PID related to the process, and if no thread exists, the only number returned would be the PID itself.
Regards, Jens
Hi Jens,
Yes, the two loops should be mutually exclusive. I have already fix this bug in my local version.
In unhide-linux.c, line 137:
if (PS_PROC == (checks & PS_PROC))
should read:
if (PS_PROC == (checks & PS_PROC) && 0 == (checks & PS_THREAD))
Not sure if test duration will be significantly shorter.
I really need to push through all the final corrections :S. I'll try to do it before the end of the month.
Cheers, Patrick.