"host" command (on specific IPs) blocking the PHP workers indefinitively
Description
Rarely, I encountered this issue where all fastcgi workers would freeze making the website unresponsive. The only solution was to kill and restart the fast-cgi processes. No error in the logs. I switched to php-fpm to have access to more logging, but the same issue surfaced yet again - it too without any additional errors logged.
However, I was now able to see (due to the status offered by systemctl) that all workers were executing a host -W 1 xxx.xxx.xxx.xxx commands during a spam bot attack. Solution was the same, restarting the workers.
In Subs.php I found this:
// If we can't access nslookup/host, PHP 4.1.x might just crash.
if (@version_compare(PHP_VERSION, '4.2.0') == -1)
$host = false;
// Try the Linux host command, perhaps?
if (!isset($host) && (strpos(strtolower(PHP_OS), 'win') === false || strpos(strtolower(PHP_OS), 'darwin') !== false) && mt_rand(0, 1) == 1)
{
if (!isset($modSettings['host_to_dis']))
$test = @shell_exec('host -W 1 ' . @escapeshellarg($ip));
else
$test = @shell_exec('host ' . @escapeshellarg($ip));
...
It seems that on my Ubuntu 20.04.06 LTS, the host command does not obey the -W 1 timeout, it just hangs. Same for nslookup -timeout=1 ...
However, on the few IPs where the host command hangs, the gethostbyaddr took 10 seconds to reply - much better than freezing the worker. For now I have just set $host = false; before the big if above.
I could use the timeout 1s command in front of host, but Subs.php expects "not found". In case of timeout, the output would be empty.
Steps to reproduce
- I would prefer not to list offending IPs.
Environment (complete as necessary)
- Version/Git revision: 2.0.19
- Database Type: mysql
- Database Version: not important
- PHP Version: 7.4.3
Additional information/references
This repository is for SMF 2.1 and above, not for SMF 2.0.x.
Additionally, your issue appears to be something wrong with your host system rather than SMF.
I suggest that you try asking for help in the official SMF 2.0.x Support board on the Simple Machines website.
Thank you for responding,
Same call exists in SMF 2.1.3 / Subs.php, which is also not expecting an empty output:
$exists = function_exists('shell_exec');
// Try the Linux host command, perhaps?
if ($exists && !isset($host) && (strpos(strtolower(PHP_OS), 'win') === false || strpos(strtolower(PHP_OS), 'darwin') !== false) && mt_rand(0, 1) == 1)
{
if (!isset($modSettings['host_to_dis']))
$test = @shell_exec('host -W 1 ' . @escapeshellarg($ip));
else
$test = @shell_exec('host ' . @escapeshellarg($ip));
// Did host say it didn't find anything?
if (strpos($test, 'not found') !== false)
$host = '';
The bug I'm seeing is that SMF is spawning external commands without ensuring they exit, but hoping they just exit :)
In SMF 2.1 I see one comment mentioning: Lookup an IP; try shell_exec first because we can do a timeout on it. But I can't seem to find how this timeout is done, it still relies on host (or nslookup) to obey their own timeout, instead of imposing it externally.
What happens if you try to run the command manually?
host -W 1 8.8.8.8
What version of host do you have? host -V
I don't have a 20.04 install, only 22.04 installs, but they work fine.
Now that I was able to identify the problems come from the host command, I made more progress in debugging it.
My hosting provider injects 3 more DNS servers in /etc/resolv.conf. The first entry, 127.0.0.1 is DJ Bernstein's dnscache. With only 127.0.0.1 (dnscache), host -W 1... behaves correctly, finishing in 1 second for the problematic IPs. However, If I allow the three additional DNS servers the hosting provider injects, host will still work for Google's DNS and most IPs, but not for some of the IPs used by the spam bots. It will just hang.
I have already reported the issue to the hosting provider - and asked if I can disable just the DNS part of their network management daemon. But they will take some time to respond. Meanwhile I could of course replace /etc/resolv.conf and only leave dnscache entry which works.
$ host -V
host 9.16.1-Ubuntu
I have also tried host on few other systems (Fedora, Ubuntu), they all work seem to work fine with different host version and different DNS servers.
I could also use the timeout command in front of host, but this too is relying on an external command to obey the timeout.
Leaving aside the problems with host, SMF can be made even more robust (and the reason for this bug). I mean, in principle, if you want a robust system, you can't assume everything external to always work properly. In Linux ( https://blog.dubbelboer.com/2012/08/24/execute-with-timeout.html ) there are PHP ways where if the spawned executable doesn't return anything on stdout / stderr, you can assume is not working and stop waiting. In Windows it seems different (seeing some posts on stackoverflow), but I did not check more.
I've experienced the issue with 2.1.2 back in October/November 2022. My problem was a (completely) unreachable DNS Server in the rotation, which then quickly ate up all php-fpm workers with forever-blocking host processes. Changing the php-fpm settings somewhat improved the problem, but identifying and removing the dead DNS server made the problem go away entirely.
Has this issue been confirmed in 2.1?
@sbulen The comment just before yours states that the issue happens with 2.1.2 - is that an acceptable answer for your question "has this issue been confirmed in 2.1" ?
The comment before mine indicates that improper server config that was root cause:
identifying and removing the dead DNS server made the problem go away entirely.
Where we can we obviously work to make things more robust, but there is not a lot we can do if the process is literally hanging at the OS level.
I find the label "Not enough details" to be unfair. I explained what happens and gave two possible improvements. What other details are missing ?
but there is not a lot we can do if the process is literally hanging at the OS level.
I repeat:
timeout 1s host [...]
Is this a lot, compared to host -W 1 [...] ?