"End of system activity file unexpected" during sadf
Here're the steps we did to hit this error:
- /bin/bash -c nohup sar 3 -o /history/log/sar_log > /history/log/sar_log.stdout 2> /history/log/sar_log.stderr &
- /usr/bin/killall -SIGINT sar
- /bin/bash -c sadf -j /history/log/sar_log -- -A > /history/log/sar_log.json 2> /history/log/sar_log_sadf.stderr
We are using:
sysstat version 11.2.0
(C) Sebastien Godard (sysstat
Step 3) above hits the "End of system activity file unexpected" error and we have checked that our /history partition has space. The binary sar file can be found here:https://drive.google.com/file/d/1hlYOSwPyqv3uI-BJrrIazaXvZavGXV0d/view?usp=sharing
Please let us know if anything else may be useful to diagnose the problem and whether this is a known bug somewhere.
Hi, I couldn't reproduce the problem. You may have sent the SIGINT signal to sar at the very moment when the data collector (sadc) was saving its data, preventing it from completing its task.
Note: Please send me your binary sar file to sysstat at orange dot fr (I don't have access to Google drive).
Hi there,
Thanks a lot for looking into this! Sorry that I forgot to grant permission on the above sar binary file,it should be fixed now. I also sent a copy of the broken binary sar_log to the email address you provided. Is the sadc collector not writing data to disk in an atomic fashion? What's the proper termination signal we should send in this case then?
Thanks, Aaron
Sorry for my previous answer. In fact it can't be because of the SIGINT signal interrupting sadc since this signal is trapped by sadc. Wrt your problem, your datafile is really corrupted. Did you have a look at its contents? It is correct until 05:11:27. Then it contains only zero values. How did you create it? Did you run it in a VM? The sar command seems to no longer have access to any /proc files... IMO this is not a sysstat problem. Excerpt from sar -f sar_log:
05:11:21 CPU %user %nice %system %iowait %steal %idle
05:11:24 all 3.10 0.00 2.94 0.04 0.00 93.92
05:11:27 all 4.08 0.00 2.92 0.00 0.00 93.00
01:00:00 all 0.00 0.00 0.00 0.00 0.00 0.00
01:00:00 all 0.00 0.00 0.00 0.00 0.00 0.00
01:00:00 all 0.00 0.00 0.00 0.00 0.00 0.00
01:00:00 all 0.00 0.00 0.00 0.00 0.00 0.00
<...>
Thanks for the reply! We create it by ssh into the machine and run the command i posted above, and it's running on Hardware.
We just got another sar binary: https://drive.google.com/file/d/1tIvaVSqEOgol5BokQQUCI9Jq6uzXOqI0/view?usp=sharing
Here's the part where it starts to become crazy from your sar -f command:
04:38:29 PM all 28.26 0.00 28.68 0.00 0.00 43.06
04:38:32 PM all 30.28 0.00 27.77 0.00 0.00 41.95
04:38:35 PM all 30.26 0.00 27.80 0.00 0.00 41.93
04:38:38 PM all 32.96 0.00 22957790449557.53 46301417919140992.00 65654386172.62 473510156758.40
12:00:00 AM all 0.00 0.00 0.00 0.00 0.00 0.00
12:00:00 AM all 0.00 0.00 6186900389888000.00 1288490188800.00 3865470566400.00 128849018880000.00
12:00:00 AM all 0.00 0.00 6202362272153600.00 1288490188800.00 9019431321600.00 128849018880000.00
12:00:00 AM all 0.00 0.00 6186900389888000.00 1288490188800.00 4294967296000.00 128849018880000.00
12:00:00 AM all 0.00 0.00 6193342840832000.00 1288490188800.00 6442450944000.00 128849018880000.00
Which file under /proc do we get the cpuinfo from? Maybe we can have a separate thread to ensure we have access to it while running our sar command...
Another interesting observation is that the time in the /history/log/sar_log.stdout (from the sar command stdout) and sar -f sar_log doesn't match. And the stdout of the sar command seems pretty legit with no errors. The /history/log/sar_log.stdout:
Linux 4.13.0-16-generic (qfsd-2) 02/19/2018 _x86_64_ (48 CPU)
06:57:51 PM CPU %user %nice %system %iowait %steal %idle
06:57:54 PM all 1.28 0.00 0.60 0.12 0.00 98.01
......
07:04:42 PM all 0.02 0.00 0.01 0.00 0.00 99.97
Average: all 12.48 0.00 4.58 0.01 0.00 82.93
The output from sar -f sar_log:
04:38:32 PM all 30.28 0.00 27.77 0.00 0.00 41.95
04:38:35 PM all 30.26 0.00 27.80 0.00 0.00 41.93
04:38:38 PM all 32.96 0.00 22957790449557.53 46301417919140992.00 65654386172.62 473510156758.40
12:00:00 AM all 0.00 0.00 0.00 0.00 0.00 0.00
12:00:00 AM all 0.00 0.00 6186900389888000.00 1288490188800.00 3865470566400.00 128849018880000.00
12:00:00 AM all 0.00 0.00 6202362272153600.00 1288490188800.00 9019431321600.00 128849018880000.00
Recently I also hit the same issue on my system and it seems critical as we are unable to analyze the performance statistics.
When I tried extracting one of the sa content, it initially displayed the stats from 12:00:01 to 12:24:11
12:00:01 AM CPU %user %nice %system %iowait %steal %idle 12:00:11 AM all 42.37 0.00 20.32 0.03 0.00 37.28 12:00:21 AM all 16.77 0.00 15.13 1.02 0.00 67.08 ... ... 12:24:11 AM all 2.25 0.00 1.67 0.06 0.00 96.02
However after ~24 mins, the timer seems to clock again from 12:00:00 with absolutely zero statistics and this continues for few secs logging huge redundant statistics to the sa file eventually ending with the message "End of system activity file unexpected":
12:00:00 AM all 0.00 0.00 0.00 0.00 0.00 0.00 … … 12:00:48 AM all 0.00 0.00 0.00 0.00 0.00 0.00 “End of system activity file unexpected”.
It seems like sysstat has corrupted the file due to some unknown reason.
I see this defect is still open. Do we have plan for fix. What can be done to avoid this issue if it occurs more frequently?
FYI, I am running sysstat 10.1.5 version on centos7.4 OS.
Will upgrading systat version helps?
I cannot promise that upgrading will fix the problem since I have been unable to reproduce it and to give any proper explanations. How did you configure sysstat / do you use it? Do you use cron/systemd to launch sadc ? Do you use it inside a VM or on real hardware ? Do you use it via ssh? etc.
Hi,
We hit this issue several times. I am not able to reproduce but here are some hints.
It obviously happens here: https://github.com/sysstat/sysstat/blob/6886152fb3af82376318c35eda416c3ce611121d/sa_common.c#L1404
This method being called at several places, it is difficult for me to say where exactly it happens.
In my case, it occurs after power-cycling and normally, sar should print a line about LINUX RESTART. Something like:
[...]
09:40:01 AM all 2.45 0.00 0.33 0.00 0.00 97.22
09:50:01 AM all 2.44 0.00 0.33 0.00 0.00 97.22
10:00:01 AM all 2.41 0.00 0.33 0.00 0.00 97.25
10:10:01 AM all 2.46 0.00 0.34 0.00 0.00 97.20
10:20:01 AM all 2.45 0.00 0.33 0.00 0.00 97.22
10:30:01 AM all 2.46 0.00 0.34 0.00 0.00 97.19
Average: all 2.43 0.00 0.33 0.00 0.00 97.23
11:46:35 AM LINUX RESTART
10:50:01 AM CPU %user %nice %system %iowait %steal %idle
11:00:01 AM all 0.24 0.00 0.02 0.00 0.00 99.73
11:10:01 AM all 0.24 0.00 0.02 0.00 0.00 99.73
11:20:01 AM all 0.24 0.00 0.02 0.00 0.00 99.73
11:30:01 AM all 0.25 0.00 0.02 0.00 0.00 99.73
Average: all 0.24 0.00 0.02 0.00 0.00 99.73
11:30:35 AM LINUX RESTART
11:30:57 AM CPU %user %nice %system %iowait %steal %idle
11:40:02 AM all 0.27 0.00 0.05 0.00 0.00 99.67
Average: all 0.27 0.00 0.05 0.00 0.00 99.67
This is not the case so I would assume something happens at this moment. It is not exactly clear why it happens because I cannot reproduce but I see two potential troubles:
- A restart a the moment the file read by sar is written. It gets corrupted or not properly formatted for sar, thus the test
if (n < size)fails and sar exits. - it may be related to the fact that sometimes when the machine is restarted, its time is brought back to 01.01.1970 (I also have some mechanism that changes the time under some conditions).
I played with watch sar and date, switching very often the date in the past, the future, etc. without observing any problem, which seems to exclude 2.. On the other hand, I cannot observe anything with restarting the machine several times either.
Restarting the service helps. The unit contains only the execution of /usr/lib64/sa/sa1 --boot.
Hope this can give some hints.
@aquemy Please also send me your saXX binary data file so that I can analyze it (send it to sysstat at orange dot fr).
@akhishukla23 FYI I have been reported that same problem ("End of system activity file unexpected") on CentOS 7.7 using sysstat 10.1.5. Altough I'm pretty sure that this issue is related to a lack of disk space, it seems to have been fixed by disabling a cron job calling /usr/lib64/sa/sa1 60 2 at 23:59 daily.
Maybe this can help.
@sysstat Earlier you'd asked if the log was created in a vm. What is the significance of that? I am now seeing that all of the sar logs that I create with vms are reporting "End of system activity file unexpected" where all of the logs on physical machines which are collected the same way are working fine.
@akhishukla23 FYI I have been reported that same problem ("End of system activity file unexpected") on CentOS 7.7 using sysstat 10.1.5. Altough I'm pretty sure that this issue is related to a lack of disk space, it seems to have been fixed by disabling a cron job calling
/usr/lib64/sa/sa1 60 2at 23:59 daily. Maybe this can help.
I am using centos 7.7 with sysstat 10.1.5 and I see this issue on all virtual machines constantly. Is there an issue I can track somewhere please?
@sysstat I am using CentOS Linux release 7.6 with sysstat sysstat-10.1.5-18.el7.x86_64
yum -y install sysstat
The following changes were made to this /etc/cron.d/sysstat
# Run system activity accounting tool every 10 minutes
* * * * * root /usr/lib64/sa/sa1 1 60
# 0 * * * * root /usr/lib64/sa/sa1 600 6 &
# Generate a daily summary of process accounting at 23:53
#53 23 * * * root /usr/lib64/sa/sa2 -A
and saxx file is created in the /var/log/sa/201912/ path.
But, Problems arise when executing.
sar -f /var/log/sa/201912/saxx
Linux 3.10.0-957.5.1.el7.x86_64 (my-server) 12/14/2019 _x86_64_ (8 CPU)
End of system activity file unexpected
An error occurs as you move on to the next day. Do you know what the problem is?
@sysstat I'd love to know if there is any active investigation regarding the issue where sar data files are unreadable for some metrics (e.g -n ETCP) on Virtual Machines.
@gleventhal Sorry, but as I have already told before, I have never been able to reproduce the problem on my machine. Moreover I don't' use VM (and I really don't have the time to set up such an environment). What is a bit surprising is that the problem seems to occur on CentOS environments using a (probably modified) sysstat version 10.1.5. Of course I cannot tell you to change your distro, but maybe you could upgrade your sysstat version to the latest stable one (by removing the version provided by CentOS and installing e.g. sysstat version 12.2.1 that you can download from my web site at http://sebastien.godard.pagesperso-orange.fr/download.html) and check if the problem is still there.
@sysstat For details about how to reproduce the problem, see this link.:https://bugzilla.redhat.com/show_bug.cgi?id=1669932
@tong-da-wei Use -f flag with sadc to prevent data corruption. See also issue #275.
@sysstat I am using CentOS Linux release 7.6 with sysstat sysstat-10.1.5-18.el7.x86_64
yum -y install sysstatThe following changes were made to this /etc/cron.d/sysstat
# Run system activity accounting tool every 10 minutes * * * * * root /usr/lib64/sa/sa1 1 60 # 0 * * * * root /usr/lib64/sa/sa1 600 6 & # Generate a daily summary of process accounting at 23:53 #53 23 * * * root /usr/lib64/sa/sa2 -Aand saxx file is created in the /var/log/sa/201912/ path.
But, Problems arise when executing.
sar -f /var/log/sa/201912/saxx Linux 3.10.0-957.5.1.el7.x86_64 (my-server) 12/14/2019 _x86_64_ (8 CPU) End of system activity file unexpectedAn error occurs as you move on to the next day. Do you know what the problem is?
I have the same problem,Can you help answer me @sysstat
@mrzhangxing Here are all the possible reasons which may explain your problem:
- sadc, the data collector, was unable to write all its data to the system activity data file. This is probably because there was no space left on the device where the data file is located. Make sure there is enough free space on the device.
- Several instances of sadc are trying to update the same data file, especially around midnight when making a file rotation. Although you can modify your crontab by hand, a new flag (
--rotate) has been added to sa1 shell script in recent sysstat versions (12.5.1 and later) to handle this gracefully. - Last this issue can also be triggered when the system is rebooted forcibly without data in cache being written to disk. To prevent this you can use switch
-fwith sadc to make sure data are written to disk immediately (but with a slight performance penalty).
I am also having this problem on Centos 7.9, with sysstat version 10.1.5. At first I had the following in crontab:
* * * * * root /usr/lib64/sa/sa1 -L 1 60 &
Then I had the issue of End of system activity file unexpected.
So I changed the crontab to the following based on answer 2 above:
* 1-22 * * * root /usr/lib64/sa/sa1 -L 1 60 &
0-58 23 * * * root /usr/lib64/sa/sa1 -L 1 60 &
59 23 * * * root /usr/lib64/sa/sa1 -L 1 120 &
1-59 0 * * * root /usr/lib64/sa/sa1 -L 1 60 &
Still had the same issue.
Solved it by uninstalling sysstat package, downloading the latest code and building the latest version, which worked.