sysstat icon indicating copy to clipboard operation
sysstat copied to clipboard

"End of system activity file unexpected" during sadf

Open tabletenniser opened this issue 7 years ago • 20 comments

Here're the steps we did to hit this error:

  1. /bin/bash -c nohup sar 3 -o /history/log/sar_log > /history/log/sar_log.stdout 2> /history/log/sar_log.stderr &
  2. /usr/bin/killall -SIGINT sar
  3. /bin/bash -c sadf -j /history/log/sar_log -- -A > /history/log/sar_log.json 2> /history/log/sar_log_sadf.stderr

We are using: sysstat version 11.2.0 (C) Sebastien Godard (sysstat orange.fr)

Step 3) above hits the "End of system activity file unexpected" error and we have checked that our /history partition has space. The binary sar file can be found here:https://drive.google.com/file/d/1hlYOSwPyqv3uI-BJrrIazaXvZavGXV0d/view?usp=sharing

Please let us know if anything else may be useful to diagnose the problem and whether this is a known bug somewhere.

tabletenniser avatar Feb 16 '18 18:02 tabletenniser

Hi, I couldn't reproduce the problem. You may have sent the SIGINT signal to sar at the very moment when the data collector (sadc) was saving its data, preventing it from completing its task.

Note: Please send me your binary sar file to sysstat at orange dot fr (I don't have access to Google drive).

sysstat avatar Feb 17 '18 16:02 sysstat

Hi there,

Thanks a lot for looking into this! Sorry that I forgot to grant permission on the above sar binary file,it should be fixed now. I also sent a copy of the broken binary sar_log to the email address you provided. Is the sadc collector not writing data to disk in an atomic fashion? What's the proper termination signal we should send in this case then?

Thanks, Aaron

tabletenniser avatar Feb 17 '18 23:02 tabletenniser

Sorry for my previous answer. In fact it can't be because of the SIGINT signal interrupting sadc since this signal is trapped by sadc. Wrt your problem, your datafile is really corrupted. Did you have a look at its contents? It is correct until 05:11:27. Then it contains only zero values. How did you create it? Did you run it in a VM? The sar command seems to no longer have access to any /proc files... IMO this is not a sysstat problem. Excerpt from sar -f sar_log:

05:11:21        CPU     %user     %nice   %system   %iowait    %steal     %idle
05:11:24        all      3.10      0.00      2.94      0.04      0.00     93.92
05:11:27        all      4.08      0.00      2.92      0.00      0.00     93.00
01:00:00        all      0.00      0.00      0.00      0.00      0.00      0.00
01:00:00        all      0.00      0.00      0.00      0.00      0.00      0.00
01:00:00        all      0.00      0.00      0.00      0.00      0.00      0.00
01:00:00        all      0.00      0.00      0.00      0.00      0.00      0.00
<...>

sysstat avatar Feb 18 '18 13:02 sysstat

Thanks for the reply! We create it by ssh into the machine and run the command i posted above, and it's running on Hardware.

We just got another sar binary: https://drive.google.com/file/d/1tIvaVSqEOgol5BokQQUCI9Jq6uzXOqI0/view?usp=sharing

Here's the part where it starts to become crazy from your sar -f command:

04:38:29 PM     all     28.26      0.00     28.68      0.00      0.00     43.06
04:38:32 PM     all     30.28      0.00     27.77      0.00      0.00     41.95
04:38:35 PM     all     30.26      0.00     27.80      0.00      0.00     41.93
04:38:38 PM     all     32.96      0.00 22957790449557.53 46301417919140992.00 65654386172.62 473510156758.40
12:00:00 AM     all      0.00      0.00      0.00      0.00      0.00      0.00
12:00:00 AM     all      0.00      0.00 6186900389888000.00 1288490188800.00 3865470566400.00 128849018880000.00
12:00:00 AM     all      0.00      0.00 6202362272153600.00 1288490188800.00 9019431321600.00 128849018880000.00
12:00:00 AM     all      0.00      0.00 6186900389888000.00 1288490188800.00 4294967296000.00 128849018880000.00
12:00:00 AM     all      0.00      0.00 6193342840832000.00 1288490188800.00 6442450944000.00 128849018880000.00

Which file under /proc do we get the cpuinfo from? Maybe we can have a separate thread to ensure we have access to it while running our sar command...

tabletenniser avatar Feb 20 '18 19:02 tabletenniser

Another interesting observation is that the time in the /history/log/sar_log.stdout (from the sar command stdout) and sar -f sar_log doesn't match. And the stdout of the sar command seems pretty legit with no errors. The /history/log/sar_log.stdout:

Linux 4.13.0-16-generic (qfsd-2) 	02/19/2018 	_x86_64_	(48 CPU)

06:57:51 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
06:57:54 PM     all      1.28      0.00      0.60      0.12      0.00     98.01
......
07:04:42 PM     all      0.02      0.00      0.01      0.00      0.00     99.97

Average:        all     12.48      0.00      4.58      0.01      0.00     82.93

The output from sar -f sar_log:

04:38:32 PM     all     30.28      0.00     27.77      0.00      0.00     41.95
04:38:35 PM     all     30.26      0.00     27.80      0.00      0.00     41.93
04:38:38 PM     all     32.96      0.00 22957790449557.53 46301417919140992.00 65654386172.62 473510156758.40
12:00:00 AM     all      0.00      0.00      0.00      0.00      0.00      0.00
12:00:00 AM     all      0.00      0.00 6186900389888000.00 1288490188800.00 3865470566400.00 128849018880000.00
12:00:00 AM     all      0.00      0.00 6202362272153600.00 1288490188800.00 9019431321600.00 128849018880000.00

tabletenniser avatar Feb 20 '18 19:02 tabletenniser

Recently I also hit the same issue on my system and it seems critical as we are unable to analyze the performance statistics.

When I tried extracting one of the sa content, it initially displayed the stats from 12:00:01 to 12:24:11

12:00:01 AM CPU %user %nice %system %iowait %steal %idle 12:00:11 AM all 42.37 0.00 20.32 0.03 0.00 37.28 12:00:21 AM all 16.77 0.00 15.13 1.02 0.00 67.08 ... ... 12:24:11 AM all 2.25 0.00 1.67 0.06 0.00 96.02

However after ~24 mins, the timer seems to clock again from 12:00:00 with absolutely zero statistics and this continues for few secs logging huge redundant statistics to the sa file eventually ending with the message "End of system activity file unexpected":

12:00:00 AM all 0.00 0.00 0.00 0.00 0.00 0.00 … … 12:00:48 AM all 0.00 0.00 0.00 0.00 0.00 0.00 “End of system activity file unexpected”.

It seems like sysstat has corrupted the file due to some unknown reason.

I see this defect is still open. Do we have plan for fix. What can be done to avoid this issue if it occurs more frequently?

FYI, I am running sysstat 10.1.5 version on centos7.4 OS.

Will upgrading systat version helps?

akhishukla23 avatar Aug 09 '18 05:08 akhishukla23

I cannot promise that upgrading will fix the problem since I have been unable to reproduce it and to give any proper explanations. How did you configure sysstat / do you use it? Do you use cron/systemd to launch sadc ? Do you use it inside a VM or on real hardware ? Do you use it via ssh? etc.

sysstat avatar Aug 09 '18 06:08 sysstat

Hi,

We hit this issue several times. I am not able to reproduce but here are some hints.

It obviously happens here: https://github.com/sysstat/sysstat/blob/6886152fb3af82376318c35eda416c3ce611121d/sa_common.c#L1404

This method being called at several places, it is difficult for me to say where exactly it happens.

In my case, it occurs after power-cycling and normally, sar should print a line about LINUX RESTART. Something like:

[...]
09:40:01 AM     all      2.45      0.00      0.33      0.00      0.00     97.22
09:50:01 AM     all      2.44      0.00      0.33      0.00      0.00     97.22
10:00:01 AM     all      2.41      0.00      0.33      0.00      0.00     97.25
10:10:01 AM     all      2.46      0.00      0.34      0.00      0.00     97.20
10:20:01 AM     all      2.45      0.00      0.33      0.00      0.00     97.22
10:30:01 AM     all      2.46      0.00      0.34      0.00      0.00     97.19
Average:        all      2.43      0.00      0.33      0.00      0.00     97.23

11:46:35 AM       LINUX RESTART

10:50:01 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
11:00:01 AM     all      0.24      0.00      0.02      0.00      0.00     99.73
11:10:01 AM     all      0.24      0.00      0.02      0.00      0.00     99.73
11:20:01 AM     all      0.24      0.00      0.02      0.00      0.00     99.73
11:30:01 AM     all      0.25      0.00      0.02      0.00      0.00     99.73
Average:        all      0.24      0.00      0.02      0.00      0.00     99.73

11:30:35 AM       LINUX RESTART

11:30:57 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
11:40:02 AM     all      0.27      0.00      0.05      0.00      0.00     99.67
Average:        all      0.27      0.00      0.05      0.00      0.00     99.67

This is not the case so I would assume something happens at this moment. It is not exactly clear why it happens because I cannot reproduce but I see two potential troubles:

  1. A restart a the moment the file read by sar is written. It gets corrupted or not properly formatted for sar, thus the test if (n < size) fails and sar exits.
  2. it may be related to the fact that sometimes when the machine is restarted, its time is brought back to 01.01.1970 (I also have some mechanism that changes the time under some conditions).

I played with watch sar and date, switching very often the date in the past, the future, etc. without observing any problem, which seems to exclude 2.. On the other hand, I cannot observe anything with restarting the machine several times either.

Restarting the service helps. The unit contains only the execution of /usr/lib64/sa/sa1 --boot.

Hope this can give some hints.

aquemy avatar Jan 07 '19 08:01 aquemy

@aquemy Please also send me your saXX binary data file so that I can analyze it (send it to sysstat at orange dot fr).

sysstat avatar Jan 11 '19 15:01 sysstat

@akhishukla23 FYI I have been reported that same problem ("End of system activity file unexpected") on CentOS 7.7 using sysstat 10.1.5. Altough I'm pretty sure that this issue is related to a lack of disk space, it seems to have been fixed by disabling a cron job calling /usr/lib64/sa/sa1 60 2 at 23:59 daily. Maybe this can help.

sysstat avatar Nov 25 '19 07:11 sysstat

@sysstat Earlier you'd asked if the log was created in a vm. What is the significance of that? I am now seeing that all of the sar logs that I create with vms are reporting "End of system activity file unexpected" where all of the logs on physical machines which are collected the same way are working fine.

gleventhal avatar Nov 25 '19 14:11 gleventhal

@akhishukla23 FYI I have been reported that same problem ("End of system activity file unexpected") on CentOS 7.7 using sysstat 10.1.5. Altough I'm pretty sure that this issue is related to a lack of disk space, it seems to have been fixed by disabling a cron job calling /usr/lib64/sa/sa1 60 2 at 23:59 daily. Maybe this can help.

I am using centos 7.7 with sysstat 10.1.5 and I see this issue on all virtual machines constantly. Is there an issue I can track somewhere please?

gleventhal avatar Nov 25 '19 14:11 gleventhal

@sysstat I am using CentOS Linux release 7.6 with sysstat sysstat-10.1.5-18.el7.x86_64

yum -y install sysstat

The following changes were made to this /etc/cron.d/sysstat

# Run system activity accounting tool every 10 minutes
* * * * * root /usr/lib64/sa/sa1 1 60
# 0 * * * * root /usr/lib64/sa/sa1 600 6 &
# Generate a daily summary of process accounting at 23:53
#53 23 * * * root /usr/lib64/sa/sa2 -A

and saxx file is created in the /var/log/sa/201912/ path.

But, Problems arise when executing.

sar -f /var/log/sa/201912/saxx 
Linux 3.10.0-957.5.1.el7.x86_64 (my-server)        12/14/2019      _x86_64_        (8 CPU)
End of system activity file unexpected

An error occurs as you move on to the next day. Do you know what the problem is?

JakeKyung avatar Dec 19 '19 07:12 JakeKyung

@sysstat I'd love to know if there is any active investigation regarding the issue where sar data files are unreadable for some metrics (e.g -n ETCP) on Virtual Machines.

gleventhal avatar Jan 02 '20 14:01 gleventhal

@gleventhal Sorry, but as I have already told before, I have never been able to reproduce the problem on my machine. Moreover I don't' use VM (and I really don't have the time to set up such an environment). What is a bit surprising is that the problem seems to occur on CentOS environments using a (probably modified) sysstat version 10.1.5. Of course I cannot tell you to change your distro, but maybe you could upgrade your sysstat version to the latest stable one (by removing the version provided by CentOS and installing e.g. sysstat version 12.2.1 that you can download from my web site at http://sebastien.godard.pagesperso-orange.fr/download.html) and check if the problem is still there.

sysstat avatar Jan 03 '20 07:01 sysstat

@sysstat For details about how to reproduce the problem, see this link.:https://bugzilla.redhat.com/show_bug.cgi?id=1669932

AnHeBridgeNorth avatar Jul 26 '21 06:07 AnHeBridgeNorth

@tong-da-wei Use -f flag with sadc to prevent data corruption. See also issue #275.

sysstat avatar Aug 10 '21 07:08 sysstat

@sysstat I am using CentOS Linux release 7.6 with sysstat sysstat-10.1.5-18.el7.x86_64

yum -y install sysstat

The following changes were made to this /etc/cron.d/sysstat

# Run system activity accounting tool every 10 minutes
* * * * * root /usr/lib64/sa/sa1 1 60
# 0 * * * * root /usr/lib64/sa/sa1 600 6 &
# Generate a daily summary of process accounting at 23:53
#53 23 * * * root /usr/lib64/sa/sa2 -A

and saxx file is created in the /var/log/sa/201912/ path.

But, Problems arise when executing.

sar -f /var/log/sa/201912/saxx 
Linux 3.10.0-957.5.1.el7.x86_64 (my-server)        12/14/2019      _x86_64_        (8 CPU)
End of system activity file unexpected

An error occurs as you move on to the next day. Do you know what the problem is?


I have the same problem,Can you help answer me @sysstat

mrzhangxing avatar Sep 09 '21 03:09 mrzhangxing

@mrzhangxing Here are all the possible reasons which may explain your problem:

  1. sadc, the data collector, was unable to write all its data to the system activity data file. This is probably because there was no space left on the device where the data file is located. Make sure there is enough free space on the device.
  2. Several instances of sadc are trying to update the same data file, especially around midnight when making a file rotation. Although you can modify your crontab by hand, a new flag ( --rotate ) has been added to sa1 shell script in recent sysstat versions (12.5.1 and later) to handle this gracefully.
  3. Last this issue can also be triggered when the system is rebooted forcibly without data in cache being written to disk. To prevent this you can use switch -f with sadc to make sure data are written to disk immediately (but with a slight performance penalty).

sysstat avatar Sep 10 '21 15:09 sysstat

I am also having this problem on Centos 7.9, with sysstat version 10.1.5. At first I had the following in crontab:

* * * * * root /usr/lib64/sa/sa1 -L 1 60 &

Then I had the issue of End of system activity file unexpected. So I changed the crontab to the following based on answer 2 above:

* 1-22 * * * root /usr/lib64/sa/sa1 -L 1 60 &
0-58 23 * * * root /usr/lib64/sa/sa1 -L 1 60 &
59 23 * * * root /usr/lib64/sa/sa1 -L 1 120 &
1-59 0 * * * root /usr/lib64/sa/sa1 -L 1 60 &

Still had the same issue.

Solved it by uninstalling sysstat package, downloading the latest code and building the latest version, which worked.

peregineme avatar Dec 14 '23 08:12 peregineme