DAOS-17443 gurt: Always retain first .old log file
- Upon log rotation, the very first '.old' file is saved now as '.oldest'
Steps for the author:
- [ ] Commit message follows the guidelines.
- [ ] Appropriate Features or Test-tag pragmas were used.
- [ ] Appropriate Functional Test Stages were run.
- [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
- [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.
After all prior steps are complete:
- [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).
Ticket title is 'Always retain the first daos*.log.old logfile' Status is 'In Progress' Labels: 'lrz,lrz_track,scrubbed_2.8,usability' https://daosio.atlassian.net/browse/DAOS-17443
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16314/3/execution/node/1427/log
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16314/3/execution/node/1382/log
I don't really mind either way but what is the incentive for keeping the first one and not the other ones ? are we just assuming that when we start rotating logs there are already so many errors that only the first log is interesting ? why don't we add a timestamp at the end of the filename and always use that to save the old log ? (instead of having old, oldest etc)
This request was based on real world usages on different clusters, and what we observe is that once we hit some bad error the logs tend to fill up rather quickly with repeated stuff making current and previous log files contain similar info. So, in practice only first log ends up being useful to see errors leading to repeated flood.
As for timestamps - we dont want to save every log rotation, but only [current] [current -1] and [first]
that's fine but imo it looks a bit clunky to have .oldest and .old naming, if I look at other services running on my linux box I don't see anybody doing that kind of things :)
that's fine but imo it looks a bit clunky to have .oldest and .old naming, if I look at other services running on my linux box I don't see anybody doing that kind of things :)
maybe rename oldest to something like .first or .initial ?
that's fine but imo it looks a bit clunky to have .oldest and .old naming, if I look at other services running on my linux box I don't see anybody doing that kind of things :)
maybe rename oldest to something like .first or .initial ?
renamed to .first
we should get a review from a test engineer to verify this is not going to break CI in the case:
- .first is also collected for logs
- 3 logs are still able to be collected and we do not run out of space since we have an extra log (this is the case when the logs do overflow).
ftest archives all *log*. Space can be a concern, and I don't think there are hard mechanisms in place to prevent that. Logs are stored in /var/tmp (not sure how much space we have on all the various clusters). Luckily, when archived, any files larger than 1M are compressed.
we should get a review from a test engineer to verify this is not going to break CI in the case:
- .first is also collected for logs
- 3 logs are still able to be collected and we do not run out of space since we have an extra log (this is the case when the logs do overflow).
ftest archives all
*log*. Space can be a concern, and I don't think there are hard mechanisms in place to prevent that. Logs are stored in/var/tmp(not sure how much space we have on all the various clusters). Luckily, when archived, any files larger than 1M are compressed.
I talked with Phil and we think it's safe to move forward with this. Worst case if we notice issues in the future we can adjust the log rollover size per test or globally as needed.
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16314/5/testReport/