daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-17443 gurt: Always retain first .old log file

Open frostedcmos opened this issue 9 months ago • 9 comments

  • Upon log rotation, the very first '.old' file is saved now as '.oldest'

Steps for the author:

  • [ ] Commit message follows the guidelines.
  • [ ] Appropriate Features or Test-tag pragmas were used.
  • [ ] Appropriate Functional Test Stages were run.
  • [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

frostedcmos avatar Apr 25 '25 18:04 frostedcmos

Ticket title is 'Always retain the first daos*.log.old logfile' Status is 'In Progress' Labels: 'lrz,lrz_track,scrubbed_2.8,usability' https://daosio.atlassian.net/browse/DAOS-17443

github-actions[bot] avatar Apr 25 '25 18:04 github-actions[bot]

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16314/3/execution/node/1427/log

daosbuild3 avatar May 23 '25 19:05 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16314/3/execution/node/1382/log

daosbuild3 avatar May 23 '25 20:05 daosbuild3

I don't really mind either way but what is the incentive for keeping the first one and not the other ones ? are we just assuming that when we start rotating logs there are already so many errors that only the first log is interesting ? why don't we add a timestamp at the end of the filename and always use that to save the old log ? (instead of having old, oldest etc)

This request was based on real world usages on different clusters, and what we observe is that once we hit some bad error the logs tend to fill up rather quickly with repeated stuff making current and previous log files contain similar info. So, in practice only first log ends up being useful to see errors leading to repeated flood.

As for timestamps - we dont want to save every log rotation, but only [current] [current -1] and [first]

frostedcmos avatar Jun 25 '25 19:06 frostedcmos

that's fine but imo it looks a bit clunky to have .oldest and .old naming, if I look at other services running on my linux box I don't see anybody doing that kind of things :)

soumagne avatar Jun 25 '25 21:06 soumagne

that's fine but imo it looks a bit clunky to have .oldest and .old naming, if I look at other services running on my linux box I don't see anybody doing that kind of things :)

maybe rename oldest to something like .first or .initial ?

mchaarawi avatar Jun 25 '25 21:06 mchaarawi

that's fine but imo it looks a bit clunky to have .oldest and .old naming, if I look at other services running on my linux box I don't see anybody doing that kind of things :)

maybe rename oldest to something like .first or .initial ?

renamed to .first

frostedcmos avatar Jun 26 '25 21:06 frostedcmos

we should get a review from a test engineer to verify this is not going to break CI in the case:

  1. .first is also collected for logs
  2. 3 logs are still able to be collected and we do not run out of space since we have an extra log (this is the case when the logs do overflow).

ftest archives all *log*. Space can be a concern, and I don't think there are hard mechanisms in place to prevent that. Logs are stored in /var/tmp (not sure how much space we have on all the various clusters). Luckily, when archived, any files larger than 1M are compressed.

daltonbohning avatar Jun 27 '25 18:06 daltonbohning

we should get a review from a test engineer to verify this is not going to break CI in the case:

  1. .first is also collected for logs
  2. 3 logs are still able to be collected and we do not run out of space since we have an extra log (this is the case when the logs do overflow).

ftest archives all *log*. Space can be a concern, and I don't think there are hard mechanisms in place to prevent that. Logs are stored in /var/tmp (not sure how much space we have on all the various clusters). Luckily, when archived, any files larger than 1M are compressed.

I talked with Phil and we think it's safe to move forward with this. Worst case if we notice issues in the future we can adjust the log rollover size per test or globally as needed.

daltonbohning avatar Jun 27 '25 19:06 daltonbohning

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16314/5/testReport/

daosbuild3 avatar Jul 02 '25 15:07 daosbuild3