mtail icon indicating copy to clipboard operation
mtail copied to clipboard

Better clearer documentation on using hidden string variables

Open ferringb opened this issue 6 years ago • 6 comments

This request is primarily to work around sequential loglines that don't include request IDs that allow matching the log lines together- but are atomically written so that it's correct to interpret previous logline matches as relating to the following.

Consider this OOMK (timestamp removed from the logs, but it's your run of the mill kernel OOMK):

Task in /docker/0c028dbe2997cc43d97045ac005d774dca19f1f40a8268dc4354a1c2189ff32a killed as a result of limit of /docker/0c028dbe2997cc43d97045ac005d774dca19f1f40a8268dc4354a1c2189ff32a
memory: usage 1048576kB, limit 1048576kB, failcnt 164913
memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
kmem: usage 63384kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /docker/0c028dbe2997cc43d97045ac005d774dca19f1f40a8268dc4354a1c2189ff32a: cache:0KB rss:985192KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0$
[ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[204624] 1000 204624 203812 59171 164 5 0 0 celery
[164121] 1000 164121 2112 186 10 3 0 0 timeout
[164122] 1000 164122 408939 18884 80 5 0 0 phantomjs
[164127] 1000 164127 2112 201 10 3 0 0 timeout
[164129] 1000 164129 5683321 166314 1141 25 0 0 phantomjs
[164133] 1000 164133 2112 201 10 4 0 0 timeout
[164134] 1000 164134 716373 14010 101 6 0 0 phantomjs
[165525] 1000 165525 2112 205 10 3 0 0 timeout
[165526] 1000 165526 345234 9226 55 4 0 0 phantomjs
[165685] 1000 165685 2112 204 10 3 0 0 timeout
[165686] 1000 165686 325226 9047 47 5 0 0 phantomjs
Memory cgroup out of memory: Kill process 164129 (phantomjs) score 638 or sacrifice child
Killed process 164129 (phantomjs) total-vm:22733284kB, anon-rss:637428kB, file-rss:29216kB, shmem-rss:0kB

The last log line tells me which binary got shot, but the preceeding /docker/.+ tells me which cgroup that was- which if you're running multiple containers with differing restrictions but the same binary, makes reporting that metric painful because the loglines have no linkage one can rely on to join it back.

Ultimately, I would want to synthesis a metric like thus from the above:

kernel_cgroup_OOMK_total{cgroup="/docker/0c028dbe2997cc43d97045ac005d774dca19f1f40a8268dc4354a1c2189ff32a", binary="phantomjs"} 1

But this is not possible with mtail as of rc30 because there is no linkage I can use to merge the labels. Best I can do is have two disparate metrics and try to fuse it at the promql level, and that will result in false positives at best.

What I'm proposing is roughly thus for parsing that data:

hidden string cgroup_oomk
/Task in [^ ]+ killed as a result of limit of (?P<cgroup>[^ ]+)/ {
  cgroup_oomk = $cgroup
}
counter kernel_oomk by cgroup, binary
/Memory cgroup out of memory: Kill process \d+ \((?P<binary>[^\)]+)\)/ {
  kernel_oomk[$cgroup_oomk][$binary]++
/

Offhand, I realize end users could misuse this- if they write such a rule while assuming the loglines are atomic (for the blocker they're tracing), it would be subtle in it's failure. Same angle, if mtail is interrupted in it's processing and doesn't catch the initial match that sets the value- it would result in cgroup_oomk="" which is annoying, but isn't the end of the world.

ferringb avatar Jun 06 '19 21:06 ferringb

I'm surprised that you can't make a string hidden variable at the moment already. I'll look into this...

On Fri, 7 Jun 2019 at 05:40, ferringb [email protected] wrote:

This request is primarily to work around sequential loglines that don't include request IDs that allow matching the log lines together- but are atomically written so that it's correct to interpret previous logline matches as relating to the following.

Consider this OOMK (timestamp removed from the logs, but it's your run of the mill kernel OOMK):

Task in /docker/0c028dbe2997cc43d97045ac005d774dca19f1f40a8268dc4354a1c2189ff32a killed as a result of limit of /docker/0c028dbe2997cc43d97045ac005d774dca19f1f40a8268dc4354a1c2189ff32a memory: usage 1048576kB, limit 1048576kB, failcnt 164913 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0 kmem: usage 63384kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /docker/0c028dbe2997cc43d97045ac005d774dca19f1f40a8268dc4354a1c2189ff32a: cache:0KB rss:985192KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0$ [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name [204624] 1000 204624 203812 59171 164 5 0 0 celery [164121] 1000 164121 2112 186 10 3 0 0 timeout [164122] 1000 164122 408939 18884 80 5 0 0 phantomjs [164127] 1000 164127 2112 201 10 3 0 0 timeout [164129] 1000 164129 5683321 166314 1141 25 0 0 phantomjs [164133] 1000 164133 2112 201 10 4 0 0 timeout [164134] 1000 164134 716373 14010 101 6 0 0 phantomjs [165525] 1000 165525 2112 205 10 3 0 0 timeout [165526] 1000 165526 345234 9226 55 4 0 0 phantomjs [165685] 1000 165685 2112 204 10 3 0 0 timeout [165686] 1000 165686 325226 9047 47 5 0 0 phantomjs Memory cgroup out of memory: Kill process 164129 (phantomjs) score 638 or sacrifice child Killed process 164129 (phantomjs) total-vm:22733284kB, anon-rss:637428kB, file-rss:29216kB, shmem-rss:0kB

The last log line tells me which binary got shot, but the preceeding /docker/.+ tells me which cgroup that was- which if you're running multiple containers with differing restrictions but the same binary, makes reporting that metric painful because the loglines have no linkage one can rely on to join it back.

Ultimately, I would want to synthesis a metric like thus from the above:

kernel_cgroup_OOMK_total{cgroup="/docker/0c028dbe2997cc43d97045ac005d774dca19f1f40a8268dc4354a1c2189ff32a", binary="phantomjs"} 1

But this is not possible with mtail as of rc30 because there is no linkage I can use to merge the labels. Best I can do is have two disparate cgroups and try to fuse it at the promql level, and that will result in false positives at best.

What I'm proposing is roughly thus for parsing that data:

hidden string cgroup_oomk /Task in [^ ]+ killed as a result of limit of (?<?P[^ ]+)/ { cgroup_oomk = $cgroup } counter kernel_oomk by cgroup, binary /Memory cgroup out of memory: Kill process \d+ (?P[^)]+)/ { kernel_oomk[$cgroup_oomk][$binary]++ /

Offhand, I realize end users could misuse this- if they write such a rule while assuming the loglines are atomic (for the blocker they're tracing), it would be subtle in it's failure. Same angle, if mtail is interrupted in it's processing and doesn't catch the initial match that sets the value- it would result in cgroup_oomk="" which is annoying, but isn't the end of the world.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google/mtail/issues/234?email_source=notifications&email_token=AAXFX65SVORSRU6MZKLIU3LPZF73HA5CNFSM4HVKK2J2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GYEBQ2Q, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXFX6YBUBSUH3MOOCRH4UTPZF73HANCNFSM4HVKK2JQ .

jaqx0r avatar Jun 13 '19 01:06 jaqx0r

The variable type is not string, it's text, (see examples/mysql_slowqueries.mtail). Also, the variable should not be prefixed by $. So this should work:

hidden text cgroup_oomk
counter kernel_oomk by cgroup, binary

/Task in [^ ]+ killed as a result of limit of (?P<cgroup>[^ ]+)/ {
  cgroup_oomk = $cgroup
}

/Memory cgroup out of memory: Kill process \d+ \((?P<binary>[^\)]+)\)/ {
  kernel_oomk[cgroup_oomk][$binary]++
}

BTW, It would be nice to document this use of variable in the documentaiton... i have been looking for it for a while.

finkr avatar Sep 04 '19 13:09 finkr

It would be nice to document this use of variable in the documentaiton.. --@finkr

I don't understand, you mean the use of 'hidden', if so, that's here: https://github.com/google/mtail/blob/master/docs/state.md

jnovack avatar Sep 09 '19 19:09 jnovack

I suppose this issue can be closed.

@jnovack, This issue and the page state.md document how to handle state across multi-lines.

I should open a distinct issue about using hidden text as variables for labels. For exemple in in examples/mysql_slowqueries.mtail :

query_time[query_type, host, service, user] += tmp_query_time

finkr avatar Sep 10 '19 14:09 finkr

Some documentation would be nice. As far as I can tell the metric type "text" is not documented anywhere. I was glad to find the mysql example as I needed the same thing.

gsstark avatar Oct 31 '19 15:10 gsstark

Yess. The documentation is not that clear. An respective example can be found on the examples path: https://github.com/google/mtail/blob/master/examples/stringy.mtail Extending this with the hidden attribute would also be an additional idea.

anthraxn8b avatar Nov 14 '20 15:11 anthraxn8b