Times are not represented as Time in the VM so lose precision when compared
Mtail records timestamps as Go time.Time values, that have nanosecond resolution. However, when accessing a time acquired using timestamp() from a program, e.g., when subtracting two timestamps or assigning a timestamp to a gauge, then the timestamps are converted to integers using Time.Unix(). This means that the resolution when comparing timestamps is 1s. This can be ok when monitoring longer sessions such as in the rsync example, where the durations are tens of seconds or more. But if the durations are in single digit seconds, the systematic error is significant. Many applications have sub-second transactions or sessions, and these will all be rounded to 0.
Here is a toy demonstration that monitors the duration between emitted log lines for the mtail log itself.
# demo.mtail
counter lines_total
# For why the 0 and negative buckets are needed, see:
# https://github.com/google/mtail/issues/226#issuecomment-524862192
histogram delta_t buckets -1e9, 0, 0.0005, 0.001, 0.01, 0.1, 1, 10, 100
gauge last
def glog {
/(?P<level>[IWECF])(?P<timestamp>\d{4} \d{2}:\d{2}:\d{2}\.\d{6}) +/
/(?P<thread>\d+) (?P<src_file>[^:]+):(?P<src_line>\d+)\] / {
strptime($timestamp, "0102 15:04:05")
next
}
}
@glog {
/$/ {
lines_total > 0 {
delta_t = (timestamp() - last)
}
last = timestamp()
lines_total++
}
}
Example input:
I0829 22:08:41.220692 1288 main.go:100] mtail version invalid ... go version go1.12.7 go arch amd64 go os windows
I0829 22:08:41.221691 1288 main.go:101] Commandline: ["C:\\mtail.exe" "-progs" "demo.mtail" "-logs" "foo.log" "-logtostderr"]
I0829 22:08:41.224689 1288 log_watcher.go:248] No abspath in watched list, added new one for X
I0829 22:08:41.224689 1288 checker.go:202] declaration of capture group reference `src_file' at demo.mtail:7:2-59 appears to be unused
I0829 22:08:41.225714 1288 checker.go:202] declaration of capture group reference `src_line' at demo.mtail:7:2-59 appears to be unused
I0829 22:08:41.225714 1288 checker.go:202] declaration of capture group reference `src_line' at demo.mtail:7:2-59 appears to be unused
I0829 22:08:41.225714 1288 checker.go:202] declaration of capture group reference `thread' at demo.mtail:7:2-59 appears to be unused
I0829 22:08:41.225714 1288 checker.go:202] declaration of capture group reference `thread' at demo.mtail:7:2-59 appears to be unused
I0829 22:08:41.225714 1288 checker.go:202] declaration of capture group reference `src_file' at demo.mtail:7:2-59 appears to be unused
I0829 22:08:41.225714 1288 checker.go:202] declaration of capture group reference `level' at demo.mtail:6:2-68 appears to be unused
I0829 22:08:41.225714 1288 checker.go:202] declaration of capture group reference `level' at demo.mtail:6:2-68 appears to be unused
I0829 22:08:41.226713 1288 loader.go:229] Loaded program demo.mtail
I0829 22:08:41.226713 1288 log_watcher.go:248] No abspath in watched list, added new one for X
W0829 22:08:41.227713 1288 mtail.go:101] No matches for pattern "foo.log"
I0829 22:08:41.227713 1288 store.go:136] Starting metric store expiry loop every 1h0m0s
I0829 22:08:41.227713 1288 tail.go:435] Starting log handle expiry loop every 1h0m0s
I0829 22:08:41.227713 1288 mtail.go:321] Listening on [::]:3903
I0829 22:08:48.037569 1288 log_watcher.go:253] Found this processor in watched list
I0829 22:08:48.038569 1288 log_watcher.go:248] No abspath in watched list, added new one for C:\foo.log
I0829 22:08:48.044581 1288 tail.go:289] Tailing C:\foo.log
I0829 22:09:21.933622 1288 mtail.go:349] Received SIGTERM, exiting...
I0829 22:09:21.934635 1288 mtail.go:363] Shutdown requested.
I0829 22:09:21.935667 1288 log_watcher.go:218] Shutting down log watcher.
I0829 22:09:21.935667 1288 loader.go:381] Shutting down loader.
I0829 22:09:21.936655 1288 mtail.go:386] END OF LINE
Example Prometheus metrics:
delta_t_bucket{prog="demo.mtail",le="0"} 22
delta_t_bucket{prog="demo.mtail",le="0.0005"} 22
delta_t_bucket{prog="demo.mtail",le="0.001"} 22
delta_t_bucket{prog="demo.mtail",le="0.01"} 22
delta_t_bucket{prog="demo.mtail",le="0.1"} 22
delta_t_bucket{prog="demo.mtail",le="1"} 22
delta_t_bucket{prog="demo.mtail",le="10"} 23
delta_t_bucket{prog="demo.mtail",le="100"} 24
delta_t_bucket{prog="demo.mtail",le="+Inf"} 24
delta_t_sum{prog="demo.mtail"} 40
delta_t_count{prog="demo.mtail"} 24
Here we see that all durations were counted as 0, except for the 7s and 33s pauses, that add up to the sum of 40s.
I tested and changed the few places in vm.go that use Unix seconds to use Unix nanoseconds instead. This of course works, but is hardly backwards compatible as timestamp() returns a billion times larger values. Perhaps changing to Unix seconds as a float64 would be better, as that wouldn't change the unit and would give microsecond resolution.
As a reference, this is what I would expect:
delta_t_bucket{prog="demo.mtail",le="0"} 12
delta_t_bucket{prog="demo.mtail",le="0.0005"} 12
delta_t_bucket{prog="demo.mtail",le="0.001"} 17
delta_t_bucket{prog="demo.mtail",le="0.01"} 22
delta_t_bucket{prog="demo.mtail",le="0.1"} 22
delta_t_bucket{prog="demo.mtail",le="1"} 22
delta_t_bucket{prog="demo.mtail",le="10"} 23
delta_t_bucket{prog="demo.mtail",le="100"} 24
delta_t_bucket{prog="demo.mtail",le="+Inf"} 24
delta_t_sum{prog="demo.mtail"} 40.715963
delta_t_count{prog="demo.mtail"} 24
It's terrible, isn't it! Thanks for the bug report, Time should be a first class type in the language so that this is done correctly.
On Fri, 30 Aug 2019 at 22:00, antong [email protected] wrote:
As a reference, this is what I would expect:
delta_t_bucket{prog="demo.mtail",le="0"} 12 delta_t_bucket{prog="demo.mtail",le="0.0005"} 12 delta_t_bucket{prog="demo.mtail",le="0.001"} 17 delta_t_bucket{prog="demo.mtail",le="0.01"} 22 delta_t_bucket{prog="demo.mtail",le="0.1"} 22 delta_t_bucket{prog="demo.mtail",le="1"} 22 delta_t_bucket{prog="demo.mtail",le="10"} 23 delta_t_bucket{prog="demo.mtail",le="100"} 24 delta_t_bucket{prog="demo.mtail",le="+Inf"} 24 delta_t_sum{prog="demo.mtail"} 40.715963 delta_t_count{prog="demo.mtail"} 24
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google/mtail/issues/263?email_source=notifications&email_token=AAXFX65VON535CKUXOYIR53QHEDWZA5CNFSM4ISMQGJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5ROJLQ#issuecomment-526574766, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXFX63N3R2KPRNBVHZTHITQHEDWZANCNFSM4ISMQGJA .
I want to make a Time type internally in the VM to solve this issue.
bump to this issue
Any news on that issue ?
I'm working on it, but it's taking some time.
On Wed, 19 Oct 2022 at 23:03, Stéphane Cazeaux @.***> wrote:
Any news on that issue ?
— Reply to this email directly, view it on GitHub https://github.com/google/mtail/issues/263#issuecomment-1283901213, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXFX65JY5GSQSJ4I7DRADDWD7PPHANCNFSM4ISMQGJA . You are receiving this because you commented.Message ID: @.***>