mtail icon indicating copy to clipboard operation
mtail copied to clipboard

Times are not represented as Time in the VM so lose precision when compared

Open antong opened this issue 6 years ago • 6 comments

Mtail records timestamps as Go time.Time values, that have nanosecond resolution. However, when accessing a time acquired using timestamp() from a program, e.g., when subtracting two timestamps or assigning a timestamp to a gauge, then the timestamps are converted to integers using Time.Unix(). This means that the resolution when comparing timestamps is 1s. This can be ok when monitoring longer sessions such as in the rsync example, where the durations are tens of seconds or more. But if the durations are in single digit seconds, the systematic error is significant. Many applications have sub-second transactions or sessions, and these will all be rounded to 0.

Here is a toy demonstration that monitors the duration between emitted log lines for the mtail log itself.

# demo.mtail
counter lines_total

# For why the 0 and negative buckets are needed, see:
# https://github.com/google/mtail/issues/226#issuecomment-524862192
histogram delta_t buckets -1e9, 0, 0.0005, 0.001, 0.01, 0.1, 1, 10, 100

gauge last

def glog {
        /(?P<level>[IWECF])(?P<timestamp>\d{4} \d{2}:\d{2}:\d{2}\.\d{6}) +/
        /(?P<thread>\d+) (?P<src_file>[^:]+):(?P<src_line>\d+)\] / {
                strptime($timestamp, "0102 15:04:05")
                next
        }
}

@glog {

/$/ {
        lines_total > 0 {
                delta_t = (timestamp() - last)
        }
        last = timestamp()
        lines_total++
}

}

Example input:

I0829 22:08:41.220692    1288 main.go:100] mtail version invalid ... go version go1.12.7 go arch amd64 go os windows
I0829 22:08:41.221691    1288 main.go:101] Commandline: ["C:\\mtail.exe" "-progs" "demo.mtail" "-logs" "foo.log" "-logtostderr"]
I0829 22:08:41.224689    1288 log_watcher.go:248] No abspath in watched list, added new one for X
I0829 22:08:41.224689    1288 checker.go:202] declaration of capture group reference `src_file' at demo.mtail:7:2-59 appears to be unused
I0829 22:08:41.225714    1288 checker.go:202] declaration of capture group reference `src_line' at demo.mtail:7:2-59 appears to be unused
I0829 22:08:41.225714    1288 checker.go:202] declaration of capture group reference `src_line' at demo.mtail:7:2-59 appears to be unused
I0829 22:08:41.225714    1288 checker.go:202] declaration of capture group reference `thread' at demo.mtail:7:2-59 appears to be unused
I0829 22:08:41.225714    1288 checker.go:202] declaration of capture group reference `thread' at demo.mtail:7:2-59 appears to be unused
I0829 22:08:41.225714    1288 checker.go:202] declaration of capture group reference `src_file' at demo.mtail:7:2-59 appears to be unused
I0829 22:08:41.225714    1288 checker.go:202] declaration of capture group reference `level' at demo.mtail:6:2-68 appears to be unused
I0829 22:08:41.225714    1288 checker.go:202] declaration of capture group reference `level' at demo.mtail:6:2-68 appears to be unused
I0829 22:08:41.226713    1288 loader.go:229] Loaded program demo.mtail
I0829 22:08:41.226713    1288 log_watcher.go:248] No abspath in watched list, added new one for X
W0829 22:08:41.227713    1288 mtail.go:101] No matches for pattern "foo.log"
I0829 22:08:41.227713    1288 store.go:136] Starting metric store expiry loop every 1h0m0s
I0829 22:08:41.227713    1288 tail.go:435] Starting log handle expiry loop every 1h0m0s
I0829 22:08:41.227713    1288 mtail.go:321] Listening on [::]:3903
I0829 22:08:48.037569    1288 log_watcher.go:253] Found this processor in watched list
I0829 22:08:48.038569    1288 log_watcher.go:248] No abspath in watched list, added new one for C:\foo.log
I0829 22:08:48.044581    1288 tail.go:289] Tailing C:\foo.log
I0829 22:09:21.933622    1288 mtail.go:349] Received SIGTERM, exiting...
I0829 22:09:21.934635    1288 mtail.go:363] Shutdown requested.
I0829 22:09:21.935667    1288 log_watcher.go:218] Shutting down log watcher.
I0829 22:09:21.935667    1288 loader.go:381] Shutting down loader.
I0829 22:09:21.936655    1288 mtail.go:386] END OF LINE

Example Prometheus metrics:

delta_t_bucket{prog="demo.mtail",le="0"} 22
delta_t_bucket{prog="demo.mtail",le="0.0005"} 22
delta_t_bucket{prog="demo.mtail",le="0.001"} 22
delta_t_bucket{prog="demo.mtail",le="0.01"} 22
delta_t_bucket{prog="demo.mtail",le="0.1"} 22
delta_t_bucket{prog="demo.mtail",le="1"} 22
delta_t_bucket{prog="demo.mtail",le="10"} 23
delta_t_bucket{prog="demo.mtail",le="100"} 24
delta_t_bucket{prog="demo.mtail",le="+Inf"} 24
delta_t_sum{prog="demo.mtail"} 40
delta_t_count{prog="demo.mtail"} 24

Here we see that all durations were counted as 0, except for the 7s and 33s pauses, that add up to the sum of 40s.

I tested and changed the few places in vm.go that use Unix seconds to use Unix nanoseconds instead. This of course works, but is hardly backwards compatible as timestamp() returns a billion times larger values. Perhaps changing to Unix seconds as a float64 would be better, as that wouldn't change the unit and would give microsecond resolution.

antong avatar Aug 30 '19 11:08 antong

As a reference, this is what I would expect:

delta_t_bucket{prog="demo.mtail",le="0"} 12
delta_t_bucket{prog="demo.mtail",le="0.0005"} 12
delta_t_bucket{prog="demo.mtail",le="0.001"} 17
delta_t_bucket{prog="demo.mtail",le="0.01"} 22
delta_t_bucket{prog="demo.mtail",le="0.1"} 22
delta_t_bucket{prog="demo.mtail",le="1"} 22
delta_t_bucket{prog="demo.mtail",le="10"} 23
delta_t_bucket{prog="demo.mtail",le="100"} 24
delta_t_bucket{prog="demo.mtail",le="+Inf"} 24
delta_t_sum{prog="demo.mtail"} 40.715963
delta_t_count{prog="demo.mtail"} 24

antong avatar Aug 30 '19 12:08 antong

It's terrible, isn't it! Thanks for the bug report, Time should be a first class type in the language so that this is done correctly.

On Fri, 30 Aug 2019 at 22:00, antong [email protected] wrote:

As a reference, this is what I would expect:

delta_t_bucket{prog="demo.mtail",le="0"} 12 delta_t_bucket{prog="demo.mtail",le="0.0005"} 12 delta_t_bucket{prog="demo.mtail",le="0.001"} 17 delta_t_bucket{prog="demo.mtail",le="0.01"} 22 delta_t_bucket{prog="demo.mtail",le="0.1"} 22 delta_t_bucket{prog="demo.mtail",le="1"} 22 delta_t_bucket{prog="demo.mtail",le="10"} 23 delta_t_bucket{prog="demo.mtail",le="100"} 24 delta_t_bucket{prog="demo.mtail",le="+Inf"} 24 delta_t_sum{prog="demo.mtail"} 40.715963 delta_t_count{prog="demo.mtail"} 24

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google/mtail/issues/263?email_source=notifications&email_token=AAXFX65VON535CKUXOYIR53QHEDWZA5CNFSM4ISMQGJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5ROJLQ#issuecomment-526574766, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXFX63N3R2KPRNBVHZTHITQHEDWZANCNFSM4ISMQGJA .

jaqx0r avatar Aug 30 '19 12:08 jaqx0r

I want to make a Time type internally in the VM to solve this issue.

jaqx0r avatar Mar 20 '21 06:03 jaqx0r

bump to this issue

dmitry-ee avatar May 13 '22 17:05 dmitry-ee

Any news on that issue ?

cazeaux avatar Oct 19 '22 12:10 cazeaux

I'm working on it, but it's taking some time.

On Wed, 19 Oct 2022 at 23:03, Stéphane Cazeaux @.***> wrote:

Any news on that issue ?

— Reply to this email directly, view it on GitHub https://github.com/google/mtail/issues/263#issuecomment-1283901213, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXFX65JY5GSQSJ4I7DRADDWD7PPHANCNFSM4ISMQGJA . You are receiving this because you commented.Message ID: @.***>

jaqx0r avatar Oct 20 '22 02:10 jaqx0r