Telemetry events contain negative values
Describe the bug
Looking at replication:medic:to:success and replication:medic:from:success events, I'm seeing occasional negative values for min.
To Reproduce
{
"_id": "telemetry-2022-12-20-user-guid",
"type": "telemetry",
"metrics": {
"replication:medic:to:success": {
"max": 31881,
"min": -13551,
"sum": 18330,
"count": 2,
"sumsqr": 1200027762
},
},
}
Expected behavior I would not expect anything to take negative time.
Environment
- App: core
- Version:
4.0.1-4.0.1.3593855120
Additional context Add any other context about the problem here. What have you tried? Is there a workaround?
All these events have negative values for min
enketo:tasks:monthly_follow_up:add:render
enketo:contacts:monthly_follow_up:add:user_edit_time
enketo:contacts:under_5_screening:add:save
replication:medic:from:ms-since-last-replicated-date
replication:meta:sync:failure
analytics:targets:load
enketo:contacts:over_5_screening:add:render
client-date-offset
enketo:contacts:under_5_screening:add:user_edit_time
rules-engine:update-emissions
rules-engine:targets
replication:medic:to:failure
replication:medic:to:ms-since-last-replicated-date
rules-engine:tasks:all-contacts
rules-engine:tasks:some-contacts:queued
enketo:contacts:over_5_screening:add:user_edit_time
enketo:contacts:mute:add:save
enketo:contacts:over_5_screening:add:save
enketo:tasks:monthly_follow_up:add:save
enketo:contacts:under_5_screening:add:render
rules-engine:tasks:some-contacts
tasks:refresh
rules-engine:tasks:all-contacts:queued
replication:medic:from:failure
Since so many events have negative mins, I'll change the issue title to reflect that.
@kennsippell
How common are these telemetry docs? Are they restricted to a certain user or a contained group of users?
| metric | frequency of negative values |
|---|---|
| client-date-offset | 26 |
| enketo:*:render | 25 |
| replication:medic:from:failure | 28 |
| replication:medic:from:ms-since-last-replicated-date | 33 |
| replication:medic:from:success | 4 |
| replication:medic:to:failure | 15 |
| replication:medic:to:ms-since-last-replicated-date | 31 |
| replication:medic:to:success | 16 |
| replication:meta:sync:failure | 6 |
| replication:meta:sync:success | 3 |
| rules-engine:ensureTaskFreshness:cancel | 12 |
| rules-engine:initialize | 11 |
| rules-engine:targets | 5 |
| rules-engine:tasks:all-contacts | 15 |
| rules-engine:tasks:all-contacts:queued | 6 |
| rules-engine:tasks:some-contacts | 62 |
| rules-engine:tasks:some-contacts:queued | 1 |
| rules-engine:update-emissions | 30 |
| enketo:*:save | 28 |
| tasks:load | 2 |
| tasks:refresh | 18 |
| enketo:*:user_edit_time | 56 |
For a project with ~400 users it happened about 430 times in the lifespan of the project. About 175 unique users had negative values. Average user who had negative values had it 2.4 times. Max for a single user was 48. Sorry I'm unsure of any commonalities for this set of 175 users.
I was just after an order of frequency. I'd say that 430 times for 175 users is very high. This implies that it's likely caused by a bug in our code (telemetry code?) and not one user with a constantly re-adjusting device time.
Thanks for sharing the detailed frequency.
So far, the only way I have been able to reproduce this is by setting my device clock to the past. It's unlikely that 175 users have done that. We need to keep researching.
All these telemetry records are using the JavaScript's Date object, that can be affected by system and user clock adjustments. As part of the Apdex work, we have been changing this to use Performance.now that is based on a monotonic clock which current time never decreases and which isn't subject to these adjustments.
In terms of the telemetry code:
- The value passed to the telemetry isn't affected to when it's stored
- Those records are fetch by a map function then the stats are calculated by couch in a mapReduce. The value doesn't seem affected through these.
- I checked all these keys, and they are calculating the value in the correct way, I mean: end - start.
I tried to query some DBs but my user is limited: SQL Error [42501]: ERROR: permission denied for relation useview_telemetry_metrics
Next is:
- Change all these to use
Performance.now - Keep researching how this happened. Maybe ask for help to get some data from Postgres
@kennsippell
- Do you remember what project you extracted the data from?
- Have you seen it in any other production instance?
Sorry, don't know what project I pulled the above from and I don't see any identifying marks like a formname or something....
But I picked two projects and checked if is happening. For moh_mali_chw and I see 1437 occurrences. And for Muso Mali I see 2,616.
I only ran select COUNT(*) from useview_telemetry_metrics where min < 0; this time, so not checking by metric or recency.
Hi @1yuv Can you please enable access to a copy of moh_mali_chw and Muso Mali? I want to run some SQL queries and inspect the telemetry and feedback docs in more detail.
Hi @latin-panda , you should already have the access to these databases.
Continuation from this status.
Short story: We didn't find a good explanation for this behavior in production instances. However, we have migrated how we record the performance metrics to use the monotonic clock. This work is tracked here. We are leaving this ticket open and will look at it again in the future.
Long story: The code looks good, as noted in the previous status. We checked the data, and it is happening in many instances in CHT v3.15 to v4.4, so it's not a new issue.
- Occurrences: This is happening every month with 2 or 3 different users, commonly the first week or last week of the month, but on different dates (i.e. it's not always the 3rd of X month. It varies)
- Production instances (these are the ones we have access in RDBMS):
- moh_mali_chw has 1,448
- moh_zanzibar has 767
- pih_malawi has 660
- musoapp has 2,616
- First negative value recorded (at least in the RDBMS)
- muso mali: 2022-12-20
- moh mali: 2019-07-01
- Last negative value recorded
- muso mali: 2024-03-02
- moh mali: 2024-02-04
- Users: total users since first recorded
- muso mali: 318
- moh mali: 186
- Production instances (these are the ones we have access in RDBMS):
- Values variation
- There is a wide range of values and no pattern identified; for example, some
minvalues in milliseconds are -8, -60, -240, -10272, -18215, -73602702666, -422774528280, etc. - It is worth mentioning that there are users recording telemetry in the future; telemetry dates as of
2024-03-22. Some of these are captured by theclient-date-offsetwhen they are online at the moment of checking the their device date:- muso mali: 6 users
- moh mali: 23 users
- There is a wide range of values and no pattern identified; for example, some
In a single day, there can be several telemetry entries for one user, but only 1 of those has a negative value; that same entry on the same date for the same user might not have a negative value. For example, observe the image below; the entry for enketo:tasks:<_form>:add:user_edit_time has a negative value on that same date, and another telemetry about tasks and user edit time, but a different form has a positive value.
We observed this in all the telemetry entries Kenn highlighted here, more often in telemetry about replication.
These stats are calculated by Pouch (/_stats) by using a reduce to pick the min or max value (similar to: 'min': values.reduce(function(a, b) { return Math.min(a, b.min) }),). It doesn't seem that the error is in how Pouch calculates this. Something else that we still don't know nor are able to reproduce is making the negative values.