cht-core Telemetry events contain negative values

Describe the bug Looking at replication:medic:to:success and replication:medic:from:success events, I'm seeing occasional negative values for min.

To Reproduce

{
  "_id": "telemetry-2022-12-20-user-guid",
  "type": "telemetry",
  "metrics": {
    "replication:medic:to:success": {
      "max": 31881,
      "min": -13551,
      "sum": 18330,
      "count": 2,
      "sumsqr": 1200027762
    },
  },
}

Expected behavior I would not expect anything to take negative time.

Environment

App: core
Version: 4.0.1-4.0.1.3593855120

Additional context Add any other context about the problem here. What have you tried? Is there a workaround?

Feb 20 '23 22:02 kennsippell

All these events have negative values for min

enketo:tasks:monthly_follow_up:add:render
enketo:contacts:monthly_follow_up:add:user_edit_time
enketo:contacts:under_5_screening:add:save
replication:medic:from:ms-since-last-replicated-date
replication:meta:sync:failure
analytics:targets:load
enketo:contacts:over_5_screening:add:render
client-date-offset
enketo:contacts:under_5_screening:add:user_edit_time
rules-engine:update-emissions
rules-engine:targets
replication:medic:to:failure
replication:medic:to:ms-since-last-replicated-date
rules-engine:tasks:all-contacts
rules-engine:tasks:some-contacts:queued
enketo:contacts:over_5_screening:add:user_edit_time
enketo:contacts:mute:add:save
enketo:contacts:over_5_screening:add:save
enketo:tasks:monthly_follow_up:add:save
enketo:contacts:under_5_screening:add:render
rules-engine:tasks:some-contacts
tasks:refresh
rules-engine:tasks:all-contacts:queued
replication:medic:from:failure

Feb 21 '23 09:02 kennsippell

Since so many events have negative mins, I'll change the issue title to reflect that.

Feb 21 '23 11:02 dianabarsan

@kennsippell

How common are these telemetry docs? Are they restricted to a certain user or a contained group of users?

Feb 21 '23 12:02 dianabarsan

metric frequency of negative values

client-date-offset 26

enketo:*:render 25

replication:medic:from:failure 28

replication:medic:from:ms-since-last-replicated-date 33

replication:medic:from:success 4

replication:medic:to:failure 15

replication:medic:to:ms-since-last-replicated-date 31

replication:medic:to:success 16

replication:meta:sync:failure 6

replication:meta:sync:success 3

rules-engine:ensureTaskFreshness:cancel 12

rules-engine:initialize 11

rules-engine:targets 5

rules-engine:tasks:all-contacts 15

rules-engine:tasks:all-contacts:queued 6

rules-engine:tasks:some-contacts 62

rules-engine:tasks:some-contacts:queued 1

rules-engine:update-emissions 30

enketo:*:save 28

tasks:load 2

tasks:refresh 18

enketo:*:user_edit_time 56

metric	frequency of negative values
client-date-offset	26
enketo:*:render	25
replication:medic:from:failure	28
replication:medic:from:ms-since-last-replicated-date	33
replication:medic:from:success	4
replication:medic:to:failure	15
replication:medic:to:ms-since-last-replicated-date	31
replication:medic:to:success	16
replication:meta:sync:failure	6
replication:meta:sync:success	3
rules-engine:ensureTaskFreshness:cancel	12
rules-engine:initialize	11
rules-engine:targets	5
rules-engine:tasks:all-contacts	15
rules-engine:tasks:all-contacts:queued	6
rules-engine:tasks:some-contacts	62
rules-engine:tasks:some-contacts:queued	1
rules-engine:update-emissions	30
enketo:*:save	28
tasks:load	2
tasks:refresh	18
enketo:*:user_edit_time	56

For a project with ~400 users it happened about 430 times in the lifespan of the project. About 175 unique users had negative values. Average user who had negative values had it 2.4 times. Max for a single user was 48. Sorry I'm unsure of any commonalities for this set of 175 users.

Feb 21 '23 23:02 kennsippell

I was just after an order of frequency. I'd say that 430 times for 175 users is very high. This implies that it's likely caused by a bug in our code (telemetry code?) and not one user with a constantly re-adjusting device time.

Thanks for sharing the detailed frequency.

Feb 22 '23 05:02 dianabarsan

So far, the only way I have been able to reproduce this is by setting my device clock to the past. It's unlikely that 175 users have done that. We need to keep researching.

All these telemetry records are using the JavaScript's Date object, that can be affected by system and user clock adjustments. As part of the Apdex work, we have been changing this to use Performance.now that is based on a monotonic clock which current time never decreases and which isn't subject to these adjustments.

In terms of the telemetry code:

The value passed to the telemetry isn't affected to when it's stored
Those records are fetch by a map function then the stats are calculated by couch in a mapReduce. The value doesn't seem affected through these.
I checked all these keys, and they are calculating the value in the correct way, I mean: end - start.

I tried to query some DBs but my user is limited: SQL Error [42501]: ERROR: permission denied for relation useview_telemetry_metrics

Next is:

Change all these to use Performance.now
Keep researching how this happened. Maybe ask for help to get some data from Postgres

@kennsippell

Do you remember what project you extracted the data from?
Have you seen it in any other production instance?

Feb 21 '24 11:02 latin-panda

Sorry, don't know what project I pulled the above from and I don't see any identifying marks like a formname or something....

But I picked two projects and checked if is happening. For moh_mali_chw and I see 1437 occurrences. And for Muso Mali I see 2,616.

I only ran select COUNT(*) from useview_telemetry_metrics where min < 0; this time, so not checking by metric or recency.

Feb 21 '24 15:02 kennsippell

Hi @1yuv Can you please enable access to a copy of moh_mali_chw and Muso Mali? I want to run some SQL queries and inspect the telemetry and feedback docs in more detail.

Mar 12 '24 10:03 latin-panda

Hi @latin-panda , you should already have the access to these databases.

Mar 12 '24 13:03 1yuv

Continuation from this status.

Short story: We didn't find a good explanation for this behavior in production instances. However, we have migrated how we record the performance metrics to use the monotonic clock. This work is tracked here. We are leaving this ticket open and will look at it again in the future.

Long story: The code looks good, as noted in the previous status. We checked the data, and it is happening in many instances in CHT v3.15 to v4.4, so it's not a new issue.

Occurrences: This is happening every month with 2 or 3 different users, commonly the first week or last week of the month, but on different dates (i.e. it's not always the 3rd of X month. It varies)
- Production instances (these are the ones we have access in RDBMS):
  - moh_mali_chw has 1,448
  - moh_zanzibar has 767
  - pih_malawi has 660
  - musoapp has 2,616
- First negative value recorded (at least in the RDBMS)
  - muso mali: 2022-12-20
  - moh mali: 2019-07-01
- Last negative value recorded
  - muso mali: 2024-03-02
  - moh mali: 2024-02-04
- Users: total users since first recorded
  - muso mali: 318
  - moh mali: 186
Values variation
- There is a wide range of values and no pattern identified; for example, some min values in milliseconds are -8, -60, -240, -10272, -18215, -73602702666, -422774528280, etc.
- It is worth mentioning that there are users recording telemetry in the future; telemetry dates as of 2024-03-22. Some of these are captured by the client-date-offset when they are online at the moment of checking the their device date:
  - muso mali: 6 users
  - moh mali: 23 users

In a single day, there can be several telemetry entries for one user, but only 1 of those has a negative value; that same entry on the same date for the same user might not have a negative value. For example, observe the image below; the entry for enketo:tasks:<_form>:add:user_edit_time has a negative value on that same date, and another telemetry about tasks and user edit time, but a different form has a positive value. We observed this in all the telemetry entries Kenn highlighted here, more often in telemetry about replication.

These stats are calculated by Pouch (/_stats) by using a reduce to pick the min or max value (similar to: 'min': values.reduce(function(a, b) { return Math.min(a, b.min) }),). It doesn't seem that the error is in how Pouch calculates this. Something else that we still don't know nor are able to reproduce is making the negative values.

Mar 18 '24 08:03 latin-panda