cht-core icon indicating copy to clipboard operation
cht-core copied to clipboard

Telemetry events contain negative values

Open kennsippell opened this issue 3 years ago • 10 comments

Describe the bug Looking at replication:medic:to:success and replication:medic:from:success events, I'm seeing occasional negative values for min.

To Reproduce

{
  "_id": "telemetry-2022-12-20-user-guid",
  "type": "telemetry",
  "metrics": {
    "replication:medic:to:success": {
      "max": 31881,
      "min": -13551,
      "sum": 18330,
      "count": 2,
      "sumsqr": 1200027762
    },
  },
}

Expected behavior I would not expect anything to take negative time.

Environment

  • App: core
  • Version: 4.0.1-4.0.1.3593855120

Additional context Add any other context about the problem here. What have you tried? Is there a workaround?

kennsippell avatar Feb 20 '23 22:02 kennsippell

All these events have negative values for min

enketo:tasks:monthly_follow_up:add:render
enketo:contacts:monthly_follow_up:add:user_edit_time
enketo:contacts:under_5_screening:add:save
replication:medic:from:ms-since-last-replicated-date
replication:meta:sync:failure
analytics:targets:load
enketo:contacts:over_5_screening:add:render
client-date-offset
enketo:contacts:under_5_screening:add:user_edit_time
rules-engine:update-emissions
rules-engine:targets
replication:medic:to:failure
replication:medic:to:ms-since-last-replicated-date
rules-engine:tasks:all-contacts
rules-engine:tasks:some-contacts:queued
enketo:contacts:over_5_screening:add:user_edit_time
enketo:contacts:mute:add:save
enketo:contacts:over_5_screening:add:save
enketo:tasks:monthly_follow_up:add:save
enketo:contacts:under_5_screening:add:render
rules-engine:tasks:some-contacts
tasks:refresh
rules-engine:tasks:all-contacts:queued
replication:medic:from:failure

kennsippell avatar Feb 21 '23 09:02 kennsippell

Since so many events have negative mins, I'll change the issue title to reflect that.

dianabarsan avatar Feb 21 '23 11:02 dianabarsan

@kennsippell

How common are these telemetry docs? Are they restricted to a certain user or a contained group of users?

dianabarsan avatar Feb 21 '23 12:02 dianabarsan

metric frequency of negative values
client-date-offset 26
enketo:*:render 25
replication:medic:from:failure 28
replication:medic:from:ms-since-last-replicated-date 33
replication:medic:from:success 4
replication:medic:to:failure 15
replication:medic:to:ms-since-last-replicated-date 31
replication:medic:to:success 16
replication:meta:sync:failure 6
replication:meta:sync:success 3
rules-engine:ensureTaskFreshness:cancel 12
rules-engine:initialize 11
rules-engine:targets 5
rules-engine:tasks:all-contacts 15
rules-engine:tasks:all-contacts:queued 6
rules-engine:tasks:some-contacts 62
rules-engine:tasks:some-contacts:queued 1
rules-engine:update-emissions 30
enketo:*:save 28
tasks:load 2
tasks:refresh 18
enketo:*:user_edit_time 56

For a project with ~400 users it happened about 430 times in the lifespan of the project. About 175 unique users had negative values. Average user who had negative values had it 2.4 times. Max for a single user was 48. Sorry I'm unsure of any commonalities for this set of 175 users.

kennsippell avatar Feb 21 '23 23:02 kennsippell

I was just after an order of frequency. I'd say that 430 times for 175 users is very high. This implies that it's likely caused by a bug in our code (telemetry code?) and not one user with a constantly re-adjusting device time.

Thanks for sharing the detailed frequency.

dianabarsan avatar Feb 22 '23 05:02 dianabarsan

So far, the only way I have been able to reproduce this is by setting my device clock to the past. It's unlikely that 175 users have done that. We need to keep researching.

All these telemetry records are using the JavaScript's Date object, that can be affected by system and user clock adjustments. As part of the Apdex work, we have been changing this to use Performance.now that is based on a monotonic clock which current time never decreases and which isn't subject to these adjustments.

In terms of the telemetry code:

I tried to query some DBs but my user is limited: SQL Error [42501]: ERROR: permission denied for relation useview_telemetry_metrics

Next is:

  • Change all these to use Performance.now
  • Keep researching how this happened. Maybe ask for help to get some data from Postgres

@kennsippell

  1. Do you remember what project you extracted the data from?
  2. Have you seen it in any other production instance?

latin-panda avatar Feb 21 '24 11:02 latin-panda

Sorry, don't know what project I pulled the above from and I don't see any identifying marks like a formname or something....

But I picked two projects and checked if is happening. For moh_mali_chw and I see 1437 occurrences. And for Muso Mali I see 2,616.

I only ran select COUNT(*) from useview_telemetry_metrics where min < 0; this time, so not checking by metric or recency.

kennsippell avatar Feb 21 '24 15:02 kennsippell

Hi @1yuv Can you please enable access to a copy of moh_mali_chw and Muso Mali? I want to run some SQL queries and inspect the telemetry and feedback docs in more detail.

latin-panda avatar Mar 12 '24 10:03 latin-panda

Hi @latin-panda , you should already have the access to these databases.

1yuv avatar Mar 12 '24 13:03 1yuv

Continuation from this status.

Short story: We didn't find a good explanation for this behavior in production instances. However, we have migrated how we record the performance metrics to use the monotonic clock. This work is tracked here. We are leaving this ticket open and will look at it again in the future.

Long story: The code looks good, as noted in the previous status. We checked the data, and it is happening in many instances in CHT v3.15 to v4.4, so it's not a new issue.

  • Occurrences: This is happening every month with 2 or 3 different users, commonly the first week or last week of the month, but on different dates (i.e. it's not always the 3rd of X month. It varies)
    • Production instances (these are the ones we have access in RDBMS):
      • moh_mali_chw has 1,448
      • moh_zanzibar has 767
      • pih_malawi has 660
      • musoapp has 2,616
    • First negative value recorded (at least in the RDBMS)
      • muso mali: 2022-12-20
      • moh mali: 2019-07-01
    • Last negative value recorded
      • muso mali: 2024-03-02
      • moh mali: 2024-02-04
    • Users: total users since first recorded
      • muso mali: 318
      • moh mali: 186
  • Values variation
    • There is a wide range of values and no pattern identified; for example, some min values in milliseconds are -8, -60, -240, -10272, -18215, -73602702666, -422774528280, etc.
    • It is worth mentioning that there are users recording telemetry in the future; telemetry dates as of 2024-03-22. Some of these are captured by the client-date-offset when they are online at the moment of checking the their device date:
      • muso mali: 6 users
      • moh mali: 23 users

In a single day, there can be several telemetry entries for one user, but only 1 of those has a negative value; that same entry on the same date for the same user might not have a negative value. For example, observe the image below; the entry for enketo:tasks:<_form>:add:user_edit_time has a negative value on that same date, and another telemetry about tasks and user edit time, but a different form has a positive value. We observed this in all the telemetry entries Kenn highlighted here, more often in telemetry about replication.

These stats are calculated by Pouch (/_stats) by using a reduce to pick the min or max value (similar to: 'min': values.reduce(function(a, b) { return Math.min(a, b.min) }),). It doesn't seem that the error is in how Pouch calculates this. Something else that we still don't know nor are able to reproduce is making the negative values.

latin-panda avatar Mar 18 '24 08:03 latin-panda