fleet Policy failure count mismatch on host details page

Fleet version: 4.64.2

💥 Actual behavior

On the host's details page, the Policies tab and Issues counter is counting 59 policy failures, but the warning message on the tab says the device is only failing 9. Device is truly failing only 9 policies.

🧑‍💻 Steps to reproduce

TODO
TODO

🕯️ More info (optional)

N/A

Mar 12 '25 18:03 rebeccaui

Linked to Unthread ticket:

Weird display on Failed policies count #4974

Mar 12 '25 18:03 Sampfluger88

I can reproduce this by manually updating my host_issues table, which is what drives the badge on the tab and the tooltip. The "This device is failing xxx policies" note is driven from the policies list retrieved from the host details API, which seems more up-to-date. The host_issues table is updated hourly by default.

Apr 03 '25 19:04 sgress454

Hey team! Please add your planning poker estimate with Zenhub @dantecatalfamo @jacobshandling @sgress454

Apr 23 '25 18:04 sharon-fdm

I'll go ahead and source all 3 of these counts from the host.policies array per Scott's breakdown above

May 13 '25 19:05 jacobshandling

The above UI-only solution brings up a related issue - since a host's "total issues count" and its "critical vulnerabilities count" both still reference the slow-to-update host_issues table values, just sourcing "Failing policies" count from the more recently updated data leaves a discrepancy between these numbers, since total issues count should be the sum of the other two.

Because of this it seems like this should be a backend fix, where GET ing a host's details both calculates, returns, and writes to the DB the host's updated values.

Note discrepancy here:

May 13 '25 22:05 jacobshandling

This makes sense--if we don't already have the information we need on the front end, then we need to get it from the API. I don't know how costly that is, but if we were only doing it once an hour it seems likely that it's not something we want to do at great frequency. We should consider gating this behind an API param.

May 13 '25 23:05 sgress454

But potentially not as costly to just calculate for one host at a time as needed per request to hosts/:id right? I'd think the once/hour is probably to account for doing it at the scale of all hosts

May 13 '25 23:05 jacobshandling

But potentially not as costly to just calculate for one host at a time as needed per request to hosts/:id right? I'd think the once/hour is probably to account for doing it at the scale of all hosts

We're talking about running UpdateHostIssuesFailingPolicies() when this API is hit, right? Having a GET request have any side effects like this is playing with fire -- the expectation is that these can be scripted and/or hammered (even if that expectation isn't completely fair). So if that's what we're talking about, we want to be very careful about it. It would be good to, for example, first check the updated_at of the row in host_issues so that we only do this max once per minute.

May 14 '25 01:05 sgress454

first check the updated_at of the row in host_issues so that we only do this max once per minute

I see, thanks for the great insight. Will add this to the PR.

May 14 '25 19:05 jacobshandling

Mismatched count, gone, Fleet's truth shines like morning sun. Clear as glass city.

Jun 14 '25 19:06 fleet-release