Metric for Complete Workflow/Activity Failure
Is your feature request related to a problem? Please describe.
I'd like to be able to clearly understand how many Workflows suffered a complete failure after exhausting all retries. (See Additional Context section).
Describe the solution you'd like A metric representing the failure of a workflow/activity after any and all retries have been exhausted.
Describe alternatives you've considered N/A?
Additional context
Sometimes our Temporal service goes down, and during the outage, various metrics show "failures" (temporal_workflow_failed, temporal_activity_execution_failure, etc. etc.). There are client-side retries, so would be good to know when there's been the "final" failure after all client side retries have been exhausted and the WF / Activity has "actually" failed for real and won't be re-attempted.
If my feature request doesn't make sense, then let me present our larger scenario for context: Quite reasonably, we want to "assess impact" of the service outage by knowing how many activities or workflows "actually, permanently failed" (i.e. all forms of retires are exhausted, while the service was down and they didn't get to run ever again). How can we do this?