flyte icon indicating copy to clipboard operation
flyte copied to clipboard

Update Grafana User dashboard

Open davidmirror-ops opened this issue 1 year ago • 1 comments

Closes #5670

Why are the changes needed?

As a user, I want to install the Flyte-provided Grafana dashboards and get metrics

What changes were proposed in this pull request?

How was this patch tested?

Setup process

Screenshots

Check all the applicable boxes

  • [ ] I updated the documentation accordingly.
  • [ ] All new and existing tests passed.
  • [ ] All commits are signed-off.

Related PRs

Docs link

davidmirror-ops avatar Aug 28 '24 21:08 davidmirror-ops

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 36.31%. Comparing base (30d3314) to head (81ef23c). Report is 260 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5703      +/-   ##
==========================================
+ Coverage   35.90%   36.31%   +0.40%     
==========================================
  Files        1301     1305       +4     
  Lines      109419   110019     +600     
==========================================
+ Hits        39287    39949     +662     
+ Misses      66035    65914     -121     
- Partials     4097     4156      +59     
Flag Coverage Δ
unittests-datacatalog 51.37% <ø> (ø)
unittests-flyteadmin 55.62% <ø> (+1.88%) :arrow_up:
unittests-flytecopilot 12.17% <ø> (ø)
unittests-flytectl 62.21% <ø> (-0.07%) :arrow_down:
unittests-flyteidl 7.12% <ø> (+0.03%) :arrow_up:
unittests-flyteplugins 53.35% <ø> (+0.03%) :arrow_up:
unittests-flytepropeller 41.89% <ø> (+0.13%) :arrow_up:
unittests-flytestdlib 55.37% <ø> (+0.09%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Aug 28 '24 21:08 codecov[bot]

Thanks for working on these updates!

This is already looking a lot better - success metrics are working well, though we still see a few small issues on the imported dashboard:

  • failed workflows aren't shown even though we did observe failures in this time period. I believe the stats for these workflows are being recorded against failure_duration_unlabeled_ms_count instead of failure_duration_ms_count
  • failed workflow execution time is showing in the order of weeks - this might need a unit of milliseconds instead of seconds
  • the three panels in the task stats group all show an error: "many-to-many matching not allowed: matching labels must be unique on one side"
Screenshot 2024-09-13 at 10 38 08

charliemoriarty avatar Sep 13 '24 10:09 charliemoriarty

@charliemoriarty thanks so much for reporting. I'm still exploring the Failed Workflows panel as it uses the flyte:propeller:all:workflow:event_recording:failure_duration_ms_count metric and it works when I test it standalone but for some reason is not displayed correctly. Also I'll see why are you getting that error on Task stats, it works on my flyte-binary environment:

image

I forgot to mention that with kube-state-metrics v2.0.0 or higher, you have to enable the required labels for this to work. If you're using the Helm chart it's a matter of upgrading with a values file including:

kube-state-metrics:
  metricLabelsAllowlist:
    - pods=[node-id,workflow-name,task-name]

Once I have updates on this, will push them here for you to test.

Thanks!

davidmirror-ops avatar Sep 13 '24 20:09 davidmirror-ops

@charliemoriarty I'm curious how this dashboard looks in your environment after recent changes.

davidmirror-ops avatar Sep 24 '24 18:09 davidmirror-ops

Thanks for updating this. It looks good to me but I haven't looked particularly carefully. The updates I made were mostly focused on the admin and propeller dashboards so don't worry about stepping on my changes.

Tom-Newton avatar Sep 24 '24 19:09 Tom-Newton

@charliemoriarty I'm curious how this dashboard looks in your environment after recent changes.

@davidmirror-ops Thanks again for the work on this! Apologies for the very, very late reply - somehow I missed the notification for the thread at the time 🤦 I can confirm that these changes look a lot better, and that failure visualisation is working as expected. The updates to the documentation are really helpful as well - I think we are missing the metricLabelsAllowList which is why the task visualisations weren't working as expected. Hopefully this helps others get set up out of the box!

charliemoriarty avatar Dec 12 '24 16:12 charliemoriarty