Update Grafana User dashboard
Closes #5670
Why are the changes needed?
As a user, I want to install the Flyte-provided Grafana dashboards and get metrics
What changes were proposed in this pull request?
How was this patch tested?
Setup process
Screenshots
Check all the applicable boxes
- [ ] I updated the documentation accordingly.
- [ ] All new and existing tests passed.
- [ ] All commits are signed-off.
Related PRs
Docs link
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 36.31%. Comparing base (
30d3314) to head (81ef23c). Report is 260 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #5703 +/- ##
==========================================
+ Coverage 35.90% 36.31% +0.40%
==========================================
Files 1301 1305 +4
Lines 109419 110019 +600
==========================================
+ Hits 39287 39949 +662
+ Misses 66035 65914 -121
- Partials 4097 4156 +59
| Flag | Coverage Δ | |
|---|---|---|
| unittests-datacatalog | 51.37% <ø> (ø) |
|
| unittests-flyteadmin | 55.62% <ø> (+1.88%) |
:arrow_up: |
| unittests-flytecopilot | 12.17% <ø> (ø) |
|
| unittests-flytectl | 62.21% <ø> (-0.07%) |
:arrow_down: |
| unittests-flyteidl | 7.12% <ø> (+0.03%) |
:arrow_up: |
| unittests-flyteplugins | 53.35% <ø> (+0.03%) |
:arrow_up: |
| unittests-flytepropeller | 41.89% <ø> (+0.13%) |
:arrow_up: |
| unittests-flytestdlib | 55.37% <ø> (+0.09%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Thanks for working on these updates!
This is already looking a lot better - success metrics are working well, though we still see a few small issues on the imported dashboard:
- failed workflows aren't shown even though we did observe failures in this time period. I believe the stats for these workflows are being recorded against
failure_duration_unlabeled_ms_countinstead offailure_duration_ms_count - failed workflow execution time is showing in the order of weeks - this might need a unit of
millisecondsinstead ofseconds - the three panels in the task stats group all show an error: "many-to-many matching not allowed: matching labels must be unique on one side"
@charliemoriarty thanks so much for reporting.
I'm still exploring the Failed Workflows panel as it uses the flyte:propeller:all:workflow:event_recording:failure_duration_ms_count metric and it works when I test it standalone but for some reason is not displayed correctly.
Also I'll see why are you getting that error on Task stats, it works on my flyte-binary environment:
I forgot to mention that with kube-state-metrics v2.0.0 or higher, you have to enable the required labels for this to work. If you're using the Helm chart it's a matter of upgrading with a values file including:
kube-state-metrics:
metricLabelsAllowlist:
- pods=[node-id,workflow-name,task-name]
Once I have updates on this, will push them here for you to test.
Thanks!
@charliemoriarty I'm curious how this dashboard looks in your environment after recent changes.
Thanks for updating this. It looks good to me but I haven't looked particularly carefully. The updates I made were mostly focused on the admin and propeller dashboards so don't worry about stepping on my changes.
@charliemoriarty I'm curious how this dashboard looks in your environment after recent changes.
@davidmirror-ops Thanks again for the work on this! Apologies for the very, very late reply - somehow I missed the notification for the thread at the time 🤦 I can confirm that these changes look a lot better, and that failure visualisation is working as expected. The updates to the documentation are really helpful as well - I think we are missing the metricLabelsAllowList which is why the task visualisations weren't working as expected. Hopefully this helps others get set up out of the box!