Metrics Dashboard for ML Services
Is your feature request related to a problem? Please describe. As a system administrator, I would like to be able to monitor the overall security, health, and performance of my installed mediation layer and its services.
Describe the solution you'd like A single dashboard that clearly shows individual services as well as aggregate system information, for transport-level information like request rates, error rates, etc., and lower-level information like CPU usage. This dashboard should have interactions available to fine-tune what metrics are shown.
The dashboard should be as implementation-agnostic as possible, so administrators can plug-in any data tracker solution they desire to be output on the dashboard.
Minimum Viable Product
- Create a new optional "Metrics Service" that is linked in the Gateway's homepage
- Retrieve HTTP metrics for each endpoint in each APIML service (core and onboarded) using Netflix Hystrix
- Aggregate Hystrix data using Neftlix Turbine
- Display HTTP metrics using the pre-built Hystrix and Turbine dashboards
- Create proof-of-concept system metrics collection using a specific system metrics collection service. Collect system metrics for the core APIML services, broken down on a per-service basis.
- Require standard Gateway authentication to access new endpoints (includes "Metrics Service" endpoints as well as Hystrix and Turbine endpoints)
- Run sanity tests for the "Metrics Service" on marist
- (From UX proposal) A high level overview of the metrics dashboard with a quick written FAQ/tutorial
- (From UX proposal) A prominent link to the appropriate Github location prompting users to give feedback/comment on desired metrics-related features. Alternatively/additionally, a link to a survey to get richer and lower friction feedback from users.
- (Stretch goal) Display system metrics using a pre-built UI integration
- (Stretch goal) Allow static configuration to enable/disable specific routes for metrics collection
Extensions to MVP
- Complete static configuration
- Make metrics collection configuration dynamic, to allow administrators to change metrics collection without downtime
- Generalize the proof-of-concept system metrics collection to allow users to integrate any system metrics collection service they desire
Describe alternatives you've considered Alternative is to not show metrics information. Could also use a partial solution, where there is less customization/integration/information available.
Additional context Can use Hystrix/Turbine dashboards as a Spring Cloud solution for transport-level information.
More detailed proposal is here. UX proposal from @IlyaKreynin here.