stackstorm statsd metrics per action, by action_ref
Hey all!
Over here we've got a lot of production workflows running through stackstorm, but the builtin statsd metrics are rather thin. I see in the documentation, we get overall number of a given execution, by name, but nothing more for each status of said executions, by name.
This means we lack success/failure rate metrics to readily consume/visualize/address on any of our deployments.
I am not sure if this was intentional due to limiting statsd resources or potential contention from said metrics possibly exploding (obviously would mean for 1000x actions, you'd have potentially ~4-5k metrics just around statuses)
I would like to propose adding in to the codebase an OPTIONAL configuration setting for [metrics] that would essentially tell the application to generate status-per-action metrics at the user's behest, defaulting to the current state of baseline metrics. This would allow users to either adopt the newer, more granular statsd metrics OR continue with the default behavior.
I believe the adds would live within this section of the liveaction status update(s): https://github.com/StackStorm/st2/blob/master/st2common/st2common/util/action_db.py#L207-L312
Please let me know if you have any concerns about adding this stuff in here or not. I can analyze/add/etc just want to elicit some feedback/thoughts. THANK YOU!!!
We did something a bit more creative to get execution status. We created a cronjob on the workers that ran a shell script that runs local st2 commands to get execution by status that we care about, and then echo's that to the local statsd service for our prometheus cluster to then scrape.
cat /opt/stackstorm/action_execution_metric_gauge.sh
CURRENT_DELAYED_COUNT=$(st2 execution list -l --status delayed -n 1000 -j | jq '.[].id'| tr -d '""' | wc -l)
CURRENT_REQUESTED_COUNT=$(st2 execution list -l --status requested -n 1000 -j | jq '.[].id'| tr -d '""' | wc -l)
CURRENT_RUNNING_COUNT=$(st2 execution list -l --status running -n 1000 -j | jq '.[].id'| tr -d '""' | wc -l)
CURRENT_PENDING_COUNT=$(st2 execution list -l --status pending -n 1000 -j | jq '.[].id'| tr -d '""' | wc -l)
echo "st2.action.execution.current.delayed:$CURRENT_DELAYED_COUNT|g" | nc -u -w1 127.0.0.1 9125
echo "st2.action.execution.current.requested:$CURRENT_REQUESTED_COUNT|g" | nc -u -w1 127.0.0.1 9125
echo "st2.action.execution.current.running:$CURRENT_RUNNING_COUNT|g" | nc -u -w1 127.0.0.1 9125
echo "st2.action.execution.current.pending:$CURRENT_PENDING_COUNT|g" | nc -u -w1 127.0.0.1 9125
We did something a bit more creative to get execution status. We created a cronjob on the workers that ran a shell script that runs local st2 commands to get execution by status that we care about, and then echo's that to the local statsd service for our prometheus cluster to then scrape.
cat /opt/stackstorm/action_execution_metric_gauge.sh CURRENT_DELAYED_COUNT=$(st2 execution list -l --status delayed -n 1000 -j | jq '.[].id'| tr -d '""' | wc -l) CURRENT_REQUESTED_COUNT=$(st2 execution list -l --status requested -n 1000 -j | jq '.[].id'| tr -d '""' | wc -l) CURRENT_RUNNING_COUNT=$(st2 execution list -l --status running -n 1000 -j | jq '.[].id'| tr -d '""' | wc -l) CURRENT_PENDING_COUNT=$(st2 execution list -l --status pending -n 1000 -j | jq '.[].id'| tr -d '""' | wc -l) echo "st2.action.execution.current.delayed:$CURRENT_DELAYED_COUNT|g" | nc -u -w1 127.0.0.1 9125 echo "st2.action.execution.current.requested:$CURRENT_REQUESTED_COUNT|g" | nc -u -w1 127.0.0.1 9125 echo "st2.action.execution.current.running:$CURRENT_RUNNING_COUNT|g" | nc -u -w1 127.0.0.1 9125 echo "st2.action.execution.current.pending:$CURRENT_PENDING_COUNT|g" | nc -u -w1 127.0.0.1 9125
This is actually a nice/simple solution I hadn't initially thought about! We actually had started on our side by creating a sensor on every instance that runs daily to do essentially what you have above, but that means we lack "realtime" status metrics.
I've tried adding runtime success/failure/etc metrics for this but that obviously explodes our workflows with things we don't really want just to have that closer to runtime data.
Obviously this all generates added code to maintain and knowledge of the systems and configuration that I'd love to see "baked in" to stackstorm. I'll hold off on anything related to change(s) on this issue, but that cronjob/script is definitely lighter weight than what we have in our sensor haha