st2 st2 performance issues

SUMMARY

I used k8s to deploy st2, and now the entire workflow action is running slowly, I tried to increase the number of replicas to increase the execution speed, but it doesn't seem to help

STACKSTORM VERSION

st2 3.8.0,on Python 3.8.10

OS, environment, install method

Kubernetes

Steps to reproduce the problem

The number of pods for each of my microservices in k8s is as follows

st2sensor 4 Pods, sensor only kafka trigger is in use, kafka has 4 partitions, so I used 4 sensor Pods.
st2actionrunner 30 Pods
st2workflowengine 30 pods
st2rulesengine 30 pods
st2scheduler 20 pods
st2notifier 20 pods
st2garbagecollector 1 pods , the garbage collection mechanism is as follows: [garbagecollector] action_executions_ttl = 3 action_executions_output_ttl = 3 trigger_instances_ttl = 3 traces_ttl = 3 rule_enforcements_ttl = 3 workflow_executions_ttl = 3 task_executions_ttl = 3 tokens_ttl = 3
st2client 1 pods
st2auth 2 pods
st2api 2 pods
st2stream 2 pods

Expected Results

Actual Results

I've looked at mongo's slow queries before, adding composite indexes to increase the query speed, and now that I see some indexes in mongo that are never used, the next step is probably to remove all unused indexes from mongo. According to the monitoring, it seems that too few plays can be matched. Is it necessary to increase the number of Pods for st2rulesengine? Do you have any good suggestions?

Thanks!

Jul 27 '23 02:07 chain312

The StackStorm performance indeed has lots of bottlenecks and while we have an official K8s HA-focused deployment, it's important to keep in mind that platform was designed and created before the K8s mainstream era. Since that, the st2 core wasn't optimized or profiled for such a setup with big amount of pods, - the development effort is missing in this area.

With that, the latency for every st2 component, including backends is important. K8s adds its own latency and with big amount of Pods you'll likely get a lot of network spam, error rate, retries and may overdeploy things. I may guess that 20-30 pods for each microservice could degrade the system rather than help. I'd try to play with the Pod numbers for each component there.

The best HA results I've seen so far in a simple dual-VM setup with lots of cores (CPU-optimized). This StackStorm cluster connects to a dedicated MongoDB, RabbitMQ and Redis clusters (RAM-optimized, VM-based too), with proper DBA practices, buffering, caching, monitoring and kernel tuning. Aka the old way :/

Jul 27 '23 20:07 arm4b

Are your DBs/backends in K8s too? How's the setup is looking in there? If K8s is absolutely necesarry, - maybe worth trying the hybrid way, - keeping st2 pods in the K8s and DB/backends on a dedicated VM-based clusters. Just as an experimental deployment to see if that can make any difference, if you're allowed to go outside of K8s in your architecture requirements.

Jul 27 '23 20:07 arm4b

Are your DBs/backends in K8s too? How's the setup is looking in there? If K8s is absolutely necesarry, - maybe worth trying the hybrid way, - keeping st2 pods in the K8s and DB/backends on a dedicated VM-based clusters. Just as an experimental deployment to see if that can make any difference, if you're allowed to go outside of K8s in your architecture requirements.

I have deployed rabbitmq, mongo and redis in docker outside of k8s, but they are all in single-node mode, not in multi-copy mode. Should I deploy these middleware directly on the physical server next?

Jul 28 '23 01:07 chain312

The StackStorm performance indeed has lots of bottlenecks and while we have an official K8s HA-focused deployment, it's important to keep in mind that platform was designed and created before the K8s mainstream era. Since that, the st2 core wasn't optimized or profiled for such a setup with big amount of pods, - the development effort is missing in this area.

With that, the latency for every st2 component, including backends is important. K8s adds its own latency and with big amount of Pods you'll likely get a lot of network spam, error rate, retries and may overdeploy things. I may guess that 20-30 pods for each microservice could degrade the system rather than help. I'd try to play with the Pod numbers for each component there.

The best HA results I've seen so far in a simple dual-VM setup with lots of cores (CPU-optimized). This StackStorm cluster connects to a dedicated MongoDB, RabbitMQ and Redis clusters (RAM-optimized, VM-based too), with proper DBA practices, buffering, caching, monitoring and kernel tuning. Aka the old way :/

Is there any performance indicator for the set you deployed, such as the rate of action, the rate of workflow, and other rates?

Jul 28 '23 06:07 chain312

The workflow engine has a tooz lock that will cause any more than a few engines to return diminishing speed improvements.

Jul 28 '23 17:07 guzzijones

The workflow engine has a tooz lock that will cause any more than a few engines to return diminishing speed improvements.

Is there any plan in the project to improve the tooz lock?

Sep 06 '23 12:09 chain312

The workflow engine has a tooz lock that will cause any more than a few engines to return diminishing speed improvements. @guzzijones @arm4b I took a look at the code, and it seems like the st2scheduler service is fetching data from the st2.action_execution_scheduling_queue_item_db database. Why isn't this data stored in the message queue (MQ)?

Mar 01 '24 11:03 chain312