State machine in kubernetes cluster
I want to run a spring state machine in an AKS cluster with multiple pods running the same application instance. The state machine is long-running and might take over 12 hours to complete. State machine configuration looks like the below with internal state transition every 15 minutes and triggers state change if required conditions are met in the internal state action.
I am using state machine service along with JPA persister.
How do I ensure that the long-running state machine starts running on another pod if any one of the pod crashes or restarts ?
transitions .withExternal() .source(S1).target(S2).event(S1_to_S2).and() .withInternal().timer(Duration.ofMinutes(15)).action(a -> checkAndTriggerEvent());
First of all, let me tell you that I have no affiliation with Spring State Machine or Development, so please took what I write as a certain grain of salt...
Spring State Machine is great for many use cases but I am not sure it fits your use case, as much as I know Spring State Machine has no concept of unit of work, so several Spring State Machines would coordinate for fulfilment of your scenario, so it will pick and continue the work of another instance if that other instance, in your case k8s pod, terminated...
That level of cluster coordination are provided from other Frameworks like
I explained in one of my Blogs how the Cluster Sharding and State Machine works for Pekko
Cluster Sharding with State Machine
may be this solutions fits better to your requirement, I hope this helps..
It is about fault tolerance.
Your internal state change should be idempotent for the case unexpected(Not graceful) shutting down. It is all your responsible because of business logic.
But other-part usually, catch the termination event and re-create new instance(in this case state machine bean) are about clustering and fault tolerance design.
In spring state-machine, There is zookeepr extenstion for it. https://github.com/spring-projects/spring-statemachine/blob/main/spring-statemachine-zookeeper/src/main/java/org/springframework/statemachine/zookeeper/ZookeeperStateMachineEnsemble.java
I haven't tried it but zookeeper is a long history solution about it. The persistence is also important to recover previous state but i believe you already have done with Jpa Persistence.
However, it is all about re-creating instance but internal communication between instances. It depends on the implementation of instance. Sometimes, it could be huge bottleneck, so please do test for it.
I hope the extension would be matching your situation but if not, You can also see other clustering solutions like Play Framework Actor.
https://www.playframework.com/
It is different one but nice for fault tolerance and internal communication channel via event bus. There is no direct spring state machine integration but you can put a state machine into an actor instance.
And another approach. If you don't need cluster of spring state machine within global level of application but just within pod replication set. Then, you can just check POD_UID environment and initializing(or restoring) the state during bootstrapping of application.
Playframework is also using Akka / Pekko at the background, of you can also solve this problem Playframework but if you desire bare bone access to the underlying framework, I advice you check Akka / Pekko solution I mentioned above.