metafora
metafora copied to clipboard
m_etcd: Make ClaimTTL and Lost interval documented and easily configurable
Description
m_etcd.ClaimTTL is critical to ensuring a task is being executed exactly once in a Metafora cluster.
Currently the claim TTL is 120s and the claim is actually refreshed every 90s. If a refresh fails and the task is lost, it has at most 30s to exit before the exactly-once guarantee is lost and the task is eligible for simultaneous execution within the cluster.
Claim TTL and refresh interval should be configurable because:
- They are critical to Metafora's correct operation.
- Acceptable values vary by handler and task.
Solution: Configurable TTL, documented refresh calculation
- Make TTL configurable on
m_etcd.EtcdCoordinatorinstead of via a global. - Document refresh calculation from
taskmgr.goand/or make it configurable
Future Improvements
- The coordinator could inform the task handler when the claim will expire via
Stop()or metadata on astatemachine.Message. This would allow a handler to detect that simultaneous execution may have occurred and choose to rollback a transaction, not flush data, avoid checkpointing, etc if possible. - Claim TTL / Refresh Interval may be more appropriate to define per task since a safe interval depends on how long a task handler takes to exit. Ideally a handler would checkpoint in intervals less than the Claim TTL, so if the claim expires it can simply skip its final checkpoint since another node may be executing that task.