Orquesta throws python exception if core.ask is run in parallel another task
SUMMARY
Orquesta can't handle pausing for core.ask while a parallel task is running
STACKSTORM VERSION
st2 3.1.0, on Python 2.7.5
OS, environment, install method
RHEL 7.6, script install
Steps to reproduce the problem
Create a workflow that starts two tasks in parallel, one of which is a core.ask inquiry
Run the workflow and answer the inquiry while the parallel task is running
Use
st2 execution get
Example Orquesta workflow: https://gist.github.com/markcrobinson/1618c90d2c2b66f0ae46a64c25b132d6
Expected Results
The workflow allows you to answer the inquiry and then completes successfully
Actual Results
A python error such as:
$ st2 execution get 5e43e5da103007aab0b8bcf2
id: 5e43e5da103007aab0b8bcf2
action.ref: testing.test-parallel-pause
parameters: None
status: failed (16s elapsed)
start_timestamp: Wed, 12 Feb 2020 11:47:38 UTC
end_timestamp: Wed, 12 Feb 2020 11:47:54 UTC
result:
error: Unable to process event "workflow_paused" for workflow in "pausing" status.
traceback: " File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2actions/container/base.py", line 193, in _do_pause
(status, result, context) = runner.pause()
File "/opt/stackstorm/st2/lib/python2.7/site-packages/orquesta_runner/orquesta_runner.py", line 136, in pause
wf_ex_db = wf_svc.request_pause(self.execution)
File "/opt/stackstorm/st2/lib/python2.7/site-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/opt/stackstorm/st2/lib/python2.7/site-packages/retrying.py", line 206, in call
return attempt.get(self._wrap_exception)
File "/opt/stackstorm/st2/lib/python2.7/site-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/opt/stackstorm/st2/lib/python2.7/site-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/opt/stackstorm/st2/lib/python2.7/site-packages/retrying.py", line 49, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/opt/stackstorm/st2/lib/python2.7/site-packages/retrying.py", line 206, in call
return attempt.get(self._wrap_exception)
File "/opt/stackstorm/st2/lib/python2.7/site-packages/retrying.py", line 247, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/opt/stackstorm/st2/lib/python2.7/site-packages/retrying.py", line 200, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2common/services/workflows.py", line 312, in request_pause
conductor.request_workflow_status(statuses.PAUSED)
File "/opt/stackstorm/st2/lib/python2.7/site-packages/orquesta/conducting.py", line 363, in request_workflow_status
raise exc.InvalidWorkflowStatusTransition(current_status, wf_ex_event.name)
"
+--------------------------+-------------------------+------------+------------+-------------------------------+
| id | status | task | action | start_timestamp |
+--------------------------+-------------------------+------------+------------+-------------------------------+
| 5e43e5db103007aaf8abdee5 | succeeded (0s elapsed) | start | core.echo | Wed, 12 Feb 2020 11:47:39 UTC |
| 5e43e5dc103007aaf8abdee8 | succeeded (14s elapsed) | do_inquiry | core.ask | Wed, 12 Feb 2020 11:47:40 UTC |
| 5e43e5dc103007aaf8abdeeb | running (17s elapsed) | snooze | core.local | Wed, 12 Feb 2020 11:47:40 UTC |
+--------------------------+-------------------------+------------+------------+-------------------------------+
This exception is cleared once the parallel task is done, but the execution stays in paused state:
$ st2 execution get 5e43e5da103007aab0b8bcf2
id: 5e43e5da103007aab0b8bcf2
action.ref: testing.test-parallel-pause
parameters: None
status: paused
start_timestamp: Wed, 12 Feb 2020 11:47:38 UTC
end_timestamp: Wed, 12 Feb 2020 11:47:54 UTC
result:
output: null
+--------------------------+-------------------------+------------+------------+-------------------------------+
| id | status | task | action | start_timestamp |
+--------------------------+-------------------------+------------+------------+-------------------------------+
| 5e43e5db103007aaf8abdee5 | succeeded (0s elapsed) | start | core.echo | Wed, 12 Feb 2020 11:47:39 UTC |
| 5e43e5dc103007aaf8abdee8 | succeeded (14s elapsed) | do_inquiry | core.ask | Wed, 12 Feb 2020 11:47:40 UTC |
| 5e43e5dc103007aaf8abdeeb | succeeded (60s elapsed) | snooze | core.local | Wed, 12 Feb 2020 11:47:40 UTC |
+--------------------------+-------------------------+------------+------------+-------------------------------+
Thanks for the detailed bug report and workflow example.
Just to confirm, you're not running Orquesta in HA environment? Eg. It's a single server with a single Orquesta service running?
cc @m4dcoder
Nope, it's a single box
I have just retested this with a fresh install of StackStorm 3.2 on RHEL 7.
The behaviour is now better in that the errors are handed gracefully, but the underlying issue is still present.
Status with 3.2:
- There is no longer a python traceback printed when doing st2 execution get (after responding to the inquiry)
- The workflow status sits at "pausing" after the inquiry has been answered, until the sleep finishes
- Once the sleep is done, its task changes to state timeout(!)
- Once that happens, the main workflow changes to state failed
$ st2 run -a testing.test-parallel-pause
To get the results, execute:
st2 execution get 5eb136603e3753679fff04fa
To view output in real-time, execute:
st2 execution tail 5eb136603e3753679fff04fa
$ st2 execution get 5eb136603e3753679fff04fa
id: 5eb136603e3753679fff04fa
action.ref: testing.test-parallel-pause
parameters: None
status: running (3s elapsed)
start_timestamp: Tue, 05 May 2020 09:48:16 UTC
end_timestamp:
result:
output: null
+--------------------------+------------------------+------------+------------+------------------------+
| id | status | task | action | start_timestamp |
+--------------------------+------------------------+------------+------------+------------------------+
| 5eb136611c7af71f5a03a21e | succeeded (0s elapsed) | start | core.echo | Tue, 05 May 2020 |
| | | | | 09:48:17 UTC |
| 5eb136611c7af71f5a03a221 | pending | do_inquiry | core.ask | Tue, 05 May 2020 |
| | | | | 09:48:17 UTC |
| 5eb136611c7af71f5a03a224 | running (2s elapsed) | snooze | core.local | Tue, 05 May 2020 |
| | | | | 09:48:17 UTC |
+--------------------------+------------------------+------------+------------+------------------------+
$ st2 inquiry respond 5eb136611c7af71f5a03a221
answer: blah
Response accepted for inquiry 5eb136611c7af71f5a03a221.
$ st2 execution get 5eb136603e3753679fff04fa
id: 5eb136603e3753679fff04fa
action.ref: testing.test-parallel-pause
parameters: None
status: pausing
start_timestamp: Tue, 05 May 2020 09:48:16 UTC
end_timestamp:
result:
output: null
+--------------------------+-------------------------+------------+------------+------------------------+
| id | status | task | action | start_timestamp |
+--------------------------+-------------------------+------------+------------+------------------------+
| 5eb136611c7af71f5a03a21e | succeeded (0s elapsed) | start | core.echo | Tue, 05 May 2020 |
| | | | | 09:48:17 UTC |
| 5eb136611c7af71f5a03a221 | succeeded (16s elapsed) | do_inquiry | core.ask | Tue, 05 May 2020 |
| | | | | 09:48:17 UTC |
| 5eb136611c7af71f5a03a224 | running (18s elapsed) | snooze | core.local | Tue, 05 May 2020 |
| | | | | 09:48:17 UTC |
+--------------------------+-------------------------+------------+------------+------------------------+
$ st2 execution get 5eb136603e3753679fff04fa
id: 5eb136603e3753679fff04fa
action.ref: testing.test-parallel-pause
parameters: None
status: failed (62s elapsed)
start_timestamp: Tue, 05 May 2020 09:48:16 UTC
end_timestamp: Tue, 05 May 2020 09:49:18 UTC
result:
errors: []
output:
context:
answer: blah
awake: true
parent:
pack: testing
user: stanley
st2:
action_execution_id: 5eb136603e3753679fff04fa
api_url: http://127.0.0.1:9101/v1
pack: testing
user: stanley
workflow_execution_id: 5eb13661cc0f7052c3a8742e
+--------------------------+-------------------------+------------+------------+------------------------+
| id | status | task | action | start_timestamp |
+--------------------------+-------------------------+------------+------------+------------------------+
| 5eb136611c7af71f5a03a21e | succeeded (0s elapsed) | start | core.echo | Tue, 05 May 2020 |
| | | | | 09:48:17 UTC |
| 5eb136611c7af71f5a03a221 | succeeded (16s elapsed) | do_inquiry | core.ask | Tue, 05 May 2020 |
| | | | | 09:48:17 UTC |
| 5eb136611c7af71f5a03a224 | timeout (61s elapsed) | snooze | core.local | Tue, 05 May 2020 |
| | | | | 09:48:17 UTC |
+--------------------------+-------------------------+------------+------------+------------------------+
I have hit what looks like the same issue, except I have a join downstream of the parallel tasks which never gets hit. The workflow remains in 'paused' state indefinitely (well, for at least 10 minutes)
Stackstorm 3.4 installed with the bash one-liner on Ubuntu 18.04.
I believe we encounter this issue on st2 3.5.0, on Python 3.6.8, on centos 7,
When running a core.ask action using the with items method. Happens both in a subworkflow, where the parent workflow stays paused indefinitely or as a standalone workflow, where the workflow gets stuck resuming.
sometimes the workflow continues without issues but at least half the times it gets stuck in a paused state. We tested this behavior with 3 parrallel core.ask actions.