st2 Orquesta throws python exception if core.ask is run in parallel another task

SUMMARY

Orquesta can't handle pausing for core.ask while a parallel task is running

STACKSTORM VERSION

st2 3.1.0, on Python 2.7.5

OS, environment, install method

RHEL 7.6, script install

Steps to reproduce the problem

Create a workflow that starts two tasks in parallel, one of which is a core.ask inquiry Run the workflow and answer the inquiry while the parallel task is running Use st2 execution get after answering the inquiry, but before the parallel task has completed

Example Orquesta workflow: https://gist.github.com/markcrobinson/1618c90d2c2b66f0ae46a64c25b132d6

Expected Results

The workflow allows you to answer the inquiry and then completes successfully

Actual Results

A python error such as:

$  st2 execution get 5e43e5da103007aab0b8bcf2
id: 5e43e5da103007aab0b8bcf2
action.ref: testing.test-parallel-pause
parameters: None
status: failed (16s elapsed)
start_timestamp: Wed, 12 Feb 2020 11:47:38 UTC
end_timestamp: Wed, 12 Feb 2020 11:47:54 UTC
result:
  error: Unable to process event "workflow_paused" for workflow in "pausing" status.
  traceback: "  File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2actions/container/base.py", line 193, in _do_pause
    (status, result, context) = runner.pause()
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/orquesta_runner/orquesta_runner.py", line 136, in pause
    wf_ex_db = wf_svc.request_pause(self.execution)
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/retrying.py", line 206, in call
    return attempt.get(self._wrap_exception)
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/retrying.py", line 49, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/retrying.py", line 206, in call
    return attempt.get(self._wrap_exception)
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/retrying.py", line 247, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/retrying.py", line 200, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/st2common/services/workflows.py", line 312, in request_pause
    conductor.request_workflow_status(statuses.PAUSED)
  File "/opt/stackstorm/st2/lib/python2.7/site-packages/orquesta/conducting.py", line 363, in request_workflow_status
    raise exc.InvalidWorkflowStatusTransition(current_status, wf_ex_event.name)
"
+--------------------------+-------------------------+------------+------------+-------------------------------+
| id                       | status                  | task       | action     | start_timestamp               |
+--------------------------+-------------------------+------------+------------+-------------------------------+
| 5e43e5db103007aaf8abdee5 | succeeded (0s elapsed)  | start      | core.echo  | Wed, 12 Feb 2020 11:47:39 UTC |
| 5e43e5dc103007aaf8abdee8 | succeeded (14s elapsed) | do_inquiry | core.ask   | Wed, 12 Feb 2020 11:47:40 UTC |
| 5e43e5dc103007aaf8abdeeb | running (17s elapsed)   | snooze     | core.local | Wed, 12 Feb 2020 11:47:40 UTC |
+--------------------------+-------------------------+------------+------------+-------------------------------+

This exception is cleared once the parallel task is done, but the execution stays in paused state:

$  st2 execution get 5e43e5da103007aab0b8bcf2
id: 5e43e5da103007aab0b8bcf2
action.ref: testing.test-parallel-pause
parameters: None
status: paused
start_timestamp: Wed, 12 Feb 2020 11:47:38 UTC
end_timestamp: Wed, 12 Feb 2020 11:47:54 UTC
result:
  output: null
+--------------------------+-------------------------+------------+------------+-------------------------------+
| id                       | status                  | task       | action     | start_timestamp               |
+--------------------------+-------------------------+------------+------------+-------------------------------+
| 5e43e5db103007aaf8abdee5 | succeeded (0s elapsed)  | start      | core.echo  | Wed, 12 Feb 2020 11:47:39 UTC |
| 5e43e5dc103007aaf8abdee8 | succeeded (14s elapsed) | do_inquiry | core.ask   | Wed, 12 Feb 2020 11:47:40 UTC |
| 5e43e5dc103007aaf8abdeeb | succeeded (60s elapsed) | snooze     | core.local | Wed, 12 Feb 2020 11:47:40 UTC |
+--------------------------+-------------------------+------------+------------+-------------------------------+

Feb 12 '20 11:02 markcrobinson

Thanks for the detailed bug report and workflow example.

Just to confirm, you're not running Orquesta in HA environment? Eg. It's a single server with a single Orquesta service running?

cc @m4dcoder

Feb 12 '20 15:02 arm4b

Nope, it's a single box

Feb 12 '20 15:02 markcrobinson

I have just retested this with a fresh install of StackStorm 3.2 on RHEL 7. The behaviour is now better in that the errors are handed gracefully, but the underlying issue is still present.
Status with 3.2:

There is no longer a python traceback printed when doing st2 execution get (after responding to the inquiry)
The workflow status sits at "pausing" after the inquiry has been answered, until the sleep finishes
Once the sleep is done, its task changes to state timeout(!)
Once that happens, the main workflow changes to state failed

$ st2 run -a testing.test-parallel-pause
To get the results, execute:
 st2 execution get 5eb136603e3753679fff04fa

To view output in real-time, execute:
 st2 execution tail 5eb136603e3753679fff04fa
$  st2 execution get 5eb136603e3753679fff04fa
id: 5eb136603e3753679fff04fa
action.ref: testing.test-parallel-pause
parameters: None
status: running (3s elapsed)
start_timestamp: Tue, 05 May 2020 09:48:16 UTC
end_timestamp:
result:
  output: null
+--------------------------+------------------------+------------+------------+------------------------+
| id                       | status                 | task       | action     | start_timestamp        |
+--------------------------+------------------------+------------+------------+------------------------+
| 5eb136611c7af71f5a03a21e | succeeded (0s elapsed) | start      | core.echo  | Tue, 05 May 2020       |
|                          |                        |            |            | 09:48:17 UTC           |
| 5eb136611c7af71f5a03a221 | pending                | do_inquiry | core.ask   | Tue, 05 May 2020       |
|                          |                        |            |            | 09:48:17 UTC           |
| 5eb136611c7af71f5a03a224 | running (2s elapsed)   | snooze     | core.local | Tue, 05 May 2020       |
|                          |                        |            |            | 09:48:17 UTC           |
+--------------------------+------------------------+------------+------------+------------------------+
$ st2 inquiry respond 5eb136611c7af71f5a03a221
answer: blah

Response accepted for inquiry 5eb136611c7af71f5a03a221.
$  st2 execution get 5eb136603e3753679fff04fa
id: 5eb136603e3753679fff04fa
action.ref: testing.test-parallel-pause
parameters: None
status: pausing
start_timestamp: Tue, 05 May 2020 09:48:16 UTC
end_timestamp:
result:
  output: null
+--------------------------+-------------------------+------------+------------+------------------------+
| id                       | status                  | task       | action     | start_timestamp        |
+--------------------------+-------------------------+------------+------------+------------------------+
| 5eb136611c7af71f5a03a21e | succeeded (0s elapsed)  | start      | core.echo  | Tue, 05 May 2020       |
|                          |                         |            |            | 09:48:17 UTC           |
| 5eb136611c7af71f5a03a221 | succeeded (16s elapsed) | do_inquiry | core.ask   | Tue, 05 May 2020       |
|                          |                         |            |            | 09:48:17 UTC           |
| 5eb136611c7af71f5a03a224 | running (18s elapsed)   | snooze     | core.local | Tue, 05 May 2020       |
|                          |                         |            |            | 09:48:17 UTC           |
+--------------------------+-------------------------+------------+------------+------------------------+
$  st2 execution get 5eb136603e3753679fff04fa
id: 5eb136603e3753679fff04fa
action.ref: testing.test-parallel-pause
parameters: None
status: failed (62s elapsed)
start_timestamp: Tue, 05 May 2020 09:48:16 UTC
end_timestamp: Tue, 05 May 2020 09:49:18 UTC
result:
  errors: []
  output:
    context:
      answer: blah
      awake: true
      parent:
        pack: testing
        user: stanley
      st2:
        action_execution_id: 5eb136603e3753679fff04fa
        api_url: http://127.0.0.1:9101/v1
        pack: testing
        user: stanley
        workflow_execution_id: 5eb13661cc0f7052c3a8742e
+--------------------------+-------------------------+------------+------------+------------------------+
| id                       | status                  | task       | action     | start_timestamp        |
+--------------------------+-------------------------+------------+------------+------------------------+
| 5eb136611c7af71f5a03a21e | succeeded (0s elapsed)  | start      | core.echo  | Tue, 05 May 2020       |
|                          |                         |            |            | 09:48:17 UTC           |
| 5eb136611c7af71f5a03a221 | succeeded (16s elapsed) | do_inquiry | core.ask   | Tue, 05 May 2020       |
|                          |                         |            |            | 09:48:17 UTC           |
| 5eb136611c7af71f5a03a224 | timeout (61s elapsed)   | snooze     | core.local | Tue, 05 May 2020       |
|                          |                         |            |            | 09:48:17 UTC           |
+--------------------------+-------------------------+------------+------------+------------------------+

May 05 '20 10:05 markcrobinson

I have hit what looks like the same issue, except I have a join downstream of the parallel tasks which never gets hit. The workflow remains in 'paused' state indefinitely (well, for at least 10 minutes)

Stackstorm 3.4 installed with the bash one-liner on Ubuntu 18.04.

Apr 23 '21 08:04 leigh-thg

I believe we encounter this issue on st2 3.5.0, on Python 3.6.8, on centos 7,

When running a core.ask action using the with items method. Happens both in a subworkflow, where the parent workflow stays paused indefinitely or as a standalone workflow, where the workflow gets stuck resuming.

sometimes the workflow continues without issues but at least half the times it gets stuck in a paused state. We tested this behavior with 3 parrallel core.ask actions.

Dec 07 '21 10:12 lnguyendl