openwhisk Activations Fail After 5 Min Timeout Occurs

Environment details:

Adobe I/O deployment
Not sure of the OS version but I can try to find out if required

Steps to reproduce the issue:

Run action that takes longer than the max time alotted, aka 5 minutes (300,000ms)
Run action that should succeed and take less time than the max
Watch second action fail

Provide the expected results and outputs:

{
    "namespace": "xxxx",
    "name": "dispatcher",
    "version": "0.0.2",
    "subject": "xxxx",
    "activationId": "50f29217fa9a4d6db29217fa9a7d6d54",
    "start": 1634755228463,
    "end": 1634755528502,
    "duration": 300039,
    "statusCode": 1,
    "response": {
        "status": "application error",
        "statusCode": 0,
        "success": false,
        "result": {
            "error": "The action exceeded its time limits of 300000 milliseconds."
        }
    },
    "logs": [],
    "annotations": [
        {
            "key": "path",
            "value": "xxxx/graphql-reference/dispatcher"
        },
        {
            "key": "waitTime",
            "value": 10
        },
        {
            "key": "kind",
            "value": "nodejs:14"
        },
        {
            "key": "timeout",
            "value": true
        },
        {
            "key": "limits",
            "value": {
                "concurrency": 200,
                "logs": 10,
                "memory": 256,
                "timeout": 300000
            }
        }
    ],
    "publish": false
}

followed by a successful action (as long as it takes less than 5 mins)

Provide the actual results and outputs:

{
    "namespace": "xxxx",
    "name": "dispatcher",
    "version": "0.0.2",
    "subject": "xxxx",
    "activationId": "50f29217fa9a4d6db29217fa9a7d6d54",
    "start": 1634755228463,
    "end": 1634755528502,
    "duration": 300039,
    "statusCode": 1,
    "response": {
        "status": "application error",
        "statusCode": 0,
        "success": false,
        "result": {
            "error": "The action exceeded its time limits of 300000 milliseconds."
        }
    },
    "logs": [],
    "annotations": [
        {
            "key": "path",
            "value": "xxxx/graphql-reference/dispatcher"
        },
        {
            "key": "waitTime",
            "value": 10
        },
        {
            "key": "kind",
            "value": "nodejs:14"
        },
        {
            "key": "timeout",
            "value": true
        },
        {
            "key": "limits",
            "value": {
                "concurrency": 200,
                "logs": 10,
                "memory": 256,
                "timeout": 300000
            }
        }
    ],
    "publish": false
}

followed by failures every time

{
    "namespace": "xxxx",
    "name": "dispatcher",
    "version": "0.0.1",
    "subject": "xxxx",
    "activationId": "4163948263e84268a3948263e8b2682e",
    "start": 1634765247645,
    "end": 1634765257663,
    "duration": 10018,
    "statusCode": 2,
    "response": {
        "status": "action developer error",
        "statusCode": 0,
        "success": false,
        "result": {
            "error": "The action did not produce a valid response and exited unexpectedly."
        }
    },
    "logs": [],
    "annotations": [
        {
            "key": "path",
            "value": "xxxx/graphql-reference/dispatcher"
        },
        {
            "key": "waitTime",
            "value": 2
        },
        {
            "key": "kind",
            "value": "nodejs:14"
        },
        {
            "key": "timeout",
            "value": false
        },
        {
            "key": "limits",
            "value": {
                "concurrency": 200,
                "logs": 10,
                "memory": 256,
                "timeout": 60000
            }
        }
    ],
    "publish": false
}

Additional information you deem important:

If 1 action fails after 5 minutes, the subsequent actions shouldn't continue to fail. I guess I don't know for sure that this is related to OpenWhisk itself or with Adobe's infrastructure, but I would think it would work the same way in any cloud setting. If I'm wrong, you can close this ticket.

Oct 20 '21 21:10 ecooperman

It's hard to diagnose this without understanding more about the function's code. If the action is somehow reusing the same container despite the failure then it's possible that what you're saying is true - I see no init time annotation so these are all warm starts. The error from your last function activation is classified as action developer error so it crashed - does it run cleaning on a cold start?

Oct 29 '21 14:10 rabbah

@rabbah thanks for responding! I worked with the Adobe I/O support team and got the root cause of the issue, which appears to be a few different things as follows:

dcfcd0681d804dafbcd0681d80edafb6 : (Pod eviction) [Pod ephemeral local storage usage exceeds the total limit of containers 500Mi.]

4163948263e84268a3948263e8b2682e : (the node process may have crashed, or the network was slow for some time)

50f29217fa9a4d6db29217fa9a7d6d54 : (Concurrent activation failed and container was terminated) [Activation f27f18a7347146f0bf18a7347116f003 - {"error":"An error has occurred: Cannot load module '': ReferenceError: config is not defined"} ]

To give a little context, all our application does it to make JS calls (using the Axios library), to a commerce system. The commerce system returns data, and we then transform it into what our CMS expects, and return it to the CMS. So the application effectively works as a middleware/ESL layer between a CMS and Commerce system in order to integrate them in a way that makes them feel like a single website. That's all the code does in every possible instance.

So I think this is probably not related to anything that's core to OpenWhisk, but more that we're not handling some specific cases that I'm not really sure how to handle. To answer your question, we're not intentional reusing any containers. I'm new to this, so maybe I'm doing that by accident? Is there some way I can force each activation to use a new container or something?

Since we saw this error, I've at least added a timeout to our Axios calls (30s), with the hopes that at least we won't have any long running calls after that. I'm just not sure exactly what might be happening and what I can do to prevent it. My understanding here is that each activation is independent of all other activations, and so I'm a little confused as to where to even start. That being said, I'm trying to escalate the issue with Adobe to try and get some support from them at least. If you have any suggestions though, I'd love to hear them! For the pod storage, I'm really confused. My code is transient, and after each activation, the container should be destroyed and space reclaimed. Is there anything specific I need to do in order to get storage to clean up?

Oct 29 '21 14:10 ecooperman