A daemon pod that is triggered through templateRef keeps running after the workflow ends
Pre-requisites
- [X] I have double-checked my configuration
- [X] I can confirm the issues exists when I tested with
:latest - [ ] I'd like to contribute the fix myself (see contributing guide)
What happened/what you expected to happen?
The workflow basic structure: I run a workflow which is consists of 2 steps: one is a (redis) server pod which is configured as a daemon pod, and the second one is a consumer pod which gets the ip address of the daemon one and makes something with it. what I expect to happen: when the workflow is ended, the daemon pod stops as well. what actually happens: if I call the daemon pod via a "regular" template then the daemon pod ends as expected when the workflow ends. But, if I call the daemon pod via templateRef, then the daemon pod does not stop when the workflow ends and it keeps running...
Version
3.4.8, 3.5.0.rc1
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: daemon-bug
spec:
entrypoint: engine
templates:
- name: engine
steps:
- - name: set-daemon-srv
templateRef:
name: general-demo-templates
template: daemon-template
- - name: use-daemon-srv
template: use-daemon-srv
arguments:
parameters:
- name: srv-ip
value: "{{steps.set-daemon-srv.ip}}"
- name: use-daemon-srv
inputs:
parameters:
- name: srv-ip
container:
image: docker/whalesay
command: [ sh, -c ]
args: [ "echo server ip is: {{inputs.parameters.srv-ip}}" ]
---
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
name: general-demo-templates
spec:
templates:
- name: daemon-template
daemon: true
retryStrategy:
limit: 3
container:
image: redis
command: [ docker-entrypoint.sh ]
args: [ "redis-server" ]
readinessProbe:
exec:
command:
- redis-cli
- ping
Logs from the workflow controller
time="2023-09-03T13:32:02.297Z" level=info msg="Processing workflow" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.301Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.301Z" level=info msg="Node became daemoned" namespace=argo nodeId=daemon-bug-kx56t-4049627799 workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.301Z" level=info msg="node changed" namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=daemon-bug-kx56t-4049627799 old.message= old.phase=Pending old.progress=0/1 workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.302Z" level=info msg="node daemon-bug-kx56t-2849824988 phase Running -> Succeeded" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.302Z" level=info msg="node daemon-bug-kx56t-2849824988 finished: 2023-09-03 13:32:02.302944725 +0000 UTC" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.302Z" level=info msg="Step group node daemon-bug-kx56t-1751689535 successful" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.302Z" level=info msg="node daemon-bug-kx56t-1751689535 phase Running -> Succeeded" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.303Z" level=info msg="node daemon-bug-kx56t-1751689535 finished: 2023-09-03 13:32:02.303003172 +0000 UTC" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.303Z" level=info msg="StepGroup node daemon-bug-kx56t-745282658 initialized Running" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.304Z" level=info msg="SG Outbound nodes of daemon-bug-kx56t-2849824988 are [daemon-bug-kx56t-4049627799]" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.304Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.304Z" level=info msg="Pod node daemon-bug-kx56t-285748714 initialized Pending" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.329Z" level=info msg="Created pod: daemon-bug-kx56t[1].use-daemon-srv (daemon-bug-kx56t-use-daemon-srv-285748714)" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.329Z" level=info msg="Workflow step group node daemon-bug-kx56t-745282658 not yet completed" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.329Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.329Z" level=info msg=reconcileAgentPod namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:02.341Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=10229907 workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.329Z" level=info msg="Processing workflow" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.338Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.338Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.339Z" level=info msg="node changed" namespace=argo new.message= new.phase=Succeeded new.progress=0/1 nodeID=daemon-bug-kx56t-285748714 old.message= old.phase=Pending old.progress=0/1 workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.339Z" level=info msg="node unchanged" namespace=argo nodeID=daemon-bug-kx56t-4049627799 workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.340Z" level=info msg="SG Outbound nodes of daemon-bug-kx56t-2849824988 are [daemon-bug-kx56t-4049627799]" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.341Z" level=info msg="Step group node daemon-bug-kx56t-745282658 successful" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.342Z" level=info msg="node daemon-bug-kx56t-745282658 phase Running -> Succeeded" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.342Z" level=info msg="node daemon-bug-kx56t-745282658 finished: 2023-09-03 13:32:12.342955747 +0000 UTC" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.343Z" level=info msg="Outbound nodes of daemon-bug-kx56t-285748714 is [daemon-bug-kx56t-285748714]" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.343Z" level=info msg="Outbound nodes of daemon-bug-kx56t is [daemon-bug-kx56t-285748714]" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.343Z" level=info msg="node daemon-bug-kx56t phase Running -> Succeeded" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.343Z" level=info msg="node daemon-bug-kx56t finished: 2023-09-03 13:32:12.343117518 +0000 UTC" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.344Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.344Z" level=info msg=reconcileAgentPod namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.344Z" level=info msg="Updated phase Running -> Succeeded" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.344Z" level=info msg="Marking workflow completed" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.344Z" level=info msg="Marking workflow as pending archiving" namespace=argo workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.349Z" level=info msg="cleaning up pod" action=deletePod key=argo/daemon-bug-kx56t-1340600742-agent/deletePod
time="2023-09-03T13:32:12.349Z" level=info msg="cleaning up pod" action=terminateContainers key=argo/daemon-bug-kx56t--4049627799/terminateContainers
time="2023-09-03T13:32:12.358Z" level=info msg="Workflow update successful" namespace=argo phase=Succeeded resourceVersion=10229990 workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.364Z" level=info msg="archiving workflow" namespace=argo uid=4bf4eadb-7668-4d52-982a-bc015f678798 workflow=daemon-bug-kx56t
time="2023-09-03T13:32:12.369Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/daemon-bug-kx56t-daemon-template-4049627799/labelPodCompleted
Logs from in your workflow's wait container
time="2023-09-03T13:31:54.298Z" level=info msg="Starting Workflow Executor" version=v3.5.0-rc1
time="2023-09-03T13:31:54.300Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2023-09-03T13:31:54.300Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=argo podName=daemon-bug-kx56t-daemon-template-4049627799 template="{\"name\":\"daemon-template\",\"inputs\":{},\"outputs\":{},\"metadata\":{},\"daemon\":true,\"container\":{\"name\":\"\",\"image\":\"344864304324.dkr.ecr.us-east-1.amazonaws.com/nss_argo:redis_org\",\"command\":[\"docker-entrypoint.sh\"],\"args\":[\"redis-server\"],\"resources\":{},\"readinessProbe\":{\"exec\":{\"command\":[\"redis-cli\",\"ping\"]}}},\"archiveLocation\":{\"archiveLogs\":true,\"s3\":{\"endpoint\":\"s3.amazonaws.com\",\"bucket\":\"mobileye-nss-temp\",\"region\":\"us-east-1\",\"insecure\":true,\"key\":\"argo-logs/daemon-bug-kx56t/daemon-bug-kx56t-daemon-template-4049627799\"}},\"retryStrategy\":{\"limit\":3}}" version="&Version{Version:v3.5.0-rc1,BuildDate:2023-08-15T18:44:17Z,GitCommit:6cd3e810e9c953e0a78070f41ab8f7d799fe51d3,GitTag:v3.5.0-rc1,GitTreeState:clean,GoVersion:go1.20.7,Compiler:gc,Platform:linux/amd64,}"
time="2023-09-03T13:31:54.301Z" level=info msg="Starting deadline monitor"
time="2023-09-03T13:32:06.324Z" level=info msg="No output artifacts"
time="2023-09-03T13:32:06.324Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: argo-logs/daemon-bug-kx56t/daemon-bug-kx56t-use-daemon-srv-285748714/main.log"
time="2023-09-03T13:32:06.324Z" level=info msg="Creating minio client using IAM role"
time="2023-09-03T13:32:06.324Z" level=info msg="Saving file to s3" bucket=mobileye-nss-temp endpoint=s3.amazonaws.com key=argo-logs/daemon-bug-kx56t/daemon-bug-kx56t-use-daemon-srv-285748714/main.log path=/tmp/argo/outputs/logs/main.log
time="2023-09-03T13:32:06.491Z" level=info msg="Save artifact" artifactName=main-logs duration=166.673472ms error="<nil>" key=argo-logs/daemon-bug-kx56t/daemon-bug-kx56t-use-daemon-srv-285748714/main.log
time="2023-09-03T13:32:06.491Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2023-09-03T13:32:06.491Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2023-09-03T13:32:06.505Z" level=warning msg="failed to patch task set, falling back to legacy/insecure pod patch, see https://argoproj.github.io/argo-workflows/workflow-rbac/" error="workflowtaskresults.argoproj.io is forbidden: User \"system:serviceaccount:argo:sandbox-s3\" cannot create resource \"workflowtaskresults\" in API group \"argoproj.io\" in the namespace \"argo\""
time="2023-09-03T13:32:06.535Z" level=info msg="Alloc=8376 TotalAlloc=17335 Sys=43389 NumGC=5 Goroutines=12"
time="2023-09-03T13:32:06.535Z" level=info msg="stopping progress monitor (context done)" error="context canceled"
Is there any work around for this? I've set a large value for ttlStrategy.secondsAfterFailure on the workflow to keep failed pods around for longer, but this causes the daemon pods to run for that length of time as well, which is becoming quite costly
For anyone else struggling with this, I've realised that I hadn't set a podGC.strategy for my root workflow template which referenced the others. Now i've set this to OnWorkflowCompletion and the daemon pods are being removed as expected.
#12533 duplicated this, but had some code analysis that might be useful:
I look into the logs of the workflow-controller and find some clues: the template name is not correctly set for the
terminateContainersaction.time="2024-01-17T09:32:09.931Z" level=info msg="cleaning up pod" action=terminateContainers key=default/daemon-nginx-s5ggl--452970571/terminateContainersI then dig into the code and locate the function for this. It seems like the
podName"default/daemon-nginx-s5ggl--452970571" generated byutil.GeneratePodNamemissed thetemplateNamepart, which in this case is nilI'm not sure if this bug is related to the
initializeNodein which theTemplateNameis set from theorgTmpl. So for a template referencing a workflowTemplate, the value is nil.I wrote a small workflow that can reproduces the bug. All work fine if I use directly a template instead of referencing a workflowTemplate.
Any thoughts on how to fix this? Maybe resolve the template to get the actual podName at runtime like here?
is the workaround still working with version 3.5.6 as I can confirm that we are still affected even after trying the workaround.