Application reconciliation drops to zero (v2.10.6)
Checklist:
- [x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
- [ ] I've included steps to reproduce the bug.
- [x] I've pasted the output of
argocd version.
Describe the bug We have experienced issues recently where the application reconciliation count drops to zero and initially recovered after a restart of the application controller.
To Reproduce We observed high CPU in the both the controller and repository servers. We are running approximately 1250 applications across 11 clusters, the majority of applications are synched from a monorepo.
In an attempt to resolve the issue, we first upgraded to 2.10.6. To try and reduce the CPU utilisation, we found lots of a large number of msg="Requesting app refresh caused by object update logs and referred to the docs to implement ignoreResourceUpdates.
We amended the controller config to increase the status and operation processors and repo server timeout (ref: https://argo-cd.readthedocs.io/en/stable/operator-manual/high_availability/#argocd-application-controller)
controller.status.processors: "50"
controller.operation.processors: "25"
controller.repo.server.timeout.seconds: "120"
We discovered a lot of Failed to watch warnings in the logs, referring to a previous issue, we removed redundant CRDs.
Whilst these change have had a positive impact on the controller CPU and the number of reconciliations (applied ~12:20) , we still observed issues later in the day (~17:50) with reconciliation dropping to zero, a controller restart resolved the issue:
We are seeing the issue more frequently now and the main symptom of the issue appears to be in the repository server where we see the goroutines become stuck:
Traces of the repository server shows that GenerateManifest repository.RepoServerService is increasing significantly, from milliseconds responses to, in extreme cases, we're seeing 26 minutes.
In more recent incidents, it has required a restart of the repo server to recover the service.
Do you have any suggestions?
Version
argocd: v2.3.4+ac8b7df.dirty
BuildDate: 2022-05-22T11:51:06Z
GitCommit: ac8b7df9467ffcc0920b826c62c4b603a7bfed24
GitTreeState: dirty
GoVersion: go1.18.2
Compiler: gc
Platform: darwin/arm64
argocd-server: v2.10.6+d504d2b
Hm.... lots going on. It does sound like the repo server is getting bogged down. How are metrics looking on the repo server side? Hitting CPU throttling maybe?
Hi! Are you using cmp plugin?
@crenshaw-dev we don't have limits set on the repo server. We've updated config so the repo server runs on its own nodes as we do see it consuming 100% of the node CPU, the below shows the behaviour during an incident where goroutines usage on 1 repo server increases and reconcilations drop to zero. A restart of the repo server ~6pm resolved.
@jsolana we're not using CMP.
we do see it consuming 100% of the node CPU
Does this mean that the repo server is likely facing CPU throttling during the event you screenshot above?
@crenshaw-dev it may be briefly, yes, we observed the issue again a couple of times yesterday and we see CPU spike on on all 3 repo server pods:
On one of the pods goroutine usage remains high:
This is the CPU profile of the 3 nodes running the repo server pods:
In this instance, no manual action was taken and the service recovered (no automated pod restarts either). In previous instances of the issue, restarting just the repo server pod resolves.
Hello! I've been working with @paulparke on this. We've managed to capture the goroutine profiles from the /debug/pprof endpoint, which we took during an incident. Upon analysis, it seems that the majority of goroutines (in this particular case, 150) were waiting to acquire the lock (https://github.com/argoproj/argo-cd/blob/v2.10.6/reposerver/repository/lock.go#L52), one example:
goroutine 2369328420 [sync.Mutex.Lock, 46 minutes]:
sync.runtime_SemacquireMutex(0x7fed7642fe30?, 0x18?, 0x8?)
/usr/local/go/src/runtime/sema.go:77 +0x25
sync.(*Mutex).lockSlow(0xc000d177d8)
/usr/local/go/src/sync/mutex.go:171 +0x15d
sync.(*Mutex).Lock(0x3614d00?)
/usr/local/go/src/sync/mutex.go:90 +0x32
github.com/argoproj/argo-cd/v2/reposerver/repository.(*repositoryLock).Lock(0xc000528560, {0xc000b10c40, 0x36}, {0xc007300030, 0x28}, 0x1, 0xc007978e68)
/go/src/github.com/argoproj/argo-cd/reposerver/repository/lock.go:52 +0x22f
github.com/argoproj/argo-cd/v2/reposerver/repository.(*Service).runRepoOperation(0xc000b16640, {0x50a8638, 0xc006d60810}, {0xc00b062d38, 0x6}, 0xc0021bc240, 0xc002a30700, 0x0, 0xc0079791c8, 0xc007979208, ...)
/go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:378 +0xbe8
github.com/argoproj/argo-cd/v2/reposerver/repository.(*Service).GenerateManifest(0xc000b16640, {0x50a8638?, 0xc006d60810?}, 0x5068e90?)
/go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:561 +0x45e
github.com/argoproj/argo-cd/v2/reposerver/apiclient._RepoServerService_GenerateManifest_Handler.func1({0x50a8638, 0xc006d60810}, {0x3c853a0?, 0xc008f13a20})
/go/src/github.com/argoproj/argo-cd/reposerver/apiclient/repository.pb.go:2625 +0x72
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ErrorSanitizerUnaryServerInterceptor.func3({0x50a8638, 0xc006d60720}, {0x3c853a0, 0xc008f13a20}, 0xc006d607e0?, 0xc0013a0a50)
/go/src/github.com/argoproj/argo-cd/util/grpc/sanitizer.go:25 +0x7d
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x50a8638?, 0xc006d60720?}, {0x3c853a0?, 0xc008f13a20?})
/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:48 +0x45
github.com/argoproj/argo-cd/v2/reposerver.NewServer.PanicLoggerUnaryServerInterceptor.func2({0x50a8638?, 0xc006d60720?}, {0x3c853a0?, 0xc008f13a20?}, 0xc000c105c8?, 0xc000c105d8?)
/go/src/github.com/argoproj/argo-cd/util/grpc/grpc.go:31 +0x8c
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x50a8638?, 0xc006d60720?}, {0x3c853a0?, 0xc008f13a20?})
/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:48 +0x45
github.com/grpc-ecosystem/go-grpc-prometheus.init.(*ServerMetrics).UnaryServerInterceptor.func3({0x50a8638, 0xc006d60720}, {0x3c853a0, 0xc008f13a20}, 0x0?, 0xc0022aa200)
/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/server_metrics.go:107 +0x7d
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x50a8638?, 0xc006d60720?}, {0x3c853a0?, 0xc008f13a20?})
/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:48 +0x45
github.com/grpc-ecosystem/go-grpc-middleware/logging/logrus.UnaryServerInterceptor.func1({0x50a8638, 0xc006d60630}, {0x3c853a0, 0xc008f13a20}, 0xc004bfabc0, 0xc0022aa240)
/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/logging/logrus/server_interceptors.go:31 +0xf6
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x50a8638?, 0xc006d60630?}, {0x3c853a0?, 0xc008f13a20?})
/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:48 +0x45
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1({0x50a8638, 0xc006d60510}, {0x3c853a0, 0xc008f13a20}, 0xc004bfabc0, 0xc0022aa280)
/go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/[email protected]/interceptor.go:344 +0x52b
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5({0x50a8638, 0xc006d60510}, {0x3c853a0, 0xc008f13a20}, 0xc004bfabc0, 0x3658640?)
/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:53 +0x135
github.com/argoproj/argo-cd/v2/reposerver/apiclient._RepoServerService_GenerateManifest_Handler({0x3c27b80?, 0xc000b16640}, {0x50a8638, 0xc006d60510}, 0xc00da00a80, 0xc000b478c0)
/go/src/github.com/argoproj/argo-cd/reposerver/apiclient/repository.pb.go:2627 +0x135
google.golang.org/grpc.(*Server).processUnaryRPC(0xc00034ed20, {0x50a8638, 0xc006d60450}, {0x50b8f60, 0xc011d6b040}, 0xc0021bc120, 0xc00117e3f0, 0x738c820, 0x0)
/go/pkg/mod/google.golang.org/[email protected]/server.go:1343 +0xe03
google.golang.org/grpc.(*Server).handleStream(0xc00034ed20, {0x50b8f60, 0xc011d6b040}, 0xc0021bc120)
/go/pkg/mod/google.golang.org/[email protected]/server.go:1737 +0xc4c
google.golang.org/grpc.(*Server).serveStreams.func1.1()
/go/pkg/mod/google.golang.org/[email protected]/server.go:986 +0x86
created by google.golang.org/grpc.(*Server).serveStreams.func1 in goroutine 2369327544
/go/pkg/mod/google.golang.org/[email protected]/server.go:997 +0x145
However, I've identified that there was one goroutine which had the lock and was executing a command: git fetch origin --tags --force --prune.
goroutine 2364632315 [chan receive, 45 minutes]:
github.com/argoproj/pkg/exec.RunCommandExt(0xc008f12160, {0x14f46b0400, 0x0, {0xf, 0x1}, 0x0, 0x0})
/go/pkg/mod/github.com/argoproj/[email protected]/exec/exec.go:139 +0xd7d
github.com/argoproj/argo-cd/v2/util/exec.RunWithExecRunOpts(0xc008f12160, {0x0?, {0x0?, 0x0?}, 0xe0?, 0x8b?})
/go/src/github.com/argoproj/argo-cd/util/exec/exec.go:59 +0x6d1
github.com/argoproj/argo-cd/v2/util/git.(*nativeGitClient).runCmdOutput(0xc005f6c380, 0xc008f12160, {0xd0?, 0xe7?})
/go/src/github.com/argoproj/argo-cd/util/git/client.go:754 +0x5f5
github.com/argoproj/argo-cd/v2/util/git.(*nativeGitClient).runCredentialedCmd(0xc005f6c380, {0xc0011ee7d0, 0x5, 0x5})
/go/src/github.com/argoproj/argo-cd/util/git/client.go:709 +0x405
github.com/argoproj/argo-cd/v2/util/git.(*nativeGitClient).fetch(0xc006f98de0?, {0x0?, 0x0?})
/go/src/github.com/argoproj/argo-cd/util/git/client.go:350 +0x192
github.com/argoproj/argo-cd/v2/util/git.(*nativeGitClient).Fetch(0xc005f6c380, {0x0?, 0x0?})
/go/src/github.com/argoproj/argo-cd/util/git/client.go:362 +0x9b
github.com/argoproj/argo-cd/v2/reposerver/repository.checkoutRevision({0x50c16e0, 0xc005f6c380}, {0xc0082cb8c0, 0x28}, 0xc0?)
/go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:2411 +0xa9
github.com/argoproj/argo-cd/v2/reposerver/repository.(*Service).checkoutRevision(0xc000b16640, {0x50c16e0, 0xc005f6c380}, {0xc0082cb8c0, 0x28}, 0x1?)
/go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:2401 +0x6b
github.com/argoproj/argo-cd/v2/reposerver/repository.(*Service).runRepoOperation.func2()
/go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:379 +0x2f
github.com/argoproj/argo-cd/v2/reposerver/repository.(*repositoryLock).Lock(0xc000528560, {0xc000b10c40, 0x36}, {0xc0082cb8c0, 0x28}, 0x1, 0xc0011eee68)
/go/src/github.com/argoproj/argo-cd/reposerver/repository/lock.go:55 +0x2eb
github.com/argoproj/argo-cd/v2/reposerver/repository.(*Service).runRepoOperation(0xc000b16640, {0x50a8638, 0xc00b0916e0}, {0xc0076a3d38, 0x6}, 0xc0064c97a0, 0xc005f6c000, 0x0, 0xc0011ef1c8, 0xc0011ef208, ...)
/go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:378 +0xbe8
github.com/argoproj/argo-cd/v2/reposerver/repository.(*Service).GenerateManifest(0xc000b16640, {0x50a8638?, 0xc00b0916e0?}, 0x5068e90?)
/go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:561 +0x45e
github.com/argoproj/argo-cd/v2/reposerver/apiclient._RepoServerService_GenerateManifest_Handler.func1({0x50a8638, 0xc00b0916e0}, {0x3c853a0?, 0xc0015d2b00})
/go/src/github.com/argoproj/argo-cd/reposerver/apiclient/repository.pb.go:2625 +0x72
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ErrorSanitizerUnaryServerInterceptor.func3({0x50a8638, 0xc00b0915f0}, {0x3c853a0, 0xc0015d2b00}, 0xc00b0916b0?, 0xc001826978)
/go/src/github.com/argoproj/argo-cd/util/grpc/sanitizer.go:25 +0x7d
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x50a8638?, 0xc00b0915f0?}, {0x3c853a0?, 0xc0015d2b00?})
/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:48 +0x45
github.com/argoproj/argo-cd/v2/reposerver.NewServer.PanicLoggerUnaryServerInterceptor.func2({0x50a8638?, 0xc00b0915f0?}, {0x3c853a0?, 0xc0015d2b00?}, 0xc000c105c8?, 0xc000c105d8?)
/go/src/github.com/argoproj/argo-cd/util/grpc/grpc.go:31 +0x8c
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x50a8638?, 0xc00b0915f0?}, {0x3c853a0?, 0xc0015d2b00?})
/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:48 +0x45
github.com/grpc-ecosystem/go-grpc-prometheus.init.(*ServerMetrics).UnaryServerInterceptor.func3({0x50a8638, 0xc00b0915f0}, {0x3c853a0, 0xc0015d2b00}, 0x0?, 0xc00ca65400)
/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/server_metrics.go:107 +0x7d
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x50a8638?, 0xc00b0915f0?}, {0x3c853a0?, 0xc0015d2b00?})
/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:48 +0x45
github.com/grpc-ecosystem/go-grpc-middleware/logging/logrus.UnaryServerInterceptor.func1({0x50a8638, 0xc00b091500}, {0x3c853a0, 0xc0015d2b00}, 0xc005500860, 0xc00ca65440)
/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/logging/logrus/server_interceptors.go:31 +0xf6
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x50a8638?, 0xc00b091500?}, {0x3c853a0?, 0xc0015d2b00?})
/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:48 +0x45
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1({0x50a8638, 0xc008d9c690}, {0x3c853a0, 0xc0015d2b00}, 0xc005500860, 0xc00ca65480)
/go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/[email protected]/interceptor.go:344 +0x52b
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5({0x50a8638, 0xc008d9c690}, {0x3c853a0, 0xc0015d2b00}, 0xc005500860, 0x3658640?)
/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:53 +0x135
github.com/argoproj/argo-cd/v2/reposerver/apiclient._RepoServerService_GenerateManifest_Handler({0x3c27b80?, 0xc000b16640}, {0x50a8638, 0xc008d9c690}, 0xc000a96b00, 0xc000b478c0)
/go/src/github.com/argoproj/argo-cd/reposerver/apiclient/repository.pb.go:2627 +0x135
google.golang.org/grpc.(*Server).processUnaryRPC(0xc00034ed20, {0x50a8638, 0xc008d9c5d0}, {0x50b8f60, 0xc0021c8d00}, 0xc0064c9680, 0xc00117e3f0, 0x738c820, 0x0)
/go/pkg/mod/google.golang.org/[email protected]/server.go:1343 +0xe03
google.golang.org/grpc.(*Server).handleStream(0xc00034ed20, {0x50b8f60, 0xc0021c8d00}, 0xc0064c9680)
/go/pkg/mod/google.golang.org/[email protected]/server.go:1737 +0xc4c
google.golang.org/grpc.(*Server).serveStreams.func1.1()
/go/pkg/mod/google.golang.org/[email protected]/server.go:986 +0x86
created by google.golang.org/grpc.(*Server).serveStreams.func1 in goroutine 2364632314
/go/pkg/mod/google.golang.org/[email protected]/server.go:997 +0x145
After continuing to do a bit more digging, I could see that we are indeed waiting for the git command to exit (https://github.com/argoproj/argo-cd/blob/v2.10.6/util/git/client.go#L746-L754). I'm guessing that either the git command doesn't respect the SIGTERM signal, or it exited before we started waiting for the command to finish (if that's even possible).
Either way, because the goroutine which has the lock on the repo & revision is waiting for the git command to finish, no other goroutines/threads can continue (resulting in us seeing the reconciliation dropping to zero).
While 45 minutes is an extremely long time for our git fetch operation to take, I'd also expect the command to be aborted after some configurable timeout - but I'm failing to find that option (we thought it might be the reconcile timeout, but that has a default of 3 minutes...).
Since there is no timeout being set (it's using the default timeout of 0), I don't think the signal ever gets sent to kill the process if it the git command hangs.
EDIT: Nevermind, I see the stack trace showing it is stuck on waiting for the command to finish
Yeah, I believe that we're using the default, 90s timeout as we're not setting the environment variable ARGOCD_EXEC_TIMEOUT (ref). So I believe that the SIGTERM signal has been sent but git fetch still takes ages to terminate.
Interestingly, I've found a similar issue where git fetch wasn't respecting an interrupt signal. Instead, after some time, a kill signal was sent as a fallback.
https://github.com/golang/go/issues/38887 / https://go-review.googlesource.com/c/build/+/325771
We're still experiencing this issue quite frequently, when the issue recovers without any intervention of restarting repo server pods, we do see a log entry with failed timeout after 1m30s, however, the grpc.time_ms is always ~2hrs 10 mins:
time="2024-05-03T05:34:31Z" level=error msg="finished unary call with code Internal" error="rpc error: code = Internal desc = unable to checkout git repo [[email protected]](mailto:[email protected]):ORG/repo.git with revision 5b0302498db5fc706b20f6247fc9ffb40563696f: failed to initialize repository resources: rpc error: code = Internal desc = Failed to fetch default: `git fetch origin --tags --force --prune` failed timeout after 1m30s" grpc.code=Internal grpc.method=UpdateRevisionForPaths grpc.service=repository.RepoServerService grpc.start_time="2024-05-03T03:22:11Z" grpc.time_ms=7.940276e+06 span.kind=server system=grpc
Without being able to find out the underlying reason for the git fetch delay, does anyone have any suggestions as to how we could force it to terminate?
Closing this issue, as thanks to @xamroc's comments in https://github.com/argoproj/argo-cd/issues/18011 it does appear to be related to enabling network policy agent in the AWS CNI. It's unclear yet why switching network policy engines caused issues given the NetworkPolicy resources did not change, but closing as it is not an ArgoCD issue.
I experience this issue for a long time, while using Cilium as my CNI, and I do not hold any network policies. Does anyone have here another advise how to solve it?