argo-cd icon indicating copy to clipboard operation
argo-cd copied to clipboard

Application reconciliation drops to zero (v2.10.6)

Open paulparke opened this issue 2 years ago • 11 comments

Checklist:

  • [x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • [ ] I've included steps to reproduce the bug.
  • [x] I've pasted the output of argocd version.

Describe the bug We have experienced issues recently where the application reconciliation count drops to zero and initially recovered after a restart of the application controller.

image

To Reproduce We observed high CPU in the both the controller and repository servers. We are running approximately 1250 applications across 11 clusters, the majority of applications are synched from a monorepo.

In an attempt to resolve the issue, we first upgraded to 2.10.6. To try and reduce the CPU utilisation, we found lots of a large number of msg="Requesting app refresh caused by object update logs and referred to the docs to implement ignoreResourceUpdates.

We amended the controller config to increase the status and operation processors and repo server timeout (ref: https://argo-cd.readthedocs.io/en/stable/operator-manual/high_availability/#argocd-application-controller)

  controller.status.processors: "50"
  controller.operation.processors: "25"
  controller.repo.server.timeout.seconds: "120"

We discovered a lot of Failed to watch warnings in the logs, referring to a previous issue, we removed redundant CRDs.

Whilst these change have had a positive impact on the controller CPU and the number of reconciliations (applied ~12:20) , we still observed issues later in the day (~17:50) with reconciliation dropping to zero, a controller restart resolved the issue:

image

We are seeing the issue more frequently now and the main symptom of the issue appears to be in the repository server where we see the goroutines become stuck:

image

Traces of the repository server shows that GenerateManifest repository.RepoServerService is increasing significantly, from milliseconds responses to, in extreme cases, we're seeing 26 minutes.

image

image

In more recent incidents, it has required a restart of the repo server to recover the service.

Do you have any suggestions?

Version

argocd: v2.3.4+ac8b7df.dirty
  BuildDate: 2022-05-22T11:51:06Z
  GitCommit: ac8b7df9467ffcc0920b826c62c4b603a7bfed24
  GitTreeState: dirty
  GoVersion: go1.18.2
  Compiler: gc
  Platform: darwin/arm64
argocd-server: v2.10.6+d504d2b

paulparke avatar Apr 12 '24 11:04 paulparke

Hm.... lots going on. It does sound like the repo server is getting bogged down. How are metrics looking on the repo server side? Hitting CPU throttling maybe?

crenshaw-dev avatar Apr 12 '24 14:04 crenshaw-dev

Hi! Are you using cmp plugin?

jsolana avatar Apr 13 '24 17:04 jsolana

@crenshaw-dev we don't have limits set on the repo server. We've updated config so the repo server runs on its own nodes as we do see it consuming 100% of the node CPU, the below shows the behaviour during an incident where goroutines usage on 1 repo server increases and reconcilations drop to zero. A restart of the repo server ~6pm resolved. image image image

paulparke avatar Apr 15 '24 07:04 paulparke

@jsolana we're not using CMP.

paulparke avatar Apr 15 '24 07:04 paulparke

we do see it consuming 100% of the node CPU

Does this mean that the repo server is likely facing CPU throttling during the event you screenshot above?

crenshaw-dev avatar Apr 17 '24 01:04 crenshaw-dev

@crenshaw-dev it may be briefly, yes, we observed the issue again a couple of times yesterday and we see CPU spike on on all 3 repo server pods:

image

On one of the pods goroutine usage remains high:

image

This is the CPU profile of the 3 nodes running the repo server pods: image

In this instance, no manual action was taken and the service recovered (no automated pod restarts either). In previous instances of the issue, restarting just the repo server pod resolves.

paulparke avatar Apr 17 '24 08:04 paulparke

Hello! I've been working with @paulparke on this. We've managed to capture the goroutine profiles from the /debug/pprof endpoint, which we took during an incident. Upon analysis, it seems that the majority of goroutines (in this particular case, 150) were waiting to acquire the lock (https://github.com/argoproj/argo-cd/blob/v2.10.6/reposerver/repository/lock.go#L52), one example:

goroutine 2369328420 [sync.Mutex.Lock, 46 minutes]:
sync.runtime_SemacquireMutex(0x7fed7642fe30?, 0x18?, 0x8?)
	/usr/local/go/src/runtime/sema.go:77 +0x25
sync.(*Mutex).lockSlow(0xc000d177d8)
	/usr/local/go/src/sync/mutex.go:171 +0x15d
sync.(*Mutex).Lock(0x3614d00?)
	/usr/local/go/src/sync/mutex.go:90 +0x32
github.com/argoproj/argo-cd/v2/reposerver/repository.(*repositoryLock).Lock(0xc000528560, {0xc000b10c40, 0x36}, {0xc007300030, 0x28}, 0x1, 0xc007978e68)
	/go/src/github.com/argoproj/argo-cd/reposerver/repository/lock.go:52 +0x22f
github.com/argoproj/argo-cd/v2/reposerver/repository.(*Service).runRepoOperation(0xc000b16640, {0x50a8638, 0xc006d60810}, {0xc00b062d38, 0x6}, 0xc0021bc240, 0xc002a30700, 0x0, 0xc0079791c8, 0xc007979208, ...)
	/go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:378 +0xbe8
github.com/argoproj/argo-cd/v2/reposerver/repository.(*Service).GenerateManifest(0xc000b16640, {0x50a8638?, 0xc006d60810?}, 0x5068e90?)
	/go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:561 +0x45e
github.com/argoproj/argo-cd/v2/reposerver/apiclient._RepoServerService_GenerateManifest_Handler.func1({0x50a8638, 0xc006d60810}, {0x3c853a0?, 0xc008f13a20})
	/go/src/github.com/argoproj/argo-cd/reposerver/apiclient/repository.pb.go:2625 +0x72
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ErrorSanitizerUnaryServerInterceptor.func3({0x50a8638, 0xc006d60720}, {0x3c853a0, 0xc008f13a20}, 0xc006d607e0?, 0xc0013a0a50)
	/go/src/github.com/argoproj/argo-cd/util/grpc/sanitizer.go:25 +0x7d
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x50a8638?, 0xc006d60720?}, {0x3c853a0?, 0xc008f13a20?})
	/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:48 +0x45
github.com/argoproj/argo-cd/v2/reposerver.NewServer.PanicLoggerUnaryServerInterceptor.func2({0x50a8638?, 0xc006d60720?}, {0x3c853a0?, 0xc008f13a20?}, 0xc000c105c8?, 0xc000c105d8?)
	/go/src/github.com/argoproj/argo-cd/util/grpc/grpc.go:31 +0x8c
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x50a8638?, 0xc006d60720?}, {0x3c853a0?, 0xc008f13a20?})
	/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:48 +0x45
github.com/grpc-ecosystem/go-grpc-prometheus.init.(*ServerMetrics).UnaryServerInterceptor.func3({0x50a8638, 0xc006d60720}, {0x3c853a0, 0xc008f13a20}, 0x0?, 0xc0022aa200)
	/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/server_metrics.go:107 +0x7d
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x50a8638?, 0xc006d60720?}, {0x3c853a0?, 0xc008f13a20?})
	/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:48 +0x45
github.com/grpc-ecosystem/go-grpc-middleware/logging/logrus.UnaryServerInterceptor.func1({0x50a8638, 0xc006d60630}, {0x3c853a0, 0xc008f13a20}, 0xc004bfabc0, 0xc0022aa240)
	/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/logging/logrus/server_interceptors.go:31 +0xf6
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x50a8638?, 0xc006d60630?}, {0x3c853a0?, 0xc008f13a20?})
	/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:48 +0x45
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1({0x50a8638, 0xc006d60510}, {0x3c853a0, 0xc008f13a20}, 0xc004bfabc0, 0xc0022aa280)
	/go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/[email protected]/interceptor.go:344 +0x52b
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5({0x50a8638, 0xc006d60510}, {0x3c853a0, 0xc008f13a20}, 0xc004bfabc0, 0x3658640?)
	/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:53 +0x135
github.com/argoproj/argo-cd/v2/reposerver/apiclient._RepoServerService_GenerateManifest_Handler({0x3c27b80?, 0xc000b16640}, {0x50a8638, 0xc006d60510}, 0xc00da00a80, 0xc000b478c0)
	/go/src/github.com/argoproj/argo-cd/reposerver/apiclient/repository.pb.go:2627 +0x135
google.golang.org/grpc.(*Server).processUnaryRPC(0xc00034ed20, {0x50a8638, 0xc006d60450}, {0x50b8f60, 0xc011d6b040}, 0xc0021bc120, 0xc00117e3f0, 0x738c820, 0x0)
	/go/pkg/mod/google.golang.org/[email protected]/server.go:1343 +0xe03
google.golang.org/grpc.(*Server).handleStream(0xc00034ed20, {0x50b8f60, 0xc011d6b040}, 0xc0021bc120)
	/go/pkg/mod/google.golang.org/[email protected]/server.go:1737 +0xc4c
google.golang.org/grpc.(*Server).serveStreams.func1.1()
	/go/pkg/mod/google.golang.org/[email protected]/server.go:986 +0x86
created by google.golang.org/grpc.(*Server).serveStreams.func1 in goroutine 2369327544
	/go/pkg/mod/google.golang.org/[email protected]/server.go:997 +0x145

However, I've identified that there was one goroutine which had the lock and was executing a command: git fetch origin --tags --force --prune.

goroutine 2364632315 [chan receive, 45 minutes]:
github.com/argoproj/pkg/exec.RunCommandExt(0xc008f12160, {0x14f46b0400, 0x0, {0xf, 0x1}, 0x0, 0x0})
	/go/pkg/mod/github.com/argoproj/[email protected]/exec/exec.go:139 +0xd7d
github.com/argoproj/argo-cd/v2/util/exec.RunWithExecRunOpts(0xc008f12160, {0x0?, {0x0?, 0x0?}, 0xe0?, 0x8b?})
	/go/src/github.com/argoproj/argo-cd/util/exec/exec.go:59 +0x6d1
github.com/argoproj/argo-cd/v2/util/git.(*nativeGitClient).runCmdOutput(0xc005f6c380, 0xc008f12160, {0xd0?, 0xe7?})
	/go/src/github.com/argoproj/argo-cd/util/git/client.go:754 +0x5f5
github.com/argoproj/argo-cd/v2/util/git.(*nativeGitClient).runCredentialedCmd(0xc005f6c380, {0xc0011ee7d0, 0x5, 0x5})
	/go/src/github.com/argoproj/argo-cd/util/git/client.go:709 +0x405
github.com/argoproj/argo-cd/v2/util/git.(*nativeGitClient).fetch(0xc006f98de0?, {0x0?, 0x0?})
	/go/src/github.com/argoproj/argo-cd/util/git/client.go:350 +0x192
github.com/argoproj/argo-cd/v2/util/git.(*nativeGitClient).Fetch(0xc005f6c380, {0x0?, 0x0?})
	/go/src/github.com/argoproj/argo-cd/util/git/client.go:362 +0x9b
github.com/argoproj/argo-cd/v2/reposerver/repository.checkoutRevision({0x50c16e0, 0xc005f6c380}, {0xc0082cb8c0, 0x28}, 0xc0?)
	/go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:2411 +0xa9
github.com/argoproj/argo-cd/v2/reposerver/repository.(*Service).checkoutRevision(0xc000b16640, {0x50c16e0, 0xc005f6c380}, {0xc0082cb8c0, 0x28}, 0x1?)
	/go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:2401 +0x6b
github.com/argoproj/argo-cd/v2/reposerver/repository.(*Service).runRepoOperation.func2()
	/go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:379 +0x2f
github.com/argoproj/argo-cd/v2/reposerver/repository.(*repositoryLock).Lock(0xc000528560, {0xc000b10c40, 0x36}, {0xc0082cb8c0, 0x28}, 0x1, 0xc0011eee68)
	/go/src/github.com/argoproj/argo-cd/reposerver/repository/lock.go:55 +0x2eb
github.com/argoproj/argo-cd/v2/reposerver/repository.(*Service).runRepoOperation(0xc000b16640, {0x50a8638, 0xc00b0916e0}, {0xc0076a3d38, 0x6}, 0xc0064c97a0, 0xc005f6c000, 0x0, 0xc0011ef1c8, 0xc0011ef208, ...)
	/go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:378 +0xbe8
github.com/argoproj/argo-cd/v2/reposerver/repository.(*Service).GenerateManifest(0xc000b16640, {0x50a8638?, 0xc00b0916e0?}, 0x5068e90?)
	/go/src/github.com/argoproj/argo-cd/reposerver/repository/repository.go:561 +0x45e
github.com/argoproj/argo-cd/v2/reposerver/apiclient._RepoServerService_GenerateManifest_Handler.func1({0x50a8638, 0xc00b0916e0}, {0x3c853a0?, 0xc0015d2b00})
	/go/src/github.com/argoproj/argo-cd/reposerver/apiclient/repository.pb.go:2625 +0x72
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ErrorSanitizerUnaryServerInterceptor.func3({0x50a8638, 0xc00b0915f0}, {0x3c853a0, 0xc0015d2b00}, 0xc00b0916b0?, 0xc001826978)
	/go/src/github.com/argoproj/argo-cd/util/grpc/sanitizer.go:25 +0x7d
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x50a8638?, 0xc00b0915f0?}, {0x3c853a0?, 0xc0015d2b00?})
	/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:48 +0x45
github.com/argoproj/argo-cd/v2/reposerver.NewServer.PanicLoggerUnaryServerInterceptor.func2({0x50a8638?, 0xc00b0915f0?}, {0x3c853a0?, 0xc0015d2b00?}, 0xc000c105c8?, 0xc000c105d8?)
	/go/src/github.com/argoproj/argo-cd/util/grpc/grpc.go:31 +0x8c
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x50a8638?, 0xc00b0915f0?}, {0x3c853a0?, 0xc0015d2b00?})
	/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:48 +0x45
github.com/grpc-ecosystem/go-grpc-prometheus.init.(*ServerMetrics).UnaryServerInterceptor.func3({0x50a8638, 0xc00b0915f0}, {0x3c853a0, 0xc0015d2b00}, 0x0?, 0xc00ca65400)
	/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/server_metrics.go:107 +0x7d
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x50a8638?, 0xc00b0915f0?}, {0x3c853a0?, 0xc0015d2b00?})
	/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:48 +0x45
github.com/grpc-ecosystem/go-grpc-middleware/logging/logrus.UnaryServerInterceptor.func1({0x50a8638, 0xc00b091500}, {0x3c853a0, 0xc0015d2b00}, 0xc005500860, 0xc00ca65440)
	/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/logging/logrus/server_interceptors.go:31 +0xf6
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5.1({0x50a8638?, 0xc00b091500?}, {0x3c853a0?, 0xc0015d2b00?})
	/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:48 +0x45
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1({0x50a8638, 0xc008d9c690}, {0x3c853a0, 0xc0015d2b00}, 0xc005500860, 0xc00ca65480)
	/go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/[email protected]/interceptor.go:344 +0x52b
github.com/argoproj/argo-cd/v2/reposerver.NewServer.ChainUnaryServer.func5({0x50a8638, 0xc008d9c690}, {0x3c853a0, 0xc0015d2b00}, 0xc005500860, 0x3658640?)
	/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:53 +0x135
github.com/argoproj/argo-cd/v2/reposerver/apiclient._RepoServerService_GenerateManifest_Handler({0x3c27b80?, 0xc000b16640}, {0x50a8638, 0xc008d9c690}, 0xc000a96b00, 0xc000b478c0)
	/go/src/github.com/argoproj/argo-cd/reposerver/apiclient/repository.pb.go:2627 +0x135
google.golang.org/grpc.(*Server).processUnaryRPC(0xc00034ed20, {0x50a8638, 0xc008d9c5d0}, {0x50b8f60, 0xc0021c8d00}, 0xc0064c9680, 0xc00117e3f0, 0x738c820, 0x0)
	/go/pkg/mod/google.golang.org/[email protected]/server.go:1343 +0xe03
google.golang.org/grpc.(*Server).handleStream(0xc00034ed20, {0x50b8f60, 0xc0021c8d00}, 0xc0064c9680)
	/go/pkg/mod/google.golang.org/[email protected]/server.go:1737 +0xc4c
google.golang.org/grpc.(*Server).serveStreams.func1.1()
	/go/pkg/mod/google.golang.org/[email protected]/server.go:986 +0x86
created by google.golang.org/grpc.(*Server).serveStreams.func1 in goroutine 2364632314
	/go/pkg/mod/google.golang.org/[email protected]/server.go:997 +0x145

After continuing to do a bit more digging, I could see that we are indeed waiting for the git command to exit (https://github.com/argoproj/argo-cd/blob/v2.10.6/util/git/client.go#L746-L754). I'm guessing that either the git command doesn't respect the SIGTERM signal, or it exited before we started waiting for the command to finish (if that's even possible).

Either way, because the goroutine which has the lock on the repo & revision is waiting for the git command to finish, no other goroutines/threads can continue (resulting in us seeing the reconciliation dropping to zero).

While 45 minutes is an extremely long time for our git fetch operation to take, I'd also expect the command to be aborted after some configurable timeout - but I'm failing to find that option (we thought it might be the reconcile timeout, but that has a default of 3 minutes...).

caveman280 avatar Apr 25 '24 13:04 caveman280

Since there is no timeout being set (it's using the default timeout of 0), I don't think the signal ever gets sent to kill the process if it the git command hangs.

EDIT: Nevermind, I see the stack trace showing it is stuck on waiting for the command to finish

rumstead avatar Apr 25 '24 14:04 rumstead

Yeah, I believe that we're using the default, 90s timeout as we're not setting the environment variable ARGOCD_EXEC_TIMEOUT (ref). So I believe that the SIGTERM signal has been sent but git fetch still takes ages to terminate.

caveman280 avatar Apr 29 '24 09:04 caveman280

Interestingly, I've found a similar issue where git fetch wasn't respecting an interrupt signal. Instead, after some time, a kill signal was sent as a fallback.

https://github.com/golang/go/issues/38887 / https://go-review.googlesource.com/c/build/+/325771

caveman280 avatar Apr 29 '24 10:04 caveman280

We're still experiencing this issue quite frequently, when the issue recovers without any intervention of restarting repo server pods, we do see a log entry with failed timeout after 1m30s, however, the grpc.time_ms is always ~2hrs 10 mins:

time="2024-05-03T05:34:31Z" level=error msg="finished unary call with code Internal" error="rpc error: code = Internal desc = unable to checkout git repo [[email protected]](mailto:[email protected]):ORG/repo.git with revision 5b0302498db5fc706b20f6247fc9ffb40563696f: failed to initialize repository resources: rpc error: code = Internal desc = Failed to fetch default: `git fetch origin --tags --force --prune` failed timeout after 1m30s" grpc.code=Internal grpc.method=UpdateRevisionForPaths grpc.service=repository.RepoServerService grpc.start_time="2024-05-03T03:22:11Z" grpc.time_ms=7.940276e+06 span.kind=server system=grpc

Without being able to find out the underlying reason for the git fetch delay, does anyone have any suggestions as to how we could force it to terminate?

paulparke avatar May 03 '24 10:05 paulparke

Closing this issue, as thanks to @xamroc's comments in https://github.com/argoproj/argo-cd/issues/18011 it does appear to be related to enabling network policy agent in the AWS CNI. It's unclear yet why switching network policy engines caused issues given the NetworkPolicy resources did not change, but closing as it is not an ArgoCD issue.

paulparke avatar May 24 '24 09:05 paulparke

I experience this issue for a long time, while using Cilium as my CNI, and I do not hold any network policies. Does anyone have here another advise how to solve it?

johnniee avatar May 27 '24 20:05 johnniee