dataplaneapi icon indicating copy to clipboard operation
dataplaneapi copied to clipboard

High CPU consumption on control plane churn with large config files

Open corrieb opened this issue 5 years ago • 5 comments

I'm intending this to be a writeup of some profiling work I did yesterday on dataplaneapi 2.1 when we observed high CPU consumption in our testing. It's shouldn't be considered a bug necessarily, but I think there are some significant efficiency enhancements that could be made to improve scalability.

The test scenario is where we're using dataplaneapi to make updates to a configuration file that is 1000-5000 lines long due to a significant number of backend and frontend definitions. Our updates to dataplaneapi are made serially from a single service that queues them up. We observed dataplane API consume pretty much an entire vCPU in a 2 vCPU VM making the updates.

After some testing with pprof, it became clear that the reason for the high CPU consumption is down to the amount of garbage being created by the Parser code in the haproxytech/config-parser package. The vast majority of CPU usage is the garbage collector trying to keep up with all of the objects it needs to clear.

Having dug into the code and done some further profiling, I think it's clear to me why this is. Every time a request is sent to dataplaneapi it reads the entire haproxy.cfg file. Each line of that file is processed by ProcessLine https://github.com/haproxytech/config-parser/blob/2dc92f0d9a6078abb6f0a6f8b5772af82f3af7ad/parser.go#L330 and ProcessLine iterates over every active Parser until it finds one that can process the line. Each Parser is a stateful object that contains a number of stateful sub-parsers that know how to process clauses of the given statement. As an example, the backend parser https://github.com/haproxytech/config-parser/blob/2dc92f0d9a6078abb6f0a6f8b5772af82f3af7ad/section-parsers.go#L224 creates something like 50 stateful objects eagerly in order to process all possible clauses from a backend.

Processing an haproxy.cfg with 500 backend definitions ends up going around that active parsers loop 140,000 times and creates a new Backend Parser with its all of its stateful sub-parsers 500 times. If you define a new backend with 3 servers, that's 4 requests to dataplaneapi which obviously reads the config file 4 times. The garbage collector has to work hard to keep up with all this.

There's nothing fundamentally wrong with the code and if all we ever expected to see was in the order of 10-50 backends, this would be fine. We do however expect to get to higher scale than this and we have to consider the impact of the CPU load from the control plane on the HAProxy instance co-located with it.

As a workaround, we're going to try putting dataplaneapi in a cgroup and limiting its CPU consumption, but I feel like there could be some caching or refactoring steps that could be added here that could greatly enhance the efficiency of the code at scale.

corrieb avatar Aug 05 '20 18:08 corrieb

profile002 profile001

corrieb avatar Aug 05 '20 18:08 corrieb

Note that this will eventually cause an OOM error (see part of a stacktrace here where dataplaneapi blew up).

Aug 02 13:05:20 localhost haproxy[14818]: fatal error: runtime: out of memory
Aug 02 13:05:20 localhost haproxy[14818]: runtime stack:
Aug 02 13:05:20 localhost haproxy[14818]: runtime.throw(0xda18cd, 0x16)
Aug 02 13:05:20 localhost haproxy[14818]: /usr/local/go/src/runtime/panic.go:1114 +0x72
Aug 02 13:05:20 localhost haproxy[14818]: runtime.sysMap(0xc054000000, 0x4000000, 0x1567c98)
Aug 02 13:05:20 localhost haproxy[14818]: /usr/local/go/src/runtime/mem_linux.go:169 +0xc5
Aug 02 13:05:20 localhost haproxy[14818]: runtime.(*mheap).sysAlloc(0x1553340, 0x400000, 0xc004be79b0, 0xc00d485000)
Aug 02 13:05:20 localhost haproxy[14818]: /usr/local/go/src/runtime/malloc.go:715 +0x1cd
Aug 02 13:05:20 localhost haproxy[14818]: runtime.(*mheap).grow(0x1553340, 0x1, 0x0)
Aug 02 13:05:20 localhost haproxy[14818]: /usr/local/go/src/runtime/mheap.go:1286 +0x11c
Aug 02 13:05:20 localhost haproxy[14818]: runtime.(*mheap).allocSpan(0x1553340, 0x1, 0x800, 0x1567ca8, 0x1fe)
Aug 02 13:05:20 localhost haproxy[14818]: /usr/local/go/src/runtime/mheap.go:1124 +0x6a0
Aug 02 13:05:20 localhost haproxy[14818]: runtime.(*mheap).alloc.func1()
Aug 02 13:05:20 localhost haproxy[14818]: /usr/local/go/src/runtime/mheap.go:871 +0x64
Aug 02 13:05:20 localhost haproxy[14818]: runtime.systemstack(0x0)
Aug 02 13:05:20 localhost haproxy[14818]: /usr/local/go/src/runtime/asm_amd64.s:370 +0x66
Aug 02 13:05:20 localhost haproxy[14818]: runtime.mstart()
Aug 02 13:05:20 localhost haproxy[14818]: /usr/local/go/src/runtime/proc.go:1041
Aug 02 13:05:20 localhost haproxy[14818]: goroutine 1 [running]:
Aug 02 13:05:20 localhost haproxy[14818]: runtime.systemstack_switch()
Aug 02 13:05:20 localhost haproxy[14818]: /usr/local/go/src/runtime/asm_amd64.s:330 fp=0xc000edec30 sp=0xc000edec28 pc=0x460e20
Aug 02 13:05:20 localhost haproxy[14818]: runtime.(*mheap).alloc(0x1553340, 0x1, 0xc053fe0108, 0x8)
Aug 02 13:05:20 localhost haproxy[14818]: /usr/local/go/src/runtime/mheap.go:865 +0x81 fp=0xc000edec80 sp=0xc000edec30 pc=0x425461
Aug 02 13:05:20 localhost haproxy[14818]: runtime.(*mcentral).grow(0x1563998, 0x0)
Aug 02 13:05:20 localhost haproxy[14818]: /usr/local/go/src/runtime/mcentral.go:255 +0x79 fp=0xc000edecc0 sp=0xc000edec80 pc=0x4179e9
Aug 02 13:05:20 localhost haproxy[14818]: runtime.(*mcentral).cacheSpan(0x1563998, 0x40c444)
Aug 02 13:05:20 localhost haproxy[14818]: /usr/local/go/src/runtime/mcentral.go:106 +0x2bc fp=0xc000eded08 sp=0xc000edecc0 pc=0x41751c
Aug 02 13:05:20 localhost haproxy[14818]: runtime.(*mcache).refill(0x7f48b5f977d0, 0x8)
Aug 02 13:05:20 localhost haproxy[14818]: /usr/local/go/src/runtime/mcache.go:138 +0x85 fp=0xc000eded28 sp=0xc000eded08 pc=0x417005
Aug 02 13:05:20 localhost haproxy[14818]: runtime.(*mcache).nextFree(0x7f48b5f977d0, 0x151f608, 0xc00026dc00, 0x7f48b5f977d0, 0xc000000180)
Aug 02 13:05:20 localhost haproxy[14818]: /usr/local/go/src/runtime/malloc.go:868 +0x87 fp=0xc000eded60 sp=0xc000eded28 pc=0x40bd97
Aug 02 13:05:20 localhost haproxy[14818]: runtime.mallocgc(0x30, 0xd04b60, 0xc053fe7901, 0x13)

Aug 02 13:05:20 localhost haproxy[14818]: runtime.newobject(0xd04b60, 0xc043ae0a40)
Aug 02 13:05:20 localhost haproxy[14818]: /usr/local/go/src/runtime/malloc.go:1165 +0x38 fp=0xc000edee30 sp=0xc000edee00 pc=0x40cac8
Aug 02 13:05:20 localhost haproxy[14818]: github.com/haproxytech/config-parser/v2/parsers.(*OptionPgsqlCheck).Parse(0xc053d4d500, 0xc0449720c9, 0x5d, 0xc053ffeec0, 0x2, 0x2, 0xc053f0f700, 0x8, 0x8, 0x0, ...)
Aug 02 13:05:20 localhost haproxy[14818]: /go/pkg/mod/github.com/haproxytech/config-parser/[email protected]/parsers/option-pgsql-check.go:55 +0x87 fp=0xc000edee50 sp=0xc000edee30 pc=0x9b8db7
Aug 02 13:05:20 localhost haproxy[14818]: github.com/haproxytech/config-parser/v2.(*Parser).ProcessLine(0xc021939f00, 0xc0449720c9, 0x5d, 0xc053ffeec0, 0x2, 0x2, 0xc053f0f700, 0x8, 0x8, 0x0, ...)

You can see the memory allocation in ProcessLine is the culprit. The biggest challenge about this is that dataplaneapi will never come back up with the haproxy.cfg file in the state that it's in, it will just keep crashing trying to load it on startup. There really is no easy remediation at that point.

corrieb avatar Aug 06 '20 00:08 corrieb

Hi @corrieb , thank you for the very detailed and interesting report, we will look into it, and are aware of some of the points you brought up here.

mjuraga avatar Aug 06 '20 10:08 mjuraga

We have a very large configuration file as well that contains one frontend, 2300+ acl, and 1100+ backends, 25K+ lines total including comments, configuration options and etc.

We have not tested it with the dataplane API yet, but given @corrieb's analysis, I am afraid we will run into the same problem.

aintbrokedontfix avatar Mar 27 '21 12:03 aintbrokedontfix

Hi @corrieb, there has been some changes optimizing our config-parser performance in the meantime. Can you try running your tests with the latest version to see if this addresses your issues?

mjuraga avatar Jul 27 '21 14:07 mjuraga