haproxy icon indicating copy to clipboard operation
haproxy copied to clipboard

stick-tables missing data since 3.2.x with many peers and traffic

Open MarcWort opened this issue 3 months ago • 18 comments

Detailed Description of the Problem

Hi, we are running 50 loadbalancers in two datacenters (with a latency of ~15ms between them) sharing four stick-tables in a single peer group and serving many thousand reqs per second.

Since upgrading from version 3.1.8 to 3.2.1, we have noticed that many instances sometimes can no longer see table entries other than themselves.

We see this because we have the hostname as a key:

http-request track-sc4 hostname,concat(|,req.fe_name) table table1
Image

On the top panel you can see that sometimes for a few hours it works again.

Here you can see that it is still broken on the newest version 3.2.6: Image

The status with show peers seems fine:

echo "show peers scrubbing" | socat stdio /var/run/haproxy.stat | grep "addr="
  0x556688f73de0: id=n255027(remote,active) addr=192.168.255.27:10000 app_state=RUNNING learn_state=NOTASSIGNED last_status=ESTA last_hdshk=1h37m
  0x556688f72d00: id=n255026(remote,active) addr=192.168.255.26:10000 app_state=RUNNING learn_state=NOTASSIGNED last_status=ESTA last_hdshk=1h37m
  0x556688f71c20: id=n255025(remote,active) addr=192.168.255.25:10000 app_state=RUNNING learn_state=NOTASSIGNED last_status=ESTA last_hdshk=1h37m

The state from to peers (in the same datacenter) looks like this:

#node a:
          0x555af86299a0 local_id=1 remote_id=1 flags=0x0 remote_data=0x400600
              last_acked=467 last_pushed=1758 last_get=467 teaching_origin=0 update=1758
              table:0x555ad2ff70f0 id=table1 update=1778 localupdate=1758 commitupdate=1758 refcnt=1


#node b:
          0x5566afe57eb0 local_id=1 remote_id=1 flags=0x0 remote_data=0x400600
              last_acked=1758 last_pushed=467 last_get=1758 teaching_origin=0 update=467
              table:0x5566899fb9c0 id=table1 update=517 localupdate=467 commitupdate=467 refcnt=1

Expected Behavior

stick-tables the be somewhat synchronized between many nodes.

Steps to Reproduce the Behavior

On my laptop I reproduced it on 3.2.1, with this minimal config and 4 instances:

global
    log stdout format raw daemon
    stats socket [email protected]:9001 level admin
    stats timeout 2m

defaults
    mode http
    log global
    timeout client 5s
    timeout server 5s
    timeout connect 5s
    option httpslog

frontend fe_http
    bind 127.0.0.1:8001

    http-request track-sc0 req.hdr(host) table mycluster/table1
    default_backend be_app

backend be_app
    http-request deny deny_status 200

peers mycluster
  peer loadbalancer1 127.0.0.1:10001
  peer loadbalancer2 127.0.0.1:10002
  peer loadbalancer3 127.0.0.1:10003
  peer loadbalancer4 127.0.0.1:10004
  table table1 type string len 64 size 16k expire 10m store http_req_cnt

with simulating many requests:

oha -n 50000 -c 4000 -p 200 "http://localhost:8001"
oha -n 50000 -c 4000 -p 200 "http://localhost:8002"
oha -n 50000 -c 4000 -p 200 "http://localhost:8003"
oha -n 50000 -c 4000 -p 200 "http://localhost:8004"

which results in a weird:

echo "show table mycluster/table1" | socat stdio tcp4-connect:127.0.0.1:9001 && \
echo "show table mycluster/table1" | socat stdio tcp4-connect:127.0.0.1:9002 && \
echo "show table mycluster/table1" | socat stdio tcp4-connect:127.0.0.1:9003 && \
echo "show table mycluster/table1" | socat stdio tcp4-connect:127.0.0.1:9004

# table: mycluster/table1, type: string, size:16384, used:5
0x72473c040638: key=localhost:8004 use=0 exp=598371 shard=0 http_req_cnt=500000
0x724710048e78: key=localhost:8002 use=0 exp=597804 shard=0 http_req_cnt=500000
0x5e35d21bc718: key=localhost:8003 use=0 exp=598117 shard=0 http_req_cnt=500000
0x7247040585f8: key=localhost:8001 use=0 exp=598619 shard=0 http_req_cnt=499999
0x724708269ca8: key=localhost:8001 use=0 exp=581638 shard=0 http_req_cnt=1

# table: mycluster/table1, type: string, size:16384, used:4
0x7a7fb8040428: key=localhost:8004 use=0 exp=598361 shard=0 http_req_cnt=500000
0x7a7f9c08d888: key=localhost:8002 use=0 exp=597789 shard=0 http_req_cnt=500000
0x7a7fb804ede8: key=localhost:8003 use=0 exp=598109 shard=0 http_req_cnt=500000
0x7a7fa0040948: key=localhost:8001 use=0 exp=598622 shard=0 http_req_cnt=499999

# table: mycluster/table1, type: string, size:16384, used:4
0x5af2f39c4858: key=localhost:8004 use=0 exp=598354 shard=0 http_req_cnt=500000
0x7180dc042a28: key=localhost:8002 use=0 exp=597789 shard=0 http_req_cnt=500000
0x7180ac18cb18: key=localhost:8003 use=0 exp=598100 shard=0 http_req_cnt=500000

# table: mycluster/table1, type: string, size:16384, used:4
0x7cfe04228c28: key=localhost:8004 use=0 exp=598337 shard=0 http_req_cnt=500000
0x5ce57e243d98: key=localhost:8002 use=0 exp=597781 shard=0 http_req_cnt=500000
0x5ce57e2398b8: key=localhost:8003 use=0 exp=598092 shard=0 http_req_cnt=500000

notice that localhost:8001 is somehow twice on node 1 and missing from node 3 and 4. Even with one more http://localhost:8001" it does not update. After restarting node1 it has learned the state and continues to update the other nodes again.

On 3.1.8 there is no problem.

Do you have any idea what may have caused this?

There are always a group of at least 5 nodes doing a reboot at the same time. I could think of a split brain situation or race condition between first sync and first request, but haven't figured this out yet.

Do you have an idea how to solve the issue?

We are currently thinking of creating subgroups with fewer peers. Could I ask you guys what a "good" number of host per peer group?

What is your configuration?

The configuration is quite big, but I am happy to share any relevant configuration snippet if you want.

maybe relevant:
tune.stick-counters 12

since 3.2 we also set:
cpu-policy group-by-cluster

Output of haproxy -vv

HAProxy version 3.2.6-81568b2 2025/10/03 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2030.
Known bugs: http://www.haproxy.org/bugs/bugs-3.2.6.html
Running on: Linux 6.1.0-40-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.153-1 (2025-09-20) x86_64
Build options : 
  TARGET  = linux-glibc
  CC      = cc
  CFLAGS  = -O2 -g -march=x86-64-v3 -mtune=znver1 -fwrapv
  OPTIONS = USE_LIBCRYPT=1 USE_OPENSSL_AWSLC=1 USE_LUA=1 USE_ZLIB= USE_SLZ=1 USE_NS= USE_PROMEX=1 USE_PCRE= USE_PCRE_JIT= USE_PCRE2=1 USE_PCRE2_JIT=
  DEBUG   = -DDEBUG_GLITCHES=1

Feature list : -51DEGREES +ACCEPT4 +BACKTRACE -CLOSEFROM +CPU_AFFINITY +CRYPT_H -DEVICEATLAS +DL -ENGINE +EPOLL -EVPORTS +GETADDRINFO -KQUEUE -LIBATOMIC +LIBCRYPT +LINUX_CAP +LINUX_SPLICE +LINUX_TPROXY +LUA +MATH -MEMORY_PROFILING +NETFILTER -NS -OBSOLETE_LINKER +OPENSSL +OPENSSL_AWSLC -OPENSSL_WOLFSSL -OT -PCRE +PCRE2 -PCRE2_JIT -PCRE_JIT +POLL +PRCTL -PROCCTL +PROMEX -PTHREAD_EMULATION -QUIC -QUIC_OPENSSL_COMPAT +RT +SLZ +SSL -STATIC_PCRE -STATIC_PCRE2 +TFO +THREAD +THREAD_DUMP +TPROXY -WURFL -ZLIB +ACME

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_TGROUPS=32, MAX_THREADS=1024, default=4).
Built with SSL library version : OpenSSL 1.1.1 (compatible; AWS-LC 1.62.0)
Running on SSL library version : AWS-LC 1.62.0
SSL library supports TLS extensions : yes
SSL library supports SNI : yes
SSL library FIPS mode : no
SSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.6
Built with the Prometheus exporter as a service
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE2 version : 10.42 2022-12-11
PCRE2 library supports JIT : no (USE_PCRE2_JIT not set)
Encrypted password support via crypt(3): yes
Built with gcc compiler version 12.2.0

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
         h2 : mode=HTTP  side=FE|BE  mux=H2    flags=HTX|HOL_RISK|NO_UPG
  <default> : mode=HTTP  side=FE|BE  mux=H1    flags=HTX
         h1 : mode=HTTP  side=FE|BE  mux=H1    flags=HTX|NO_UPG
       fcgi : mode=HTTP  side=BE     mux=FCGI  flags=HTX|HOL_RISK|NO_UPG
  <default> : mode=SPOP  side=BE     mux=SPOP  flags=HOL_RISK|NO_UPG
       spop : mode=SPOP  side=BE     mux=SPOP  flags=HOL_RISK|NO_UPG
  <default> : mode=TCP   side=FE|BE  mux=PASS  flags=
       none : mode=TCP   side=FE|BE  mux=PASS  flags=NO_UPG

Available services : prometheus-exporter
Available filters :
	[BWLIM] bwlim-in
	[BWLIM] bwlim-out
	[CACHE] cache
	[COMP] compression
	[FCGI] fcgi-app
	[SPOE] spoe
	[TRACE] trace


uname -a:
Linux server1 6.1.0-40-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.153-1 (2025-09-20) x86_64 GNU/Linux

Last Outputs and Backtraces


Additional Information

I tried reverting b3ad7b6371e10cee4e0c23f80191e62d1da7415f MEDIUM: peers: Give up if we fail to take locks in hot path on one production loadbalancer and my local test setup, but did not see any difference.

I can try some stuff like reverting some patches on prod, if you are somewhat sure it doesn't crash immediately :)

cc @idl0r as my colleague :)

MarcWort avatar Oct 16 '25 15:10 MarcWort

Additional information:

I bisected between good f2b97918e80b2f4df1da751a44fe6e323c6e4b9e [RELEASE] Released version 3.1.0 and bad: 81568b2d10d89be28e0147c8f25ca41b2f9dde20 [RELEASE] Released version 3.2.6.

and found that the bad commit is

388539faa35cd11019d81619cce86ac34a554d33 MEDIUM: stick-tables: defer adding updates to a tasklet

here is show peers from my testsetup:

peers-node-1.txt

peers-node-2.txt

peers-node-3.txt

peers-node-4.txt

MarcWort avatar Oct 17 '25 09:10 MarcWort

Thanks a lot for the detailed report, and the bissecting work, @MarcWort ! I'll have a look at that soon

cognet avatar Oct 17 '25 11:10 cognet

Hi @MarcWort !

So this is a bit of a shot in the dark, as I've been unable to reproduce it locally, even with the help of your configuration example, but can you try the attached patch, see if it changes anything? It should apply cleanly on 3.2

Thank you!

0001-WIP-only-set-is_local-if-append-succeeded.txt

cognet avatar Oct 22 '25 13:10 cognet

Hi Olivier @cognet,

unfortunately no. I applied that patch and still get not updating entries on my local test setup.

I've been unable to reproduce it locally,

To make the test more reproducible I stop (SIGINT via ctrl-c in shell) haproxy at an arbitrary point in time and restart that instance again. After the start of e.g. loadbalancer4 the local does updates but not on the other instances anymore.

Output of

watch -n 0.1 'echo "show table mycluster/table1" | socat stdio tcp4-connect:127.0.0.1:9001 && \
echo "show table mycluster/table1" | socat stdio tcp4-connect:127.0.0.1:9002 && \
echo "show table mycluster/table1" | socat stdio tcp4-connect:127.0.0.1:9003 && \
echo "show table mycluster/table1" | socat stdio tcp4-connect:127.0.0.1:9004'

https://github.com/user-attachments/assets/d2d6a5a2-60e3-4427-bd54-270209a165f0

tbh I hope there are not two bugs we are fighting...

For completeness, I downgraded our production to 3.1.9 and see stick-table updates from any other lb again:

Image

Also, we maybe should ignore the double entry from the first post, because I haven't seen this again and it seems really rare or somehow triggend by something else.

0x7247040585f8: key=localhost:8001 use=0 exp=598619 shard=0 http_req_cnt=499999
0x724708269ca8: key=localhost:8001 use=0 exp=581638 shard=0 http_req_cnt=1

MarcWort avatar Oct 22 '25 15:10 MarcWort

Regarding a double entry, there's no possibility to store that in a stick-table, as they're indexed by a unique key, so I guess that what happened in your case was that the entry was purged and re-created during the "show table" dump, and was thus met twice. This can indeed happen but should be super rare because you have to be very lucky to yield on the dump between the two occurrences.

wtarreau avatar Oct 22 '25 16:10 wtarreau

@cognet I can now reproduce it on latest 3.2:

# table: peers/cluster, type: string, size:2048000, used:4
0x78365004df18: key=0:2210 use=0 exp=592585 shard=0 http_req_cnt=1000000
0x78362c047da8: key=0:2213 use=0 exp=592581 shard=0 http_req_cnt=997840
0x78363804fc58: key=0:2211 use=0 exp=592581 shard=0 http_req_cnt=996746
0x56dd33f863a8: key=0:2212 use=0 exp=592596 shard=0 http_req_cnt=272039

It doesn't happen with 3.3-dev however. I'll give you my config if you're interested. In the mean time I'm going to bisect to see what fixed it in 3.3.

wtarreau avatar Oct 23 '25 14:10 wtarreau

I'm also getting the duplicate. I don't know what to think about it since it's technically impossible:

0x5fbdf2644208: key=0:2210 use=0 exp=597029 shard=0 http_req_cnt=1999999
0x70dcf0036ae8: key=0:2213 use=0 exp=597051 shard=0 http_req_cnt=1999999
0x5fbdf2512948: key=0:2213 use=0 exp=588312 shard=0 http_req_cnt=108
0x70dd00039f78: key=0:2211 use=0 exp=597042 shard=0 http_req_cnt=2000000
0x70dd0c03b0b8: key=0:2212 use=0 exp=597042 shard=0 http_req_cnt=2000000

That's totally puzzling :-/

wtarreau avatar Oct 23 '25 14:10 wtarreau

@wtarreau yes I'd be very interested in your reproducer, I did some test with 3.2, but it is true most of my tests were with 3.3

cognet avatar Oct 23 '25 14:10 cognet

I could bisect it to that commit in 3.2.5-16:

commit c43ecfd408e15c82df2c8cf6acf88197a5b92fed (HEAD)
Author: Willy Tarreau <[email protected]>
Date:   Tue Sep 9 17:51:39 2025 +0200

    MEDIUM: peers: don't even try to process updates under contention
    
    Recent fix 2421c3769a ("BUG/MEDIUM: peers: don't fail twice to grab the
    update lock") improved the situation a lot for peers under locking
    contention but still not enough for situations with many peers and
    many entries to expire fast. It's indeed still possible to trigger
    warnings at end of injection sessions for 16 peers at 100k req/s each
    doing 10 random track-sc when process_table_expire() runs and holds the
    update lock if compiled with a high value of STKTABLE_MAX_UPDATES_AT_ONCE
    (1000). Better just not insist in this case and postpone the update.

I'm suspecting that when we exit that loop, for one reason or another the entry is marked as pushed while it was not. It's worth noting that I could still see a loss in some rare occasions with all other versions (e.g. 999999 entries instead of 1000000) that I cannot explain so I've been only focusing on blattant losses. Let's just have a call, it will be easier than writing our lives here.

wtarreau avatar Oct 23 '25 14:10 wtarreau

It doesn't happen with 3.3-dev however. I'll give you my config if you're interested. In the mean time I'm going to bisect to see what fixed it in 3.3.

Thanks for looking into this. Unfortunately I can still reproduce it on 3.3-dev10 aa259f5b4. Sure I would like to see how you reproduce this behavior.

Here is mine to get 100% reproducible behavior, where none of the other instances updates the entry after restarting just one instance under light load:

https://github.com/user-attachments/assets/b2209b9d-a00a-4d9e-aca3-13aa1b4ac10e

I could bisect it to that commit in 3.2.5-16:

I don't quite understand this. As you can see on my second screenshot in the very first post, we had this problem on prod on all versions of the 3.2 series.

MarcWort avatar Oct 23 '25 15:10 MarcWort

@MarcWort I think the confusion comes from the fact that we're seeing multiple issues :)

cognet avatar Oct 23 '25 15:10 cognet

I agree. Initially while trying to reproduce it, seeing missing entries made me think I was reproducing your issue (i.e. until proven otherwise it would match). But we figured that this was different and a very special corner case of the reconnection sequence of a peer having no local older process, that waits for 5s for an old process to connect before connecting to other peers, which causes some locally learned entries to be replaced by other peers hence some counts to be lost. We've even figured a way to improve this. We've also found what caused the double entry, the bug has been there from the very beginning in 2010. It indeed indexes different keys, it's just that the data can differ after the \0 that ends the string. We have a fix for this. And now with all that in mind, we have to restart the analysis from scratch for the description of what you're observing (the missing entry) which we now know is not caused by any of these.

wtarreau avatar Oct 24 '25 03:10 wtarreau

Hi @MarcWort,

Sorry for the very long wait time. Thanks to @capflam, I believe we finally have a fix, which I just pushed to 3.3 (while forgetting to mention the issue). So commit c26bcfc1e3ae0193e2589a15c96ffd44b2e0f770 should (hopefully) fix the problem, and should apply as-is to 3.2, if you want to test it.

cognet avatar Nov 07 '25 15:11 cognet

Of course you waited for me to tag 3.2.8 to find this :-) No problem, it will be in 3.2.9 then.

wtarreau avatar Nov 07 '25 15:11 wtarreau

Blame @capflam !

cognet avatar Nov 07 '25 15:11 cognet

Hi @cognet, Thank you for your awesome work. This patch does indeed fix the issue in my local test setup.

I'm going to apply it to 3.2.8, deploy it on our production setup, and report back soon.

Should this issue then be closed?

MarcWort avatar Nov 10 '25 09:11 MarcWort

Hi @MarcWort,

Thanks for testing! Let's keep the issue opened until the patch is backported to 3.2

cognet avatar Nov 10 '25 10:11 cognet

Hi, just to keep you updated, it works fine on our production servers with 3.2.8 and the patch.

Thanks again!

MarcWort avatar Nov 12 '25 09:11 MarcWort