kazoo icon indicating copy to clipboard operation
kazoo copied to clipboard

Unhandled Exception in Connection Loop: RuntimeError: ('xids do not match, expected %r received %r', 28, 27)

Open diranged opened this issue 11 years ago • 6 comments

Two days ago we upgraded our servers from Kazoo 1.3.1 -> Kazoo 2.0. We started seeing a 10x increase in the number of connection failures ... and the recovery of those connections is far worse too. It looks like we're seeing a new exception being raised that we did not see before.

        "stack_trace": [
            "Traceback (most recent call last):", 
            "  File '/mnt/.venv/lib/python2.7/site-packages/kazoo/protocol/connection.py', line 531, in _connect_attempt", 
            "    response = self._read_socket(read_timeout)", 
            "  File '/mnt/.venv/lib/python2.7/site-packages/kazoo/protocol/connection.py', line 407, in _read_socket", 
            "    return self._read_response(header, buffer, offset)", 
            "  File '/mnt/.venv/lib/python2.7/site-packages/kazoo/protocol/connection.py', line 338, in _read_response", 
            "    'received %r', xid, header.xid)", 
            "RuntimeError: ('xids do not match, expected %r received %r', 28, 27)"

This is being caught by kazoo.client and throwing the Unhandled exception in connection loop error. I'm completely stumped, I can't seem to replicate this in my own dev environment... and it only happens sporadically in production.

diranged avatar Oct 23 '14 18:10 diranged

I can reproduce this failure in our staging and production environments where we have zookeeper clusters and the client is disconnected from NodeA and reconnects to NodeB. It happens every time it seems. Still working to reproduce this in a smaller dev environment.

diranged avatar Oct 23 '14 20:10 diranged

@bbangert I noticed you did quite a bit of code reworking for Kazoo 2 regarding connection handling. Can you comment on this?

diranged avatar Oct 23 '14 21:10 diranged

We are still seeing this happen occasionally and are downgrading all of our servers to Kazoo 1.3.1 until this is resolved.

diranged avatar Nov 06 '14 20:11 diranged

@diranged I don't know if the latest 2 dev has a fix for this offhand. If you can reproduce on staging, maybe you can test it there?

bbangert avatar Nov 06 '14 21:11 bbangert

@diranged curious, are you using authentication? (i.e.: add_auth calls). The reason I ask is because of https://github.com/python-zk/kazoo/commit/15b7632fba6cbb2f31bda95eb4ca4ad327c04919. Not sure if related.

Also, what's running on your servers? I do see this is from time to time on 3.5 (i.e.: ZK out of trunk from a couple of months ago + patches).

rgs1 avatar Nov 24 '14 18:11 rgs1

bump @diranged any updates per above?

Is it still reproducible?

jeffwidman avatar Jul 24 '17 04:07 jeffwidman