nchan icon indicating copy to clipboard operation
nchan copied to clipboard

Stale Websocket Channel Message

Open Arctek opened this issue 6 years ago • 27 comments

Hi,

I'm having an issue where the websocket channel is returning an old message.

When I check the publisher endpoint manually i.e. via curl -v --request GET -H "Accept: text/json" http://188.166.94.193/pub/.....

It tells me that the message has been updated recently, however nothing has come through on the subscription and it still returns the old message.

Restarting nginx or deleting the channel seems to fix the issue, but it comes back fairly quickly. From what I can see it looks like some channels work while others don't, but the data is in the same format.

I'm using nchan from master so everything is up to date.

Arctek avatar Apr 03 '19 13:04 Arctek

I don't understand the problem you're having. Show me your config and I might be able to help.

slact avatar Apr 16 '19 21:04 slact

It's pretty long - https://gist.github.com/Arctek/fdfc0e0ed2bac86b24b807601361a260

The channels I am having issues with have JSON data pushed into them fairly regularly i.e. every few seconds.

Some of them seem to get stuck after a while and only return an older message. i.e. wss://bamboorelay.com/sub/main/orders/ZRX-WETH

There are some lines in the error log ocassionally too - not sure if they are related: 2019/04/17 07:26:29 [error] 11421#11421: REDISTORE: invalid message redis reply 2019/04/17 07:26:29 [error] 11421#11421: REDISTORE: invalid message or message absent after get_msg_from_key

Arctek avatar Apr 17 '19 05:04 Arctek

Those eror log lines are what I'm interested in. How often does this happen? Would you be able to catch it in a redis-cli monitor session?

slact avatar Apr 18 '19 17:04 slact

redis.log I've captured data during the same time that message showed up in the log file. It shows up maybe every couple of minutes in the log file.

Though I'm not sure it's to do with redis - I've checked and redis has the latest copy of the message i.e. timestamped 1555644444 but connecting to the channel I see an older message stamped 1555643718.

Some of the channels seems to get pretty reliably stuck that way, but not others for some reason.

The messages in the logs are (dates converted to timestamps): 1555643612 [error] 28808#28808: REDISTORE: invalid message redis reply 1555643612 [error] 28808#28808: REDISTORE: invalid message or message absent after get_msg_from_key 1555643612 [error] 28808#28808: REDISTORE: invalid message redis reply 1555643612 [error] 28808#28808: REDISTORE: invalid message or message absent after get_msg_from_key 1555643726 [error] 28808#28808: REDISTORE: invalid message redis reply 1555643726 [error] 28808#28808: REDISTORE: invalid message or message absent after get_msg_from_key 1555643858 [error] 28808#28808: REDISTORE: invalid message redis reply 1555643858 [error] 28808#28808: REDISTORE: invalid message or message absent after get_msg_from_key 1555643903 [error] 28808#28808: REDISTORE: invalid message redis reply 1555643903 [error] 28808#28808: REDISTORE: invalid message or message absent after get_msg_from_key

I could only find one in the cli dump - 1555643858, but this is for a different channel than I am having issues with. That said it could be another issue too.

There are a fair few channels though, there's >15k channel keys in redis, though most aren't being used at the same time.

Arctek avatar Apr 19 '19 03:04 Arctek

Your problem looks a lot like mine; #517. For me, it's very visible (smaller set-up); after around 24h of uptime, pre-existing/used channels would not publish newer messages anymore. When a client joins such broken channel and I publish a message to that channel while monitoring the keys inside Redis, I see the channel key being created, but it immediately disappears again within 5 seconds of publishing a new message without passing it through to still subscribed clients.

The message is still there, though, but the channel does not exist anymore. Then, eventually the message would expire never being published to the connected clients. Only a restart seems to solves this behaviour for me for these broken channels.

Maybe you see the same happening in your situation?

sg-jorrit avatar Apr 19 '19 09:04 sg-jorrit

What I need to fix this is the error from the Redis Lua script that produces that error in the Nginx logs. If you can't catch it in real-time, I will make a debug version to capture it.

slact avatar Apr 20 '19 18:04 slact

Is there anything in particular I should be looking at in the redis-cli monitor output? There's alot being pushed through, I could try setting up a separate VPS and just pushing data into the broken channel, but I'm not sure it will continue to happen. As on my local machine it doesn't appear to get stuck. Otherwise I can compile and run a debug version on production, if that will help with tracking this down.

Arctek avatar Apr 22 '19 05:04 Arctek

@Arctek please give the debug-522 branch a try. It's got some debugging code that will help me track this down. Let me know what shows up in your error log after you run it and encounter the issue.

slact avatar Apr 26 '19 19:04 slact

The dbg calls in the lua script you updated are commented out?

That said I did run compile and run it in debug mode, you can see below where it happens on the ZRX-WETH channel.

I think the issue is probably related to:

  • Channels are set to retain a single message only
  • I have multiple data sources trying to update each channel
  • Looks like this can occur at the same time and is what is causing issues

error.log.gz

2019/04/27 06:14:54 [debug] 23965#23965: *7608 write new buf t:1 f:0 000055AE8472F878, pos 000055AE8472F878, size: 140 file: 0, size: 0
2019/04/27 06:14:54 [debug] 23965#23965: *7608 http write filter: l:0 f:0 s:140
2019/04/27 06:14:54 [debug] 23965#23965: *7608 http output filter "/pub/main/orders/MANA-WETH?"
2019/04/27 06:14:54 [debug] 23965#23965: *7608 http copy filter: "/pub/main/orders/MANA-WETH?"
2019/04/27 06:14:54 [debug] 23965#23965: *7608 http postpone filter "/pub/main/orders/MANA-WETH?" 000055AE8472F818
2019/04/27 06:14:54 [debug] 23965#23965: *7608 http brotli filter
2019/04/27 06:14:54 [debug] 23965#23965: *7608 write old buf t:1 f:0 000055AE8472F878, pos 000055AE8472F878, size: 140 file: 0, size: 0
2019/04/27 06:14:54 [debug] 23965#23965: *7608 write new buf t:0 f:0 000055AE8368DB60, pos 000055AE8368DB60, size: 91 file: 0, size: 0
2019/04/27 06:14:54 [debug] 23965#23965: *7608 http write filter: l:1 f:1 s:231
2019/04/27 06:14:54 [debug] 23965#23965: *7608 http write filter limit 524288
2019/04/27 06:14:54 [debug] 23965#23965: *7608 writev: 231 of 231
2019/04/27 06:14:54 [debug] 23965#23965: *7608 http write filter 0000000000000000
2019/04/27 06:14:54 [debug] 23965#23965: *7608 http copy filter: 0 "/pub/main/orders/MANA-WETH?"
2019/04/27 06:14:54 [debug] 23965#23965: *7608 http finalize request: 0, "/pub/main/orders/MANA-WETH?" a:1, c:1
2019/04/27 06:14:54 [debug] 23965#23965: *7608 event timer add: 65: 5000:7842224
2019/04/27 06:14:54 [debug] 23965#23965: *7608 http lingering close handler
2019/04/27 06:14:54 [debug] 23965#23965: *7608 recv: eof:0, avail:1
2019/04/27 06:14:54 [debug] 23965#23965: *7608 recv: fd:65 -1 of 4096
2019/04/27 06:14:54 [debug] 23965#23965: *7608 recv() not ready (11: Resource temporarily unavailable)
2019/04/27 06:14:54 [debug] 23965#23965: *7608 lingering read: -2
2019/04/27 06:14:54 [debug] 23965#23965: *7608 event timer: 65, old: 7842224, new: 7842224
2019/04/27 06:14:54 [debug] 23965#23965: timer delta: 4
2019/04/27 06:14:54 [debug] 23965#23965: worker cycle
2019/04/27 06:14:54 [debug] 23965#23965: epoll timer: 528
2019/04/27 06:14:54 [debug] 23965#23965: epoll: fd:6 ev:0001 d:000055AE846058F8
2019/04/27 06:14:54 [debug] 23965#23965: accept on unix:/var/run/nginx.sock, ready: 1
2019/04/27 06:14:54 [debug] 23965#23965: posix_memalign: 000055AE847327D0:256 @16
2019/04/27 06:14:54 [debug] 23965#23965: posix_memalign: 000055AE845954B0:256 @16
2019/04/27 06:14:54 [debug] 23965#23965: *7609 accept: unix: fd:64
2019/04/27 06:14:54 [debug] 23965#23965: *7609 event timer add: 64: 60000:7897388
2019/04/27 06:14:54 [debug] 23965#23965: *7609 reusable connection: 1
2019/04/27 06:14:54 [debug] 23965#23965: *7609 epoll add event: fd:64 op:1 ev:80002001
2019/04/27 06:14:54 [debug] 23965#23965: accept() not ready (11: Resource temporarily unavailable)
2019/04/27 06:14:54 [debug] 23965#23965: timer delta: 164
2019/04/27 06:14:54 [debug] 23965#23965: worker cycle
2019/04/27 06:14:54 [debug] 23965#23965: epoll timer: 364
2019/04/27 06:14:54 [debug] 23965#23965: epoll: fd:6 ev:0001 d:000055AE846058F8
2019/04/27 06:14:54 [debug] 23965#23965: accept on unix:/var/run/nginx.sock, ready: 1
2019/04/27 06:14:54 [debug] 23965#23965: posix_memalign: 000055AE846E92B0:256 @16
2019/04/27 06:14:54 [debug] 23965#23965: posix_memalign: 000055AE847680C0:256 @16
2019/04/27 06:14:54 [debug] 23965#23965: *7610 accept: unix: fd:66
2019/04/27 06:14:54 [debug] 23965#23965: *7610 event timer add: 66: 60000:7897388
2019/04/27 06:14:54 [debug] 23965#23965: *7610 reusable connection: 1
2019/04/27 06:14:54 [debug] 23965#23965: *7610 epoll add event: fd:66 op:1 ev:80002001
2019/04/27 06:14:54 [debug] 23965#23965: accept() not ready (11: Resource temporarily unavailable)
2019/04/27 06:14:54 [debug] 23965#23965: timer delta: 0
2019/04/27 06:14:54 [debug] 23965#23965: worker cycle
2019/04/27 06:14:54 [debug] 23965#23965: epoll timer: 364
2019/04/27 06:14:54 [debug] 23965#23965: epoll: fd:65 ev:2011 d:000055AE84606780
2019/04/27 06:14:54 [debug] 23965#23965: epoll_wait() error on fd:65 ev:2011
2019/04/27 06:14:54 [debug] 23965#23965: *7608 http lingering close handler
2019/04/27 06:14:54 [debug] 23965#23965: *7608 recv: eof:1, avail:1
2019/04/27 06:14:54 [debug] 23965#23965: *7608 recv: fd:65 0 of 4096
2019/04/27 06:14:54 [debug] 23965#23965: *7608 lingering read: 0
2019/04/27 06:14:54 [debug] 23965#23965: *7608 http request count:1 blk:0
2019/04/27 06:14:54 [debug] 23965#23965: *7608 http close request
2019/04/27 06:14:54 [debug] 23965#23965: *7608 http log handler
2019/04/27 06:14:54 [debug] 23965#23965: *7608 free: 000055AE8474B300
2019/04/27 06:14:54 [debug] 23965#23965: *7608 free: 000055AE84703DD0
2019/04/27 06:14:54 [debug] 23965#23965: *7608 free: 000055AE846EB270, unused: 3
2019/04/27 06:14:54 [debug] 23965#23965: *7608 free: 000055AE8472F240, unused: 2184
2019/04/27 06:14:54 [debug] 23965#23965: *7608 close http connection: 65
2019/04/27 06:14:54 [debug] 23965#23965: *7608 event timer del: 65: 7842224
2019/04/27 06:14:54 [debug] 23965#23965: *7608 reusable connection: 0
2019/04/27 06:14:54 [debug] 23965#23965: *7608 free: 000055AE8496CC60
2019/04/27 06:14:54 [debug] 23965#23965: *7608 free: 000055AE847584C0, unused: 0
2019/04/27 06:14:54 [debug] 23965#23965: *7608 free: 000055AE84A04D00, unused: 8
2019/04/27 06:14:54 [debug] 23965#23965: timer delta: 0
2019/04/27 06:14:54 [debug] 23965#23965: worker cycle
2019/04/27 06:14:54 [debug] 23965#23965: epoll timer: 364
2019/04/27 06:14:55 [debug] 23965#23965: epoll: fd:64 ev:0001 d:000055AE84605EC9
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http wait request handler
2019/04/27 06:14:55 [debug] 23965#23965: *7609 malloc: 000055AE846EB270:4096
2019/04/27 06:14:55 [debug] 23965#23965: *7609 recv: eof:0, avail:1
2019/04/27 06:14:55 [debug] 23965#23965: *7609 recv: fd:64 4096 of 4096
2019/04/27 06:14:55 [debug] 23965#23965: *7609 reusable connection: 0
2019/04/27 06:14:55 [debug] 23965#23965: *7609 posix_memalign: 000055AE8496CC60:4096 @16
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http process request line
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http request line: "POST /pub/main/orders/ZRX-WETH HTTP/1.1"
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http uri: "/pub/main/orders/ZRX-WETH"
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http args: ""
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http exten: ""
2019/04/27 06:14:55 [debug] 23965#23965: *7609 posix_memalign: 000055AE8472F240:4096 @16
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http process request header line
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http header: "Content-Type: application/json"
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http header: "Accept: text/json"
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http header: "host: unix"
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http header: "content-length: 20538"
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http header: "Connection: close"
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http header done
2019/04/27 06:14:55 [debug] 23965#23965: *7609 event timer del: 64: 7897388
2019/04/27 06:14:55 [debug] 23965#23965: *7609 rewrite phase: 0
2019/04/27 06:14:55 [debug] 23965#23965: *7609 test location: ~ "/pub/([A-Za-z0-9\-\.\/]+)"
2019/04/27 06:14:55 [debug] 23965#23965: *7609 using configuration "/pub/([A-Za-z0-9\-\.\/]+)"
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http cl:20538 max:134217728
2019/04/27 06:14:55 [debug] 23965#23965: *7609 rewrite phase: 2
2019/04/27 06:14:55 [debug] 23965#23965: *7609 post rewrite phase: 3
2019/04/27 06:14:55 [debug] 23965#23965: *7609 generic phase: 4
2019/04/27 06:14:55 [debug] 23965#23965: *7609 generic phase: 5
2019/04/27 06:14:55 [debug] 23965#23965: *7609 access phase: 6
2019/04/27 06:14:55 [debug] 23965#23965: *7609 access phase: 7
2019/04/27 06:14:55 [debug] 23965#23965: *7609 access phase: 8
2019/04/27 06:14:55 [debug] 23965#23965: *7609 post access phase: 9
2019/04/27 06:14:55 [debug] 23965#23965: *7609 generic phase: 10
2019/04/27 06:14:55 [debug] 23965#23965: *7609 generic phase: 11
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http client request body preread 3948
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http request body content length filter
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http body new buf t:1 f:0 000055AE846EB304, pos 000055AE846EB304, size: 3948 file: 0, size: 0
2019/04/27 06:14:55 [debug] 23965#23965: *7609 malloc: 000055AE84A2DB10:20538
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http read client request body
2019/04/27 06:14:55 [debug] 23965#23965: *7609 recv: eof:0, avail:1
2019/04/27 06:14:55 [debug] 23965#23965: *7609 recv: fd:64 16590 of 16590
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http client request body recv 16590
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http body new buf t:1 f:0 000055AE84A2DB10, pos 000055AE84A2DB10, size: 16590 file: 0, size: 0
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http client request body rest 0
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http script capture: "main/orders/ZRX-WETH"
2019/04/27 06:14:55 [debug] 23965#23965: *7609 malloc: 000055AE8495A230:20539
2019/04/27 06:14:55 [debug] 23965#23965: malloc: 000055AE84778F50:16
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http cleanup add: 000055AE8472F7B0
2019/04/27 06:14:55 [debug] 23965#23965: malloc: 000055AE8471E6C0:104
2019/04/27 06:14:55 [debug] 23965#23965: epoll add event: fd:14 op:3 ev:80002005
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http finalize request: -4, "/pub/main/orders/ZRX-WETH?" a:1, c:2
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http request count:2 blk:0
2019/04/27 06:14:55 [debug] 23965#23965: timer delta: 40
2019/04/27 06:14:55 [debug] 23965#23965: worker cycle
2019/04/27 06:14:55 [debug] 23965#23965: epoll timer: 324
2019/04/27 06:14:55 [debug] 23965#23965: epoll: fd:66 ev:0001 d:000055AE84606971
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http wait request handler
2019/04/27 06:14:55 [debug] 23965#23965: *7610 malloc: 000055AE84A85B40:4096
2019/04/27 06:14:55 [debug] 23965#23965: *7610 recv: eof:0, avail:1
2019/04/27 06:14:55 [debug] 23965#23965: *7610 recv: fd:66 4096 of 4096
2019/04/27 06:14:55 [debug] 23965#23965: *7610 reusable connection: 0
2019/04/27 06:14:55 [debug] 23965#23965: *7610 posix_memalign: 000055AE847BFCD0:4096 @16
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http process request line
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http request line: "POST /pub/main/orders/ZRX-WETH HTTP/1.1"
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http uri: "/pub/main/orders/ZRX-WETH"
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http args: ""
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http exten: ""
2019/04/27 06:14:55 [debug] 23965#23965: *7610 posix_memalign: 000055AE847C0CE0:4096 @16
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http process request header line
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http header: "Content-Type: application/json"
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http header: "Accept: text/json"
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http header: "host: unix"
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http header: "content-length: 20538"
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http header: "Connection: close"
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http header done
2019/04/27 06:14:55 [debug] 23965#23965: *7610 event timer del: 66: 7897388
2019/04/27 06:14:55 [debug] 23965#23965: *7610 rewrite phase: 0
2019/04/27 06:14:55 [debug] 23965#23965: *7610 test location: ~ "/pub/([A-Za-z0-9\-\.\/]+)"
2019/04/27 06:14:55 [debug] 23965#23965: *7610 using configuration "/pub/([A-Za-z0-9\-\.\/]+)"
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http cl:20538 max:134217728
2019/04/27 06:14:55 [debug] 23965#23965: *7610 rewrite phase: 2
2019/04/27 06:14:55 [debug] 23965#23965: *7610 post rewrite phase: 3
2019/04/27 06:14:55 [debug] 23965#23965: *7610 generic phase: 4
2019/04/27 06:14:55 [debug] 23965#23965: *7610 generic phase: 5
2019/04/27 06:14:55 [debug] 23965#23965: *7610 access phase: 6
2019/04/27 06:14:55 [debug] 23965#23965: *7610 access phase: 7
2019/04/27 06:14:55 [debug] 23965#23965: *7610 access phase: 8
2019/04/27 06:14:55 [debug] 23965#23965: *7610 post access phase: 9
2019/04/27 06:14:55 [debug] 23965#23965: *7610 generic phase: 10
2019/04/27 06:14:55 [debug] 23965#23965: *7610 generic phase: 11
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http client request body preread 3948
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http request body content length filter
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http body new buf t:1 f:0 000055AE84A85BD4, pos 000055AE84A85BD4, size: 3948 file: 0, size: 0
2019/04/27 06:14:55 [debug] 23965#23965: *7610 malloc: 000055AE8493A510:20538
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http read client request body
2019/04/27 06:14:55 [debug] 23965#23965: *7610 recv: eof:0, avail:1
2019/04/27 06:14:55 [debug] 23965#23965: *7610 recv: fd:66 16590 of 16590
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http client request body recv 16590
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http body new buf t:1 f:0 000055AE8493A510, pos 000055AE8493A510, size: 16590 file: 0, size: 0
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http client request body rest 0
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http script capture: "main/orders/ZRX-WETH"
2019/04/27 06:14:55 [debug] 23965#23965: *7610 malloc: 000055AE84A096F0:20539
2019/04/27 06:14:55 [debug] 23965#23965: malloc: 000055AE846DD5F0:16
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http cleanup add: 000055AE847C1250
2019/04/27 06:14:55 [debug] 23965#23965: malloc: 000055AE84703960:104
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http finalize request: -4, "/pub/main/orders/ZRX-WETH?" a:1, c:2
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http request count:2 blk:0
2019/04/27 06:14:55 [debug] 23965#23965: epoll: fd:14 ev:0004 d:000055AE84605BE0
2019/04/27 06:14:55 [debug] 23965#23965: epoll del event: fd:14 op:3 ev:80002001
2019/04/27 06:14:55 [debug] 23965#23965: timer delta: 0
2019/04/27 06:14:55 [debug] 23965#23965: worker cycle
2019/04/27 06:14:55 [debug] 23965#23965: epoll timer: 324
2019/04/27 06:14:55 [debug] 23965#23965: epoll: fd:14 ev:0001 d:000055AE84605BE0
2019/04/27 06:14:55 [debug] 23965#23965: *7609 HTTP/1.1 201 Created
Server: nginx
Date: Sat, 27 Apr 2019 06:14:55 GMT
Content-Type: text/json
Content-Length: 88
Connection: close

2019/04/27 06:14:55 [debug] 23965#23965: *7609 write new buf t:1 f:0 000055AE8472F878, pos 000055AE8472F878, size: 140 file: 0, size: 0
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http write filter: l:0 f:0 s:140
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http output filter "/pub/main/orders/ZRX-WETH?"
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http copy filter: "/pub/main/orders/ZRX-WETH?"
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http postpone filter "/pub/main/orders/ZRX-WETH?" 000055AE8472F818
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http brotli filter
2019/04/27 06:14:55 [debug] 23965#23965: *7609 write old buf t:1 f:0 000055AE8472F878, pos 000055AE8472F878, size: 140 file: 0, size: 0
2019/04/27 06:14:55 [debug] 23965#23965: *7609 write new buf t:0 f:0 000055AE8368DB60, pos 000055AE8368DB60, size: 88 file: 0, size: 0
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http write filter: l:1 f:1 s:228
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http write filter limit 524288
2019/04/27 06:14:55 [debug] 23965#23965: *7609 writev: 228 of 228
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http write filter 0000000000000000
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http copy filter: 0 "/pub/main/orders/ZRX-WETH?"
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http finalize request: 0, "/pub/main/orders/ZRX-WETH?" a:1, c:1
2019/04/27 06:14:55 [debug] 23965#23965: *7609 event timer add: 64: 5000:7842428
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http lingering close handler
2019/04/27 06:14:55 [debug] 23965#23965: *7609 recv: eof:0, avail:1
2019/04/27 06:14:55 [debug] 23965#23965: *7609 recv: fd:64 -1 of 4096
2019/04/27 06:14:55 [debug] 23965#23965: *7609 recv() not ready (11: Resource temporarily unavailable)
2019/04/27 06:14:55 [debug] 23965#23965: *7609 lingering read: -2
2019/04/27 06:14:55 [debug] 23965#23965: *7609 event timer: 64, old: 7842428, new: 7842428
2019/04/27 06:14:55 [debug] 23965#23965: *7610 HTTP/1.1 201 Created
Server: nginx
Date: Sat, 27 Apr 2019 06:14:55 GMT
Content-Type: text/json
Content-Length: 88
Connection: close

2019/04/27 06:14:55 [debug] 23965#23965: *7610 write new buf t:1 f:0 000055AE847C1318, pos 000055AE847C1318, size: 140 file: 0, size: 0
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http write filter: l:0 f:0 s:140
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http output filter "/pub/main/orders/ZRX-WETH?"
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http copy filter: "/pub/main/orders/ZRX-WETH?"
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http postpone filter "/pub/main/orders/ZRX-WETH?" 000055AE847C12B8
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http brotli filter
2019/04/27 06:14:55 [debug] 23965#23965: *7610 write old buf t:1 f:0 000055AE847C1318, pos 000055AE847C1318, size: 140 file: 0, size: 0
2019/04/27 06:14:55 [debug] 23965#23965: *7610 write new buf t:0 f:0 000055AE8368DB60, pos 000055AE8368DB60, size: 88 file: 0, size: 0
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http write filter: l:1 f:1 s:228
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http write filter limit 524288
2019/04/27 06:14:55 [debug] 23965#23965: *7610 writev: 228 of 228
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http write filter 0000000000000000
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http copy filter: 0 "/pub/main/orders/ZRX-WETH?"
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http finalize request: 0, "/pub/main/orders/ZRX-WETH?" a:1, c:1
2019/04/27 06:14:55 [debug] 23965#23965: *7610 event timer add: 66: 5000:7842428
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http lingering close handler
2019/04/27 06:14:55 [debug] 23965#23965: *7610 recv: eof:0, avail:1
2019/04/27 06:14:55 [debug] 23965#23965: *7610 recv: fd:66 -1 of 4096
2019/04/27 06:14:55 [debug] 23965#23965: *7610 recv() not ready (11: Resource temporarily unavailable)
2019/04/27 06:14:55 [debug] 23965#23965: *7610 lingering read: -2
2019/04/27 06:14:55 [debug] 23965#23965: *7610 event timer: 66, old: 7842428, new: 7842428
2019/04/27 06:14:55 [debug] 23965#23965: epoll: fd:3 ev:0001 d:000055AE84605DD0
2019/04/27 06:14:55 [debug] 23965#23965: REDISTORE: Get message from msgkey {channel:/main/orders/ZRX-WETH}:msg:1556345695:0
2019/04/27 06:14:55 [debug] 23965#23965: malloc: 000055AE8491BCE0:125
2019/04/27 06:14:55 [debug] 23965#23965: epoll add event: fd:14 op:3 ev:80002005
2019/04/27 06:14:55 [debug] 23965#23965: REDISTORE: Get message from msgkey {channel:/main/orders/ZRX-WETH}:msg:1556345695:1
2019/04/27 06:14:55 [debug] 23965#23965: malloc: 000055AE8478E190:125
2019/04/27 06:14:55 [debug] 23965#23965: timer delta: 0
2019/04/27 06:14:55 [debug] 23965#23965: worker cycle
2019/04/27 06:14:55 [debug] 23965#23965: epoll timer: 324
2019/04/27 06:14:55 [debug] 23965#23965: epoll: fd:14 ev:0004 d:000055AE84605BE0
2019/04/27 06:14:55 [debug] 23965#23965: epoll del event: fd:14 op:3 ev:80002001
2019/04/27 06:14:55 [debug] 23965#23965: timer delta: 0
2019/04/27 06:14:55 [debug] 23965#23965: worker cycle
2019/04/27 06:14:55 [debug] 23965#23965: epoll timer: 324
2019/04/27 06:14:55 [debug] 23965#23965: epoll: fd:14 ev:0001 d:000055AE84605BE0
2019/04/27 06:14:55 [debug] 23965#23965: REDISTORE: get_msg_from_msgkey_callback
2019/04/27 06:14:55 [error] 23965#23965: REDISTORE: invalid message redis reply
2019/04/27 06:14:55 [error] 23965#23965: REDISTORE: invalid message or message absent after get_msg_from_key
2019/04/27 06:14:55 [debug] 23965#23965: REDISTORE: get_msg_from_msgkey_callback
2019/04/27 06:14:55 [debug] 23965#23965: timer delta: 0
2019/04/27 06:14:55 [debug] 23965#23965: worker cycle
2019/04/27 06:14:55 [debug] 23965#23965: epoll timer: 324
2019/04/27 06:14:55 [debug] 23965#23965: epoll: fd:64 ev:2011 d:000055AE84605EC9
2019/04/27 06:14:55 [debug] 23965#23965: epoll_wait() error on fd:64 ev:2011
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http lingering close handler
2019/04/27 06:14:55 [debug] 23965#23965: *7609 recv: eof:1, avail:1
2019/04/27 06:14:55 [debug] 23965#23965: *7609 recv: fd:64 0 of 4096
2019/04/27 06:14:55 [debug] 23965#23965: *7609 lingering read: 0
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http request count:1 blk:0
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http close request
2019/04/27 06:14:55 [debug] 23965#23965: *7609 http log handler
2019/04/27 06:14:55 [debug] 23965#23965: *7609 free: 000055AE8495A230
2019/04/27 06:14:55 [debug] 23965#23965: *7609 free: 000055AE84A2DB10
2019/04/27 06:14:55 [debug] 23965#23965: *7609 free: 000055AE8496CC60, unused: 4
2019/04/27 06:14:55 [debug] 23965#23965: *7609 free: 000055AE8472F240, unused: 2184
2019/04/27 06:14:55 [debug] 23965#23965: *7609 close http connection: 64
2019/04/27 06:14:55 [debug] 23965#23965: *7609 event timer del: 64: 7842428
2019/04/27 06:14:55 [debug] 23965#23965: *7609 reusable connection: 0
2019/04/27 06:14:55 [debug] 23965#23965: *7609 free: 000055AE846EB270
2019/04/27 06:14:55 [debug] 23965#23965: *7609 free: 000055AE847327D0, unused: 0
2019/04/27 06:14:55 [debug] 23965#23965: *7609 free: 000055AE845954B0, unused: 8
2019/04/27 06:14:55 [debug] 23965#23965: timer delta: 16
2019/04/27 06:14:55 [debug] 23965#23965: worker cycle
2019/04/27 06:14:55 [debug] 23965#23965: epoll timer: 308
2019/04/27 06:14:55 [debug] 23965#23965: epoll: fd:66 ev:2011 d:000055AE84606971
2019/04/27 06:14:55 [debug] 23965#23965: epoll_wait() error on fd:66 ev:2011
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http lingering close handler
2019/04/27 06:14:55 [debug] 23965#23965: *7610 recv: eof:1, avail:1
2019/04/27 06:14:55 [debug] 23965#23965: *7610 recv: fd:66 0 of 4096
2019/04/27 06:14:55 [debug] 23965#23965: *7610 lingering read: 0
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http request count:1 blk:0
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http close request
2019/04/27 06:14:55 [debug] 23965#23965: *7610 http log handler
2019/04/27 06:14:55 [debug] 23965#23965: *7610 free: 000055AE84A096F0
2019/04/27 06:14:55 [debug] 23965#23965: *7610 free: 000055AE8493A510
2019/04/27 06:14:55 [debug] 23965#23965: *7610 free: 000055AE847BFCD0, unused: 4
2019/04/27 06:14:55 [debug] 23965#23965: *7610 free: 000055AE847C0CE0, unused: 2184
2019/04/27 06:14:55 [debug] 23965#23965: *7610 close http connection: 66
2019/04/27 06:14:55 [debug] 23965#23965: *7610 event timer del: 66: 7842428
2019/04/27 06:14:55 [debug] 23965#23965: *7610 reusable connection: 0
2019/04/27 06:14:55 [debug] 23965#23965: *7610 free: 000055AE84A85B40
2019/04/27 06:14:55 [debug] 23965#23965: *7610 free: 000055AE846E92B0, unused: 0
2019/04/27 06:14:55 [debug] 23965#23965: *7610 free: 000055AE847680C0, unused: 8
2019/04/27 06:14:55 [debug] 23965#23965: timer delta: 0
2019/04/27 06:14:55 [debug] 23965#23965: worker cycle
2019/04/27 06:14:55 [debug] 23965#23965: epoll timer: 308
2019/04/27 06:14:55 [debug] 23965#23965: epoll: fd:6 ev:0001 d:000055AE846058F8
2019/04/27 06:14:55 [debug] 23965#23965: accept on unix:/var/run/nginx.sock, ready: 1
2019/04/27 06:14:55 [debug] 23965#23965: posix_memalign: 000055AE847680C0:256 @16
2019/04/27 06:14:55 [debug] 23965#23965: posix_memalign: 000055AE846E92B0:256 @16
2019/04/27 06:14:55 [debug] 23965#23965: *7611 accept: unix: fd:64
2019/04/27 06:14:55 [debug] 23965#23965: *7611 event timer add: 64: 60000:7897528
2019/04/27 06:14:55 [debug] 23965#23965: *7611 reusable connection: 1
2019/04/27 06:14:55 [debug] 23965#23965: *7611 epoll add event: fd:64 op:1 ev:80002001
2019/04/27 06:14:55 [debug] 23965#23965: accept() not ready (11: Resource temporarily unavailable)
2019/04/27 06:14:55 [debug] 23965#23965: timer delta: 84
2019/04/27 06:14:55 [debug] 23965#23965: worker cycle
2019/04/27 06:14:55 [debug] 23965#23965: epoll timer: 224
2019/04/27 06:14:55 [debug] 23965#23965: epoll: fd:64 ev:0001 d:000055AE84606970
2019/04/27 06:14:55 [debug] 23965#23965: *7611 http wait request handler
2019/04/27 06:14:55 [debug] 23965#23965: *7611 malloc: 000055AE8496CC60:4096
2019/04/27 06:14:55 [debug] 23965#23965: *7611 recv: eof:0, avail:1
2019/04/27 06:14:55 [debug] 23965#23965: *7611 recv: fd:64 1095 of 4096
2019/04/27 06:14:55 [debug] 23965#23965: *7611 reusable connection: 0
2019/04/27 06:14:55 [debug] 23965#23965: *7611 posix_memalign: 000055AE846EB270:4096 @16
2019/04/27 06:14:55 [debug] 23965#23965: *7611 http process request line
2019/04/27 06:14:55 [debug] 23965#23965: *7611 http request line: "POST /pub/main/zrxSra/orderUpdate HTTP/1.1"
2019/04/27 06:14:55 [debug] 23965#23965: *7611 http uri: "/pub/main/zrxSra/orderUpdate"
2019/04/27 06:14:55 [debug] 23965#23965: *7611 http args: ""
2019/04/27 06:14:55 [debug] 23965#23965: *7611 http exten: ""
2019/04/27 06:14:55 [debug] 23965#23965: *7611 posix_memalign: 000055AE8472F240:4096 @16
2019/04/27 06:14:55 [debug] 23965#23965: *7611 http process request header line
2019/04/27 06:14:55 [debug] 23965#23965: *7611 http header: "Content-Type: application/json"
2019/04/27 06:14:55 [debug] 23965#23965: *7611 http header: "Accept: text/json"
2019/04/27 06:14:55 [debug] 23965#23965: *7611 http header: "host: unix"
2019/04/27 06:14:55 [debug] 23965#23965: *7611 http header: "content-length: 946"
2019/04/27 06:14:55 [debug] 23965#23965: *7611 http header: "Connection: close"
2019/04/27 06:14:55 [debug] 23965#23965: *7611 http header done
2019/04/27 06:14:55 [debug] 23965#23965: *7611 event timer del: 64: 7897528
2019/04/27 06:14:55 [debug] 23965#23965: *7611 rewrite phase: 0
2019/04/27 06:14:55 [debug] 23965#23965: *7611 test location: ~ "/pub/([A-Za-z0-9\-\.\/]+)"
2019/04/27 06:14:55 [debug] 23965#23965: *7611 using configuration "/pub/([A-Za-z0-9\-\.\/]+)"
2019/04/27 06:14:55 [debug] 23965#23965: *7611 http cl:946 max:134217728
2019/04/27 06:14:55 [debug] 23965#23965: *7611 rewrite phase: 2
2019/04/27 06:14:55 [debug] 23965#23965: *7611 post rewrite phase: 3
2019/04/27 06:14:55 [debug] 23965#23965: *7611 generic phase: 4
2019/04/27 06:14:55 [debug] 23965#23965: *7611 generic phase: 5
2019/04/27 06:14:55 [debug] 23965#23965: *7611 access phase: 6
2019/04/27 06:14:55 [debug] 23965#23965: *7611 access phase: 7
2019/04/27 06:14:55 [debug] 23965#23965: *7611 access phase: 8
2019/04/27 06:14:55 [debug] 23965#23965: *7611 post access phase: 9
2019/04/27 06:14:55 [debug] 23965#23965: *7611 generic phase: 10
2019/04/27 06:14:55 [debug] 23965#23965: *7611 generic phase: 11
2019/04/27 06:14:55 [debug] 23965#23965: *7611 http client request body preread 946
2019/04/27 06:14:55 [debug] 23965#23965: *7611 http request body content length filter
2019/04/27 06:14:55 [debug] 23965#23965: *7611 http body new buf t:1 f:0 000055AE8496CCF5, pos 000055AE8496CCF5, size: 946 file: 0, size: 0
2019/04/27 06:14:55 [debug] 23965#23965: *7611 http script capture: "main/zrxSra/orderUpdate"
2019/04/27 06:14:55 [debug] 23965#23965: malloc: 000055AE846905D0:16
2019/04/27 06:14:55 [debug] 23965#23965: *7611 http cleanup add: 000055AE846EC218
2019/04/27 06:14:55 [debug] 23965#23965: malloc: 000055AE846F4700:104
2019/04/27 06:14:55 [debug] 23965#23965: epoll add event: fd:14 op:3 ev:80002005
2019/04/27 06:14:55 [debug] 23965#23965: *7611 http finalize request: -4, "/pub/main/zrxSra/orderUpdate?" a:1, c:2
2019/04/27 06:14:55 [debug] 23965#23965: *7611 http request count:2 blk:0
2019/04/27 06:14:55 [debug] 23965#23965: timer delta: 12
2019/04/27 06:14:55 [debug] 23965#23965: worker cycle
2019/04/27 06:14:55 [debug] 23965#23965: epoll timer: 212
2019/04/27 06:14:55 [debug] 23965#23965: epoll: fd:14 ev:0004 d:000055AE84605BE0
2019/04/27 06:14:55 [debug] 23965#23965: epoll del event: fd:14 op:3 ev:80002001
2019/04/27 06:14:55 [debug] 23965#23965: timer delta: 4
2019/04/27 06:14:55 [debug] 23965#23965: worker cycle
2019/04/27 06:14:55 [debug] 23965#23965: epoll timer: 208
2019/04/27 06:14:55 [debug] 23965#23965: epoll: fd:14 ev:0001 d:000055AE84605BE0
2019/04/27 06:14:55 [debug] 23965#23965: *7611 HTTP/1.1 201 Created
Server: nginx
Date: Sat, 27 Apr 2019 06:14:55 GMT
Content-Type: text/json
Content-Length: 91
Connection: close

Arctek avatar Apr 27 '19 06:04 Arctek

@Arctek: Ok, this is going to take a few more rounds. Please rebuild from the debug-522 branch and report the error log. Don't set it to debug mode, it doesn't give me any useful information.

slact avatar Apr 29 '19 01:04 slact

No problem, just ran it now and got:

2019/04/29 02:06:13 [error] 6439#6439: REDISTORE: invalid message redis reply
2019/04/29 02:06:13 [error] 6439#6439: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_ARRAY: 9
2019/04/29 02:06:13 [error] 6439#6439: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_INTEGER: -2
2019/04/29 02:06:13 [error] 6439#6439: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_NIL: nil
2019/04/29 02:06:13 [error] 6439#6439: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_NIL: nil
2019/04/29 02:06:13 [error] 6439#6439: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_INTEGER: 0
2019/04/29 02:06:13 [error] 6439#6439: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_INTEGER: 0
2019/04/29 02:06:13 [error] 6439#6439: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_STRING: 
2019/04/29 02:06:13 [error] 6439#6439: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_STRING: 
2019/04/29 02:06:13 [error] 6439#6439: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_STRING: 
2019/04/29 02:06:13 [error] 6439#6439: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_INTEGER: 0
2019/04/29 02:06:13 [error] 6439#6439: REDISTORE: reply id(time) isn't an int or str
2019/04/29 02:06:13 [error] 6439#6439: REDISTORE: reply id(tag) isn't an int or str
2019/04/29 02:06:13 [error] 6439#6439: REDISTORE: invalid message or message absent after get_msg_from_key

Redis doesn't have any memory limits set either - so there shouldn't be any issues of cache eviction on that side.

Arctek avatar Apr 29 '19 02:04 Arctek

one more round -- pull, rebuild, and show me the error log

slact avatar Apr 30 '19 01:04 slact

Hi,

Looks to be the same

2019/04/30 02:00:50 [error] 8916#8916: REDISTORE: invalid message redis reply
2019/04/30 02:00:50 [error] 8916#8916: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_ARRAY: 9
2019/04/30 02:00:50 [error] 8916#8916: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_INTEGER: -2
2019/04/30 02:00:50 [error] 8916#8916: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_NIL: nil
2019/04/30 02:00:50 [error] 8916#8916: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_NIL: nil
2019/04/30 02:00:50 [error] 8916#8916: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_INTEGER: 0
2019/04/30 02:00:50 [error] 8916#8916: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_INTEGER: 0
2019/04/30 02:00:50 [error] 8916#8916: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_STRING: 
2019/04/30 02:00:50 [error] 8916#8916: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_STRING: 
2019/04/30 02:00:50 [error] 8916#8916: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_STRING: 
2019/04/30 02:00:50 [error] 8916#8916: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_INTEGER: 0
2019/04/30 02:00:50 [error] 8916#8916: REDISTORE: reply id(time) isn't an int or str
2019/04/30 02:00:50 [error] 8916#8916: REDISTORE: reply id(tag) isn't an int or str
2019/04/30 02:00:50 [error] 8916#8916: REDISTORE: invalid message or message absent after get_msg_from_key
2019/04/30 02:00:50 [error] 8916#8916: REDISTORE: invalid message redis reply
2019/04/30 02:00:50 [error] 8916#8916: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_ARRAY: 9
2019/04/30 02:00:50 [error] 8916#8916: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_INTEGER: -2
2019/04/30 02:00:50 [error] 8916#8916: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_NIL: nil
2019/04/30 02:00:50 [error] 8916#8916: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_NIL: nil
2019/04/30 02:00:50 [error] 8916#8916: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_INTEGER: 0
2019/04/30 02:00:50 [error] 8916#8916: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_INTEGER: 0
2019/04/30 02:00:50 [error] 8916#8916: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_STRING: 
2019/04/30 02:00:50 [error] 8916#8916: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_STRING: 
2019/04/30 02:00:50 [error] 8916#8916: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_STRING: 
2019/04/30 02:00:50 [error] 8916#8916: nchan: Redis node 127.0.0.1:6379 REDIS_REPLY_INTEGER: 0
2019/04/30 02:00:50 [error] 8916#8916: REDISTORE: reply id(time) isn't an int or str
2019/04/30 02:00:50 [error] 8916#8916: REDISTORE: reply id(tag) isn't an int or str
2019/04/30 02:00:50 [error] 8916#8916: REDISTORE: invalid message or message absent after get_msg_from_key

Arctek avatar Apr 30 '19 02:04 Arctek

Alright folks, I think I've got it.

The issue occurs when a Redis-backed channel has subscribers that can't keep up with the message publishing rate, and fall behind by more than the max number of buffered messages. The fix for this is in fix-522.

slact avatar May 09 '19 10:05 slact

Looks alot better, thanks!

I'm still seeing those error messages in the log - should they still be present with the fix?

Arctek avatar May 09 '19 23:05 Arctek

When can we expect a release incorporating this fix? Would be very nice if I could re-build my docker images with an official release tag. :)

arpiecodes avatar May 26 '19 19:05 arpiecodes

The fix is great - although it's still occurring under some conditions - just a whole lot less often than before. So I'm still having to restart nginx, i.e. once a day.

Arctek avatar May 27 '19 03:05 Arctek

The fix is great - although it's still occurring under some conditions

Then it's not fixed yet. Are these new conditions reproducible?

slact avatar May 29 '19 01:05 slact

No, it's just happening less often now with the same message in the error logs.

But I haven't been able to work out what is causing it - as it seems to affect channels at random now.

Not sure how to best track it down - previously it was happening pretty reliably on the same channels each time. Any ideas on how I can help?

Arctek avatar May 29 '19 01:05 Arctek

It's looking like that the individual client websocket connection might freeze - but upon disconnecting and then reconnecting to the channel the client receives messages again (until it stops again).

Whereas before your last change disconnecting/reconnecting didn't help and nginx had to be restarted.

Arctek avatar May 29 '19 06:05 Arctek

We went into production with v1.2.6, initially it seemed to be working fine, but its still 'hanging' sometimes on particular channels on particular nodes.

Situation is as follows; we have two nchan servers in production and in this particular case I've just encountered, one of them stops publishing to certain client channels. If I force that client to go to the second server and join on the exact same channel, publishing to that client works correctly. When the client is subsequently put back on the 'broken' server again, it would again not publish any new messages to this client (channel). Other, non-related client channels seem to publish fine. As soon as nchan on the 'broken' server is reloaded, it starts to magically work again and all is fine.

The nchan servers are both linked to the same Redis master (actually, two - one master one slave which are being managed by Sentinel). Redis seems to be completely fine (no failover or log entries occurred in the meantime since initial service start), both servers do not display any weird behaviour. Receiving messages from clients (sub) also works fine without any problems, so the WebSocket itself seems to be absolutely fine.

'broken' server nchan_stub; total published messages: 98 stored messages: 0 shared memory used: 20K shared memory limit: 1048576K channels: 18 subscribers: 5 redis pending commands: 0 redis connected servers: 32 total interprocess alerts received: 744722 interprocess alerts in transit: 8 interprocess queued alerts: 0 total interprocess send delay: 0 total interprocess receive delay: 0 nchan version: 1.2.6

'working' second server nchan_stub; total published messages: 16 stored messages: 0 shared memory used: 20K shared memory limit: 1048576K channels: 12 subscribers: 4 redis pending commands: 0 redis connected servers: 32 total interprocess alerts received: 109943 interprocess alerts in transit: 0 interprocess queued alerts: 0 total interprocess send delay: 0 total interprocess receive delay: 0 nchan version: 1.2.6

I guess it happens more often the more traffic a server gets to process.

Each server has 16 workers, so a total of 16 connections per Redis instance. No errors to be found in any of the logs, except an incidental SUB:WEBSOCKET:ws_recv NOT OK when receiving payload and Connection closed prematurely, but I also see that on the other, 'working' server (and sometimes we're under heavy portscanning activity, so that could be false alarms).

For now we probably will have to reload the queue servers every day as well to mitigate.

Let me know what would help in your effort to completely resolve the issue. :-)

sg-jorrit avatar Jul 22 '19 10:07 sg-jorrit

I'm associated with the same deployment as @sg-jorrit and have tried reading a little of the code.

Just looking at https://github.com/slact/nchan/commit/f1a25dfba4d45bfa7cd311ad27554583d7dadf42, I think it is remarkable that spool_fetch_msg() is called with weaker preconditions than the existing calls: the spool's msg_status is not ensured to be MSG_INVALID. This might cause fetchmsg_ev_handler() (after the stack overflow protector has triggered) to do nothing, or other problems.

Note that the stack overflow protector triggers if there are more than 20 calls in a millisecond (or as long as nginx does not update its cached time), not only if there are actually recursive calls.

Apart from the stack overflow threat, the recursive spool_fetch_msg() calls make it harder to ensure multiple channels are served fairly under high load.

jilles-sg avatar Jul 22 '19 13:07 jilles-sg

Hi @sg-jorrit @jilles-sg , Did you solve this issue now? The workaround is to reload nginx periodically?

xmglin avatar Aug 27 '19 02:08 xmglin

If you are only using the one redis server then you can set nchan_redis_storage_mode to backup instead.

This is what we have done and it has been running fine now for 4 months.

Arctek avatar Aug 27 '19 03:08 Arctek

Thanks for your suggestion @Arctek Yes, currently my deployment has only one redis instance. I will try it immediately.

xmglin avatar Aug 27 '19 03:08 xmglin

@jilles-sg : your observations are as astute and accurate as my response is late. I'm not too proud of that fix (f1a25df), and I will address it in a future release. Actually, the whole message fetching system is getting a rewrite, which will also address your concerns about fairness. Nchan is going to get much, much better in the coming months.

slact avatar Nov 28 '19 06:11 slact

Small update from our side. As traffic increased over time, the bug has become unbearable. Even a daily restart of the nchan instances does not help prevent it from happening anymore. We already tried changing Redis mode to nostore, but this doesn't seem to help. We are now going to disconnect the nchan nodes from Redis and publish each message to all instance separately for now.

sg-jorrit avatar Dec 06 '19 14:12 sg-jorrit