XHR Polling misbehaves under (very) poor network conditions

Open exFalso opened this issue 10 years ago • 0 comments

A bit of context: We started using a SockJS based protocol in production a couple of weeks ago. After a couple of days of monitoring we noticed some clients exhibiting weird behaviour in rare cases: it seemed as if newly connected clients started in the middle of the application-layer protocol rather than with a regular handshake. After a bit of debugging we realized that the cause is that sometimes the XHR polling client sends a delayed (> 5 second delay) poll request, which is treated as a new connection. If in the meantime an /xhr_send also happens then on the server this is treated as the first message of a newly connected client. The server side SockJS is our own implementation so I checked with the reference sockjs-node implementation and could reproduce the same issue.

Steps to reproduce: I'll use the example echo service from the https://github.com/sockjs/sockjs-node README. The poor network condition is simulated with a port-forwarding proxy.

Start server on port 9999
Start port forward on localhost from 9998 to 9999
$ curl -X POST localhost:9998/echo/000/000/xhr open frame
$ curl -X POST localhost:9998/echo/000/000/xhr poll
$ curl localhost:9998/echo/000/000/xhr_send --data '["Hello"]' -H content-type:text/plain poll returns, server is waiting 5 seconds for next poll before close
Send SIGSTOP to the proxy. The port will stay open but data sent will be buffered by the OS
$ curl -X POST localhost:9998/echo/000/000/xhr poll, does not reach server
$ curl localhost:9998/echo/000/000/xhr_send --data '["World!"]' -H content-type:text/plain hangs
Wait 5 seconds so that server drops connection
Send SIGCONT to port forward. This will cause the /xhr and /xhr_send requests to be delivered (hopefully in this order). The /xhr request will open a new connection with the same session_id (return code 200) and the /xhr_send will send the message (return code 204), most probably causing havoc in the application layer protocol.

Solution proposal 1: Move the "open connection" functionality to a new endpoint, e.g. /xhr_open. If a rogue /xhr request arrives the server can simply disregard it. I think this is The Right Way to solve this, however it is obviously not backwards-compatible. Maybe keep the the original behaviour around as deprecated for a few versions? Solution proposal 2: Keep around a set of recently-closed-session_ids on the server. New connections opened with such a session_id should be rejected. This has the benefit of being backwards compatible, however it adds additional complexity to the server and I reckon it also makes testing more cumbersome, as they may rely on being able to instantly open a new connection with the same session_id.

Feb 05 '16 13:02 exFalso