socketcluster ping requests are sent even when traffic is ongoing

socketcluster would send ping requests (and expect responses in time) even when traffic is ongoing on a connection. In general, ping keep-alive packets have two purposes:

to check whether the peer is still there, e.g. to check that he has not dropped / crashed / disconnected from the network
to keep the state in stateful firewalls

so, it does not make sense to send keep-alive packets when other traffic is being transmitted over the connection. It only makes sense when the connection is otherwise idle. From https://en.wikipedia.org/wiki/Keepalive:

Typically TCP Keepalives are sent every 45 or 60 seconds on an idle TCP connection

On top of being useless, the behavior can cause disconnects due to bogus ping timeouts when heavy traffic is being transmitted - the ping packets could get delayed beyond pingTimeout if lots of data is being sent/received.

Jul 15 '19 16:07 vasild

The reason why it always sends ping/pong (instead of waiting for the connection to go idle first) is because it makes the code simpler and more consistent and sending pings/pongs a few times a minute is cheap.

Heavy traffic should not cause ping/pongs to be missed if you have a safe margin between pingTimeout and pingInterval. It's possible that a very large message from a client (e.g. several tens or hundreds of megabytes in a single message) could cause a pingTimeout but in that case, it indicates that the connection was saturated so a timeout is the appropriate response to protect the server from having to spend more time servicing that (potentially malicious) socket.

Also, as mentioned in the other issue, the client relies on the absence of a ping to detect a lost connection; otherwise to detect disconnection on both sides, we'd have to have pings and pongs going in both directions which is more expensive and creates more edge cases (server ping <-> client pong, client ping <-> server pong).

Jul 15 '19 19:07 jondubois

I agree that sending ping regularly is easier to implement and the code is simpler, but I think that in this particular case the effects of it go beyond "design decision" and into the "bug" category. Just consider this - the connection would be dropped due to a ping timeout (aka "the other side has gone away") while traffic is ongoing.

It's possible that a very large message from a client (e.g. several tens or hundreds of megabytes in a single message) could cause a pingTimeout but in that case, it indicates that the connection was saturated so a timeout is the appropriate response to protect the server from having to spend more time servicing that (potentially malicious) socket.

A 20 MiB message is sufficient to trigger the timeout with the default pingTimeout of 20 seconds if the transfer speed is 1 MiB/sec. In other words, it is not possible to transfer messages larger than 20 MiB. Just consider any other library, protocol or application that transfers data over a network - ftp, http, apache, openssl, filezilla, curl, wget, younameit - none of those exhibit such a behavior.

The correct behavior is to just keep going if data is being sent or received through the connection.

Jul 17 '19 08:07 vasild

In the above case the connection is dropped with an obscure message: Error [BadConnectionError]: Event '...' was aborted due to a bad connection

Jul 17 '19 09:07 vasild

I agree with @vasild that it seems more like a bug than a design decision.

We could keep sending pings all the time, because it's easy and cheap, and simply add more validations before declaring a "bad connection", verifying that we are not in a middle of a transfer, and if we are simply wait for the next ping (or something like that)..

Aug 07 '19 15:08 iMoses

@iMoses You can allow multiple failed attempts without disconnecting. For example, if you set pingTimeout to 30000 (30 seconds) and pingInterval to 9000 then you can miss 2 pings without disconnection. It will only disconnect if you miss 3 full consecutive pings.

Aug 07 '19 16:08 jondubois

it is not possible to transfer messages larger than 20 MiB

It is possible but not recommended due to reasons beyond the pingTimeout. SocketCluster is optimized for handling JSON; objects larger than 20MB can take a while to parse and because JSON parsing is typically a synchronous operation in Node.js, it's often not a good idea to send such large objects in a single message.

That said, if necessary, it's possible to increase the pingTimeout to allow messages bigger than 20MB so it's not a hard limit.

Synchronous operations which freeze the event loop for too long should also cause pingTimeout. When it comes to healthchecks; if you need to check that a process is healthy, it's not enough to detect that it's running, you also need to check that it's responsive (e.g. not stuck in an infinite loop). The ping/pong should be thought of as a healthcheck for the connection; it's not only about connectivity.

Aug 07 '19 23:08 jondubois

@jondubois,

Network throughput can be highly volatile. The above example of 20MiB message is if the transfer speed is 1 MiB/sec - I am sure you know that this can vary a lot, either it could be a slower connection (e.g. on mobile or whatever, just slower), or it could in general be faster but occasionally have reduced throughput, maybe because being overloaded or for other reasons. For example one 20MiB message would transfer for the same time as one 2MiB, if 10 such smaller messages are transferred at the same time. If the app transfers 100 messages "concurrently", then we go to 200KiB limit per message on a stable 1MiB/sec connection.

If we have to set hard limits with pingTimeout and pingInterval, then we practically declare "it will not be possible to transfer a message if it takes longer than X seconds" whereas X will not adapt to network conditions. Safely choosing such a X to account for all throughput fluctuations means to set it so high, that the ping timeout is practically disabled.

Aug 08 '19 05:08 vasild

For front end use cases, it's not unreasonable to timeout clients if a single overly large packet is taking too long (e.g. > 20 seconds) to transfer across the network because that would result in a UX issue anyway.

For backend usage of SC, the case is stronger. The problem there in terms of implementation is that most popular client WebSocket APIs (which SC depends on) do not have an event for when individual TCP packets are received; you either get the whole message (based on the WebSocket frame) or you get nothing. This makes implementation difficult. Alternative implementations are more error prone.

It may be possible to find a middle ground and allow the user of SC to set an option to extend ping whenever a whole message is received (that way either pings or messages will prevent timeout) - This is what socket.io does. There is a small performance penalty for doing this but it's not too bad.

Aug 08 '19 06:08 jondubois