20 seconds delay
Hi everybody,
you guys may already seen some issues of me and you might know that we use a real time trivia app. Our app is getting a lot of attention currently and the first issues are arising.
Our trivia app will publish a json-object to all connected users. Its just a question and 3 answers. This works great, we currently have 2500 users playing live. From time to time we receive mails people complaining "they did not receive a question".
My first question is (let's assume they did not disconnect and every watcher is working fine) Is it possible that a publish won't be delivered to a user? This is a serious question.
The second and much bigger issue is: currently, the user has 10 seconds time to pick an answer. After 10 seconds, the user can't emit to the server (blocked inside the app). But the user could chose his answer at 9,99 second. On the server (worker.js) we've added another 3500ms of "latency" buffer, means, we will still accept users answer when it arrives 13 seconds after publishing.
Today, we had around 2500 live users playing (we will have prob. 50.000 Users in about 30 days) We received angry mails, because the answers haven't be accepted. After watching our logs, we realized, that the latest message arrived 28 seconds later.
So 2500 user have chosen a answer from 1-10 seconds after publishing. How can it be possible that the last message arrived 28 seconds after emitting. Is this something common? Are we doing something wrong?
We've seen that 99% of messages are getting checked against redis in-time, but around 1% of the answers have been processed after 15 seconds, some at 18 and the longest took, as already said, 28 seconds.
So every message arrived after 13500ms from the publish event will be declined.
What shall we do? We're pretty lost on this. I've seen issues that it might be a good idea to put a setTimeout on PUBLISH_OUT Middleware for around 1-500ms, but from our testings, every message published from the server came in nearly at real-time. Talking for myself, I've never had a issue with delayed answers when playing or testing myself, but the logs are real and the issue too.
Any ideas, advices, thoughts?
Some more facts I forgot to mention.
Installed server/client version: 12.0.0 Server: 1 Single Server currently, fully dedicated (redis is also a single server, fully dedicated) AMD Ryzen 7 1700X Octa-Core "Summit Ridge" (Zen) 64 DDR4 RAM 500GB SSD
14 workers, 2 brokers. All on the same machine. We have no code inside broker.js, except the one which was created from CLI using "socketcluster create app". We're using sc-uws.
I might need to mention that we have nginx on the same server as socketcluster, as we're using SSL and we just proxy_pass everything from nginx to node. We've highly tuned sysctl.conf and the limits.conf to use the total maximum on a linux distro based on hundreds of tutorials on the net.
The clients are connected on port 443 with a https domain to node and node is passing everything to socketcluster. I don't know if this could also be any reason for the delayed arrivals.
We have a guy upstairs who's app is experiencing this exact same thing. I've pointed him to this issue. So far, I haven't heard anything on our app but who knows. We have a lot of traffic but much of the traffic is editing things in the DB and the response time isn't important.
My first question is (let's assume they did not disconnect and every watcher is working fine) Is it possible that a publish won't be delivered to a user? This is a serious question.
If you're running SC on a single machine, the only realistic case where this can happen is if one of the processes crashes (e.g. because of an uncaught error) but that should be pretty obvious from the logs.
The second and much bigger issue is: currently, the user has 10 seconds time to pick an answer...
Maybe the user's socket was disconnected for a few seconds and their answer was sent after a delay (when the connection came back).
maybe your server is overloaded since you are running multiple programs on it; what is the CPU % usage on that machine (e.g. with the top command)?
0,2 % CPU, the server is just running nginx and socketcluster with "forever". mysql, redis etc are dedicated own servers.
How is Redis and MySQL performance?
MySQL will not be called when answers getting in, only redis, which is a dedicated, 8 core intel i7 with SSD and 64GB DDR4 RAM. We've been running millions of request against redis in under a sec, redis is extremely performant
I'm having the same problem and I've spent days working on it. I do not, by any means have the same traffic as the OP. I'm still developing the application that's using SC. However I've been trying to isolate the problem by measuring the time it takes for the client's request to arrive to the server.
My method to track latency is not super accurate but its accurate enough for me to identify that the problem does not exist in any code that I've written. My server is also not very powerful either. It doesn't need to be for multiple reasons.
- It's still under development.
- At any time, the most users that we have using the application is 5.
- SC is being used instead of a regular REST API.
- It's traffic isn't anything close to that of a chat application, or a game where state needs to be shared across multiple clients. It's just a simple REST API replacement.
Regardless of these factors, responsiveness and performance of client/server communication has not been consistent. My investigations into this problem have shown that the lag that I'm seeing can happen because there is a lag time in one or both of these situations.
- It takes a long time (seconds) for the client's request to reach the event handler in the worker.
- It takes a long time (seconds) for the server's response to the client's message to reach the client.
The performance problems are intermittent. The server, under normal load hasn't once gone over 15% of its overall CPU capacity. RAM consumption hovers usually around 17% and doesn't have any wild fluctuations. The SC run application is behind an NGINX reverse proxy, and that's been updated, reconfigured, and tested to see if it was responsible for the performance issues, and at this point NGINX has been eliminated as the cause for the intermittent slow downs.
This application has been on SC 9.3.1, 11.1.0, 11.3.1, and now the latest 13.1.3. Only recently has fixing this irregularity in performance has become a priority for me, so since 11.3.1 have I been trying to track down the problem.
The Server's configuration is as follows:
- 1 vCPU (Intel Sandy Bridge)
- 3.75 GB of RAM
- DNS is being handled through Cloud Flare.
- Server is running on Google Cloud in Google Compute Engine.
Typical ping time is 70 ms to the server from client computers that have experienced the slow down. Apache Benchmark hitting the /health-check route on the server shows the following.
Connection Times (ms)
min mean[+/-sd] median max
Connect: 415 934 243.8 901 2136
Processing: 77 190 115.7 144 619
Waiting: 77 155 71.5 134 496
Total: 575 1124 240.3 1063 2235
Lastly, I'm measuring the execution times for all server operations as part of my investigation, they are not responsible for the slow downs in SC traffic.
@MatthewAry it looks like your HTTP requests are slow as well. Is it also slow when you have no users on it? Could it be that your cloud provider is throttling your bandwidth? It looks like you're using a small machine.
What happens if you run a plain Node.js HTTP server next to your SC instance and try to connect to it? Is it also slow? What WebSocket wsEngine are you using? What version? How does it behave if you use ws as the wsEngine?
@Hirbod When you say that CPU % is 0.2%, is that at the peak? I suspect that for your use case, you would have very low CPU usage and then a significant peak every 10 second or so (when all the answers are sent from clients at almost the same time).
Watching "top" live, I've never seen the CPU load going over 1%. Memory also 61GB free from 64GB. Using "speedometer" to watch the network-interface, there is an output around 1Mbit and input about 1 Mbit.
The server is basically sleeping and has a lot of headroom.
It's not the cloud provider. I've seen it degrade on ws, sc-uws, and uws. I haven't tried running a plain node.js HTTP server next to it. Yes, it's a small machine, but what it's doing isn't too major either. Google's infrastructure ties instance bandwith to the number of vCPU's (we have 1) the instance has.
The egress throughput cap is dependent on the number of vCPUs that a virtual machine instance has. Each vCPU has a 2 Gbps egress cap for peak performance. Each additional vCPU increases the network cap, up to a theoretical maximum of 16 Gbps for each virtual machine. For example, a virtual machine instance with 4 vCPUs has an intra-zone network throughput cap of 2 Gbps * 4 = 8 Gbps. A virtual machine instance with 8 vCPUs has an intra-zone network throughput cap of 2 Gbps * 8 = 16 Gbps.
Specific workloads and benchmarks are regularly measured:
- A single stream performance test can achieve a maximum of 8.5 Gbps on average.
- A multistream performance test can achieve a maximum of 15 Gbps on average.
On the same server, we are serving up static files for the front end of the application using NGINX, it's performance as tested using Apache Benchmark is as follows.
Connection Times (ms)
min mean[+/-sd] median max
Connect: 302 932 338.9 856 5595
Processing: 74 161 88.1 126 399
Waiting: 74 126 51.0 104 321
Total: 412 1093 344.9 999 5793
Keep in mind that while doing my benchmarks, I'm hitting the server with 100 concurrent connections at a time.
I'm currently using sc-uws which version is 10.148.0 according to package-lock.json
@Hirbod are you guys on google? aws? nginx?
Trying to find the common denominator.
@happilymarrieddad nope, self-hosted, on hetzner. https://www.hetzner.com/dedicated-rootserver/matrix-ax?country=gb
Using AX60-SSD. I was start to think that AMD sucks and we should switch to an Intel Xeon E3...
@Hirbod Many cloud providers throttle CPU and bandwidth on smaller instances; sometimes it doesn't take much to trigger it. Not sure that's the problem but worth trying with a different instance to see if it makes a difference.
@jondubois It's a dedicated instance (not a shared vCPU) with guaranteed memory and computational power. Before we were on a shared instance where that could be the case, but we moved to a dedicated instance to eliminate the possibility even though our resource consumption was not high enough to justify the move.
In any case, @happilymarrieddad works down stairs from me and he's got me trying out one more thing with NGINX. I'll let you know how that goes.
Please share your informations @MatthewAry if it help. I also adjusted rfile_limit and set "tcp_nodelay" to "off" and added "proxy_buffering off;" and "proxy_request_buffering" to off; Tomorrow we will see how it goes.
Hahaha. @happilymarrieddad is hoping that my rlimit_nofile value is not high enough. So that's what I'm testing.
Nope, that didn't fix it.
Epic fail... dang it
In any case, @happilymarrieddad works down stairs
Haha good teamwork ;p
@Hirbod I just did a test on Amazon EC2 with 16K concurrent sockets subscribed to the same channel on an 8 core machine (7 workers, 1 broker). SocketCluster version 13.1.5 (using sc-uws as the wsEngine).
I published a message to the channel once every 10 seconds:
- CPU peaked at 30% (it was for a fraction of a second) and immediately went back down to near 0% - It was 0% throughout most of the test.
- All 16K copies of the message were received for each round (I watched the channel on each socket and added up the messages).
- Latency varied from 10ms to 600ms max (but I think a lot of this is was because of all the logging I was doing on the client) - I added the timestamp to the channel message just before publishing and then when the socket received the message, I compared it with the current time and logged the results.
The machine which I used to simulate the clients had 32 cores and was in the same region (US) as the server machine but in a different availability zone.
You may want to double-check that you're not accidentally publishing more messages than intended; for example if you publish from inside an event handler which is accidentally attached multiple times.
Did you do all of this behind an nginx proxy using ssl? I would love to know how this will perform. I am sure there are no multiple attached event handlers, this has been tested very well inside our app (we had that issue but fixed it)
@jondubois but more to mention: we're using "forever", 14 workers, 2 brokers, single machine and we're also doing a lot of logging. Every "received message" on the server will be stored asap into redis after it has been processed.
I should have mentioned that we're using our NGINX reverse proxy to SSL the connection to the application too.
It sounds like this issue could be related to nginx proxy_buffering see https://stackoverflow.com/questions/9612983/socket-io-slow-response-after-using-nginx
Well, I can try turning proxy buffering off entirely. @happilymarrieddad has his on and he hasn't seen any of the problems that I've been having.
Correction @happilymarrieddad doesn't have proxy buffering on. I've turned off Proxy Buffering on my machines. I'm in the process of collecting data to see if the issue is resolved.
@jondubois are you using SC without nginx? Could you recommend to renounce nginx as proxy and accessing node.js directly, setting port to 443? Could this help to increase performance? From my understanding, this should be avoided and always proxied by nginx. Should we use a single dedicated server for proxying connections instead of running nginx on the same machine ?
@Hirbod Have you observed the issue since you turned off proxy buffering?
We had so many other issues with our app today that I didn't have any chance to check it deeply enough, but from a quick look it seems like it's working so far.
We have a dedicated loadbalancer (nginx) and everything else in the network is on their own respective instance. We haven't noticed any of these problems. We are also using SCC if that makes a difference. 1 scc-state server and 2 scc-broker servers.