blaze icon indicating copy to clipboard operation
blaze copied to clipboard

Exceptions when benchmarking: ERROR NIO1SocketServerGroup

Open jan0sch opened this issue 6 years ago • 4 comments

Hi,

during benchmarking a http4s service implementation I ran into some issues. Occasionally the service errored producing the following kinds of exceptions in the logs.

First kind

After this exception the service continued to reply but errors kept on ramping up.

ERROR NIO1SocketServerGroup - Error handling client channel. Closing.
java.util.concurrent.RejectedExecutionException: This SelectorLoop is closed.
        at org.http4s.blaze.channel.nio1.SelectorLoop.enqueueTask(SelectorLoop.scala:118)
        at org.http4s.blaze.channel.nio1.SelectorLoop.initChannel(SelectorLoop.scala:139)
        at org.http4s.blaze.channel.nio1.NIO1SocketServerGroup.org$http4s$blaze$channel$nio1$NIO1SocketServerGroup$$handleClientChannel(NIO1SocketServerGroup.scala:290)
        at org.http4s.blaze.channel.nio1.NIO1SocketServerGroup$SocketAcceptor.acceptNewConnections(NIO1SocketServerGroup.scala:148)
        at org.http4s.blaze.channel.nio1.NIO1SocketServerGroup$SocketAcceptor.opsReady(NIO1SocketServerGroup.scala:119)
        at org.http4s.blaze.channel.nio1.SelectorLoop.processKeys(SelectorLoop.scala:200)
        at org.http4s.blaze.channel.nio1.SelectorLoop.org$http4s$blaze$channel$nio1$SelectorLoop$$runLoop(SelectorLoop.scala:171)
        at org.http4s.blaze.channel.nio1.SelectorLoop$$anon$1.run(SelectorLoop.scala:68)
        at java.base/java.lang.Thread.run(Thread.java:834)

Second kind

After this exception the service stopped responding.

ERROR SelectorLoop - Unhandled exception in selector loop
java.io.IOException: Connection reset by peer
        at java.base/sun.nio.ch.FileDispatcherImpl.close0(Native Method)
        at java.base/sun.nio.ch.SocketDispatcher.close(SocketDispatcher.java:55)
        at java.base/sun.nio.ch.SocketChannelImpl.kill(SocketChannelImpl.java:907)
        at java.base/sun.nio.ch.SelectorImpl.processDeregisterQueue(SelectorImpl.java:267)
        at java.base/sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:116)
        at java.base/sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:124)
        at java.base/sun.nio.ch.SelectorImpl.select(SelectorImpl.java:141)
        at org.http4s.blaze.channel.nio1.SelectorLoop.org$http4s$blaze$channel$nio1$SelectorLoop$$runLoop(
SelectorLoop.scala:163)
        at org.http4s.blaze.channel.nio1.SelectorLoop$$anon$1.run(SelectorLoop.scala:68)
        at java.base/java.lang.Thread.run(Thread.java:834)
ERROR NIO1SocketServerGroup - Listening socket(/0.0.0.0:53248) closed forcibly.
java.nio.channels.ShutdownChannelGroupException: null
        at org.http4s.blaze.channel.nio1.SelectorLoop.killSelector(SelectorLoop.scala:225)
        at org.http4s.blaze.channel.nio1.SelectorLoop.org$http4s$blaze$channel$nio1$SelectorLoop$$runLoop(
SelectorLoop.scala:186)
        at org.http4s.blaze.channel.nio1.SelectorLoop$$anon$1.run(SelectorLoop.scala:68)
        at java.base/java.lang.Thread.run(Thread.java:834)
ERROR NIO1HeadStage - Abnormal NIO1HeadStage termination
java.nio.channels.ShutdownChannelGroupException: null
        at org.http4s.blaze.channel.nio1.SelectorLoop.killSelector(SelectorLoop.scala:225)
        at org.http4s.blaze.channel.nio1.SelectorLoop.org$http4s$blaze$channel$nio1$SelectorLoop$$runLoop(
SelectorLoop.scala:186)
        at org.http4s.blaze.channel.nio1.SelectorLoop$$anon$1.run(SelectorLoop.scala:68)
        at java.base/java.lang.Thread.run(Thread.java:834)

As mentioned these are hard to reproduce. In general I could observe that once a service started erroring then leaving the JVM running it would continue to produce errors.

System environment

The code for the service can be found in the following repository: https://github.com/jan0sch/pfhais

It is located within the pure folder. The configuration files for the jmeter benchmarks can be found in the jmeter folder.

Service workstation

  • CPU : Core i5-9600K, 6 Cores, 3,7 GHz
  • RAM : 32 GB
  • HDD : 2x Samsung SSD 860 PRO 512GB, SATA
  • OS : FreeBSD 12 (HT disabled)
  • JDK : 11.0.4+11-2
  • DB : PostgreSQL 11.3

Client workstation

  • CPU : AMD Ryzen Threadripper 2950X
  • RAM : 32 GB
  • HDD : 2x Samsung SSD 970 PRO 512GB, M.2
  • OS : FreeBSD 12 (HT disabled)
  • JDK : 11.0.4+11-2

Apache JMeter 5.1.1 was used to run the benchmark.

jan0sch avatar Sep 10 '19 14:09 jan0sch

@jan0sch @rossabaker I also have this issue with the exact same error message.

The error only occurs using Websockets and NIO1. Using NIO2 resolves the issue.

CharlesAHunt avatar May 04 '21 01:05 CharlesAHunt

Unfortunately, the NIO2 server is deprecated in blaze. Performance was worse by all measures, and we didn't backport the CVE fix.

rossabaker avatar May 06 '21 15:05 rossabaker

As of 0.23.10 I guess there is no fix for this issue, is that right? What is the preferred workaround? Would you recommend switching to a different server impl?

sergiojoker11 avatar Feb 14 '22 11:02 sergiojoker11

The ember server is the new default one since some time now. I did not try to re-create the bug with the ember one but so far had no trouble with it (using it in production although only on light/moderate service loads).

jan0sch avatar Feb 14 '22 12:02 jan0sch