bad_record_mac error as a cause of service rollout restart
Hi, One of our services implementing Bouncy Castle dependency constantly falls into rollout restart (this service with no Bouncy Castle works normally). What can be seen is in logs is:
redis.clients.jedis.exceptions.JedisConnectionException: org.bouncycastle.tls.TlsFatalAlert: bad_record_mac(20) at redis.clients.jedis.util.RedisInputStream.ensureFill(RedisInputStream.java:208) at redis.clients.jedis.util.RedisInputStream.readByte(RedisInputStream.java:46) at redis.clients.jedis.Protocol.process(Protocol.java:126)
(...)
Caused by: javax.crypto.AEADBadTagException: Error finalising cipher data: mac check in GCM failed at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499) at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480) at org.bouncycastle.jcajce.provider.ClassUtil.throwBadTagException(Unknown Source) at org.bouncycastle.jcajce.provider.BaseCipher.engineDoFinal(Unknown Source) at org.bouncycastle.jcajce.provider.BaseCipher.engineDoFinal(Unknown Source) at java.base/javax.crypto.Cipher.doFinal(Cipher.java:2417) at org.bouncycastle.tls.crypto.impl.jcajce.JceAEADCipherImpl.doFinal(JceAEADCipherImpl.java:135)
Current Bouncy Castle version is: 1.0.12.3 The issue can be seen only on one of our evironments. On the other service operates normally. I noticed that on the issued environment the traffic is much bigger than on the others. I've read somewhere, that: "bad_record_mac could mean some sort of packet corruption on-the-wire, but it's not common in practice (...) it happens with a probability of one in tens of millions" Taking this into consideration, is it possible that huge traffic is the cause of those bad_record_mac(20) errors which cause service rollout restart? If so, what could be possible solution for this situation? Is there a safe 'treshold' for the throughput to avoid these problems? Thanks in advance for your help
Up, can someone help with this, please?
Are you saying that there are multiple instances of some service using the BC providers and only one of these instances is having the issue?
Not exactly. We have multiple environments, but only in one of the environments we experience the issue. However on each environment there are multiple instances of the server and in this particular (issued) environment all server's instances with BC installed fall into rollout restart
If the service is under especially high load, I would guess a race condition is involved before suspecting packet corruption (and of course it would be helpful to rule out hardware issues by confirming the issue is independent of particular machine(s)).
I don't have any specific theory in mind as we haven't received any similar reports.
I see this is a redis/jedis service. Is there connection pooling (try disabling for diagnostic purposes) or other multiplexing involved? I'd be looking for sharing of endpoints, where multiple threads are reading from or writing to the same SSLSocket or SSLEngine.
Which TLS version(s) are being negotiated? What is the Java version and the OS? Are the problematic connections all TLS server-side or does it also act as a client making outbound connections?
Please update to 1.0.13 if possible, as it is easier to deal with the latest version.
Thank you for your answer! TLS version: our default one is 1.2 OS: distroless java version: [gcr.io/distroless/java17-debian11] We use these TLS connections to connect to our database and/or event hubs.
@Niegi , I have the same issue with Redis (AWS ElastiCache). Have you found any solution / workaround ? @peterdettman , We use Lettuce client for Redis , SpringBoot 2.6.4, java17 (jre17.0.2) , OS: alpine Linux 3.4.3
Jul 7, 2022 @ 20:13:30.039 F at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:223) ~[spring-data-redis-2.6.4.jar:2.6.4]
Jul 7, 2022 @ 20:13:30.039 F at org.springframework.data.redis.connection.lettuce.LettuceConnection.lambda$doInvoke$4(LettuceConnection.java:920) ~[spring-data-redis-2.6.4.jar:2.6.4]
Jul 7, 2022 @ 20:13:30.039 F org.springframework.data.redis.RedisSystemException: Unknown redis exception; nested exception is io.netty.handler.codec.DecoderException: javax.net.ssl.SSLException: org.bouncycastle.tls.TlsFatalAlert: bad_record_mac(20)
Jul 7, 2022 @ 20:13:30.039 F
Jul 7, 2022 @ 20:13:30.039 F 2022-07-07 17:13:30.033 ERROR 1 --- [Container#0-162] c.c.e.d.services.DataIngestionService : [manipulateEvent]
Jul 7, 2022 @ 20:13:30.039 F at org.bouncycastle.tls.crypto.impl.jcajce.JceAEADCipherImpl.doFinal(JceAEADCipherImpl.java:135) ~[bctls-fips-1.0.12.2.jar:1.0.12.2]
Jul 7, 2022 @ 20:13:30.039 F Caused by: javax.crypto.AEADBadTagException: Error finalising cipher data: mac check in GCM failed
Thank you.
I have an update.
I've been writing a tester that simply loads AWS redis with 10 threads, and the issue is reproduced after few hours.
See tester code : https://github.com/oridool/redistest
I also tried upgrading to bc-fips:1.0.2.3 and bctls-fips:1.0.13, and I still get the same error.
When it happens, it happens together for almost all threads at the same time.
Here is the call stack (with masked addresses):
2022-08-18 11:04:50.184 WARN 1 --- [pool-1-thread-4] org.redistest.RedistestApplication : Exception while performing redis command, curErrorCount=1
org.springframework.data.redis.RedisSystemException: Unknown redis exception; nested exception is io.netty.handler.codec.DecoderException: javax.net.ssl.SSLException: org.bouncycastle.tls.TlsFatalAlert: bad_record_mac(20)
at org.springframework.data.redis.FallbackExceptionTranslationStrategy.getFallback(FallbackExceptionTranslationStrategy.java:53) ~[spring-data-redis-2.6.4.jar:2.6.4]
at org.springframework.data.redis.FallbackExceptionTranslationStrategy.translate(FallbackExceptionTranslationStrategy.java:43) ~[spring-data-redis-2.6.4.jar:2.6.4]
at org.springframework.data.redis.connection.lettuce.LettuceConnection.convertLettuceAccessException(LettuceConnection.java:272) ~[spring-data-redis-2.6.4.jar:2.6.4]
at org.springframework.data.redis.connection.lettuce.LettuceConnection.await(LettuceConnection.java:1063) ~[spring-data-redis-2.6.4.jar:2.6.4]
at org.springframework.data.redis.connection.lettuce.LettuceConnection.lambda$doInvoke$4(LettuceConnection.java:920) ~[spring-data-redis-2.6.4.jar:2.6.4]
at org.springframework.data.redis.connection.lettuce.LettuceInvoker$Synchronizer.invoke(LettuceInvoker.java:665) ~[spring-data-redis-2.6.4.jar:2.6.4]
at org.springframework.data.redis.connection.lettuce.LettuceInvoker.just(LettuceInvoker.java:125) ~[spring-data-redis-2.6.4.jar:2.6.4]
at org.springframework.data.redis.connection.lettuce.LettuceHashCommands.hIncrBy(LettuceHashCommands.java:193) ~[spring-data-redis-2.6.4.jar:2.6.4]
at org.springframework.data.redis.connection.DefaultedRedisConnection.hIncrBy(DefaultedRedisConnection.java:1380) ~[spring-data-redis-2.6.4.jar:2.6.4]
at org.springframework.data.redis.core.DefaultHashOperations.lambda$increment$2(DefaultHashOperations.java:81) ~[spring-data-redis-2.6.4.jar:2.6.4]
at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:223) ~[spring-data-redis-2.6.4.jar:2.6.4]
at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:190) ~[spring-data-redis-2.6.4.jar:2.6.4]
at org.springframework.data.redis.core.AbstractOperations.execute(AbstractOperations.java:97) ~[spring-data-redis-2.6.4.jar:2.6.4]
at org.springframework.data.redis.core.DefaultHashOperations.increment(DefaultHashOperations.java:81) ~[spring-data-redis-2.6.4.jar:2.6.4]
at org.redistest.RedistestApplication.lambda$main$0(RedistestApplication.java:76) ~[classes/:na]
at java.base/java.util.concurrent.FutureTask.run(Unknown Source) ~[na:na]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[na:na]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[na:na]
at java.base/java.lang.Thread.run(Unknown Source) ~[na:na]
Caused by: io.netty.handler.codec.DecoderException: javax.net.ssl.SSLException: org.bouncycastle.tls.TlsFatalAlert: bad_record_mac(20)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:480) ~[netty-codec-4.1.77.Final.jar:4.1.77.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:279) ~[netty-codec-4.1.77.Final.jar:4.1.77.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[netty-transport-4.1.77.Final.jar:4.1.77.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[netty-transport-4.1.77.Final.jar:4.1.77.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[netty-transport-4.1.77.Final.jar:4.1.77.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[netty-transport-4.1.77.Final.jar:4.1.77.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[netty-transport-4.1.77.Final.jar:4.1.77.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[netty-transport-4.1.77.Final.jar:4.1.77.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[netty-transport-4.1.77.Final.jar:4.1.77.Final]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[netty-transport-4.1.77.Final.jar:4.1.77.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722) ~[netty-transport-4.1.77.Final.jar:4.1.77.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658) ~[netty-transport-4.1.77.Final.jar:4.1.77.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584) ~[netty-transport-4.1.77.Final.jar:4.1.77.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) ~[netty-transport-4.1.77.Final.jar:4.1.77.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:995) ~[netty-common-4.1.77.Final.jar:4.1.77.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.77.Final.jar:4.1.77.Final]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.77.Final.jar:4.1.77.Final]
... 1 common frames omitted
Caused by: javax.net.ssl.SSLException: org.bouncycastle.tls.TlsFatalAlert: bad_record_mac(20)
at org.bouncycastle.jsse.provider.ProvSSLEngine.unwrap(ProvSSLEngine.java:508) ~[bctls-fips-1.0.13.jar:1.0.13]
at java.base/javax.net.ssl.SSLEngine.unwrap(Unknown Source) ~[na:na]
at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:295) ~[netty-handler-4.1.77.Final.jar:4.1.77.Final]
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1342) ~[netty-handler-4.1.77.Final.jar:4.1.77.Final]
at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1235) ~[netty-handler-4.1.77.Final.jar:4.1.77.Final]
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1284) ~[netty-handler-4.1.77.Final.jar:4.1.77.Final]
at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:510) ~[netty-codec-4.1.77.Final.jar:4.1.77.Final]
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:449) ~[netty-codec-4.1.77.Final.jar:4.1.77.Final]
... 17 common frames omitted
Caused by: org.bouncycastle.tls.TlsFatalAlert: bad_record_mac(20)
at org.bouncycastle.tls.crypto.impl.TlsAEADCipher.decodeCiphertext(TlsAEADCipher.java:293) ~[bctls-fips-1.0.13.jar:1.0.13]
at org.bouncycastle.tls.RecordStream.decodeAndVerify(RecordStream.java:253) ~[bctls-fips-1.0.13.jar:1.0.13]
at org.bouncycastle.tls.RecordStream.readFullRecord(RecordStream.java:204) ~[bctls-fips-1.0.13.jar:1.0.13]
at org.bouncycastle.tls.TlsProtocol.safeReadFullRecord(TlsProtocol.java:903) ~[bctls-fips-1.0.13.jar:1.0.13]
at org.bouncycastle.tls.TlsProtocol.offerInput(TlsProtocol.java:1308) ~[bctls-fips-1.0.13.jar:1.0.13]
at org.bouncycastle.jsse.provider.ProvSSLEngine.unwrap(ProvSSLEngine.java:464) ~[bctls-fips-1.0.13.jar:1.0.13]
... 24 common frames omitted
Caused by: java.lang.IllegalStateException:
at org.bouncycastle.tls.crypto.impl.jcajce.Exceptions.illegalStateException(Exceptions.java:10) ~[bctls-fips-1.0.13.jar:1.0.13]
at org.bouncycastle.tls.crypto.impl.jcajce.JceAEADCipherImpl.doFinal(JceAEADCipherImpl.java:139) ~[bctls-fips-1.0.13.jar:1.0.13]
at org.bouncycastle.tls.crypto.impl.TlsAEADCipher.decodeCiphertext(TlsAEADCipher.java:288) ~[bctls-fips-1.0.13.jar:1.0.13]
... 29 common frames omitted
Caused by: javax.crypto.AEADBadTagException: Error finalising cipher data: mac check in GCM failed
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:na]
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) ~[na:na]
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) ~[na:na]
at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Unknown Source) ~[na:na]
at java.base/java.lang.reflect.Constructor.newInstance(Unknown Source) ~[na:na]
at org.bouncycastle.jcajce.provider.ClassUtil.throwBadTagException(Unknown Source) ~[bc-fips-1.0.2.3.jar:1.0.2.3]
at org.bouncycastle.jcajce.provider.BaseCipher.engineDoFinal(Unknown Source) ~[bc-fips-1.0.2.3.jar:1.0.2.3]
at org.bouncycastle.jcajce.provider.BaseCipher.engineDoFinal(Unknown Source) ~[bc-fips-1.0.2.3.jar:1.0.2.3]
at java.base/javax.crypto.Cipher.doFinal(Unknown Source) ~[na:na]
at org.bouncycastle.tls.crypto.impl.jcajce.JceAEADCipherImpl.doFinal(JceAEADCipherImpl.java:135) ~[bctls-fips-1.0.13.jar:1.0.13]
... 30 common frames omitted
2022-08-18 11:04:50.244 INFO 1 --- [xecutorLoop-1-1] i.l.core.protocol.ConnectionWatchdog : Reconnecting, last destination was master.***-redis.***.euw1.cache.amazonaws.com/172.31.**.***:6379
2022-08-18 11:04:50.256 INFO 1 --- [ioEventLoop-6-2] o.b.jsse.provider.PropertyUtils : Found string system property [java.home]: /opt/zulu17.32.14-sa-jre17.0.2-linux_musl_x64
2022-08-18 11:04:50.648 INFO 1 --- [ioEventLoop-6-2] i.l.core.protocol.ReconnectionHandler : Reconnected to master.***-redis.***.euw1.cache.amazonaws.com/<unresolved>:6379
2
@oridool We still try to figure it out. If we'll find anything I'll let know
@oridool We still try to figure it out. If we'll find anything I'll let know
Thanks. If it helps, I reproduced the error also with a single thread. See the readme in the GitHub tester repository I mentioned above.
@oridool It might provide useful information to try the reproducer with the regular BC library (bcprov + bctls jars). I would be interested in results from both the latest release (1.71) as well as the latest beta (1.72b15 - available from https://downloads.bouncycastle.org/betas/). In particular the beta version features a rewritten SSLEngine implementation (which will eventually find its way into the FIPS version), so if the issue is with our SSLEngine there's some chance that it resolves the issue.
@peterdettman , I was trying to reproduce the issue with bcprov+bctls 1.71 and even 1.69 but with no success. It doesn't occur when using these jars. I didn't change anything in my code or in the environment except the bc dependencies (and java.security file as described below). I used these dependencies:
implementation 'org.bouncycastle:bcprov-jdk15to18:1.71'
implementation 'org.bouncycastle:bctls-jdk15to18:1.71'
What can be the reason? is bcfips/bctls using the exact same code as bcprov/bctls?
If if helps, here are the changes I made to my java.security file:
For fips version:
security.provider.1=org.bouncycastle.jcajce.provider.BouncyCastleFipsProvider
security.provider.2=org.bouncycastle.jsse.provider.BouncyCastleJsseProvider fips:org.bouncycastle.jcajce.provider.BouncyCastleFipsProvider
security.provider.3=SUN
security.provider.4=SunEC
security.provider.5=SunJSSE
security.provider.6=SunJCE
security.provider.7=SunJGSS
security.provider.8=SunSASL
security.provider.9=XMLDSig
security.provider.10=SunPCSC
security.provider.11=JdkLDAP
security.provider.12=JdkSASL
security.provider.13=SunPKCS11
When application starts, I print the providers and get:
2022-09-03 12:43:01.468 WARN 37444 --- [ main] org.redistest.RedistestApplication :
PROVIDER[1] : BCFIPS --> BouncyCastle Security Provider (FIPS edition) v1.0.2.3
PROVIDER[2] : BCJSSE --> Bouncy Castle JSSE Provider Version 1.0.13
PROVIDER[3] : SUN --> SUN (DSA key/parameter generation; DSA signing; SHA-1, MD5 digests; SecureRandom; X.509 certificates; PKCS12, JKS & DKS keystores; PKIX CertPathValidator; PKIX CertPathBuilder; LDAP, Collection CertStores, JavaPolicy Policy; JavaLoginConfig Configuration)
PROVIDER[4] : SunEC --> Sun Elliptic Curve provider
PROVIDER[5] : SunJSSE --> Sun JSSE provider(PKCS12, SunX509/PKIX key/trust factories, SSLv3/TLSv1/TLSv1.1/TLSv1.2/TLSv1.3/DTLSv1.0/DTLSv1.2)
PROVIDER[6] : SunJCE --> SunJCE Provider (implements RSA, DES, Triple DES, AES, Blowfish, ARCFOUR, RC2, PBE, Diffie-Hellman, HMAC, ChaCha20)
PROVIDER[7] : SunJGSS --> Sun (Kerberos v5, SPNEGO)
PROVIDER[8] : SunSASL --> Sun SASL provider(implements client mechanisms for: DIGEST-MD5, EXTERNAL, PLAIN, CRAM-MD5, NTLM; server mechanisms for: DIGEST-MD5, CRAM-MD5, NTLM)
PROVIDER[9] : XMLDSig --> XMLDSig (DOM XMLSignatureFactory; DOM KeyInfoFactory; C14N 1.0, C14N 1.1, Exclusive C14N, Base64, Enveloped, XPath, XPath2, XSLT TransformServices)
PROVIDER[10] : SunPCSC --> Sun PC/SC provider
PROVIDER[11] : JdkLDAP --> JdkLDAP Provider (implements LDAP CertStore)
PROVIDER[12] : JdkSASL --> JDK SASL provider(implements client and server mechanisms for GSSAPI)
PROVIDER[13] : SunPKCS11 --> Unconfigured and unusable PKCS11 provider
For the non-fips version, I modified the first 2 providers to:
security.provider.1=org.bouncycastle.jce.provider.BouncyCastleProvider
security.provider.2=org.bouncycastle.jsse.provider.BouncyCastleJsseProvider
And get in my application logs:
2022-09-03 12:37:50.343 WARN 17604 --- [ main] org.redistest.RedistestApplication :
PROVIDER[1] : BC --> BouncyCastle Security Provider v1.71
PROVIDER[2] : BCJSSE --> Bouncy Castle JSSE Provider Version 1.0.13
PROVIDER[3] : SUN --> SUN (DSA key/parameter generation; DSA signing; SHA-1, MD5 digests; SecureRandom; X.509 certificates; PKCS12, JKS & DKS keystores; PKIX CertPathValidator; PKIX CertPathBuilder; LDAP, Collection CertStores, JavaPolicy Policy; JavaLoginConfig Configuration)
...
I was trying to reproduce the issue with bcprov+bctls 1.71 and even 1.69 but with no success. It doesn't occur when using these jars.
@oridool That's interesting information. Could you please also test bcprov+bctls 1.72b15 from here, so we can be sure the SSLEngine changes won't regress this.
bc-fips+bctls-fips is not exactly the same JSSE code; there are some relatively small changes to adapt to using the FIPS crypto. Taken together with this:
When it happens, it happens together for almost all threads at the same time.
...it suggests some kind of issue with either the FIPS AES/GCM implementation itself or perhaps in the thin layer of bctls-fips that uses it. A race condition of some kind still seems likely. We will review the codepath into the FIPS jar and look for what could be a possible cause.
Could you clarify which TLS version your reproducer is using? I am interested specifically in whether there is a difference b/w TLS 1.2 and TLS 1.3, as there is a special nonce mechanism for TLS GCM only used by the FIPS version in TLS 1.2.
@peterdettman , I've run my test with the 1.7b15 version. It runs OK with no error. Version confirmation from my logs:
022-09-07 09:50:16.603 WARN 1 --- [ main] org.redistest.RedistestApplication :
PROVIDER[1] : BC --> BouncyCastle Security Provider v1.72b
PROVIDER[2] : BCJSSE --> Bouncy Castle JSSE Provider Version 1.0.14
...
I am using TLS version 1.2. To make sure of that, I initialized the Redis client (Lettuce) as following:
LettuceClientConfiguration.LettuceClientConfigurationBuilder builder = // ....
builder.clientOptions(ClientOptions.builder().autoReconnect(true)
.sslOptions(SslOptions.builder().protocols("TLSv1.2").build())
// ...
.build());
Unfortunately, I cannot switch to TLS v1.3 or v1.1 to test those because I get an exception on startup. For v1.1 I get :
Caused by: java.lang.IllegalStateException: No usable protocols enabled
at org.bouncycastle.jsse.provider.ProvSSLContextSpi.getActiveProtocolVersions(ProvSSLContextSpi.java:710) ~[bctls-jdk18on-172b15.jar:1.71.99.0]
For v1.3 I get another exception, probably because the server doesn't support it:
@oridool TLS 1.1 doesn't support GCM cipher suites. For TLS 1.3, you need different cipher suite names, e.g. "TLS_AES_128_GCM_SHA256", so that might be the issue. Maybe post the exception you're getting.
@peterdettman , For TLS 1.3 I don't get any clear error.
Here it is:
TLS 1.3 exception
org.springframework.data.redis.RedisConnectionFailureException: Unable to connect to Redis; nested exception is org.springframework.data.redis.connection.PoolException: Could not get a resource from the pool; nested exception is io.lettuce.core.RedisConnectionException: Unable to connect to master.xxx.cache.amazonaws.com/<unresolved>:6379
at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$ExceptionTranslatingConnectionProvider.translateException(LettuceConnectionFactory.java:1689) ~[spring-data-redis-2.7.2.jar:2.7.2]
at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$ExceptionTranslatingConnectionProvider.getConnection(LettuceConnectionFactory.java:1597) ~[spring-data-redis-2.7.2.jar:2.7.2]
at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.getNativeConnection(LettuceConnectionFactory.java:1383) ~[spring-data-redis-2.7.2.jar:2.7.2]
at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.getConnection(LettuceConnectionFactory.java:1366) ~[spring-data-redis-2.7.2.jar:2.7.2]
at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getSharedConnection(LettuceConnectionFactory.java:1093) ~[spring-data-redis-2.7.2.jar:2.7.2]
at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getConnection(LettuceConnectionFactory.java:421) ~[spring-data-redis-2.7.2.jar:2.7.2]
at org.springframework.data.redis.core.RedisConnectionUtils.fetchConnection(RedisConnectionUtils.java:193) ~[spring-data-redis-2.7.2.jar:2.7.2]
at org.springframework.data.redis.core.RedisConnectionUtils.doGetConnection(RedisConnectionUtils.java:144) ~[spring-data-redis-2.7.2.jar:2.7.2]
at org.springframework.data.redis.core.RedisConnectionUtils.getConnection(RedisConnectionUtils.java:105) ~[spring-data-redis-2.7.2.jar:2.7.2]
at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:211) ~[spring-data-redis-2.7.2.jar:2.7.2]
at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:191) ~[spring-data-redis-2.7.2.jar:2.7.2]
at org.springframework.data.redis.core.AbstractOperations.execute(AbstractOperations.java:97) ~[spring-data-redis-2.7.2.jar:2.7.2]
at org.springframework.data.redis.core.DefaultHashOperations.increment(DefaultHashOperations.java:81) ~[spring-data-redis-2.7.2.jar:2.7.2]
at org.redistest.RedistestApplication.increment(RedistestApplication.java:129) ~[main/:na]
at org.redistest.RedistestApplication.lambda$main$0(RedistestApplication.java:87) ~[main/:na]
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[na:na]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[na:na]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[na:na]
at java.base/java.lang.Thread.run(Thread.java:833) ~[na:na]
Caused by: org.springframework.data.redis.connection.PoolException: Could not get a resource from the pool; nested exception is io.lettuce.core.RedisConnectionException: Unable to connect to master.xxx.cache.amazonaws.com/<unresolved>:6379
at org.springframework.data.redis.connection.lettuce.LettucePoolingConnectionProvider.getConnection(LettucePoolingConnectionProvider.java:109) ~[spring-data-redis-2.7.2.jar:2.7.2]
at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$ExceptionTranslatingConnectionProvider.getConnection(LettuceConnectionFactory.java:1595) ~[spring-data-redis-2.7.2.jar:2.7.2]
... 17 common frames omitted
Caused by: io.lettuce.core.RedisConnectionException: Unable to connect to master.xxx.cache.amazonaws.com/<unresolved>:6379
at io.lettuce.core.RedisConnectionException.create(RedisConnectionException.java:78) ~[lettuce-core-6.1.9.RELEASE.jar:6.1.9.RELEASE]
at io.lettuce.core.RedisConnectionException.create(RedisConnectionException.java:56) ~[lettuce-core-6.1.9.RELEASE.jar:6.1.9.RELEASE]
at io.lettuce.core.AbstractRedisClient.getConnection(AbstractRedisClient.java:330) ~[lettuce-core-6.1.9.RELEASE.jar:6.1.9.RELEASE]
at io.lettuce.core.RedisClient.connect(RedisClient.java:216) ~[lettuce-core-6.1.9.RELEASE.jar:6.1.9.RELEASE]
at org.springframework.data.redis.connection.lettuce.StandaloneConnectionProvider.lambda$getConnection$1(StandaloneConnectionProvider.java:115) ~[spring-data-redis-2.7.2.jar:2.7.2]
at java.base/java.util.Optional.orElseGet(Optional.java:364) ~[na:na]
at org.springframework.data.redis.connection.lettuce.StandaloneConnectionProvider.getConnection(StandaloneConnectionProvider.java:115) ~[spring-data-redis-2.7.2.jar:2.7.2]
at org.springframework.data.redis.connection.lettuce.LettucePoolingConnectionProvider.lambda$null$0(LettucePoolingConnectionProvider.java:97) ~[spring-data-redis-2.7.2.jar:2.7.2]
at io.lettuce.core.support.ConnectionPoolSupport$RedisPooledObjectFactory.create(ConnectionPoolSupport.java:211) ~[lettuce-core-6.1.9.RELEASE.jar:6.1.9.RELEASE]
at io.lettuce.core.support.ConnectionPoolSupport$RedisPooledObjectFactory.create(ConnectionPoolSupport.java:201) ~[lettuce-core-6.1.9.RELEASE.jar:6.1.9.RELEASE]
at org.apache.commons.pool2.BasePooledObjectFactory.makeObject(BasePooledObjectFactory.java:70) ~[commons-pool2-2.11.1.jar:2.11.1]
at org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:571) ~[commons-pool2-2.11.1.jar:2.11.1]
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:298) ~[commons-pool2-2.11.1.jar:2.11.1]
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:223) ~[commons-pool2-2.11.1.jar:2.11.1]
at io.lettuce.core.support.ConnectionPoolSupport$1.borrowObject(ConnectionPoolSupport.java:122) ~[lettuce-core-6.1.9.RELEASE.jar:6.1.9.RELEASE]
at io.lettuce.core.support.ConnectionPoolSupport$1.borrowObject(ConnectionPoolSupport.java:117) ~[lettuce-core-6.1.9.RELEASE.jar:6.1.9.RELEASE]
at org.springframework.data.redis.connection.lettuce.LettucePoolingConnectionProvider.getConnection(LettucePoolingConnectionProvider.java:103) ~[spring-data-redis-2.7.2.jar:2.7.2]
... 18 common frames omitted
Caused by: java.nio.channels.ClosedChannelException: null
at io.netty.handler.ssl.SslHandler.channelInactive(SslHandler.java:1065) ~[netty-handler-4.1.79.Final.jar:4.1.79.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) ~[netty-transport-4.1.79.Final.jar:4.1.79.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) ~[netty-transport-4.1.79.Final.jar:4.1.79.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241) ~[netty-transport-4.1.79.Final.jar:4.1.79.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405) ~[netty-transport-4.1.79.Final.jar:4.1.79.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262) ~[netty-transport-4.1.79.Final.jar:4.1.79.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248) ~[netty-transport-4.1.79.Final.jar:4.1.79.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901) ~[netty-transport-4.1.79.Final.jar:4.1.79.Final]
at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:813) ~[netty-transport-4.1.79.Final.jar:4.1.79.Final]
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) ~[netty-common-4.1.79.Final.jar:4.1.79.Final]
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) ~[netty-common-4.1.79.Final.jar:4.1.79.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) ~[netty-common-4.1.79.Final.jar:4.1.79.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) ~[netty-transport-4.1.79.Final.jar:4.1.79.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.79.Final.jar:4.1.79.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.79.Final.jar:4.1.79.Final]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.79.Final.jar:4.1.79.Final]
... 1 common frames omitted
How can we proceed from here ? Should I turn on some bcfips logs when using TLS 1.2 ? Do you want to provide a potential fix by a new build/jar and try it ?
Thanks.
How can we proceed from here ?
Can you disable GCM cipher suites in TLS 1.2 and try that please (e.g. configure your client so that only TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 is supported, or some other non-GCM suite).
Do you want to provide a potential fix by a new build/jar and try it ?
I'm just hoping to isolate the issue to GCM, then we can investigate and, yes, provide you a new jar once we've a potential fix.
@peterdettman ,
Can you disable GCM cipher suites in TLS 1.2 and try that please (e.g. configure your client so that only TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 is supported, or some other non-GCM suite)
Can you provide more details how I can do that? Is it something I need to configure in BouncyCastle? Is it a system parameter ? Other method?
e.g. by setting jdk.tls.client.cipherSuites system property:
-Djdk.tls.client.cipherSuites="TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256"
@peterdettman ,
I've just noticed that even after setting the property jdk.tls.client.cipherSuites as you requested, my test application is creating the TLS connection with the GCM cypher suite, and not the CBC one.
I've enabled debug logs for org.bouncycastle and as per my understanding, the server changes the cypher suite to TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256.
Any way for enforce it ?
2022-09-08 13:38:59.973 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Boolean system property [org.bouncycastle.jsse.server.dh.disableDefaultSuites] defaulted to: false
2022-09-08 13:38:59.999 INFO 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Found boolean security property [keystore.type.compat]: true
2022-09-08 13:39:00.001 INFO 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Found string system property [java.home]: /opt/zulu17.32.14-sa-jre17.0.2-linux_musl_x64
2022-09-08 13:39:00.017 INFO 1 --- [ioEventLoop-6-1] o.b.j.p.ProvTrustManagerFactorySpi : Initializing default trust store from path: /opt/zulu17.32.14-sa-jre17.0.2-linux_musl_x64/lib/security/cacerts
2022-09-08 13:39:00.911 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Boolean system property [com.sun.net.ssl.checkRevocation] defaulted to: false
2022-09-08 13:39:00.911 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Boolean system property [org.bouncycastle.jsse.trustManager.checkEKU] defaulted to: true
2022-09-08 13:39:00.959 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Integer system property [javax.net.ssl.sessionCacheSize] defaulted to: 20480
2022-09-08 13:39:00.961 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Boolean system property [org.bouncycastle.jsse.ec.disableChar2] defaulted to: false
2022-09-08 13:39:00.962 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Boolean system property [org.bouncycastle.ec.disable_f2m] defaulted to: false
2022-09-08 13:39:00.962 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Boolean system property [jsse.enableFFDHE] defaulted to: true
2022-09-08 13:39:01.534 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Boolean system property [org.bouncycastle.jsse.client.dh.disableDefaultSuites] defaulted to: false
2022-09-08 13:39:01.541 INFO 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Found string system property [jdk.tls.client.cipherSuites]: TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
2022-09-08 13:39:01.541 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Boolean system property [org.bouncycastle.jsse.server.dh.disableDefaultSuites] defaulted to: false
2022-09-08 13:39:01.543 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Boolean system property [org.bouncycastle.jsse.ec.disableChar2] defaulted to: false
2022-09-08 13:39:01.544 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Boolean system property [org.bouncycastle.ec.disable_f2m] defaulted to: false
2022-09-08 13:39:01.544 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Boolean system property [jsse.enableFFDHE] defaulted to: true
2022-09-08 13:39:01.546 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Boolean system property [org.bouncycastle.jsse.client.dh.disableDefaultSuites] defaulted to: false
2022-09-08 13:39:01.547 INFO 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Found string system property [jdk.tls.client.cipherSuites]: TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
2022-09-08 13:39:01.547 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Boolean system property [org.bouncycastle.jsse.server.dh.disableDefaultSuites] defaulted to: false
2022-09-08 13:39:01.549 INFO 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Found string system property [java.home]: /opt/zulu17.32.14-sa-jre17.0.2-linux_musl_x64
2022-09-08 13:39:01.549 INFO 1 --- [ioEventLoop-6-1] o.b.j.p.ProvTrustManagerFactorySpi : Initializing default trust store from path: /opt/zulu17.32.14-sa-jre17.0.2-linux_musl_x64/lib/security/cacerts
2022-09-08 13:39:01.899 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Boolean system property [org.bouncycastle.jsse.ec.disableChar2] defaulted to: false
2022-09-08 13:39:01.903 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Boolean system property [org.bouncycastle.ec.disable_f2m] defaulted to: false
2022-09-08 13:39:01.903 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Boolean system property [jsse.enableFFDHE] defaulted to: true
2022-09-08 13:39:01.989 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Boolean system property [jdk.tls.client.enableCAExtension] defaulted to: false
2022-09-08 13:39:01.989 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Boolean system property [org.bouncycastle.jsse.client.enableSessionResumption] defaulted to: true
2022-09-08 13:39:01.990 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Boolean system property [jdk.tls.client.enableStatusRequestExtension] defaulted to: true
2022-09-08 13:39:01.990 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Boolean system property [org.bouncycastle.jsse.client.enableTrustedCAKeysExtension] defaulted to: false
2022-09-08 13:39:01.990 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.PropertyUtils : Boolean system property [jsse.enableSNIExtension] defaulted to: true
2022-09-08 13:39:02.115 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.ProvTlsClient : Client notified of selected protocol version: TLSv1.2
2022-09-08 13:39:02.116 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.ProvTlsClient : Server specified new session: 45c5467ce4acfa1f5e3953afa6e6d4cd6125dbde41cc316fe7226d789c14cab7
2022-09-08 13:39:02.116 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.ProvTlsClient : Client notified of selected cipher suite: TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
2022-09-08 13:39:02.117 DEBUG 1 --- [ioEventLoop-6-1] o.b.jsse.provider.ProvTlsClient : Server accepted SNI?: true
For a TLS server, it's jdk.tls.server.cipherSuites . You could try that instead and as long as the suite is also in the list the client offers it should work.
I would assume your client software has its own configuration for TLS cipher suites somewhere, you'd have to check the docs for it.
@peterdettman ,
I was able to force the cipher suite TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 by doing the following in Redis client builder:
.sslOptions(SslOptions.builder().cipherSuites("TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256").protocols(tlsVersion).build())
I can confirm that the issue does not occur when running with the CBC cipher. Your theory of the issue being related to GCM is correct. How do you want to proceed?
@oridool It's not clear to me that your tests with the non-FIPS BC providers would actually be using GCM cipher suites, since in that case our client offers CHACHA20_POLY1305 suites at higher priority. Could you please try the non-FIPS version with an explicit cipher suite configured e.g. TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256.
We'd also like to rule out any issues stemming from the java version (17 IIUC), so if it's possible, please test the FIPS version with Java 11. While VM bugs aren't all that common, we do occasionally hit them, so this check would be helpful.
@peterdettman , After doing the tests as you requested, here are my findings:
The error occur sporadically only when running on Java 17 and using BC FIPS and GCM cipher suite (TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256).
The error doesn't occur when :
- Running Java 17 with BC FIPS and CBC cipher suite (TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256 )
- Running Java 17 with non-FIPS BC v1.71 and GCM cipher suite
- Running Java 11 (11.0.16.1) with BC FIPS and GCM cipher suite
hi @peterdettman , any conclusions from the last findings?
@oridool
- A JVM (or JIT) error with 17 is still in the cards. Still, it seems a little weird to see such an error take an hour of heavy load to trigger. Probably it will be more like a "difference that makes a difference" and not a bug, but we'll see.
- The error is very suggestive of a race condition, but you are reporting that it still occurs single-threaded, so it would have to be something more exotic like a race b/w the user thread and the finalizer thread.
- Other users have previously reported memory usage problems that ultimately stem from high memory usage by AES/GCM (and sometimes resulting finalizer queue issues), and the 1-2 hour trigger aligns with that, although there's nothing obvious in the exception related to that.
Practically speaking, the next step should be to simplify the reproducer to something that just hammers AES/GCM decryption (maybe encryption also needs to be included), basically by lifting the code out of TLS.
Practically speaking, the next step should be to simplify the reproducer to something that just hammers AES/GCM decryption (maybe encryption also needs to be included), basically by lifting the code out of TLS.
@peterdettman , I don't know how to do that, since it requires deeper knowledge in the BC FIPS library I assume. Can you/someone try to isolate this ? I already shared my tester application (https://github.com/oridool/redistest ), perhaps it can be used as a starting point.
@oridool Yes, I intend to write the low-level/direct reproducer myself.
This sounds an awful lot like a recent bug report we're currently investigating with 1.0.2.3. 1.0.2.3 wasn't tested with Java 17 and it's not certified for it. For now the best advice I can give is 1.0.2.3 should not be used with Java 17.
We believe we have a fix version though, if you would like to test it and confirm the problem goes away email us at [email protected] and I will organise access.
Hi @dghgit / @peterdettman , I can confirm that the latest snapshot bc-fips jar I've tested indeed resolves the issue :). Well done !!!
Phew and thanks for the help in testing. For other readers the issue is BC-FJA 1.0.2.3 and how it interacts with Java 17. There is now a non-certified patch release of BC-FJA 1.0.2.4 which fixes the problem available at:
https://www.bouncycastle.org/latest_releases.html#1.0.2.4-NONCERT
For people concerned about the fact the release is not certified, keep in mind BC-FJA 1.0.2.3 is not certified on Java 17 either.