pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[Bug] Messages seems broken when got SSL exception in the producer

Open pqab opened this issue 2 years ago • 13 comments

Search before asking

  • [X] I searched in the issues and found nothing similar.

Version

2.10.5

Minimal reproduce step

  1. Create topic
bin/pulsar-admin tenants create tenant1
bin/pulsar-admin namespaces create tenant1/namespace1
bin/pulsar-admin namespaces set-persistence --bookkeeper-ack-quorum 2 --bookkeeper-ensemble 3 --bookkeeper-write-quorum 3 --ml-mark-delete-max-rate 0 tenant1/namespace1
bin/pulsar-admin namespaces set-retention tenant1/namespace1 --size -1 --time 3d
bin/pulsar-admin namespaces set-message-ttl tenant1/namespace1 --messageTTL 604800
bin/pulsar-admin topics create-partitioned-topic tenant1/namespace1/topic1 -p 3
  1. Produce large payload & batch from the admin tool with tls
bin/pulsar-perf produce persistent://tenant1/namespace1/topic1 -mk autoIncrement -bb 5242880 -r 5000 -s 5242 -bm 1000 -threads 30 --auth-plugin org.apache.pulsar.client.impl.auth.AuthenticationTls --auth-params '{"tlsCertFile":"conf/user.cer","tlsKeyFile":"conf/user.key.pem"}'
  1. Stop until it produced around 1 million messages

  2. Wait until all the messages goes to BookKeeper backlog

  3. Start consumer to consume all the messages with tls

bin/pulsar-perf  consume persistent://tenant1/namespace1/topic1 --auth-plugin org.apache.pulsar.client.impl.auth.AuthenticationTls --auth-params '{"tlsCertFile":"conf/user.cer","tlsKeyFile":"conf/user.key.pem"}' -sp Earliest -ss sub1

What did you expect to see?

Able to consume all produced messages properly from the consumer

What did you see instead?

Consumer stopped receiving msg in the middle, and could see some error from the broker logs like

2024-01-19T14:05:39,899+0000 [BookKeeperClientWorker-OrderedExecutor-4-0] ERROR org.apache.bookkeeper.proto.checksum.DigestManager - Mac mismatch for ledger-id: 852, entry-id: 35932
2024-01-19T14:05:39,902+0000 [BookKeeperClientWorker-OrderedExecutor-4-0] ERROR org.apache.bookkeeper.proto.checksum.DigestManager - Mac mismatch for ledger-id: 852, entry-id: 35932
2024-01-19T14:05:39,916+0000 [BookKeeperClientWorker-OrderedExecutor-4-0] ERROR org.apache.bookkeeper.proto.checksum.DigestManager - Mac mismatch for ledger-id: 852, entry-id: 35932
2024-01-19T14:05:39,916+0000 [BookKeeperClientWorker-OrderedExecutor-4-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L852 E35899-E35998, Sent to [100.87.157.209:3181, 100.111.147.236:3181, 100.96.184.253:3181], Heard from [100.87.157.209:3181, 100.111.147.236:3181, 100.96.184.253:3181] : bitset = {0, 1, 2}, Error = 'Entry digest does not match'. First unread entry is (35973, rc = 0)
2024-01-19T14:05:39,916+0000 [broker-topic-workers-OrderedExecutor-15-0] ERROR org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer - [persistent://tenant1/namespace1/topic1-0 / sub1-Consumer{subscription=PersistentSubscription{topic=persistent://tenant1/namespace1/topic1-0, name=sub1}, consumerId=0, consumerName=383fd, address=/100.96.184.253:50090}] Error reading entries at 852:35899 : Entry digest does not match - Retrying to read in 15.0 seconds

Anything else?

Seems only happening when there is SSL exception in the middle of the produce like

2024-01-19T13:39:13,450+0000 [pulsar-client-io-12-1] WARN  org.apache.pulsar.client.impl.ClientCnx - Got exception io.netty.handler.codec.DecoderException: io.netty.handler.ssl.ReferenceCountedOpenSslEngine$OpenSslException: error:100003fc:SSL routines:OPENSSL_internal:SSLV3_ALERT_BAD_RECORD_MAC
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:499)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:499)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:397)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.netty.handler.ssl.ReferenceCountedOpenSslEngine$OpenSslException: error:100003fc:SSL routines:OPENSSL_internal:SSLV3_ALERT_BAD_RECORD_MAC
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.newSSLExceptionForError(ReferenceCountedOpenSslEngine.java:1377)
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.shutdownWithError(ReferenceCountedOpenSslEngine.java:1089)
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.sslReadErrorResult(ReferenceCountedOpenSslEngine.java:1399)
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1325)
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1426)
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1469)
	at io.netty.handler.ssl.SslHandler$SslEngineType$1.unwrap(SslHandler.java:223)
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1353)
	at io.netty.handler.ssl.SslHandler.decodeNonJdkCompatible(SslHandler.java:1257)
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1297)
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468)
	... 15 more

or

2024-01-19T14:01:02,532+0000 [pulsar-client-io-6-1] WARN  org.apache.pulsar.client.impl.ClientCnx - Got exception io.netty.handler.codec.DecoderException: io.netty.handler.ssl.ReferenceCountedOpenSslEngine$OpenSslException: error:10000438:SSL routines:OPENSSL_internal:TLSV1_ALERT_INTERNAL_ERROR
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:499)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:499)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:397)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.netty.handler.ssl.ReferenceCountedOpenSslEngine$OpenSslException: error:10000438:SSL routines:OPENSSL_internal:TLSV1_ALERT_INTERNAL_ERROR
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.newSSLExceptionForError(ReferenceCountedOpenSslEngine.java:1377)
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.shutdownWithError(ReferenceCountedOpenSslEngine.java:1089)
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.sslReadErrorResult(ReferenceCountedOpenSslEngine.java:1399)
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1325)
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1426)
	at io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1469)
	at io.netty.handler.ssl.SslHandler$SslEngineType$1.unwrap(SslHandler.java:223)
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1353)
	at io.netty.handler.ssl.SslHandler.decodeNonJdkCompatible(SslHandler.java:1257)
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1297)
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468)
	... 15 more

Are you willing to submit a PR?

  • [ ] I'm willing to submit a PR!

pqab avatar Jan 19 '24 14:01 pqab

Not an answer to this problem, but noticed --ml-mark-delete-max-rate 0 in your persistence parameters. Could you share the reason for using that? I'd be interesting to know if there's an underlying requirement that could be covered with a better solution in Pulsar. UPDATE: I now noticed that it's also in the examples: https://pulsar.apache.org/docs/next/admin-api-namespaces/#set-persistence-policies .

lhotari avatar Jan 19 '24 20:01 lhotari

-bb 5242880

Does the problem reproduce with lower sizes? let's say 4000000 or 5000000.

lhotari avatar Jan 19 '24 20:01 lhotari

Which OS and platform is this (uname -a)? Is this Linux x86_64? Which Java version (java -version)?

lhotari avatar Jan 19 '24 20:01 lhotari

Not an answer to this problem, but noticed --ml-mark-delete-max-rate 0 in your persistence parameters. Could you share the reason for using that? I'd be interesting to know if there's an underlying requirement that could be covered with a better solution in Pulsar. UPDATE: I now noticed that it's also in the examples: https://pulsar.apache.org/docs/next/admin-api-namespaces/#set-persistence-policies .

yes, I believe we are just following the examples to create the topics

pqab avatar Jan 20 '24 01:01 pqab

java -version

yes

Linux 4.20.11-1.el7.elrepo.x86_64 #1 SMP Wed Feb 20 09:00:53 EST 2019 x86_64 x86_64 x86_64 GNU/Linux

openjdk version "11.0.12" 2021-07-20 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.12+7-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.12+7-LTS, mixed mode, sharing)

pqab avatar Jan 20 '24 01:01 pqab

-bb 5242880

Does the problem reproduce with lower sizes? let's say 4000000 or 5000000.

I was trying to re-create the topic in order to clean the environment before re-running the test, but somehow the topic couldn't be deleted, and I keep seeing some error in the broker like

2024-01-20T01:07:32,897+0000 [pulsar-web-39-8] ERROR org.apache.pulsar.broker.admin.impl.PersistentTopicsBase - [superuser] Failed to delete topic persistent://tenant1/namespace1/topic1-partition-0
org.apache.bookkeeper.mledger.ManagedLedgerException$MetadataNotFoundException: Managed ledger not found

If I try to delete the topic, it's saying Topic not found, but if I try to re-create the same topic, it's saying Topic already exists.

I tried to create another new topic & publish with 3242880, but the polling rate is very slow and stopped receiving new message at some time. seems like the server is kind of very busy now, we will need some time to check why the topic couldn't be deleted first

pqab avatar Jan 20 '24 01:01 pqab

Version

2.10.5

Do you have a specific reason to use 2.10.x ? 2.10.x is very close to end of support since according to our release policy, we should have already dropped support (decision is pending, mailing list discussion). It would be recommended to use and test the latest LTS release which is 3.0.2 at the moment.

In any case, it would be helpful to test with 3.0.2 to see if the problem can be reproduced there.

lhotari avatar Jan 20 '24 09:01 lhotari

Version

2.10.5

Do you have a specific reason to use 2.10.x ? 2.10.x is very close to end of support since according to our release policy, we should have already dropped support (decision is pending, mailing list discussion). It would be recommended to use and test the latest LTS release which is 3.0.2 at the moment.

In any case, it would be helpful to test with 3.0.2 to see if the problem can be reproduced there.

It's our current production version and we found this issues when we tried to run the load test to our application

sure, thank you we will upgrade to 3.0.2 and run the same test again to see if we are still getting the same issues I will update the result here after

pqab avatar Jan 21 '24 03:01 pqab

After we upgraded to 3.1.2, we are still having the same issues

It looks similar to this ticket, but we are already using bookkeeperUseV2WireProtocol=true https://github.com/apache/pulsar/issues/21421 would you have any idea? would it be something related to the SSL?

now we have another workaround to set managedLedgerDigestType=MAC to bypass it, and so far it looks fine, we don't see error and it doesn't stop consuming

pqab avatar Jan 23 '24 07:01 pqab

After we upgraded to 3.1.2, we are still having the same issues

It looks similar to this ticket, but we are already using bookkeeperUseV2WireProtocol=true #21421 would you have any idea? would it be something related to the SSL?

now we have another workaround to set managedLedgerDigestType=MAC to bypass it, and so far it looks fine, we don't see error and it doesn't stop consuming

This is surprising since I have been assuming that #21421 impacts the case when the libcirce-checksum.so library cannot be loaded and Bookkeeper falls back to use JVM based CRC32 calculations with com.scurrilous.circe.checksum.Java9IntHash class.

lhotari avatar Jan 29 '24 20:01 lhotari

Just wondering if you might be impacted by #21684 ? do you happen to have haProxyProtocolEnabled=true?

lhotari avatar Jan 29 '24 20:01 lhotari

It's possible that this is a bug in Bookkeeper's checksum calculation. I'll follow up later after investigation. Thanks for the good issue reports!

lhotari avatar Jan 29 '24 23:01 lhotari

Just wondering if you might be impacted by #21684 ? do you happen to have haProxyProtocolEnabled=true?

we are using haProxyProtocolEnabled=false

pqab avatar Jan 30 '24 04:01 pqab

This is possibly related to #22601 / #22810 if this isn't fixed by Pulsar 3.0.5 / 3.2.3 which includes Bookkeeper 4.16.5 . @pqab Do you have a chance to test with Pulsar 3.0.5 ?

lhotari avatar May 31 '24 13:05 lhotari

I believe @semistone is testing with Pulsar 3.2.2, and this error is gone

pqab avatar May 31 '24 13:05 pqab

I believe @semistone is testing with Pulsar 3.2.2, and this error is gone

Sounds good. btw. It would be worth to test Pulsar 3.2.3 since there's over 100 commits since 3.2.2. Pulsar 3.2.3 release notes: https://github.com/apache/pulsar/releases/tag/v3.2.3 . Pulsar 3.2.3 also includes Bookkeeper 4.16.5 with important fixes.

lhotari avatar May 31 '24 13:05 lhotari

I think so, we will test with the latest version and see if all similar issues are gone

pqab avatar May 31 '24 13:05 pqab

We tested with 3.2.3 broker & client, and the issue is gone now

pqab avatar Jun 07 '24 08:06 pqab