google-cloud-go pubsub: Many acks/modacks could cause other acks/modacks to be delayed

Client

PubSub

Environment

CentOS on GCE (t2d-standard-4)

Go Environment

NumGoroutines=40
MaxExtension=15s
MaxOutstandingMessages=40000
MaxOutstandingBytes=-1
Synchronous=false

Expected behavior

I expect to see 0 in subscription/expired_ack_deadlines_count assuming that my AckCount match my PullCount.

Actual behavior

We periodically see a huge rate of expired acks as high as 15k/s. We are currently acknowledging 20k messages per second across 2 GCE instances, ~10k/s per instance, and pulling 20k messages per second across those instances as well. I would expect then that we shouldn't really be seeing any expired Acks.

I don't know the actual distribution of messages across the 40 goroutines but if some of them are getting most of the messages then it's possible for the ackIDBatchSize to be exceeded. When it's exceeded, sendModAck and sendAck both loop internally until all of the ackIDs have been sent. We don't have visibility into the distribution of time it takes to Acknowledge 2500 ackIDs but we can see from the GCP console that the Acknowledge method has a 95th percentile latency of over 100ms. Separately, we are calling ModifyAckDeadline (at the 95th percentile latency takes 40ms) with 16k IDs per second which needs 7 calls per instance which could take more than 250ms+.

Either of those would end up delaying the other since there's only a single sender goroutine which could be contributing to our expired acks issue.

Additionally, since we aren't using exactly-once-delivery, there's no way for us to measure how long it took from when we called Ack to when the request was sent to Pub/Sub. One way to fix that would be if the *AckResult returned from AckWithResult would actually be Ready once the message is sent, even if you're not using exactly-once-delivery.

Screenshots

Something else that's interesting is that the latencies shown in the GCP Console do not match our application-level metrics (which measure from start of Receive callback to Ack/Nack function call) at all: vs

This is what led us to investigate if there was some sort of delay between when we Ack a message and when the underlying Acknowledge is sent by the client.

ModAckCount (the reason for the increase at 19:30 UTC is because we increased MaxOutstandingMessages from 25000 to 40000)

Finally, the increase in expired acks happened after a sharp decrease in the StreamingPull response for which I have no explanation unless some change was made on Pub/Sub's side. It's not clear if this might mean that there's a higher concentration of messages in individual goroutines.

Additional context

We don't have any visibility into the 99th percentile modack extension being used and that would have been helpful in debugging.

Apr 08 '24 21:04 jameshartig

I have some high level thoughts before doing a thorough investigation.

How large are your messages on average? StreamingPull streams support up to 10k MB/s, so having 40 Goroutines (equivalent of 40 streams open) might be overkill. If there's suspicions network bottleneck issues, reducing this could help.

the increase in expired acks happened after a sharp decrease in the StreamingPull response for which I have no explanation unless some change was made on Pub/Sub's side. It's not clear if this might mean that there's a higher concentration of messages in individual goroutines

It's possible the decrease comes from streams restarting, as streams are periodically terminated, but you would need to open a ticket to provide project ID / subscription ID for me to look into this.

Additionally, since we aren't using exactly-once-delivery, there's no way for us to measure how long it took from when we called Ack to when the request was sent to Pub/Sub

We currently have a beta release of the library that includes OpenTelemetry tracing that would potentially provide insight into this. Given your very high throughput, you would have to significantly reduce the sampling rate to make sure you don't hit the tracing quota if using Cloud Trace. This might not be an issue if you use your own backend. Instructions to enable tracing live here for now.

Apr 08 '24 23:04 hongalex

How large are your messages on average? StreamingPull streams support up to 10k MB/s, so having 40 Goroutines (equivalent of 40 streams open) might be overkill. If there's suspicions network bottleneck issues, reducing this could help.

Mean is 768b and 95th percentile is 1kb. 40 is derived from 10 x NumCPU. We will try lowering that.

We currently have a beta release of the library that includes OpenTelemetry tracing that would potentially provide insight into this.

We will try that out and report back.

Apr 09 '24 03:04 jameshartig

@hongalex Here's a trace where we called the Ack after 1.7s but then the ack wasn't started for another 6s and then the ack itself took 8s:

Here's one where the ack took 12s to start (maybe because the modack took 10s to end?):

Another where the ack took 10s to start and I'm not sure why:

There's already a lot of these cases and we only have the sampling set to record a few per minute.

This trace is a bit confusing since we set the MinExtensionPeriod=30s (to workaround this issue) but I see 2 modacks within 100ms of each other. I'm guessing that's because the iterator doesn't check the expiration when it adds to pendingModAcks and just modacks everything that isn't expired:

Apr 09 '24 14:04 jameshartig

@hongalex Also for added context, our handler batches messages in-memory and uploads batches to GCS in chunks of ~25k. So we are acking ~25k messages at once which might exacerbate this problem.

Apr 09 '24 15:04 jameshartig

Thanks for the screenshots, this helps a lot with understanding the problem. Are these values already including the change to lower NumGoroutines?

So I think the delayed acks are a result of what you wrote in the first comment, that when batching large groups of ackIDs, any delay will result in a lot of further delay for messages latter in the batch. Could you confirm what message batch size you're seeing? This can be seen on the "modack start" and "ack end" events' attributes.

This trace is a bit confusing since we set the MinExtensionPeriod=30s (to workaround this issue) but I see 2 modacks within 100ms of each other. I'm guessing that's because the iterator doesn't check the expiration when it adds to pendingModAcks and just modacks everything that isn't expired:

That's correct. The first modack is the receipt modack that happens regardless, and subsequent modacks happen on a regular schedule.

Apr 09 '24 16:04 hongalex

Are these values already including the change to lower NumGoroutines? Yes, it was lowered to 20. I can lower it further if you think that would help?

Could you confirm what message batch size you're seeing? This can be seen on the "modack start" and "ack end" events' attributes.

The batch size for the ack was 2500 but the batch size for the modack was only 2 since the StreamingPull Response only contained 2 messages, I assume. I'm not sure why the modack took 44 seconds to complete though. That's longer than the Acknowledge deadline. We might need traces into the grpc library to understand if there's maybe contention on the number of streams or connections since the number of StreamingPull Response's and ModAck's are back up to their normal levels.

Apr 09 '24 16:04 jameshartig

Interesting, so maybe the bottleneck isn't on the number of streams opened, but the number of modacks sent. If the number of batches coming in is small, then a large number of modacks will be sent, relatively to acks.

Can you explain how your Mod Ack Count graph is being populated?

Apr 09 '24 17:04 hongalex

We might need traces into the grpc library to understand if there's maybe contention on the number of streams or connections since the number of StreamingPull Response's and ModAck's are back up to their normal levels.

I'm not sure why there aren't any grpc traces. We aren't setting WithTelemetryDisabled and from what I can tell from the code it looks like by default it calls addOpenTelemetryStatsHandler which uses otelgrpc.NewClientHandler(). Since we only are allowing pubsub-related spans (or children of those spans) in the sampler I wonder if the context is getting lost somehow when the modack or ack happens.

Can you explain how your Mod Ack Count graph is being populated?

Both of those graphs are from the OpenCensus data from the pubsub client. We have this function:

func registerPubSubOpenCensusViews() {
	registerPubSubOpenCensusViewsOnce.Do(func() {
		for _, v := range pubsub.DefaultPublishViews {
			lmetric.RegisterOpenCensusView("", v)
		}
		for _, v := range pubsub.DefaultSubscribeViews {
			lmetric.RegisterOpenCensusView("", v)
		}
	})
}

Which will register each of those views with our own lmetric package which exposes them to prometheus, which is the source of those graphs.

Apr 09 '24 17:04 jameshartig

I'm not sure why there aren't any grpc traces. We aren't setting WithTelemetryDisabled and from what I can tell from the code it looks like by default it calls addOpenTelemetryStatsHandler which uses otelgrpc.NewClientHandler()

So the generated gRPC traces are children of batch ack/modack spans that live in their own trace, outside of the message trace spans you've been screenshotting. The reason is a bit complicated but it has to do with how parent/child relationships work in trace views. You should be able to see spans named <subscription ID> ack and <subscription ID> modack, where the gRPC traces should exist below that. However, if your sampling rate is too low, it's possible these are not being sampled.

StreamingPull traces should also be generated but they're not very useful, since it only represents the duration of the stream being opened, and not more specific information.

Apr 09 '24 17:04 hongalex

So the generated gRPC traces are children of batch ack/modack spans that live in their own trace, outside of the message trace spans you've been screenshotting.

Ah the reason our sampler was not catching these was because I thought they'd have semconv.MessagingSystemKey.String("pubsub") set on them but the batch spans don't have any of these generic attributes set on them. Would they be able to get semconv.MessagingSystemKey.String("pubsub") and semconv.MessagingDestinationName(sub) set on them?

I'll make some changes to the sampler to include these other spans.

Apr 09 '24 18:04 jameshartig

Would they be able to get semconv.MessagingSystemKey.String("pubsub") and semconv.MessagingDestinationName(sub) set on them?

That's good feedback, let me get these added.

Apr 09 '24 19:04 hongalex

I fixed our sampler to include the grpc traces and interestingly all of the spans >20s are from ModifyAckDeadline:

Unfortunately the spans generated by grpc aren't very helpful:

Apr 09 '24 20:04 jameshartig

I think the next thing is to figure out if Pub/Sub is seeing the same high modack latencies on the server side. I suspect this isn't a bug with the server, and rather either a throttled network or perhaps less likely a bug with the client library, but it wouldn't hurt to check if you can open a support ticket.

Apr 10 '24 01:04 hongalex

@hongalex The case number is 50592102.

Also, there are a few spans I found where the grpc is significantly delayed: The following attributes were on that span:

messaging.batch.message_count	1
messaging.gcp_pubsub.ack_deadline_seconds	30
messaging.gcp_pubsub.is_receipt_modack	true

Then there are some like: Where the first grpc call returned status_code: 14 and then it retried. That one also had:

messaging.batch.message_count	1
messaging.gcp_pubsub.ack_deadline_seconds	30
messaging.gcp_pubsub.is_receipt_modack	true

I'm not sure if that indicates some kind of client-side throttling is happening.

Apr 10 '24 15:04 jameshartig

So I understand that my colleagues are investigating the bug with the message IDs you provided. In the meantime, I'm still trying to rule out what's going on from the client library perspective.

From your latest message (and also earlier messages), it is a bit concerning is that the batch is only comprised of a single message (ack ID). I was able to confirm that Pub/Sub should return, on a best effort basis, as many messages as available StreamingPull Response.

Some open questions I have

Why is the modack latency significantly higher than ack latency?
Why are you receiving batches of 1 or 2 messages from the stream?

I plan to create some kind of reproduction with the numbers you gave me to see if I can repro the behavior above.

Apr 12 '24 00:04 hongalex

Some more investigations

Why are you receiving batches of 1 or 2 messages from the stream?

So I realized this might be happening because of how flow control works. It seems that maybe one or two messages are acked at a time, which allows a tiny bit more messages to come through the stream (which issues a modack RPC). Still need to do more investigation to see if this can be mitigated.

Apr 12 '24 02:04 hongalex

It's hard for me to connect a subscribe span to a modack/ack span but it's easy to go the other way because there's a link. Is there anything we can do to make it easier to navigate both ways (subscribe -> modack/ack and modack/ack -> subscribe) with links?

So I realized this might be happening because of how flow control works.

I'm not sure if this is the reason because we are typically always acking in batches of ~25000 which should translate to ~9 Acknowledge calls containing 2500 ids and the final one which has the leftover. The GCP Console's trace explorer isn't complex enough to let me graph the messaging.batch.message_count field but I looked at a 4 minute window, there were 795 acknowledge spans, over 300 of them had exactly 2500 ids. I couldn't find a single one with less than 10 ids.

I think the reason why were getting batches of 1-2 messages is because we're publishing in 1-2 message batches. My naive understanding of the server architecture is that the streaming subscribe batches are equivalent to the publish batches if our subscribers are keeping up. We have 2 subscribe VMs but we have >50 publishing VMs each publishing messages coming from end-user HTTP requests which typically only have a few messages in them.

We also have the following settings on the publishing side:

DelayThreshold = time.Millisecond
CountThreshold = pubsub.MaxPublishRequestCount
Timeout = 15 * time.Second
BufferedByteLimit = 100 * pubsub.MaxPublishRequestBytes

I don't have the tracing beta version on the publishing side since I can't put beta code in that application so I can't tell you what the batches look like on the publishing side exactly. Depending on the region the servers are publishing between 150 messages/second to 250 messages/second.

One thing we could try is raising the DelayThreshold to 10ms on the publishing side which should allow more messages to be bundled and see what effect that has on the number of StreamingPull responses on the subscriber side. That would meaningfully increase our end-user request latency since we could now see publishes taking up to 10ms so I'm not sure how easily I could get that change in.

Also I noticed an odd pattern several times with the ModAckDeadline spans:

Screenshot 2024-04-11 214357 Screenshot 2024-04-11 212839

It looks like the latencies start to increase until they plateau at 60 seconds (the client-side timeout) and then recover later. Almost like there's something happening on the server where it starts to respond slower and slower and then either crashes or recovers somehow.

Apr 12 '24 03:04 jameshartig

I'm not sure if this is the reason because we are typically always acking in batches of ~25000 which should translate to ~9 Acknowledge calls containing 2500 ids and the final one which has the leftover

Hm that's fair. I had contrived a toy example where processing times was randomized between 5 and 9 seconds, which resulted in acks completing randomly within that interval and less batching.

One thing we could try is raising the DelayThreshold to 10ms on the publishing side which should allow more messages to be bundled and see what effect that has on the number of StreamingPull responses on the subscriber side

I would hold off on making publisher side changes for now. I was recently told that Pub/Sub servers will try and deliver as many messages as possible in a single StreamingPull response, even if the messages are published in small batches. It used to be that publish batches were more tightly coupled with subscribe batches, but that was before server side flow control was introduced.

I'll try reproducing with a large number of smaller publishers to see if that changes anything on my end, but aside from that I wasn't able to get my modack latency high.

Edit: forgot to respond

It's hard for me to connect a subscribe span to a modack/ack span but it's easy to go the other way because there's a link. Is there anything we can do to make it easier to navigate both ways (subscribe -> modack/ack and modack/ack -> subscribe) with links?

The OpenTelemetry spec only recently started allowing links to be added after span creation, so this bidirectional linking wasn't previously possible. I'm looking into adding it now.

Apr 12 '24 06:04 hongalex

@jameshartig apologies for the delay here. Would you be willing to try this fix that batches the initial receipt modacks? In cases where you're receiving a large number of small batched StreamingPull responses, this should reduce the number of modack RPCs being issued and hopefully lower the amount of network overhead those RPCs induce. Note, this branch is separate from the OpenTelemetry preview and builds upon the 1.37.1 release of the library.

go get cloud.google.com/go/pubsub@pubsub-batch-receipts

May 02 '24 21:05 hongalex

@hongalex

I'm willing to test it but I think modacks need to be in their own goroutine because even in your proposed change they still could be held up by a long-running ack. If we had the acknowledge deadline at 10 seconds then we'd have only 10 seconds to send the receipt modack but if a large ack took 10+ seconds we would miss all of those receipt modacks. Even a separate goroutine isn't a perfect solution because we've already found modack calls that take 60+ seconds.

However if you think the underlying delay is a network/grpc contention issue and not something on the server-side then this could help.

May 03 '24 15:05 jameshartig

@hongalex The ModAckCountView just counts total ModAcks and not the number of requests so I'm not able to observe that metric to understand the new count of modack requests but if I look at our GCP project metrics I do see a sharp decline in ModifyAckDeadline API calls:

Update: Helping the contention theory there was a significant drop in CPU after deploying that branch:

May 03 '24 16:05 jameshartig

Yeah so the biggest pain point I saw was the number of receipt modacks that were being issued since they aren't batched.

The reason I came back to this from a client library perspective is because I was working with someone on the backend side and couldn't see the Ack/Modacks actually reach the server, hinting there was network/cpu contention.

I'm curious to see if the fix holds stable. Your previous two graphs are great insights (thanks for sharing those) but let me know if anything changes. I'll start working on getting this fix merged.

Also, batching of receipts are only happening with non-exactly once delivery subscriptions. I don't think you're using that but thought I'd let you know anyway.

May 03 '24 18:05 hongalex

To best test it under load I stopped our subscribers for a few minutes to let ~5 million messages in the backlog. Then I started the subscribers again. It's definitely better than before. The 99th percentile for Ack latency (on the Google side) is 1.5min, 95th is 1.25min and 50th is 30 seconds. But that's still different from our internal metrics for the same time which were 8 seconds, 4 seconds and 3 seconds accordingly so I believe there's still some delayed acks happening. Since this doesn't include tracing I couldn't dig into them specifically.

May 03 '24 18:05 jameshartig

The reason I came back to this from a client library perspective is because I was working with someone on the backend side and couldn't see the Ack/Modacks actually reach the server, hinting there was network/cpu contention.

That makes a lot of sense. Do you have any thoughts on if its a Linux kernel tcp contention issue, Go scheduling issue, or something in the Go grpc library? I'm not familiar enough with grpc to know if there's potentially a max-outstanding-in-flight limit or something similar that we might've been hitting.

May 03 '24 18:05 jameshartig

To be honest, I'm not confident where the bottleneck is. I suspect it's a gRPC issue, but I don't have the data to back that up.

I remember a while back, I had suggested increasing the WithGRPCConnectionPool option. What value is that set at now?

May 03 '24 22:05 hongalex

I remember a while back, I had suggested increasing the WithGRPCConnectionPool option. What value is that set at now?

I missed that update to the docs so we just have it at the default which would be 4 since we're using t2d-standard-4 instances.

I can bump it up to 8 and see how that affects the ack latencies.

May 06 '24 15:05 jameshartig

Raised the pool to 8 but the service was actually doing 50% more messages per second than earlier. The latencies on Google's side were better: 99th percentile for Ack latency was 0.8min, 95th is .68min and 50th is 25 seconds.

Raising the pool to 16: Pretty much the same as 8. 99th percentile for Ack latency was 0.8min, 95th is .7min and 50th is 30 seconds.

Though these numbers are slightly better than the default (4 connections) I'm not sure why grpc might be the bottleneck now since each connection should have 100 streams and there are at most 1 modAck/ack/receipt in-flight with your new batching changes so it shouldn't be more than a handful of streams at any given time.

I might need to wait until your batching is merged and then combined with the otel tracing changes so I could do more digging on tracing. There's still an issue where we might ack 25k messages at a time and if it takes 10 seconds (for example) to ack 2500 messages then we'll end up spending 100 seconds acking those 25k messages since it's still a single goroutine serially acking the messages.

There was a spike in latency (according to the Google API Dashboard) for the Acknowledge call right before and during the spikes in expired acks:

That indicates it's taking at most ~200ms however we are acking ~13k per second which means we need to be able to do 5-6 calls per second, if we cannot make Acknowledge calls as fast as we are acking messages then it could fall behind.

May 07 '24 16:05 jameshartig

There's still an issue where we might ack 25k messages at a time and if it takes 10 seconds (for example) to ack 2500 messages then we'll end up spending 100 seconds acking those 25k messages since it's still a single goroutine serially acking the messages.

Yeah so I'm inclined to making all of the ack/modack RPCs concurrent (within a single invocation of sendAck and sendModack). That theoretically should improve the latency you are seeing here. However, expecting 10 seconds per ack/modack RPC is a bit strange which is why I would want to investigate that more.

May 10 '24 17:05 hongalex

Let me know what's next on my end for debugging. Happy to help. We might need the tracing changes though which means maybe we need to wait until your batching branch is merged?

May 10 '24 20:05 jameshartig

@jameshartig Sorry for the delay. Can you try pulling from go get cloud.google.com/go/pubsub@pubsub-batch-receipts again? I added the change that makes the ack/modack RPCs concurrent (called in their own goroutine), so this should improve performance somewhat.

May 14 '24 03:05 hongalex