grpc-java icon indicating copy to clipboard operation
grpc-java copied to clipboard

status code for unclean connection termination

Open benjaminp opened this issue 4 years ago • 14 comments

When the TCP connection underlying a netty server connection is abruptly closed by the client, grpc-java goes through all active call streams and closes them as UNAVAILABLE: https://github.com/grpc/grpc-java/blob/6518d7bd6dc496c89ff320e06c73424f26af364a/netty/src/main/java/io/grpc/netty/NettyServerHandler.java#L613-L625 A problem we've seen with this arrangement is that there is no way for to distinguish UNAVAILABLE call failures due to "real" server unavailability and due to the client uncleanly closing their connections (probably due to abrupt process death). What do you think about changing the status in this situation to CANCELLED?

benjaminp avatar Nov 09 '21 19:11 benjaminp

Personally I'm in favor of such a change. The conversation has come up several times (mostly to "not use UNAVAILABLE for broken streams"), with the most recent probably in early 2019. But it's never really quite gone anywhere useful. There's two issues: 1) this is not an easy technical topic and 2) it is made more difficult by compatibility.

In any case, what is the status code preventing you from doing? Non-idempotency has been the issue in the past. But if streaming RPCs is involved here, then is the problem something similar to exponential backoff? In our designs, we've tended to not to consider a stream as failing after having received at least one response, and we would immediately re-create it. If the RPC fails without receiving a response then that is a failure and we do exponential backoff.

ejona86 avatar Nov 12 '21 16:11 ejona86

Thank you for your response.

The reason I want to distinguish this case with a non-UNAVAILABLE status code is alerting. A busy dying client can cause 100s of UNAVAILABLE errors to be logged. That may page an operator, but it's not the server's "fault".

benjaminp avatar Nov 12 '21 16:11 benjaminp

How is this not the server's fault? (At least, in a different way than being unable to contact the server is the server's fault.) The connection dying like that is either the network killed the TCP connection, or the server closed it because of MAX_CONNECTION_AGE_GRACE or the like.

ejona86 avatar Nov 12 '21 17:11 ejona86

The case I saw is the client is killed by the Linux OOM killer. Then, the kernel closes all the client's TCP connections, leading the server to invoke the code in the OP on all the open streams, thus logging UNAVAILABLE as their status. (The server's StreamObserver.onError method appears to be called with a CANCELLED: client cancelled throwable for the affected streams, which makes the UNAVAILABLE stat logging extra confusing.)

benjaminp avatar Nov 12 '21 17:11 benjaminp

Oh, I see. Interesting. In that case the status only applies to stats. The correct behavior isn't immediately obvious, but this is something that can be played around with much more easily than the client-observed status. (Although it may be strange that we purposefully have them see different statuses.)

ejona86 avatar Nov 12 '21 17:11 ejona86

Discussed this with other languages. Notes:

  • The various languages may not actually be handling this case the same way. We actually doubt they are consistent. We began wondering if Go is even reporting RPCs killed this way at all
  • In general, matching the status code the client sees is strongly preferable. Yes, there are times where the client and server would disagree, but this is a case they would actually agree
  • There was grumbling that there is a single set of status codes shared between the application and library; if they were split, this wouldn't be a problem. Nothing can be done here, but the perspective was that UNAVAILABLE here isn't the problem, but instead the shared set of codes (which may influence how to address it)
  • While the current stat may be arbitrarily defined, we can't easily redefine it to another arbitrary definition. Someone may be needing the current definition (just like you are needing a new definition) and making a change like this can have fallout
  • If a strong need was demonstrated here (probably would require multiple users), then we'd probably model this as a new stat with new semantics. We could either have all library-generated status codes be CANCELLED (potentially allowing some to be INTERNAL as well), or have the stat be just the server's status code and don't report in other cases

ejona86 avatar Nov 12 '21 18:11 ejona86

* In general, matching the status code the client sees is strongly preferable. Yes, there are times where the client and server would disagree, but this is a case they would actually agree

I'm confused by this point. The server is incapable of sending the client a status code because the connection is gone. How can the client and server status codes be said to match? (In my situation, the client process is beyond status codes; it's dead.)

* There was grumbling that there is a single set of status codes shared between the application and library; if they were split, this wouldn't be a problem. Nothing can be done here, but the perspective was that UNAVAILABLE here isn't the problem, but instead the shared set of codes (which may influence how to address it)

Even forgetting the application, I believe the problem is inconsistent use of status codes within the gRPC library. Sure, if there was an UNAVAILABLE_BECAUSE_CONNECTION_ABRUPTLY_CLOSED status code, it would solve my problem. But absent that, I believe the best approximation today is CANCELLED.

To conclude why this status code should be changed to CANCELLED:

  • This particular status code is only seen by stream tracers, so blast radius is limited.
  • The error that all application StreamObserver.onError callbacks will receive in the situation under consideration is a StatusRuntimeException with code CANCELLED. Eliminating the inconsistency between the stream tracer and stubs is valuable. Changing the status code the stream tracer sees is less likely to see compatibility problems than changing the error sent to stubs.
  • It resolves the alerting problem of being unable to distinguish "normal" UNAVAILABLE statuses from UNAVAILABLE stauses caused completely by the client, which are unavoidable by better server programming or operations.

benjaminp avatar Nov 23 '21 02:11 benjaminp

The server is incapable of sending the client a status code because the connection is gone. How can the client and server status codes be said to match?

But we know how the client will react. The client will use status code UNAVAILABLE.

This particular status code is only seen by stream tracers, so blast radius is limited.

Except that this data is published via stats to monitoring tools. And those monitoring tools are tuned to the current stats.

The error that all application StreamObserver.onError callbacks will receive in the situation under consideration is a StatusRuntimeException with code CANCELLED. Eliminating the inconsistency between the stream tracer and stubs is valuable.

This is looking at the problem backward. We didn't expose the lower-level cause to services because a service really shouldn't be changing its logic due to the type of cancellation. But it was explicitly decided to share the additional details to stats, because that is for monitoring and won't impact behavior.

It resolves the alerting problem of being unable to distinguish "normal" UNAVAILABLE statuses from UNAVAILABLE stauses caused completely by the client

I don't disagree with this, but it can break other users. I'll also mention that technically a server can respond with CANCELLED; the collision of status codes is service-specific (although there may be widely used patterns).

ejona86 avatar Nov 24 '21 20:11 ejona86

The server is incapable of sending the client a status code because the connection is gone. How can the client and server status codes be said to match?

But we know how the client will react. The client will use status code UNAVAILABLE.

Do you know what situations this codepath will happen where both the client and the server are alive? (Middleboxes that decide to send TCP resets to each side?)

This particular status code is only seen by stream tracers, so blast radius is limited. Except that this data is published via stats to monitoring tools. And those monitoring tools are tuned to the current stats.

There are so few bits of information provided by stat that we should be able to reason about possible effects. Either the application has alerting on

  1. CANCELLED and UNAVAILABLE
  2. CANCELLED only
  3. UNAVAILABLE only

In case 1, the proposed change simply moves one kind of status to another. Case 2 seems very unlikely to be useful, since CANCELLED can arise from any number of client conditions that the server has no control over. This change would just add one more. Case 3 is the case I care about. Here, a client error is moved out from UNAVAILABLE to CANCELLED, improving the classification.

The error that all application StreamObserver.onError callbacks will receive in the situation under consideration is a StatusRuntimeException with code CANCELLED. Eliminating the inconsistency between the stream tracer and stubs is valuable.

This is looking at the problem backward. We didn't expose the lower-level cause to services because a service really shouldn't be changing its logic due to the type of cancellation. But it was explicitly decided to share the additional details to stats, because that is for monitoring and won't impact behavior.

If the client closes a http/2 stream with a particular status, the server stub will see the status the client closed with, right? It won't be forced to CANCELLED?

It resolves the alerting problem of being unable to distinguish "normal" UNAVAILABLE statuses from UNAVAILABLE stauses caused completely by the client

I don't disagree with this, but it can break other users. I'll also mention that technically a server can respond with CANCELLED; the collision of status codes is service-specific (although there may be widely used patterns).

What is service-specific about it? The server library code of gRPC uses CANCELLED in some situations and UNAVAILABLE in some situations. This issue is about which status to use when the OP code runs. That's independent of statuses the application uses, though it behooves the application to use statuses in a way consistent with gRPC internals.

Is the upshot that this behavior can only be changed by introducing a new stat? (presumably along with StreamTracer API extensions that allow the new and old stats to be emitted)

benjaminp avatar Nov 30 '21 03:11 benjaminp

Do you know what situations this codepath will happen where both the client and the server are alive? (Middleboxes that decide to send TCP resets to each side?)

TCP-level LBs (which includes common cloud hosting and things like k8s) and NATs will do stuff like this. Sometimes a firewall will also trigger in the middle of a connection. Dealing with this is the main reason for keepalive (to either proactively prevent it with activity or to notice it has happened).

If the client closes a http/2 stream with a particular status, the server stub will see the status the client closed with, right? It won't be forced to CANCELLED?

A client does not "close a http/2 stream with a particular status." It can only cancel. The server stub will only see a cancellation event with zero bits of other information. The only way for an RPC to end without the server specifying the status code is via cancellation.

It resolves the alerting problem of being unable to distinguish "normal" UNAVAILABLE statuses from UNAVAILABLE stauses caused completely by the client What is service-specific about it?

The request does not distinguish between two UNAVAILABLE statuses. Instead, it moves a case that uses UNAVAILABLE to CANCELLED. That works because your service implementation is using UNAVAILABLE but not using CANCELLED. Some other service could technically be different.

The issue is about which status to use when the OP code runs.

Yes and no. The change to gRPC would not depend on the status the application uses. But it is only a solution because of the status codes used by the application.

Is the upshot that this behavior can only be changed by introducing a new stat? (presumably along with StreamTracer API extensions that allow the new and old stats to be emitted)

Yes, that is what I was saying earlier, based on the cross-language discussion. And yes, we'd implement it by another method on StreamTracer.

The only other approach I see is to change the actual status code used by the client in this situation (e.g., maybe by configuration on the Channel to avoid breaking existing users, and similar for this on server-side). But that would take serious effort and has no guarantee for success.

ejona86 avatar Dec 09 '21 15:12 ejona86

How about adding an option to NettyServerBuilder to configure what error code is logged in this case?

benjaminp avatar Dec 09 '21 22:12 benjaminp

@benjaminp, this is a cross-language problem. Adding an option to NettyServerBuilder would put pressure on all languages to support it. That means we really need to have cross-language alignment before we make such a change.

ejona86 avatar Dec 23 '21 16:12 ejona86

The problem is mostly in ServerStreamTracer, which I believe is an API specific to grpc-java. E.g., https://github.com/grpc/grpc-java/issues/8684#issuecomment-967340827 says that grpc-go doesn't even trace a status for the unclean connection termination condition.

benjaminp avatar Jan 03 '22 20:01 benjaminp

ServerStreamTracer is used to gather stats and those stats are cross-language in definition. Also, we may be seeing other languages introduce a similar tracer API (internally). While there may be bugs (that should be fixed), the expected behavior of languages is clear. If we changed the behavior, that'd introduce something that'd be considered a bug, which is obviously the wrong direction.

I'm quite sympathetic to the issue you face. But you're the only one to have raised this issue and there isn't a quick-and-easy fix. So it really has to have a low priority.

ejona86 avatar Feb 15 '22 17:02 ejona86