Rework gorouter error classifiers and retry logic
Is this a security vulnerability?
No.
Issue
The current classifiers are misleading because they seem to represent relations between different sets of errors which should be completely distinct and, to some extent, abuse the error type.
Affected Versions
All.
Context
These two changes are closely related which is why I opened this as one issue. If we feel like this is not a good idea I can split it into two separate issues.
Suggested Changes
There are two (relatively) independent changes that I would like to discuss.
Make All Classifier Groups Distinct
Just because the groups happen to share some errors / classifiers they shouldn't depend on each other. This is a possible source of confusion and has caused mistakes in the past.
diff --git a/proxy/fails/classifier_group.go b/proxy/fails/classifier_group.go
index d0a1da5..b790c01 100644
--- a/proxy/fails/classifier_group.go
+++ b/proxy/fails/classifier_group.go
@@ -1,41 +1,57 @@
package fails
type ClassifierGroup []Classifier
// RetriableClassifiers include backend errors that are safe to retry
//
// Backend errors are only safe to retry if we can be certain that they have
// occurred before any http request data has been sent from gorouter to the
// backend application.
//
// Otherwise, there’s risk of a mutating non-idempotent request (e.g. send
// payment) being silently retried without the client knowing.
var RetriableClassifiers = ClassifierGroup{
Dial,
AttemptedTLSWithNonTLSBackend,
HostnameMismatch,
RemoteFailedCertCheck,
RemoteHandshakeFailure,
RemoteHandshakeTimeout,
UntrustedCert,
ExpiredOrNotYetValidCertFailure,
IdempotentRequestEOF,
IncompleteRequest,
}
var FailableClassifiers = ClassifierGroup{
- RetriableClassifiers,
+ Dial,
+ AttemptedTLSWithNonTLSBackend,
+ HostnameMismatch,
+ RemoteFailedCertCheck,
+ RemoteHandshakeFailure,
+ RemoteHandshakeTimeout,
+ UntrustedCert,
+ ExpiredOrNotYetValidCertFailure,
ConnectionResetOnRead,
}
-var PrunableClassifiers = RetriableClassifiers
+var PrunableClassifiers = ClassifierGroup{
+ Dial,
+ AttemptedTLSWithNonTLSBackend,
+ HostnameMismatch,
+ RemoteFailedCertCheck,
+ RemoteHandshakeFailure,
+ RemoteHandshakeTimeout,
+ UntrustedCert,
+ ExpiredOrNotYetValidCertFailure,
+}
// Classify returns true on errors that are retryable
func (cg ClassifierGroup) Classify(err error) bool {
for _, classifier := range cg {
if classifier.Classify(err) {
return true
}
}
return false
}
This change includes a minor tweak to the groups as well: IdempotentRequestEOF and IncompleteRequest are no longer part of the FailableClassifiers and PruneableClassifiers because those two errors are only "annotated" versions of an underlying error. Their sole purpose is to be able to match them using the RetriableClassifiers because we checked some pre-condition that allows us to retry the wrapped error even though we usually wouldn't be able to retry in that case.
Get Rid of "annotated" Errors
We initially introduced IdempotentRequestEOF and IncompleteRequest because we needed a way to tell the classifiers that those errors are retry-able without unconditionally retrying all of the errors that might get wrapped inside them. This included an additional check which is done in isRetriable.
The main issue is that we wrap errors without providing details that are particularly relevant in the sense that they enrich the error message. They purely exist because we need to pass a value of type error to the classifier groups. Instead I propose to split the logic for performing retries:
diff --git a/proxy/round_tripper/proxy_round_tripper.go b/proxy/round_tripper/proxy_round_tripper.go
index 5d2c9d0..09fd093 100644
--- a/proxy/round_tripper/proxy_round_tripper.go
+++ b/proxy/round_tripper/proxy_round_tripper.go
@@ -440,24 +439,23 @@ func isIdempotent(request *http.Request) bool {
return false
}
-func (rt *roundTripper) isRetriable(request *http.Request, err error, trace *requestTracer) (bool, error) {
+func (rt *roundTripper) isRetriable(request *http.Request, err error, trace *requestTracer) bool {
// if the context has been cancelled we do not perform further retries
if request.Context().Err() != nil {
- return false, fmt.Errorf("%w (%w)", request.Context().Err(), err)
+ return false
}
// io.EOF errors are considered safe to retry for certain requests
// Replace the error here to track this state when classifying later.
if err == io.EOF && isIdempotent(request) {
- err = fails.IdempotentRequestEOFError
+ return true
}
// We can retry for sure if we never obtained a connection
// since there is no way any data was transmitted. If headers could not
// be written in full, the request should also be safe to retry.
if !trace.GotConn() || !trace.WroteHeaders() {
- err = fmt.Errorf("%w (%w)", fails.IncompleteRequestError, err)
+ return true
}
- retriable := rt.retriableClassifier.Classify(err)
- return retriable, err
+ return rt.retriableClassifier.Classify(err)
}
and completely remove IdempotentRequestEOF and IncompleteRequest. This way we get the benefits of our additional retries either from the checks that allow us to retry in special cases or from the classifiers while not tampering with the errors that are passed around and even displayed to the user (and I'm pretty sure that end-users would be confused if they see incomplete request (EOF) in response to their request).
Next Steps
- [ ] Collect Feedback and discuss potential issues
- [ ] Propose a PR
- [ ] Review & Merge
- [ ] Release
ping: @geofffranks @domdom82 @ameowlia
On first read through this seems pretty reasonable and thorough. I like the notion of splitting the classifiers out explicitly to reduce confusion and accidental side effects of changing any one of these individually.
Good idea. I'd say we tackle both issues in one go:
- Make prunable errors distinct from retryable errors. Only prune errors that make future requests impossible (i.e. TLS cert mismatch)
- Remove types for IdempotentEOF and IncompleteRequest errors and replace them with simple boolean checks. For this we have to keep in mind: If we remove the types we can only use
isRetriableto actually determine if an error is retriable, not theClassifyfunction of theRetriableClassifiersgroup. So far, this is not a problem because they are used only in one place (round tripper). But we won't be able to take the error anywhere else and then callClassifyon it do determine if it was / is retriable, for example in other handlers that want to add logs or logic. That's the only downside I see, but it is a rather theoretical one. - As for the access log, we are losing a little bit of information, where the
x-cf-routererrorfield would change fromEOF (via idempotent request)andincompleteRequest (EOF)to justEOFBut we might be able to fix that later on with a fix to access log.
Make prunable errors distinct from retryable errors. Only prune errors that make future requests impossible (i.e. TLS cert mismatch)
I would clarify this also include dial and handshake timeouts in in this. There could be cases where future requests would eventually be possible once the condition causing a timeout goes away, but we shouldn't keep the backend in the pool while its known to time out. Otherwise a lot of requests will incur extra latency trying it before retrying requests. If the backend eventually recovers, route-emitter will re-add it.
Sorry that his has been so quiet for the last months. I've split out the simple change into a dedicated PR that can probably be merged more easily and addresses the first issue: cloudfoundry/gorouter#355.
The combined changes are still in cloudfoundry/gorouter#349 but since I need to figure out the testing and the change is more involved I think it makes sense to separate them. I will look into it in the coming days and (hopefully) provide a full PR soon.
ah, sorry. gh auto-closed this when i merged :D
ah, sorry. gh auto-closed this when i merged :D
My fault, I tried to prevent it by changing the commit trailer to Partially-Resolves instead of Resolves but GitHub is pretty liberal with those trailers :smile:
Hi @domdom82 & @geofffranks , Any further activity on this issue needed?
There is still one PR from me open which I unfortunately never got around to finish up :/
So it's about half-way done 😄