Differentiate retries for NOT_OPEN and TIMED_OUT transport exceptions
The underlying thrift Ruby library raises TransportException under a variety of circumstances:
From thrift's socket.rb:
raise TransportException.new(TransportException::NOT_OPEN, "Connection timeout to #{@desc}")
raise TransportException.new(TransportException::NOT_OPEN, "Could not connect to #{@desc}: #{e}")
raise TransportException.new(TransportException::TIMED_OUT, "Socket: Timed out writing #{str.length} bytes to #{@desc}")
raise TransportException.new(TransportException::NOT_OPEN, e.message)
raise TransportException.new(TransportException::TIMED_OUT, "Socket: Timed out reading #{sz} bytes from #{@desc}")
raise TransportException.new(TransportException::NOT_OPEN, e.message)
raise TransportException.new(TransportException::UNKNOWN, "Socket: Could not read #{sz} bytes from #{@desc}")
The type of transport exception is a property of the exception instance (NOT_OPEN, TIMED_OUT, or UNKNOWN). Currently thrift_client provides the ability to retry only at the level of an exception class (e.g. TransportException). For non-idempotent requests, TIMED_OUT and UNKNOWN are not safe to retry, but NOT_OPEN should be safe. An ideal solution would be to have the Ruby thrift library itself raise subclasses of TransportException. In the interim, does this seem like a reasonable retry strategy to add to thrift_client? If so, I'm happy to do the implementation but would like some input on a good interface for configuring ThriftClient.
Some ideas I had:
- Add a new top-level option that lets users give a block to determine whether the exception should be retried
- Don't add any new options, but have thrift_client instead convert
Thrift::TransportExceptiontoThriftClient::TransportException::NotOpen,ThriftClient::TransportException::TimedOut, andThriftClient::TransportException::Unknownall of which subclass Thrift::TransportException, preserving existing behavior.
@klingerf and @sprsquish, I know this library isn't something that you work on day-to-day, but if you have a moment I'd love to hear your take on this issue.
I think the second option might be your best bet here as it maintains backward compatibility.
Wouldn't the block option also preserve compatibility? I think that would be a better API in general, and it would be less likely to break if people are directly comparing exception classes with ==.
What does the API look like in that case?
Current options (from the gem docs):
:exception_classes: Which exceptions to catch and retry when sending a request. Defaults to [IOError, Thrift::Exception, Thrift::ApplicationException, Thrift::TransportException, NoServersAvailable]
exception_class_overrides: For specifying children of classes in exception_classes for which you don't want to retry or reconnect.
Ways to allow a block for retry behavior and maintain backwards compatability:
- accept either an array or a block for
exception_classes(let's call this option A) - add a new optional parameter, name TBD, like
retry_exception_if(option B)
In order to use this new syntax to retry NOT_OPEN TransportException and raise TIMED_OUT TransportException, I would do something like:
Option A.
@client = ThriftClient.new(
MyCustomService::Client,
servers,
:transport_wrapper => Thrift::FramedTransport,
:timeout => 0.500,
:connect_timeout => 0.50,
:exception_classes => lambda do |ex|
# whatever logic you want here
[IOError, Thrift::ApplicationException, Thrift::TransportException, NoServersAvailable].any? {|ea| ex.is_a?(ea) } ||
( ex.is_a?(TransportException) && ex.type == NOT_OPEN )
end,
:raise => true,
)
Option B.
@client = ThriftClient.new(
MyCustomService::Client,
servers,
:transport_wrapper => Thrift::FramedTransport,
:timeout => 0.500,
:connect_timeout => 0.50,
:exception_classes => [IOError, Thrift::ApplicationException, Thrift::TransportException, NoServersAvailable], # have to provide so defaults with Thrift::Transport aren't used
:retry_exception_if => lambda do |ex|
( ex.is_a?(TransportException) && ex.type == NOT_OPEN )
end
)
I'm very flexible on the actual key name retry_exception_if in option B.
I have a slight preference for A, just because I think exception_classes + exception_class_overrides + retry_exception_if is a more confusing interface for users to consider, but I can always be convinced otherwise.
I think A is a bit confusing, I'd expect something called "exception_classes" to contain a list of classes. How about option B where, if the block is provided then exception_classes is ignored? That would provide backward compatibility.
Given that you want to change the default behavior though, you'd have to add the new exception classes rather than using the block. Maybe a hybrid approach: do the exception unwrapping, add the new exceptions to the default list, and add an option to provide a block that overrides exception_classes completely.
@sprsquish good point--providing something other than a list of classes as exception_classes is weird. Let me first try pursuing this from another angle to see if the Thrift project is receptive to the idea of adding Thrift::TransportException::NotOpen and Thrift::TransportException::TimedOut to the underlying library.
See https://issues.apache.org/jira/browse/THRIFT-2403.
Please consider stale idle connections in the group of safe to retry. I have the following scenario:
- long-running unicorn process talks to long-running thrift server API
- the unicorn process keeps the socket connection open and reuses it for future calls as they happen
- thrift server is restarted
- unicorn process tries to use the old stale connection, gets an error and incorrectly considers that server dead/error
Here the client should check for a remotely closed idle connection and treat it like NOT_OPEN.