Thrift health checker support
Commit Message: Introduce a thrift health checker so thrift only cluster could reflect the health-ness more correctly.
It will be a custom health checker. It sends a thrift request with configed method_name and expect a success response. The thrift server can respond an exception to cause an immediate health check failure.
An example setting for custom_health_check as a Thrift health checker is shown below:
custom_health_check:
name: envoy.health_checkers.thrift
typed_config:
"@type": type.googleapis.com/envoy.extensions.health_checkers.thrift.v3.Thrift
method_name: "ping",
transport: FRAMED
protocol: BINARY
Risk Level: low Testing: unit test Docs Changes: docs/root/configuration/upstream/health_checkers/thrift.rst [Optional Fixes #Issue] #22555
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).
The basic flow is verified by unit tests. While I worked on testing corner cases, could I have some initial feedback?
cc @zuercher @rgs1 @fishcakez @PeterL328
Some code decisions:
(a) UpstreamRequest is not reusable since HC (Health Checker) creates ClientConnection directly but UpstreamRequest uses ConnectionPool. Therefore we implement a simple thrift client.
(b) Extract ThriftHealthChecker::Client for the sake of testing and possible future reuse.
(c) Inject ClientFactory so we can split the test to ThriftHealthChecker::Client and
mock it to test ThriftHealthChecker
(d) Introduce passthrough_decoder_event_handler for DIY propose
(e) Now we don't have retry semantic response.
@markdroth heads up that the reason we don't use not_in to exclude AUTO and TWITTER is a bug
for foreign enum which triggers the compile error
bazel-out/k8-dbg/bin/external/envoy_api/envoy/extensions/health_checkers/thrift/v3/thrift.pb.validate.cc:62:17: error: unknown type name 'TransportType'; did you mean 'filters::network::thrift_proxy::v3::TransportType'?
const std::set<TransportType> _Thrift_Transport_NotInLookup = {
^~~~~~~~~~~~~
filters::network::thrift_proxy::v3::TransportType
Code coverage for source/extensions/health_checkers/thrift is lower than limit of 96.6 (89.3)
more boundary cases will help
/retest
Retrying Azure Pipelines: Retried failed jobs in: envoy-presubmit
With some discussion with @fishcakez, we should use fixed sequence id 0.
/retest
Retrying Azure Pipelines: Retried failed jobs in: envoy-presubmit
@markdroth the implantation review is converging. would you have a chance to review api part? thanks.
@fishcakez noticed that we don't specify transport and protocol in thrift.proto to be required since protoc-gen doesn't support (validate.rules).enum.required.
Fortunately we throw for AUTO_PROTOCOL and AUTO_TRANSPORT, which is 0 and the default value as well, which means that we accidentally make them required. https://github.com/envoyproxy/envoy/blob/1fabd6decb8de8ec97d9d632c60ffb648a4c917a/api/envoy/extensions/filters/network/thrift_proxy/v3/thrift_proxy.proto#L33
Another feedback from him was that we might able to honor ThriftProtocolOptions when we have a good precedence design, so transport/protocol in thrift.proto could be optional, but we could discuss it when we have a need later.
@markdroth can you PTAL at the API changes? Assigning @zuercher as code-owner. /assign @zuercher
Thanks Adi. Leave a note that we've verified thrift HC in our environment.
@envoyproxy/api-shepherds any chance to move this forward? Thanks.
Hello @htuch This PR has @.rgs1 and @.fishcakez reviewed the api but need one more stamp from @envoyproxy/api-shepherds We conduct several bike-shedding as we can see in the comment so we believe it's good to go. We also verified the the implementation in our environment as well. Could you help to move this forward as we have you several thrift proto reviews? Thank you.
/retest
Retrying Azure Pipelines: Check envoy-presubmit isn't fully completed, but will still attempt retrying. Retried failed jobs in: envoy-presubmit
/retest
Retrying Azure Pipelines: Retried failed jobs in: envoy-presubmit
Code coverage for source/common/config is lower than limit of 96.5 (96.0)
This is irrelevant since I didn't touch the code and looks like it happens on top of the tree.