feat: add hbone_idle_timeout field to MeshConfig API
Add configurable idle timeout for HBONE connections between proxies and ztunnel to address stale connection reuse when pod IPs are recycled.
This is particularly critical in environments with aggressive IP address reuse, such as AWS EKS with VPC CNI (default 30s cooldown period). Without an explicit idle timeout, Envoy defaults to 1 hour, causing proxies to reuse stale connections from connection pools when target pod IPs are recycled, resulting in 503 errors and upstream reset failures.
The new hbone_idle_timeout field in MeshConfig allows operators to configure the idle timeout appropriately for their environment. For AWS VPC CNI, a value of 15 seconds is recommended.
See: https://github.com/istio/istio/pull/58389
😊 Welcome @dcoppa! This is either your first contribution to the Istio api repo, or it's been a while since you've been here.
You can learn more about the Istio working groups, Code of Conduct, and contribution guidelines by referring to Contributing to Istio.
Thanks for contributing!
Courtesy of your friendly welcome wagon.
Hi @dcoppa. Thanks for your PR.
I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/ok-to-test
GitHub is being weird for me..here's my review comment:
Isn't this idle timeout just for envoy? Ztunnel doesn't respect it right? We should be clear in the comment (which gets turned into docs)
Also, should this kind of setting be a part of proxy config so that different envoys can have different values?
GitHub is being weird for me..here's my review comment:
Isn't this idle timeout just for envoy? Ztunnel doesn't respect it right? We should be clear in the comment (which gets turned into docs) Also, should this kind of setting be a part of proxy config so that different envoys can have different values?
I believe the current placement in MeshConfig is more appropriate because the underlying issue is infrastructure-wide: IP address recycling in the AWS VPC CNI affects all workloads equally, and the 30-second cooldown period is applied cluster-wide. As a result, there is no clear justification for giving different workloads distinct HBONE idle timeouts. This choice is also consistent with the existing connect_timeout, which already resides in MeshConfig and represents a similar connection-level timeout.
As for the documentation, I tried to make the comment clearer following your advice. Is it better now?
Just got back from holiday break and this looks pretty good to me. I'll approve after a rebase
Could someone else please review this?