api icon indicating copy to clipboard operation
api copied to clipboard

feat: add hbone_idle_timeout field to MeshConfig API

Open dcoppa opened this issue 1 month ago • 5 comments

Add configurable idle timeout for HBONE connections between proxies and ztunnel to address stale connection reuse when pod IPs are recycled.

This is particularly critical in environments with aggressive IP address reuse, such as AWS EKS with VPC CNI (default 30s cooldown period). Without an explicit idle timeout, Envoy defaults to 1 hour, causing proxies to reuse stale connections from connection pools when target pod IPs are recycled, resulting in 503 errors and upstream reset failures.

The new hbone_idle_timeout field in MeshConfig allows operators to configure the idle timeout appropriately for their environment. For AWS VPC CNI, a value of 15 seconds is recommended.

See: https://github.com/istio/istio/pull/58389

dcoppa avatar Dec 05 '25 12:12 dcoppa

😊 Welcome @dcoppa! This is either your first contribution to the Istio api repo, or it's been a while since you've been here.

You can learn more about the Istio working groups, Code of Conduct, and contribution guidelines by referring to Contributing to Istio.

Thanks for contributing!

Courtesy of your friendly welcome wagon.

istio-policy-bot avatar Dec 05 '25 12:12 istio-policy-bot

Hi @dcoppa. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

istio-testing avatar Dec 05 '25 12:12 istio-testing

/ok-to-test

ilrudie avatar Dec 05 '25 13:12 ilrudie

GitHub is being weird for me..here's my review comment:

Isn't this idle timeout just for envoy? Ztunnel doesn't respect it right? We should be clear in the comment (which gets turned into docs)

Also, should this kind of setting be a part of proxy config so that different envoys can have different values?

keithmattix avatar Dec 11 '25 14:12 keithmattix

GitHub is being weird for me..here's my review comment:

Isn't this idle timeout just for envoy? Ztunnel doesn't respect it right? We should be clear in the comment (which gets turned into docs)

Also, should this kind of setting be a part of proxy config so that different envoys can have different values?

I believe the current placement in MeshConfig is more appropriate because the underlying issue is infrastructure-wide: IP address recycling in the AWS VPC CNI affects all workloads equally, and the 30-second cooldown period is applied cluster-wide. As a result, there is no clear justification for giving different workloads distinct HBONE idle timeouts. This choice is also consistent with the existing connect_timeout, which already resides in MeshConfig and represents a similar connection-level timeout.

As for the documentation, I tried to make the comment clearer following your advice. Is it better now?

dcoppa avatar Dec 11 '25 15:12 dcoppa

Just got back from holiday break and this looks pretty good to me. I'll approve after a rebase

keithmattix avatar Jan 05 '26 20:01 keithmattix

Could someone else please review this?

dcoppa avatar Jan 08 '26 19:01 dcoppa