cloud_controller_ng Diego::Client targets active bbs instance with id from locket

A short explanation of the proposed change:

Get the bosh instance id for the currently-active bbs process from locket, so that Diego::Client can use an instance-specific bosh-dns URL instead of one that resolves to all IPs in the instance group (i.e. fd0458bf-71f9-4df2-9ab1-5fb523a92e56.bbs.service.cf.internal, not bbs.service.cf.internal).

An explanation of the use cases your change solves

Mitigates the risk of Runner is unavailable errors during a cf push when the client is unable to reach a bbs instance.

bbs operates as a cluster. At any one time one instance is 'active', and able to serve responses to API calls, while others are inactive. An active instance holds the bbs lock from locket.

At present, whenever the Diego::Client needs to communicate with bbs it hits the URL passed in from config.diego.bbs.url. The default value passed in for this from the capi-release is https://bbs.service.cf.internal:8889. This domain is defined by cf-deployment as a bosh-dns alias for the diego-api instance group with query q-s4 (i.e. resolving to all instances, whether healthy or unhealthy, in a random order).

When all bbs instances are reachable, you either hit the active instance first, or you hit an inactive one (connection refused) and the http client tries the next IP in the list. But if the first IP in the list is unreachable, the request hangs until it times out (by default after 10s) and then fails.

The Diego::Client makes three attempts to reach the active bbs instance before giving up and raising an error. Say you've got two availability zones with one diego-api instance in each, and a network issue makes one unreachable. In such a case, every API call currently has a 1/8 chance of that the first IP in the list is the unreachable one three times in a row, which will cause cf push and other commands to fail.

Note:

We considered switching to a q-s3 bosh-dns query (healthy instances only) but decided against this. If some process other than bbs was failing on an active bbs instance it would be impossible to reach the Diego API (DNS won't resolve to the active instance and no leadership election would be triggered).
locket is colocated with bbs on the diego-api instance. Retries are handled automatically by gRPC (and with a shorter 5-second timeout).
Links to any other associated PRs

cloudfoundry/capi-release#275 (should be merged before this PR)

[x] I have reviewed the contributing guide
[x] I have viewed, signed, and submitted the Contributor License Agreement
[x] I have made this pull request to the main branch
[x] I have run all the unit tests using bundle exec rake
[x] I have run CF Acceptance Tests

Oct 03 '22 15:10 will-gant

This holds up quite nicely with a HA setup, even when all VMs in a particular availability zone are isolated to simulate an outage. I've run the majority cf-acceptance-test specs, and all pass in that scenario. Running without the changes in the same scenario, the very first cf push fails with this:

{
  "created_at": "2022-10-11T09:47:51Z",
  "errors": [
    {
      "code": 170015,
      "detail": "Runner is unavailable: Process Guid: eb07d1a8-b37b-440a-b3fe-166d877c99b8-35a31446-15a1-4813-b489-89a208e5c179: execution expired",
      "title": "CF-RunnerUnavailable"
    }
  ],
  "guid": "a42634d2-2b91-4030-8e57-0d89e37070cb",
  "links": {
    "self": {
      "href": "https://redacted.sapcloud.io/v3/jobs/a42634d2-2b91-4030-8e57-0d89e37070cb"
    },
    "space": {
      "href": "https://redacted.sapcloud.io/v3/spaces/013822bf-4c0d-40e6-8fa7-1622d46ae59a"
    }
  },
  "operation": "space.apply_manifest",
  "state": "FAILED",
  "updated_at": "2022-10-11T09:48:22Z",
  "warnings": []
}

Oct 11 '22 15:10 will-gant

Locket seems like an implementation detail of Diego, I'd hesitate to use it for service discovery from another component without thinking deeply about it.

Could this be solved by going through the list of IPs returned instead of picking one randomly? Presumably the successful IP could be cached in memory and used until Diego goes through another leader election?

Oct 19 '22 23:10 mkocher

The committers listed above are authorized under a signed CLA.

:white_check_mark: login: will-gant / name: Will Gant (c98d0e010a8cb1bf87cdd5949122bd8e869359f1, 7a5681a7d135caa64ed63818d7defee90119a3bf)

Oct 20 '22 14:10 linux-foundation-easycla[bot]

Locket seems like an implementation detail of Diego, I'd hesitate to use it for service discovery from another component without thinking deeply about it.

Definitely a valid concern. At the same time it's worth weighing against that the fact that, aside from Diego itself, there are four other components packaged in cf-deployment with clients that talk to Locket. Those are the routing-api, tps, policy-server-asg-syncer and cc_deployment_updater. The latter is part of cloud_controller_ng, and has had a dependency on locket since CAPI 1.63.0, in 2018.

The risk is lowered a bit more thanks to the fact that we're doing this with a quite modest tweak to a pre-existing wrapper around a gRPC client whose ruby code has been generated from a protofile. That provides some insurance against things changing.

Given the above, and the relatively superficial nature of the dependency I've added, I feel on balance it's unlikely that future changes in the way Diego uses Locket are going to be drastic enough to risk serious disruption for us (or that the CC's usage would deter such refactoring).

Could this be solved by going through the list of IPs returned instead of picking one randomly? Presumably the successful IP could be cached in memory and used until Diego goes through another leader election?

Currently the CC uses the httpclient, which handles DNS resolution. The alternative would, I guess, be for Diego::Client to use another library to do an 'explicit' lookup with bosh-dns, loop over the IP list returned, and then try a httpclient.post on each IP, rescuing on the failures. Then cache and reuse the IP for the active server. This would work, though I think having the client doing DNS lookups like this and continuing to use trial-and-error to discover the server (albeit with the caching) isn't ideal.

It'd be good to hear from others, as the fact that I've just finished working on a different solution biases my opinion somewhat!

@mkocher

Oct 20 '22 16:10 will-gant

@sethboyles If you get time, it'd be great to get your thoughts on this!

Oct 21 '22 07:10 will-gant

Also threw together a couple of diagrams. First one shows what can happen today if there's an outage and the first IP in the randomly-ordered list that bbs.service.cf.internal resolves to is, by chance, an unreachable diego-api instance three times in a row:

diego-bbs-comms-1

The second one shows how this would work with the change I'm proposing:

diego-bbs-comms-2

Oct 21 '22 09:10 will-gant

At the same time it's worth weighing against that the fact that, aside from Diego itself, there are four other components packaged in cf-deployment with clients that talk to Locket

Both TPS and the deployment-updater are a part of CAPI, but as far as I can tell both of them use Locket to register their own locks, not query information about Diego.

I don't have much of an opinion yet on whether or not using Locket to query information about Diego is a pattern we should avoid or not, but I know little about Locket and how Diego uses it. Perhaps we should reach out to the Diego team for their thoughts?

Oct 21 '22 17:10 sethboyles

Yep I'll ask. @ameowlia, @geofffranks, @jrussett - I wonder if any of yourselves, or your other Diego colleagues, would have an opinion on the proposed use of Locket here?

Oct 21 '22 20:10 will-gant

This seems like a great poposal to me. The data is authoritative about who the active BBS VM is. There's precedent for this in the cfdot locks command (pulls data from locket to display the active BBS + Aucitoneer VMs).

Oct 21 '22 20:10 geofffranks

Wait, hold on, it feels very wrong to have a service running on the platform that a client would have to inspect which instances of the service are healthy and then target them directly.

bbs.service.cf.internal should not return instances that do not actually work, which is what it sounds like current exists.

My first thought was that bosh-dns healthiness could be used to ensure the "unhealthy" VMs aren't actually returned by bbs.service.cf.internal, and that does work great, right up until you co-locate the bbs job on a VM with other jobs, now the whole VM is unhealthy for bosh dns query purposes, which is probably bad.

Using process level healthiness does "solve" the problem with s3 queries mentioned above, because you could just switch to s0 and in the case where another job on the active bbs VM is failing, they would all be in an "unhealthy" state, and you'd basically fall back to the behavior we have now. But again, people do co-locate jobs differently than is done in cf-deployment, so until we add job level healthiness to bosh-dns-aliases, this seems like probably a bad idea.

So the other option would be to put this healthiness logic in the bbs itself, and then when a non-active VM got a request, it could proxy it to a healthy one or maybe it could just return a 302 and the clients would follow that to the proper VM? That would mean clients don't need to do any special logic and can just query bbs.service.cf.internal.

Oct 21 '22 23:10 jpalermo

The origin of the problem is that the Ruby HTTP client that is used for talking to BBS is bad in handling unreachable (i.e. isolated from a network point of view) endpoints.

This is not a general problem; when you use cURL to retrieve data from bbs.service.cf.internal and there is an unreachable VM, the request to this VM times out very fast and the next IP address is used. So in general this is something that a client could handle.

I don't think that Diego/BBS is doing something wrong; rejecting connections on an inactive VM seems like a reasonable approach and works well - the issue we are seeing is only related to unreachable VMs.

In general I see three different possibilities to solve this:

Fix/enhance the currently used Ruby HTTP client.
Identify and use a different Ruby HTTP client that behaves more like cURL.
Use an alternative approach to identify the currently active BBS instance to circumvent issues with unreachable VMs.

We tested a monkey patch (see [1]) of the TCPSocket class which introduced a connect_timeout parameter with Ruby 3.0. Setting this to a lower value than the HTTPClient.connect_timeout (default: 10 seconds), results in a similar behavior as for a refused connection, i.e. the next IP address is tried implicitly. We didn't dare to apply this monkey patch as it would affect all TCP sockets created in Ruby and we are also not sure if a contribution to the httpclient gem would be feasible.
We did not go down this road so far.
That's the purpose of this PR. It can be seen as a workaround for the deficient Ruby HTTP client, but it prevents the issues we are seeing with unreachable VMs.

[1]

module DefaultConnectTimeout
  DEFAULT_CONNECT_TIMEOUT = 2 # seconds

  def initialize(remote_host, remote_port, local_host = nil, local_port = nil, connect_timeout: nil)
    super(remote_host, remote_port, local_host, local_port, connect_timeout: connect_timeout || DEFAULT_CONNECT_TIMEOUT)
  end
end

class TCPSocket
  prepend DefaultConnectTimeout
end

Oct 24 '22 09:10 philippthun

Thanks for those details @philippthun that does make it seem less like a Diego problem than I was saying and I'm no longer thinking the problem should be solved elsewhere.

That's unfortunate that the timeout isn't configurable in the http client, that would be an ideal solution.

Oct 24 '22 17:10 jpalermo

I think it's worth evaluating other HTTP clients to see if there is support for this before adding a workaround on our side.

Oct 26 '22 18:10 sethboyles

httpclient hasn't had a release in 6 years: https://rubygems.org/gems/httpclient

Oct 26 '22 18:10 sethboyles

Ok given the mixed reaction to this I'm going to close it and we'll take a look at alternative HTTP clients.

Oct 27 '22 07:10 will-gant

Picking this story up again. Just noting that as far as I can tell there doesn't seem to be any (maintained) ruby HTTP client that, for a domain that resolves to multiple IPs, will try the next IP in the list if the first one it tries times out.

So I think we'll just need to resolve the DNS ourselves as @mkocher suggested.

Nov 03 '22 10:11 will-gant