rfcs RFC: add watch API endpoints

Rendered

Experimental PR: concourse/concourse#5802 - adds watching for the ListAllJobs endpoint (this was a pretty big change, but subsequent endpoints won't be as heavy-weight to implement).

Signed-off-by: Aidan Oldershaw [email protected]

Jun 16 '20 16:06 aoldershaw

I think there are two parts to this proposal that can be teased apart:

Notifying-on-change (usually called "Change Data Capture" in database / ETL land). This is where triggers, notify/listen are useful.
Controlling demand from clients.

I think they can be treated somewhat separately. In particular, controlling client demand requires us to provide backpressure. The ATC needs to signal to clients when it is ready to serve an endpoint, rather than relying on hardcoded client settings. That way it can, for example, select a subset of clients that it knows about to signal this to, signal less frequently when under heavy load etc.

Jun 16 '20 18:06 jchesterpivotal

(And if you go down the backpressure route ... RSocket is nifty)

Jun 16 '20 18:06 jchesterpivotal

@jchesterpivotal thanks for the pointer on the term "Change Data Capture". I was able to find some pretty helpful information on the topic - notably, that it's typically more performant to do CDC by parsing the transaction log (seems to be pg_logical_slot_get_changes in Postgres) - though this requires setting "wal_level to logical and max_replication_slots to at least 1" (https://www.postgresql.org/docs/9.4/logicaldecoding-example.html). Not sure what other implications those settings have (though I would guess it would require a fair bit more disk).

The ATC needs to signal to clients when it is ready to serve an endpoint, rather than relying on hardcoded client settings

Could you reiterate what you mean here? Do you mean the client might ask the ATC to start watching events, and the ATC says "not yet, but try again in 15 seconds"? Or do you mean the client is already watching events, but the ATC is overloaded, so the ATC decides not to forward events right away? Or neither of those things?

Jun 17 '20 15:06 aoldershaw

Approximately: that the ATC tells the client when to make another GET. RSocket makes this a little more explicit, in that it allows each end of the connection to signal "I can now receive 10 widgets" or "I am open for 3 widgets any time in the next 30 seconds".

Imagine that Concourse is a pipeline (heh) which manufactures responses to a client request. Right now, work is pushed through that system. A request arrives and is pushed forward through the ATC, to the database, then back through the ATC, back to the client. The problem is that each stage of the pipeline has no way to control how much work it receives and relies on a mix of overprovisioning, upstream load balancing and plain luck to deal with variance in demand. Worse still: a slowdown anywhere in the pipeline causes queues to form further upstream.

The idea of "reactive backpressure" is that the downstream explicitly the upstream that it has capacity to do work. So the database signals the ATC "I can serve 20 queries", the ATC only sends 20 queries. In turn it only serves 20 clients at once.

In manufacturing this is called "pull-based" flow. The best-known example is Kanban, where cards circulate between each pair of stations to govern demand between them. Another example is CONWIP, where this is a fixed supply of cards for the entire manufacturing line, circulating from the beginning to end and back around.

The reason for backpressure is that the easiest place to stabilise load is further upstream from where it's felt. What we're seeing in this case is that the furthest upstream demand is generated by web clients. We control the design of the client, so it would be possible to have it follow ATC instructions on when to send requests. RSocket has some nice ways of supporting that and can run over websockets, but we'd still be better off even with a simple long-polling / SSE method that says "Now!".

Jun 17 '20 17:06 jchesterpivotal

@jchesterpivotal Neat, thanks for taking the time to explain it! The reactive backpressure approach could definitely be a good solution to some of the problems around ListAllJobs that we've experienced in the past. In fact, we implemented concourse/concourse#5429 to prevent too many clients from hitting the endpoint at once, where the ATC would respond with a 503 Service Unavailable if the limit was exceeded, and the client would retry until it was accepted by the ATC. In hindsight, an approach like using RSocket would have probably been a better way to go.

I wonder if you think using RSocket has much merit in conjunction with the proposed ?watch parameter - rather than the ATC just telling the clients "you can ask for new data now", it'll be telling the clients "here's your new data" (which could be on the ATC's own terms). The intention of the proposed design is that there won't be any more burden on the DB whether there's 1 client or 100 clients listening for changes (albeit with an upfront "constant" cost for observing changes, and the cost of the initial query), and there probably won't be (too) much more burden on the ATC either (it's just a matter of checking permissions and forwarding events - though I may be underestimating how costly this will be).

Jun 18 '20 15:06 aoldershaw