Let's rework PAPR's architecture!
Rough problem statement
Right now, our architecture is very much like a waterfall. Events on GitHub cause a linear cascade of events that eventually fires off PAPR to run for those specific events.
This has severe limitations:
- monolithic architecture makes it harder to try out locally and thus harder to contribute
- reliance on multiple linear infrastructures (CI bus, Jenkins, OpenStack) results in a higher fault rate
- no easy way to scale horizontally for HA
Other issues plaguing the current architecture:
- queue is not easily visible/not public so it's hard to tell what's going on without manually checking the internal Jenkins queue (also homu queue is sort of visible but could be way better)
- no job prioritization, purely FIFO. But ideally, we want to be able to apply a set of rules as to how jobs should prioritized. E.g. 'auto' and 'try' branch before PRs, PRs with certain labels before others, etc...
- the combination of GHPRB and Homu is confusing and creates an inconsistent user experience
Rough solution
We split up the architecture into multiple small services:
- Homu/PAPR Scheduler (PAPRQ) Proposed to run in CentOS CI
- PAPR Workers Need either OpenStack or Docker/Kubernetes Bin packing problem - can we e.g. reuse Kubernetes https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
- PAPR Publishers
The scheduler receives events from GitHub and queues up the jobs. It understands .papr.yml and splits them into individual jobs each representing a testsuite. This allows e.g. workers that only support containerized workloads to still participate in the pool. It also allows container work to be weighed differently from VM/cluster work.
Workers periodically (with forced polls on events if implementable) query PAPRQ for available jobs. PAPRQ prioritizes jobs by a given set of rules.
What it would entail
- The largest piece of work will be to enhance (fork?) Homu to also handle PR events and add them to its queue. This naturally resolves the confusing UX experience, and makes optimizations like https://github.com/servo/homu/pull/54 trivial to implement.
E.g. @rh-atomic-bot retry will actually know whether to retry testing the PR, or retry testing the merge.
It also allows for more sophisticated syntax, like: @rh-atomic-bot retry f26-sanity
-
Teach PAPR to connect to PAPRQ for jobs. This is either a long-running service that polls, or is periodically started by an external service (e.g. Jenkins, OCP)
-
This can come later. Rather than the workers publishing to e.g. S3 themselves, do similar to what Cockpit does and stream logs and updates back to PAPRQ itself. This allows us to (1) have publicly visible streaming logs, and (2) keep all the secrets in PAPRQ and only require workers to have a single token.
Let's finalize this work and split it up amongst team members so that everyone understands how it works, and can help manage it.
Risks
- Contributions/blocking on Servo/Homu team - getting review time is hard
Sub-alternative: Sidecar/wrapper for Homu - PAPR intercepts github events and forwards them to Homu as well, but also builds its own state. (Investigate organization-wide github events)
Transition plan
Can take down per-PR testing while keeping up testing on auto branch.
Alternatives
Customize Jenkins (Integrate better with CentOS CI) Relationship with GHPRB there?
Hop on http://prow.k8s.io/ with Origin
Rely on Travis more
Other discussion
Standard test roles vs PAPR PAPR describes more things, handles tasks like provisioning more declaratively Could have stdtest in upstream git?
Colin: PAPR runs stdtest? Jonathan: Problem: Test in separate git repo. Unless upstream repo also holds stdtest definition?
We split up the architecture into multiple small services:
👍
- Homu/PAPR Scheduler (PAPRQ) Proposed to run in CentOS CI
- PAPR Workers Need either OpenStack or Docker/Kubernetes Bin packing problem - can we e.g. reuse Kubernetes https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
- PAPR Publishers
This sound a lot like a a message+worker system (EG: celery). Homu has a queue built in, but if the requirement is within PAPR itself to be able to handle it's own testing while keeping Homu strictly merge work reusing an established message+worker system would be close to trivial.
The largest piece of work will be to enhance (fork?) Homu to also handle PR events and add them to its queue. This naturally resolves the confusing UX experience, and makes optimizations like servo/homu#54 trivial to implement.
I'm not against it, other than the need to hold our own Homu fork as we'd need to rebase and modify any time we want to pull in new features from upstream. I worry that combining merge gating and distributed testing may be coupling things that should stay separate but I wouldn't complain if it ends up being easier to code.
Also as part of this consider being coming an app maybe? https://developer.github.com/apps/building-integrations/setting-up-a-new-integration/about-choosing-an-integration-type/
This way papr could be installed organization-wide more easily.
So... I think a good strategy here would be to start with the minimal amount of work needed to get into CentOS CI. And then we can slowly iterate over changing the architecture? There are some changes that will require a lot of upfront investment clearly, but trying to keep things gradual ensures we can timebox improvements.
Basically, I think the path of least resistance towards migrating to our CCI OCP project is something like:
- set up a Jenkins service with GHPRB; and make sure webhooks are enabled
- change the codebase to start containerized workloads in pods rather than containers. So we would e.g.
oc execinstead ofdocker execfor provisioning. - change the codebase to start VMs inside containers rather than in an OSP instance.
Once we got that down, we can talk about swapping out GHPRB, doing fancier queueing, web UIs, etc... Does that make sense?
I'll try my hand at (1) and (2) today.
I see what you're saying, but the thing I struggle with now is basically that the current monolithic architecture is hard at least for me to understand/hack. I guess a basic problem is I just don't have a consistent amount of time to learn it, so it's more when things fail that it becomes critical path.
I keep circling back to that queuing approach, it feels orthogonal to the other work. If we stood up a simple service that received the webhooks and also did the conversion to Kube jobs, and posted that for anything else to read - it wouldn't conflict with the ocp branch, and would allow iteration on a more decoupled architecture. Maybe? What still feels fuzzy to me is how result posting works wrt credentials and large log files.
One of my main goal is to make it really easy to get started and contribute. E.g. the instructions in https://github.com/projectatomic/papr/blob/ocp/docs/RUNNING.md should hopefully get anyone get up and running within 5 minutes. In the long-term, we'll need more people to help maintain this thing if we really want to make it part of our upstream CI story.
I keep circling back to that queuing approach, it feels orthogonal to the other work. If we stood up a simple service that received the webhooks and also did the conversion to Kube jobs, and posted that for anything else to read
Yeah, definitely would like to get to something like that (though I pictured using custom resources much like what Prow does). Between that and getting a container: MVP up and running though, it feels like the latter takes priority? Will dedicate some time to this this sprint and try to carve out pieces for others to contribute.
One thing I'm still struggling with is whether to eventually support host: and cluster: as well. Those would clearly still use VM-in-containers, though it depends on if we want to hide that from users or not. It seems like in a lot of cases, users don't want the sort of flexibility that e.g. ostree needs. PAPR also does a few nice things like setting up ssh keys, and injecting fresh rpmmd data.
One note I'd like to make here, for lack of a better place, is that recently I was playing with replacing GHPRB with Prow. And it worked surprisingly well with the PAPR-in-OCP model. Notably, this involves setting skip_report to true so that Prow wasn't involved at all in updating back to GitHub.
The config I got running with was this:
presubmits:
jlebon/papr-sandbox:
- name: papr-sandbox-pull
agent: kubernetes
branches:
- master
always_run: true
skip_report: true
# XXX: need to teach trigger to pass down full comment so PAPR can learn
# to handle retesting failed testsuites
#trigger: "(?m)^/retest(\\s+.*)*$"
#rerun_command: "/retest [testsuite1] [testsuite2] ..."
spec:
restartPolicy: Never
serviceAccountName: papr
volumes:
- name: config-mount
configMap:
name: papr-config
containers:
- name: papr
image: 172.30.1.1:5000/myproject/papr:latest
imagePullPolicy: Always
args:
- "--debug"
- "runtest"
- "--conf"
- "/etc/papr/config"
- "--repo"
- "$(REPO_OWNER)/$(REPO_NAME)"
- "--pull"
- "$(PULL_NUMBER)"
- "--expected-sha1"
- "$(PULL_PULL_SHA)"
securityContext:
runAsUser: 0
volumeMounts:
- name: config-mount
mountPath: /etc/papr
env:
- name: GITHUB_TOKEN
valueFrom:
secretKeyRef:
name: github-token
key: token
optional: false
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-access-key
key: id
optional: false
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-access-key
key: secret
optional: false
Then, e.g. /retest, /hold, etc just worked (see https://github.com/jlebon/papr-sandbox/pull/1). I'd love to make use of this, though I'm not sure if we'd have to run our own Prow instance, or if the one running at http://deck-ci.svc.ci.openshift.org/ could run pods in CentOS CI -- or just move away from CentOS CI and run them at ci.openshift.org? IIUC, we should be able to satisfy kvm: true there.
Hmmm. I definitely see the advantages of taking GHPRB out of the picture, though my primary focus tends to go to the merge queue aspect - I see it as a lot more important than the per-PR testing.
I have a worry that prow-but-not-tide etc. is going to leave us with gaps and things to glue; although that's definitely true today with GHPRB/papr/homu, so not really a change for better/worse.
In this proposal we'd still be using Homu, correct?
though my primary focus tends to go to the merge queue aspect
Yeah, that's one missing piece. I was looking at this some more today. To summarize: there is no prioritization of prowjobs; plank just blindly creates pods for any prowjob that doesn't have one already. There's a concurrency field to limit how many pending pods are allowed.
This means that any prioritization should happen at the Kube level. Kube does support pod priorities, which means we could e.g. label merge pods as having higher priority than PR pods so that they are scheduled before them. However, by default, it also does preemption, which means that e.g. it could kick out an already running PR job to schedule a merge job, which I don't think we want. You can disable preemption, but not per-pod, only at the kubelet level.
One idea is to write a custom scheduler for pods coming from prowjobs which would know how to prioritize them (could even do neat things like prioritizing some repos over others), though not sure how much work it'd be (e.g. we might have to implement node selection ourselves if there's no way to reuse the default scheduler logic).
In this proposal we'd still be using Homu, correct?
I was looking at possibly going all in at this point. Tide seems to offer the same guarantee of never-broken-master as Homu by rerunning tests before merge. IIUC though, the way it does this is very different: instead of pushing a branch like auto, test pods are fed a base SHA and a list of PRs (for batching) to git merge manually before testing. There's a helper utility (clonerefs) in the prow repo that can do this.
But at merge time, tide uses the GitHub merge API, rather than pushing it itself. This does bring back discussions from https://github.com/servo/homu/pull/57 -- it assumes that whatever strategy is used by GitHub will be exactly equivalent to what was tested (clonerefs uses vanilla git merges). So it's definitely not as explicit as Homu. Not sure how hard it'd be to teach tide to adopt the Homu strategy for this of just directly pushing what was tested.
There's also no autosquashing feature, though that should be trivial to implement if tide learns to do the push itself.
However, by default, it also does preemption, which means that e.g. it could kick out an already running PR job to schedule a merge job, which I don't think we want.
Hum, that's assuming the cluster is full. The plank part may need to be a bit more configurable. Or alternatively, we could tend to rely on projects only doing "baseline" tests by default in PRs, and having e.g. /test compose or /test all that we do on particular PRs as needed.
But if the cluster is full (which actually gets into a more interesting question in a cloud world about just how much money we want to spend, since there is no "full") - yes, I'd say that merge jobs should preempt PR jobs, right?