papr Let's rework PAPR's architecture!

Rough problem statement

Right now, our architecture is very much like a waterfall. Events on GitHub cause a linear cascade of events that eventually fires off PAPR to run for those specific events.

This has severe limitations:

monolithic architecture makes it harder to try out locally and thus harder to contribute
reliance on multiple linear infrastructures (CI bus, Jenkins, OpenStack) results in a higher fault rate
no easy way to scale horizontally for HA

Other issues plaguing the current architecture:

queue is not easily visible/not public so it's hard to tell what's going on without manually checking the internal Jenkins queue (also homu queue is sort of visible but could be way better)
no job prioritization, purely FIFO. But ideally, we want to be able to apply a set of rules as to how jobs should prioritized. E.g. 'auto' and 'try' branch before PRs, PRs with certain labels before others, etc...
the combination of GHPRB and Homu is confusing and creates an inconsistent user experience

Rough solution

We split up the architecture into multiple small services:

Homu/PAPR Scheduler (PAPRQ) Proposed to run in CentOS CI
PAPR Workers Need either OpenStack or Docker/Kubernetes Bin packing problem - can we e.g. reuse Kubernetes https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
PAPR Publishers

The scheduler receives events from GitHub and queues up the jobs. It understands .papr.yml and splits them into individual jobs each representing a testsuite. This allows e.g. workers that only support containerized workloads to still participate in the pool. It also allows container work to be weighed differently from VM/cluster work.

Workers periodically (with forced polls on events if implementable) query PAPRQ for available jobs. PAPRQ prioritizes jobs by a given set of rules.

What it would entail

The largest piece of work will be to enhance (fork?) Homu to also handle PR events and add them to its queue. This naturally resolves the confusing UX experience, and makes optimizations like https://github.com/servo/homu/pull/54 trivial to implement.

E.g. @rh-atomic-bot retry will actually know whether to retry testing the PR, or retry testing the merge.

It also allows for more sophisticated syntax, like: @rh-atomic-bot retry f26-sanity

Teach PAPR to connect to PAPRQ for jobs. This is either a long-running service that polls, or is periodically started by an external service (e.g. Jenkins, OCP)
This can come later. Rather than the workers publishing to e.g. S3 themselves, do similar to what Cockpit does and stream logs and updates back to PAPRQ itself. This allows us to (1) have publicly visible streaming logs, and (2) keep all the secrets in PAPRQ and only require workers to have a single token.

Let's finalize this work and split it up amongst team members so that everyone understands how it works, and can help manage it.

Risks

Contributions/blocking on Servo/Homu team - getting review time is hard

Sub-alternative: Sidecar/wrapper for Homu - PAPR intercepts github events and forwards them to Homu as well, but also builds its own state. (Investigate organization-wide github events)

Transition plan

Can take down per-PR testing while keeping up testing on auto branch.

Alternatives

Customize Jenkins (Integrate better with CentOS CI) Relationship with GHPRB there?

Hop on http://prow.k8s.io/ with Origin

Rely on Travis more

Other discussion

Standard test roles vs PAPR PAPR describes more things, handles tasks like provisioning more declaratively Could have stdtest in upstream git?

Colin: PAPR runs stdtest? Jonathan: Problem: Test in separate git repo. Unless upstream repo also holds stdtest definition?

Sep 22 '17 21:09 jlebon

We split up the architecture into multiple small services:

👍

Homu/PAPR Scheduler (PAPRQ) Proposed to run in CentOS CI

PAPR Workers Need either OpenStack or Docker/Kubernetes Bin packing problem - can we e.g. reuse Kubernetes https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/

PAPR Publishers

This sound a lot like a a message+worker system (EG: celery). Homu has a queue built in, but if the requirement is within PAPR itself to be able to handle it's own testing while keeping Homu strictly merge work reusing an established message+worker system would be close to trivial.

The largest piece of work will be to enhance (fork?) Homu to also handle PR events and add them to its queue. This naturally resolves the confusing UX experience, and makes optimizations like servo/homu#54 trivial to implement.

I'm not against it, other than the need to hold our own Homu fork as we'd need to rebase and modify any time we want to pull in new features from upstream. I worry that combining merge gating and distributed testing may be coupling things that should stay separate but I wouldn't complain if it ends up being easier to code.

Sep 25 '17 14:09 ashcrow

Also as part of this consider being coming an app maybe? https://developer.github.com/apps/building-integrations/setting-up-a-new-integration/about-choosing-an-integration-type/

This way papr could be installed organization-wide more easily.

Sep 27 '17 14:09 cgwalters

So... I think a good strategy here would be to start with the minimal amount of work needed to get into CentOS CI. And then we can slowly iterate over changing the architecture? There are some changes that will require a lot of upfront investment clearly, but trying to keep things gradual ensures we can timebox improvements.

Basically, I think the path of least resistance towards migrating to our CCI OCP project is something like:

set up a Jenkins service with GHPRB; and make sure webhooks are enabled
change the codebase to start containerized workloads in pods rather than containers. So we would e.g. oc exec instead of docker exec for provisioning.
change the codebase to start VMs inside containers rather than in an OSP instance.

Once we got that down, we can talk about swapping out GHPRB, doing fancier queueing, web UIs, etc... Does that make sense?

I'll try my hand at (1) and (2) today.

Feb 02 '18 16:02 jlebon

I see what you're saying, but the thing I struggle with now is basically that the current monolithic architecture is hard at least for me to understand/hack. I guess a basic problem is I just don't have a consistent amount of time to learn it, so it's more when things fail that it becomes critical path.

I keep circling back to that queuing approach, it feels orthogonal to the other work. If we stood up a simple service that received the webhooks and also did the conversion to Kube jobs, and posted that for anything else to read - it wouldn't conflict with the ocp branch, and would allow iteration on a more decoupled architecture. Maybe? What still feels fuzzy to me is how result posting works wrt credentials and large log files.

Apr 03 '18 13:04 cgwalters

One of my main goal is to make it really easy to get started and contribute. E.g. the instructions in https://github.com/projectatomic/papr/blob/ocp/docs/RUNNING.md should hopefully get anyone get up and running within 5 minutes. In the long-term, we'll need more people to help maintain this thing if we really want to make it part of our upstream CI story.

I keep circling back to that queuing approach, it feels orthogonal to the other work. If we stood up a simple service that received the webhooks and also did the conversion to Kube jobs, and posted that for anything else to read

Yeah, definitely would like to get to something like that (though I pictured using custom resources much like what Prow does). Between that and getting a container: MVP up and running though, it feels like the latter takes priority? Will dedicate some time to this this sprint and try to carve out pieces for others to contribute.

One thing I'm still struggling with is whether to eventually support host: and cluster: as well. Those would clearly still use VM-in-containers, though it depends on if we want to hide that from users or not. It seems like in a lot of cases, users don't want the sort of flexibility that e.g. ostree needs. PAPR also does a few nice things like setting up ssh keys, and injecting fresh rpmmd data.

Apr 03 '18 15:04 jlebon

One note I'd like to make here, for lack of a better place, is that recently I was playing with replacing GHPRB with Prow. And it worked surprisingly well with the PAPR-in-OCP model. Notably, this involves setting skip_report to true so that Prow wasn't involved at all in updating back to GitHub.

The config I got running with was this:

presubmits:
  jlebon/papr-sandbox:
    - name: papr-sandbox-pull
      agent: kubernetes
      branches:
        - master
      always_run: true
      skip_report: true
      # XXX: need to teach trigger to pass down full comment so PAPR can learn
      # to handle retesting failed testsuites
      #trigger: "(?m)^/retest(\\s+.*)*$"
      #rerun_command: "/retest [testsuite1] [testsuite2] ..."
      spec:
        restartPolicy: Never
        serviceAccountName: papr
        volumes:
          - name: config-mount
            configMap:
              name: papr-config
        containers:
          - name: papr
            image: 172.30.1.1:5000/myproject/papr:latest
            imagePullPolicy: Always
            args:
              - "--debug"
              - "runtest"
              - "--conf"
              - "/etc/papr/config"
              - "--repo"
              - "$(REPO_OWNER)/$(REPO_NAME)"
              - "--pull"
              - "$(PULL_NUMBER)"
              - "--expected-sha1"
              - "$(PULL_PULL_SHA)"
            securityContext:
              runAsUser: 0
            volumeMounts:
              - name: config-mount
                mountPath: /etc/papr
            env:
              - name: GITHUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: github-token
                    key: token
                    optional: false
              - name: AWS_ACCESS_KEY_ID
                valueFrom:
                  secretKeyRef:
                    name: aws-access-key
                    key: id
                    optional: false
              - name: AWS_SECRET_ACCESS_KEY
                valueFrom:
                  secretKeyRef:
                    name: aws-access-key
                    key: secret
                    optional: false

Then, e.g. /retest, /hold, etc just worked (see https://github.com/jlebon/papr-sandbox/pull/1). I'd love to make use of this, though I'm not sure if we'd have to run our own Prow instance, or if the one running at http://deck-ci.svc.ci.openshift.org/ could run pods in CentOS CI -- or just move away from CentOS CI and run them at ci.openshift.org? IIUC, we should be able to satisfy kvm: true there.

Jul 03 '18 19:07 jlebon

Hmmm. I definitely see the advantages of taking GHPRB out of the picture, though my primary focus tends to go to the merge queue aspect - I see it as a lot more important than the per-PR testing.

I have a worry that prow-but-not-tide etc. is going to leave us with gaps and things to glue; although that's definitely true today with GHPRB/papr/homu, so not really a change for better/worse.

In this proposal we'd still be using Homu, correct?

Jul 03 '18 21:07 cgwalters

though my primary focus tends to go to the merge queue aspect

Yeah, that's one missing piece. I was looking at this some more today. To summarize: there is no prioritization of prowjobs; plank just blindly creates pods for any prowjob that doesn't have one already. There's a concurrency field to limit how many pending pods are allowed.

This means that any prioritization should happen at the Kube level. Kube does support pod priorities, which means we could e.g. label merge pods as having higher priority than PR pods so that they are scheduled before them. However, by default, it also does preemption, which means that e.g. it could kick out an already running PR job to schedule a merge job, which I don't think we want. You can disable preemption, but not per-pod, only at the kubelet level.

One idea is to write a custom scheduler for pods coming from prowjobs which would know how to prioritize them (could even do neat things like prioritizing some repos over others), though not sure how much work it'd be (e.g. we might have to implement node selection ourselves if there's no way to reuse the default scheduler logic).

In this proposal we'd still be using Homu, correct?

I was looking at possibly going all in at this point. Tide seems to offer the same guarantee of never-broken-master as Homu by rerunning tests before merge. IIUC though, the way it does this is very different: instead of pushing a branch like auto, test pods are fed a base SHA and a list of PRs (for batching) to git merge manually before testing. There's a helper utility (clonerefs) in the prow repo that can do this.

But at merge time, tide uses the GitHub merge API, rather than pushing it itself. This does bring back discussions from https://github.com/servo/homu/pull/57 -- it assumes that whatever strategy is used by GitHub will be exactly equivalent to what was tested (clonerefs uses vanilla git merges). So it's definitely not as explicit as Homu. Not sure how hard it'd be to teach tide to adopt the Homu strategy for this of just directly pushing what was tested.

There's also no autosquashing feature, though that should be trivial to implement if tide learns to do the push itself.

Jul 20 '18 21:07 jlebon

However, by default, it also does preemption, which means that e.g. it could kick out an already running PR job to schedule a merge job, which I don't think we want.

Hum, that's assuming the cluster is full. The plank part may need to be a bit more configurable. Or alternatively, we could tend to rely on projects only doing "baseline" tests by default in PRs, and having e.g. /test compose or /test all that we do on particular PRs as needed.

But if the cluster is full (which actually gets into a more interesting question in a cloud world about just how much money we want to spend, since there is no "full") - yes, I'd say that merge jobs should preempt PR jobs, right?

Aug 02 '18 19:08 cgwalters