[Epic] AAW Documentation Updates
Description
Update the documentation for the various repositories under the AAW projects. Specific objectives:
-
Every repository should have at least a basic
README.mdfile. In particular:- What is the purpose of the repository?
- How does it work (high level)? (Diagram if it is complex enough)
- How to set it up/test it?
- How to contribute? Esp. what other repos need to be involved if multiple steps are required to make an update on the test cluster.
-
All of the major deployment processes should be documented.
- what ArgoCD applications are "bootstrapped" from terraform modules?
- what platform components have their own dedicated ArgoCD applications?
- what is the DAG to deploy various components of the system? What is the process to create/upgrade certain system components?
- High level diagram-as-code for these components (as little duplication as possible between diagrams to remove the need to maintain multiple copies of the same thing).
- High level diagram for the patterns/workflows of the three main custom components we develop directly (controllers, mutating web hooks, gatekeeper policies)
-
High level concepts that are not specific to one repository
- networking
- peer authentication policies in Istio (e.g. one service is on-mesh and one service is off-mesh)
- how kubernetes network policies work diagram (from draw.io diagram in meeting)
- cluster ingress diagram (esp. where are internal DNS A records created in Terraform)
- core DNS work-around to enable service-to-service communication using external URLs.
- Something to do with a network policy being set that blocks traffic from non-internal IP addresses, and need to add an X-FORWARDED-FOR http header to indicate the original source IP address? TODO define more clearly what the issue/workaround is.
- various cluster gateways and their relationships. Also what is the (semantic) purpose of each gateway (i.e. which traffic should enter through which gateways?). Importantly, which DNS A records correspond to which gateways.
- OIDC
- how to register redirect URLs and what the process is to add a new OIDC application
- Kubernetes secret management with terraform
- Importantly, what is currently being done?
- Document ideas for future patterns/refactors (e.g. Bitnami sealed secrets, Vault sealed secrets).
- networking
-
Some kind of "entrypoint" for this documentation. E.g. DAaaS repo with links to github pages sites, README aggregator, some kind of "human curated concepts" that group functionalities/features and link them to the various repos/documentation
TODO
General tip: If you include any custom icons in a diagram, include a footnote linking to the source where you got the icon from (see e.g. the README in https://github.com/StatCan/aaw-kubeflow-profiles-controller).
Identify logical groupings of related repos
Idea is to (in one place) identify groups of repositories that are related to each other in some functional way.
Examples:
- [ ] Repos that platform users should be aware of (e.g. contrib-containers, daaas, etc.)
- [ ] Repos related to Kubeflow profile management
- [ ] Repos related to platform container management
README for aaw-argocd-manifests @vexingly
- [ ] #1156
- [ ] How are these manifests deployed into dev/prod (i.e. what is the general workflow to update manifests and make releases into dev/prod)?
- [ ] How are these repos "bootstrapped"? E.g.
- [ ] Which manifests correspond to which repositories? A simple table that maps folder paths to repositories is probably sufficient, we just need a clear mapping of what all is being deployed. Additional detail can be contained in each specific repo's README. E.g.
| Manifests Folder | Git Repository |
|---|---|
/daaas-system/profiles-controller/ |
https://github.com/StatCan/aaw-kubeflow-profiles-controller |
| ... | ... |
- [ ] Bonus points: it would be cool to have a diagrams as code diagram(s) here to show what all is deployed out of this repo. There is probably a way to use the Python kubernetes client to programmatically generate the equivalent of
kubectl get argocd -Aand (for example)kubectl get application -n daaas-systemso that the.pyfile with the diagram source code can dynamically update if new deployments are added.
README for aaw-blob-csi-injector @Collinbrown95
- [x] #1113
- [ ] #1114
High Level Items:
- [ ] mutating web hook that adds volumes/volume mounts to notebook pods with a given label
- [ ] checks PVCs/pods with certain labels and mounts those volumes to those pods
- [ ] container storage interface - these persistent volumes "mount" blob storage to notebook pods
README for aaw-contrib-containers @chuckbelisle
- [x] #1115
- [ ] What is the CI workflow and where do images get pushed (and other details relevant for developers)
Other Notes
- https://github.com/StatCan/aaw-contrib-containers/blob/master/.github/workflows/build.yml matrix build to build all image folders in parallel
- only way for users to introduce custom images into the cluster
- process is that user makes PR with requested image and AAW maintainer can accept the PR
- automated container scanning when people do PRs (scan job is https://github.com/StatCan/aaw-contrib-containers/blob/master/.github/workflows/build.yml#L47)
README for aaw-contrib-jupyter-notebooks @rohank07
- [x] #1116
- [x] What is the purpose of this repo?
- [x] How do the artifacts from this repo get mounted to user notebook pods?
- [x] What "customizations" are we currently making (high level)? - can mostly just be links to third party documentation
README for aaw-contrib-r-notebooks @rohank07
- [ ] README for aaw-contrib-r-notebooks
- [ ] What is the purpose of this repo?
- [ ] How do the artifacts from this repo get mounted to user notebook pods?
- [ ] What "customizations" are we currently making (high level)? - can mostly just be links to third party documentation
README for aaw-gatekeeper-contraints and README for gatekeeper-policies @Collinbrown95
- [ ] #1117
Note: we should probably make the bulk of documentation in one of these two repos and make sure there are links between them.
- [ ] What does this repo do and how does it relate to the
gatekeeper-policiesrepo? - [ ] What is the workflow to add a new gatekeeper policy/constraint (i.e. what has to be updated and in which repos)?
- [ ] Any "gotchas" or things to look out for (e.g. possible race condition where the policy has to be created before the constraint, otherwise risk "object does not exist" error).
- [ ] Bonus points: I think a diagram could be helpful here to indicate what components/repos are involved with gatekeeper policies.
Notes on Gatekeeper policies
- Gatekeeper policies create CRDs
- Gatekeeper policies are enforced at the level of the K8s API server.
- https://github.com/StatCan/gatekeeper-policies/blob/master/gatekeeper-opa-sync.yaml this file specifies which k8s resources need to be tracked by Gatekeeper - important: Gatekeeper is opt-in, you have to tell it exactly what K8s resources to watch for.
- E.g. if a new CRD is added, you would need to add that resource to the list in https://github.com/StatCan/gatekeeper-policies/blob/master/gatekeeper-opa-sync.yaml
- Gatekeeper is a K8s-specific framework for Open Policy Agent (e.g. see rego code in https://github.com/StatCan/gatekeeper-policies/blob/master/general/container-allowed-images/template.yaml).
README for aaw-inferenceservices-controller
- [ ] What is the purpose of this repo? In what capacity does the AAW team interact with it (if any)?
- [ ] If any action is required (e.g. updates), how do AAW developers update this repository?
Notes:
- This has something to do with DNS rewrites for Knative services because you can't use an internal domain when sending requests to Knative services (b/c there is no pod behind a Knative service?)
- By default, there is a local traffic policy thing that prevents pod-to-pod communication over external URLs because you lose the source (internal) IP of the requesting pod.
- This repo implements a solution to this for the knative case because you have to go through the external URL, so a dns rewrite is needed.
- This modifies the coredns configmap for the cluster.
README for aaw-kubeflow-containers @Jose-Matsuda
https://github.com/StatCan/aaw-kubeflow-containers/pull/348
- [x] What is the purpose of this repository (how is it difference from contrib-containers)?
- [x] How does this repo get updated (e.g. how to add a new image)?
- [x] How are the images built (i.e. how does the
docker-bitsfolder work and where is theMakefilethat builds the images)?
Notes:
- This is where all of the user jupyterlab/remote desktop/etc images come from
- This repo is for specific containers that are in the kubeflow drop down of images whereas contrib containers is for one-off ad hoc containers that users need
- All images are derived from "docker-bits" (https://github.com/StatCan/aaw-kubeflow-containers/tree/master/docker-bits)
-
Makefilespecifies how to assemble the docker-bits and build an image out of them (https://github.com/StatCan/aaw-kubeflow-containers/blob/master/Makefile) - used by Github action to build and push all of the kubeflow-containers to azure container registry.
README for aaw-kubeflow-manifests
- [ ] Is the current README contents up to date? (The README is already partially completed, so just need to ensure the existing content is still correct).
- [ ] Need an "intro" section indicating what the repo is actually for.
- [ ] How do you update part of the Kubeflow deployment?
Notes:
- new kubeflow deployment
- deployed by specific ArgoCD applicationset (which is an interable version of argocd application)
- there is NOT autosync set up with ArgoCD- you need to manually go to the ArgoCD web app and manually click sync.
- Any configuration to the kubeflow UI also goes here
README for aaw-kubeflow-pipelines-secret-scanner
-
This repo might not need a documentation update, but it's worth documenting as it might need to be refactored if it needs to be used again.
-
cron job that fetches all of the kubeflow pipeline jobs that have ever been submitted and does rules based + entropy checks to see if anything that looks like a password was added
-
pushes results into elasticsearch index and pushes a notification to the slack channel to alert of any possible password leaks
-
might not work any more because network policies have changed.
TODO: make a small ticket on sprint board to fix + re-implement this
README for aaw-kubeflow-profiles @chuckbelisle
-
Updated README: https://github.com/StatCan/aaw-kubeflow-profiles
-
[x] Add a README (there currently is no README in this repo)
-
[x] What is the purpose of this repo?
-
[x] How do you update/add/delete a Kubeflow profile?
-
[x] #1143
Notes:
- this is how we create AAW namespaces
- create namespace by creating jsonnet file
- ArgoCD application watches the repository root and deploys everything that ends with jsonnet
- Pattern: labels are added to jsonnet files, argocd deploys / syncs jsonnet files, then specific controllers (e.g. gitea controller) watch for those labels and take actions based on them.
- TODO: adding per-namespace features like Gitea should be label-based rather than function based
README for aaw-kubeflow-profiles-controller @cboin1996
-
[x] How to run aspects of the controller locally:
- [x] blob csi : https://github.com/StatCan/daaas/issues/1118
-
[ ] Blob-CSI feature architecture: https://github.com/StatCan/daaas/issues/1113
-
[x] Readme additions:
- [x] this is the new repo that replaces kubeflow-controller
- [x] each controller is implemented as a sub-command of a main CLI program.
-
[ ] TODO: figure out what documentation should be in here and what should move elsewhere. Should move high level documentation to aaw-argocd-manifests repo.
README for aaw-security-proposal
- [ ] https://github.com/StatCan/aaw-security-proposal/issues/6
- [ ] https://github.com/StatCan/aaw-security-proposal/issues/10 Is this up to date?
- [ ] Include any updates to reflect current architecture: https://github.com/StatCan/aaw-security-proposal/issues/4
README for aaw-network-policies @vexingly
- [x] #1119
- [x] What is the purpose of this repository?
- [x] How are these network policies updated?
- [x] What is the difference between network policies in this repo and policies configured in the
network.gocontroller of theaaw-kubeflow-profile-controllersrepo. - [x] Bonus Points: A diagram might be helpful for this repo (i,e, which network policies are applied in which namespaces)
Notes:
- All system namespace network polices are configured here
- User network policies are configured in the network.go controller in aaw-kubeflow-profile-controllers
- what actually gets deployed is either dev/prod in https://github.com/StatCan/aaw-network-policies/tree/aaw-dev-cc-00/environments
- in aaw-argocd-manifests there is a /network-policy argocd application that watches the above
- https://github.com/StatCan/aaw-network-policies/blob/aaw-dev-cc-00/environments/aaw-dev-cc-00/kustomization.yaml just deploys the policies in the parent folder with kustomize
README for aaw-prob-notebook-controller @wg102
- [x] Ensure existing content is up to date
- [x] What is the purpose of this repo?
- [x] How to contribute to it?
- [x] What other repos does it relate to/depend on (if any)?
Notes:
- watches for creation of Notebook CRDs and creates authorization policies for portected B notebooks
- these policies disable some upload/download buttons in the notebooks.
Update README for aaw-profile-state-controller @saffaalvi
- [x] What all is involved with the enforcement of this controller? (E.g. Gatekeeper policies, SAS notebooks, Kubeflow profiles, etc. - how do all of these entities interact?)
- [x] I think a diagram might be helpful in this repository. Especially because (I think) other platform features will follow the pattern used in the SAS notebook feature.
Extra Info:
-
watches for SAS notebooks - if SAS notebook is present, then non-employee cannot be added. If non-employee is in namespace, then user cannot create SAS notebook
-
watches for rolebindings in profile namespace - if any non-employee user is present in namespace role binding, then add label saying this is a non-employee namespace
-
this controller sets labels that Gatekeeper uses to enforce policies
README for aaw-terraform-repo-management
- [ ] What is the purpose of this repo? What other repos does it relate to?
- [ ] How to update this repo?
Notes:
-
Add repositories to the list in https://github.com/StatCan/aaw-terraform-repo-management/blob/main/image_registry_secrets.tf
-
Configures the listed repositories with the GH secrets they need to push to the various registries we use (e.g. Artifactory, AzureCr - only two we use)
-
Could use a repo like this to configure the GH provider for many purposes (e.g. configure GH action variables, repo metadata, GH rbac, etc.)
README for aaw-toleration-injector @Collinbrown95
- [ ] What is the purpose of this repo?
- [ ] What other repos does it interact with?
- [x] How to update it?
Notes:
- taints on nodes and tollerations on pods that want to get scheduled onto those nodes
- This is a mutating web hook that modifies pods at the k8s api admission control layer
- E.g. gpu nodes will have a taint and only in certain cases we want a tolleration to allow a pod to be scheduled on this node - see https://github.com/StatCan/aaw-toleration-injector/blob/main/mutate.go#L80-L85 for example
- Also differentiate between user nodes and system nodes
- unclassified user pods sit in one node pool, prob user pods sit in another node pool, and system pods sit in a third node pool.
Repos Scheduled for Archive
I'm identifying repos that are scheduled for deprecation in the coming weeks/months. We shouldn't document these repos extensively, but it might not be a bad idea to include a short 1-2 sentence summary in the README indicating what it did and what it's deprecated in favour of?
-
- this is going to be archived soon - it's a mutating web hook that is going to be replaced by the blob-csi-injector
- this relates to the boathouse project which will be deprecated in favour of the blob-csi implementation
-
[ ] aaw-kubeflow-controller
- Destined to be deprecated in favour of aaw-kubeflow-profiles-controller but not yet fully deprecated
- vault is still relevant to setup of minio bucket provisioning (also destined for deprecation will be replaced by blob-csi-driver + s3proxy solution)
- [ ] TODO: https://github.com/StatCan/aaw-kubeflow-controller/blob/master/defaults.go can be deprecated and the same functionality placed in https://github.com/StatCan/aaw-kubeflow-profiles-controller.
- [ ] TODO: review all functionality that can be migrated from
aaw-kubeflow-controllerto the newaaw-kubeflow-profiles-controller.
-
- kubeflow mlops should be deprecated (hasn't been updated in almost 2 years)
- doesn't work any more (access control permissions have changed)
- basically a proof of concept
- not going to be used/not needed so to be deprecated.
- [ ] TODO: we need to make a ticket to formally deprecate + archive this
-
- watches for rolebindings between users and namespaces
- if user has rolebinding to a namespace, this controller pushes this information as a JSON file to OPA
- this is necessary so that MinIO + OPA are aware of kubernetes RBAC information in order to make allow/deny decisions about which users can access which buckets
- This repo is destined for deprecation once the s3proxy alternative is finished
- setup is documented in readme here https://github.com/StatCan/aaw-argocd-manifests/tree/aaw-dev-cc-00/storage-system/kustomize-gateway
- This specific repo is deprecated in favour of s3proxy alternative because new implementation is per-namespace so the kubeflow UI already takes care of which users can access which buckets, there is no longer need for custom logic to do this, which is implemented in the controller of this repo
- [ ] TODO: need to schedule this repo for archive once s3proxy is implemented.
-
relates to various vault repos which are also destined for deprectation.
-
mutating web hook
-
watches for annotations on notebook pods, argo workflows, and one other custom label
-
a vault controller mounts a sidecar container to pods with these labels which are used to contain minio credentials/rotate them.
-
might be deprecated (flag as maybe deprecated - if not deprecated then we need to update the README to indicate what this repo is doing).
-
uses crane to force image pulls through artifactory which will do security scanning
-
TODO: figure out if we will keep or deprecate in favour of built in artifactory features
-
[ ] aaw-trino (will be removed soon, moved charts into statcan/charts)
- will probably be archived and rolled into aaw-argocd-manifests or similar
- this is set up currently for local development TODO:
- need to scope this out more with Wendy and Rohan
- then document once we figured out
- this repo is scheduled for deprecation eventually
Other Stuff
How to navigate cluster-rbac
- [ ] https://github.com/alcideio/rbac-tool scope out this tool for rbac exploration?
Tools we use
- [ ] Have a documentation page indicating all of the various CLI programs we use to navigate the platform (e.g. k9s, ranger, byobu, konstraint, rbac-tool, etc.)