Properly figure out OpenMPI on Kubernetes
Is your feature request related to a problem? Please link issue ticket
The problem is that OpenM++ models take an unacceptable amount of time to run with the current solution.
Describe the solution you'd like
Want users to be able to schedule OpenMPI jobs natively on Kubernetes. This effectively enables users to scale jobs with OpenM++.
Proposed Solution This repo (https://github.com/everpeace/kube-openmpi#run-kube-openmpi-cluster-as-non-root-user) has an implementation of OpenMPI in Kubernetes. Some prior work on this has already been done here https://github.com/Collinbrown95/kube-openmpp - in particular
- Successfully ran distributed job with everpeace openmpi solution
- successfully built image from OpenM++ on top of everpeace image
- Ran Oncosim model with
mpiexecentrypoint successfully
TODO
- [x] confirm that we can run OpenMPI job as non-root user
- [ ] test end-to-end that we can run an Oncosim (e.g.) model using the Kubernetes OpenMPI solution
- [ ] Once proof of technology is ready, coordinate with AAW team to figure out how we should integrate this feature into the cluster.
Describe alternatives you've considered
This problem can be temporarily resolved by adding big VMs to the AKS node pool. However, this has a number of drawbacks that make it unsustainable/not user friendly in the long term.
- Users would need to shut down / start up their notebook servers every day to avoid prohibitively expensive cloud bill (e.g. leaving pod with 100vCPUs allocated idle for 24 hours).
- Starting up a notebook server every day to run experiments would be time consuming and waste several hours per week for researchers (e.g. 10 min for the pod to be scheduled, start-up time to log in and get their environment set up, etc.)
- Possible additional latency to allocate a large VM to the cluster (maybe it takes longer to add a large uncommon VM to the cluster than a more common commodity VM)?.
Additional context
Today, there is a remote desktop instance that has OpenM++ installed that is fine for development purposes. However, when running these models at scale, the size of the underlying node pool VMs is a limitation to how fast these models can run. Specifically, the OpenM++ workloads are CPU bound and highly parallelizable, but the maximum size of a notebook server is currently ~15 vCPUs because the underlying nodes only have 16 vCPUs.
Importantly, the users who require OpenM++ at scale are researchers, so the user interface needs to provide a non-programming interface, and OpenM++ provides this with a graphical user interface out of the box.
OpenMPI support is import for enabling large/complex workflows through OpenM++ (the modelling software used by a few teams - replaces modgen). While it may be used in development, users are generally unlikely to submit jobs at the command line. They will most likely be using the OpenM++ web UI to run model simulations and gather results.
There is an example of using OpenMPI as a non-root user which can likely be used as inspiration for adapting the out of the box scripts that are triggered to launch jobs from the web service.
Candidate PRs for adding compute-optimized node pool (short-term solution):
-
PR to
terraform-azure-statcan-aaw-environment- add compute optimized node pools -
PR to
terraform-advanced-analytics-workspace-infrastructure- instantiate node pool in dev cluster (could make a similar PR to instantiate in prod) -
PR to
aaw-toleration-injector- update toleration injector to give oncosim pods a toleration for thenode.statcan.gc.ca/use=cpu-72taint.
Edit: toleration injector should use a name prefix + what namespace the pod is scheduled to e.g. look for prefix of cpu-big in the pod name and then add the appropriate toleration.
CC: @chuckbelisle @YannCoderre
@brendangadd can you approve the use of VM size "Standard_F72s_v2"
@Collinbrown95 Yep, that VM size is fine. Re. implementation, I'll add some comments to the injector PR.
Work for high-CPU node pool moved to https://github.com/StatCan/daaas/issues/1193
OpenM++ is now support elastic resource management to start and stop cloud servers / clusters / nodes / etc on demand. It invokes shell script / batch file / executable of your choice to start a resource(s) (server / cluster / node / etc) when user want to run the model and to stop the resource(s) when model runs queue is empty.
I am not sure, but it may help to resolve this issue with couple of couple of shell scripts. It is already deployed in cloud for OncoSimX / HPVMM customers, please take a look if you are interested.