Alternatives to updating conda environment on Job Execution
Currently, on Job Execution(If dependencies are specified), we clone the base environment which is just a bunch of copy and move operations and the next step is to update the cloned environment with the new dependencies. Primitive inspection (via console.time) shows that it takes the longest.
One alternative would be check environment file as follows:
- Check the python version (to be python 3.7)
- Check if only pip installed dependencies available
If the case above, we could just install the dependencies using pip.
The following benchmarks show that installing numpy and pandas using pip is significantly faster than waiting for conda to resolve the environment.
dependencies:
- pip:
- numpy
Example:
(base) umesh@isisdell:~$ time conda run -n deepforge-copy pip install numpy pandas
real 0m4.185s
user 0m2.503s
sys 0m0.348s
(base) umesh@isisdell:~$ time conda run -n deepforge-copy pip uninstall numpy --yes
real 0m0.802s
user 0m0.677s
sys 0m0.125s
(base) umesh@isisdell:~$ time conda env update -n deepforge-copy --file update-file.yml
real 0m19.691s
user 0m17.325s
sys 0m1.232s
Probably a good idea given the prevalence of pip dependencies. Kinda annoying to introduce a special case optimization like this though :(
https://www.anaconda.com/blog/understanding-and-improving-condas-performance Has some ideas on improving conda's performance. Very few apply to us