Move temporary charliecloud containers outside of process directory
This adresses #5192.
Instead of placing the process-container inside the process work directory, it is created in a folder that can be defined via charliecloud.containerStore, or by default in work/charliecloud_store. This solves the problem of find trying to work through the container.
There are some issues that that remain:
-
I do not know if
nextflow cleanshould removework/charliecloud_store, but I would assume it does not. -
The whole approach of creating process specific containers comes with quite some storage overhead, since each process will have a dedicated container created, which can easily create hundreds to thousands of new containers during a single run. There are some work-arounds for this: On modern systems
ch-runs's--write-fakecan be used, which skips process specific container creation and instead runs containers in the container storage. This is probably the best way. On somewhat modern systems,useSquashcan be enabled to create.squashfscontainers, which would only create a single file per container, instead of a whole directory tree. I think it would be nice to do a cleanup of the temporary (process) container after the process finished, but I am not sure if / how this can be implemented via the charliecloud driver.
Deploy Preview for nextflow-docs-staging ready!
| Name | Link |
|---|---|
| Latest commit | 416a496f672492293bccedcf6bb045c39a41377a |
| Latest deploy log | https://app.netlify.com/sites/nextflow-docs-staging/deploys/66e2b38c8e584e0007ba164f |
| Deploy Preview | https://deploy-preview-5212--nextflow-docs-staging.netlify.app |
| Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
Any comments or thoughts on this PR?
Any chance someone could review this and let me know if anything should be changed or done differently before this can be merged? @pditommaso @bentsherman
it would be nice to have a review @l-modolo or @phue (provided they are still interested/using charliecloud)
Thanks @phue, I think the review largely confirms what I have outlined in the initial comment: Creating one container per task (as was initially suggested here) is not going to be a useful route for handling containers, especially if squashfs is not an option.
I would like to discuss what a reasonable way to work around this could be.
The problem that is underlying this issue is that ch-run refuses to run images stored in $CH_IMAGE_STORAGE with -w (writeable), which is in turn required to use -b (bind-mount) to bind a path into the container root.
Some solutions I could come up with are:
- Use
--write-fakeand run from$CH_IMAGE_STORAGEand offer no other options. This removes support for old kernels, and I would not be able to use charliecloud any longer. I am obviously not in favour of this. I think it should be the default way of running images, but other ways need to be available for older systems. - Work around the charliecloud cache: only use
$NXF_CHARLIECLOUD_CACHEDIRorcharliecloud.cacheDirfor storage and either make sure that those are not identical to$CH_IMAGE_STORAGE, or unset$CH_IMAGE_STORAGEbefore running images. Unsetting$CH_IMAGE_STORAGEprobably goes against the idea of charliecloud trying to maintain a cache that contains unmodified images. - Continue using the charliecloud cache (
$CH_IMAGE_STORAGE) and providing a second cache ($NXF_CHARLIECLOUD_CACHEDIRorcharliecloud.cacheDir) , that is only available for that pipeline run. I do not really know if that has any advantages, images would still need to be converted viach-convertto be moved from$CH_IMAGE_STORAGEto the run cache. I imagine this is not much faster than simply redownloading the image into the secondary cache. It also seems unnecessarily complex to maintain a shadow-cache for each pipeline run.
I understand the idea of maintaining the purity of containers in the $CH_IMAGE_STORAGE cache, but I think for the way nextflow uses containers a cache that can be run with -w is required, in case --write-fake is not an option. I am inclined to think that simply dropping $CH_IMAGE_STORAGE from the whole container pulling procedure and relying on $NXF_CHARLIECLOUD_CACHEDIR, charliecloud.cacheDir and falling back onto $workDir/charliecloud is probably the best procedure. Maybe it would be worth checking if $NXF_CHARLIECLOUD_CACHEDIR is identical to
$CH_IMAGE_STORAGE and throw an error to avoid problems with -w?
Do you have any thoughts or comments on this @phue ?
I have updated this PR to follow what I have outlined above:
I am inclined to think that simply dropping $CH_IMAGE_STORAGE from the whole container pulling procedure and relying on $NXF_CHARLIECLOUD_CACHEDIR, charliecloud.cacheDir and falling back onto $workDir/charliecloud is probably the best procedure. Maybe it would be worth checking if $NXF_CHARLIECLOUD_CACHEDIR is identical to $CH_IMAGE_STORAGE and throw an error to avoid problems with -w
This greatly simplifies the container procedure for charliecloud. However, with this nextflow no longer makes proper use of charliecloud's internal storage capabilities. I personally do not consider this a major drawback, it is still possible to use CH_IMAGE_STORAGE for everything that is not nextflow.
The commit messages that end early should include $CH_IMAGE_STORAGE; but I used double quotes so it got expanded to nothing, sorry about that.
Just to recap why -w was needed in the first place:
Before charliecloud gained support for overlayfs (implemented as --write-fake), we had to ensure the task work directory exists within the container to make it work with nextflow. This was done by using ch-run --write and mkdir.
This wasn't optimal in the first place, because it goes against the idea of immutable containers and completely lacks provenance.
Now overlayfs solves that issue and imo should be the default, but unfortunately its implementation in charliecloud requires a rather recent kernel.
The next best thing to do imo would be something along these lines:
- leave it completely to
charliecloudto handle its cache (the cache implementation anyway seems to change every couple of releases) - let it store "vanilla" images there, and for each nextflow run create a "working copy" of that container using
ch-image convertand store it underwork/charliecloud. - to the latter one, add the mountpoint for
work/, and cross fingers that it is enough to do that once per container and not repeatedly for each task, i.e. if you bindmountwork/it will hopefully also recursively include all subdirectories thereof. Maybe give that a try.
Just to recap why -w was needed in the first place: Before charliecloud gained support for overlayfs (implemented as --write-fake), we had to ensure the task work directory exists within the container to make it work with nextflow. This was done by using ch-run --write and mkdir. This wasn't optimal in the first place, because it goes against the idea of immutable containers and completely lacks provenance. Now overlayfs solves that issue and imo should be the default, but unfortunately its implementation in charliecloud requires a rather recent kernel.
I agree that --write-fake should be used by default, but I will also be honest that a big motivation for me is having an implementation that works for me, and overlayfs is not an option with my infrastructure.
leave it completely to charliecloud to handle its cache (the cache implementation anyway seems to change every couple of releases)
This was how the cache was used up until my latest commit. Even with this commit charliecloud will still handle it's own cache, but nextflow will use a different cache ;)
let it store "vanilla" images there, and for each nextflow run create a "working copy" of that container using ch-image convert and store it under work/charliecloud. to the latter one, add the mountpoint for work/, and cross fingers that it is enough to do that once per container and not repeatedly for each task, i.e. if you bindmount work/ it will hopefully also recursively include all subdirectories thereof. Maybe give that a try.
I have contemplated this before going with the current approach. The main problem is that this would require some additional parts in the charliecloudCache, basically replicating behaviour of the imageDownloader to have an imageCloner or something. Since ch-convert is not significantly faster than ch-image pull (personal observation), the difference between downloading images to somewhere that is not CH_IMAGE_STORAGE and downloading them into CH_IMAGE_STORAGE and then converting images from CH_IMAGE_STORAGE to that other place seems kinda minor, but come with a bunch of extras that may cause issues in the future? If there is some cool advantage that I am missing I am happy to learn more about this.
Conversion of the container into a single "working copy" cannot easily be done inside the charliecloudBuilder, because those may run parallel and I assume it would create a massive mess if several tasks try to do the same ch-convert. Of course checks could be added to avoid this, but I think it would be kinda messy.
I should add that the current approach may lead to Issue 3964 coming up again..
I don't quite understand this.
Isn't ch-image pull -s <cacheDir> (as done in CharliecloudCache) exactly the same as doing CH_IMAGE_STORAGE=<cacheDir> ch-image pull ...?
And if so, I believe you'll run into https://github.com/nextflow-io/nextflow/issues/4463 again.
The main problem is that this would require some additional parts in the charliecloudCache, basically replicating behaviour of the imageDownloader to have an imageCloner or something. Since ch-convert is not significantly faster than ch-image pull (personal observation), the difference between downloading images to somewhere that is not CH_IMAGE_STORAGE and downloading them into CH_IMAGE_STORAGE and then converting images from CH_IMAGE_STORAGE to that other place seems kinda minor, but come with a bunch of extras that may cause issues in the future? If there is some cool advantage that I am missing I am happy to learn more about this.
The main difference is that this will populate a new local cache each time you start a fresh pipeline run. It may not be faster but certainly cheaper because using the global cache at least gives you some deduplication.
Conversion of the container into a single "working copy" cannot easily be done inside the charliecloudBuilder, because those may run parallel and I assume it would create a massive mess if several tasks try to do the same ch-convert. Of course checks could be added to avoid this, but I think it would be kinda messy.
Then extend CharliecloudCache, it already implements a mutex to avoid concurrent pulls.
I don't quite understand this. Isn't ch-image pull -s <cacheDir> (as done in CharliecloudCache) exactly the same as doing CH_IMAGE_STORAGE=<cacheDir> ch-image pull ...?
Yes it is. One difference is that ch-run will refuse to run with -w if the container is run by name from CH_IMAGE_STORAGE. I see the issue regarding #4463 and will think about how to handle this..
Then extend CharliecloudCache, it already implements a mutex to avoid concurrent pulls.
This mutex never worked properly and caused problems. See here https://github.com/nextflow-io/nextflow/issues/3566
The main difference is that this will populate a new local cache each time you start a fresh pipeline run. It may not be faster but certainly cheaper because using the global cache at least gives you some deduplication.
You could use the same cache for different pipeline runs via NXF_CHARLIECLOUD_CACHEDIR, when not being able to make use of --write-fake
So, here is the latest set of changes:
-
charliecloud.writeFakeistrueby default (!) - if
charliecloud.writeFakeistrueand$CH_IMAGE_STORAGEis set, images will go into$CH_IMAGE_STORAGEand be run from there with--write-fake(will be run by name in that case). - if
charliecloud.writeFakeistrue, but$CH_IMAGE_STORAGEis unset, images will go intocharliecloud.cacheDir,$NXF_CHARLIECLOUD_CACHEorwork/charliecloud(order of priority) and will be run by path (since run by name only works if$CH_IMAGE_STORAGEis set) - if
charliecloud.writeFakeis set tofalse,$CH_IMAGE_STORAGEwill not be used to store images. Images will go intocharliecloud.cacheDir,$NXF_CHARLIECLOUD_CACHEorwork/charliecloud(order of priority). If the folder used to store images is the same as$CH_IMAGE_STORAGE, throw an exception. Images will be run by path. - If the image is supposed to be run with
-w(i.e.readOnlyInputs = falseandwriteFake = false) and$CH_IMAGE_STORAGEis set, this will throw an exception. This is mainly to ensure that users realize that this can be a problem, I think it would be fine to not throw this here?
In summary:
- Give preference to
$CH_IMAGE_STORAGEin the default case ofwriteFake - Make some effort to avoid using
$CH_IMAGE_STORAGEwhenwriteFakeis false. - Not using convert, since I still do not see major advantage over pulling, as far as I understand charliecloud will maintain its own cache defined by
$CH_IMAGE_STORAGEeven if the image goes somewhere else.nextflowwill use$CH_IMAGE_STORAGEif it is set and possible (writeFake = true) andnextflowwill never use$CH_IMAGE_STORAGEif-wis used to avoid messing with charliecloud cache.
I would also like to add something that may not be obvious (at least to me) regarding this
The main difference is that this will populate a new local cache each time you start a fresh pipeline run. It may not be faster but certainly cheaper because using the global cache at least gives you some deduplication.
The charliecloud build cache appears to be located in $CH_IMAGE_STORAGE, but would still be used if the image is pulled somewhere else (https://hpc.github.io/charliecloud/ch-image.html#build-cache). I understand this to mean that setting $CH_IMAGE_STORAGE is generally a good idea. I will therefore modify this
If the image is supposed to be run with -w (i.e. readOnlyInputs = false and writeFake = false) and $CH_IMAGE_STORAGE is set, this will throw an exception.
To test if the image storage directory is the same as $CH_IMAGE_STORAGE and only throw an exception if that is the case.
Edit:
I think I misunderstood and the cache is in the directory that is passed via -s. See also here https://github.com/hpc/charliecloud/discussions/1922
To explain my reluctance to add some conversion step, besides the fact that it would be quite some work: This would only add support for cases where someone is not able to use writeFake and needs to use $CH_IMAGE_STORAGE when running nextflow. If the goal is to have a charliecloud cache for nextflow containers without writeFake, this can be defined via NXF_CHARLIECLOUD_CACHE. This writeable storage should probably be used only for nextflow anyway?
Once again, something went sideways when I tried to sign-off something.. I will try to solve whatever went wrong.