nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

Move temporary charliecloud containers outside of process directory

Open nschan opened this issue 1 year ago • 2 comments

This adresses #5192. Instead of placing the process-container inside the process work directory, it is created in a folder that can be defined via charliecloud.containerStore, or by default in work/charliecloud_store. This solves the problem of find trying to work through the container.

There are some issues that that remain:

  • I do not know if nextflow clean should remove work/charliecloud_store, but I would assume it does not.

  • The whole approach of creating process specific containers comes with quite some storage overhead, since each process will have a dedicated container created, which can easily create hundreds to thousands of new containers during a single run. There are some work-arounds for this: On modern systems ch-runs's --write-fake can be used, which skips process specific container creation and instead runs containers in the container storage. This is probably the best way. On somewhat modern systems, useSquash can be enabled to create .squashfs containers, which would only create a single file per container, instead of a whole directory tree. I think it would be nice to do a cleanup of the temporary (process) container after the process finished, but I am not sure if / how this can be implemented via the charliecloud driver.

nschan avatar Aug 07 '24 14:08 nschan

Deploy Preview for nextflow-docs-staging ready!

Name Link
Latest commit 416a496f672492293bccedcf6bb045c39a41377a
Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/66e2b38c8e584e0007ba164f
Deploy Preview https://deploy-preview-5212--nextflow-docs-staging.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar Aug 07 '24 14:08 netlify[bot]

Any comments or thoughts on this PR?

nschan avatar Aug 28 '24 08:08 nschan

Any chance someone could review this and let me know if anything should be changed or done differently before this can be merged? @pditommaso @bentsherman

nschan avatar Sep 04 '24 14:09 nschan

it would be nice to have a review @l-modolo or @phue (provided they are still interested/using charliecloud)

pditommaso avatar Sep 06 '24 14:09 pditommaso

Thanks @phue, I think the review largely confirms what I have outlined in the initial comment: Creating one container per task (as was initially suggested here) is not going to be a useful route for handling containers, especially if squashfs is not an option. I would like to discuss what a reasonable way to work around this could be. The problem that is underlying this issue is that ch-run refuses to run images stored in $CH_IMAGE_STORAGE with -w (writeable), which is in turn required to use -b (bind-mount) to bind a path into the container root. Some solutions I could come up with are:

  • Use --write-fake and run from $CH_IMAGE_STORAGE and offer no other options. This removes support for old kernels, and I would not be able to use charliecloud any longer. I am obviously not in favour of this. I think it should be the default way of running images, but other ways need to be available for older systems.
  • Work around the charliecloud cache: only use $NXF_CHARLIECLOUD_CACHEDIR or charliecloud.cacheDir for storage and either make sure that those are not identical to $CH_IMAGE_STORAGE, or unset $CH_IMAGE_STORAGE before running images. Unsetting $CH_IMAGE_STORAGE probably goes against the idea of charliecloud trying to maintain a cache that contains unmodified images.
  • Continue using the charliecloud cache ($CH_IMAGE_STORAGE) and providing a second cache ($NXF_CHARLIECLOUD_CACHEDIR or charliecloud.cacheDir) , that is only available for that pipeline run. I do not really know if that has any advantages, images would still need to be converted via ch-convert to be moved from $CH_IMAGE_STORAGE to the run cache. I imagine this is not much faster than simply redownloading the image into the secondary cache. It also seems unnecessarily complex to maintain a shadow-cache for each pipeline run.

I understand the idea of maintaining the purity of containers in the $CH_IMAGE_STORAGE cache, but I think for the way nextflow uses containers a cache that can be run with -w is required, in case --write-fake is not an option. I am inclined to think that simply dropping $CH_IMAGE_STORAGE from the whole container pulling procedure and relying on $NXF_CHARLIECLOUD_CACHEDIR, charliecloud.cacheDir and falling back onto $workDir/charliecloud is probably the best procedure. Maybe it would be worth checking if $NXF_CHARLIECLOUD_CACHEDIR is identical to $CH_IMAGE_STORAGE and throw an error to avoid problems with -w?

Do you have any thoughts or comments on this @phue ?

nschan avatar Sep 10 '24 09:09 nschan

I have updated this PR to follow what I have outlined above:

I am inclined to think that simply dropping $CH_IMAGE_STORAGE from the whole container pulling procedure and relying on $NXF_CHARLIECLOUD_CACHEDIR, charliecloud.cacheDir and falling back onto $workDir/charliecloud is probably the best procedure. Maybe it would be worth checking if $NXF_CHARLIECLOUD_CACHEDIR is identical to $CH_IMAGE_STORAGE and throw an error to avoid problems with -w

This greatly simplifies the container procedure for charliecloud. However, with this nextflow no longer makes proper use of charliecloud's internal storage capabilities. I personally do not consider this a major drawback, it is still possible to use CH_IMAGE_STORAGE for everything that is not nextflow.

The commit messages that end early should include $CH_IMAGE_STORAGE; but I used double quotes so it got expanded to nothing, sorry about that.

nschan avatar Sep 10 '24 13:09 nschan

Just to recap why -w was needed in the first place: Before charliecloud gained support for overlayfs (implemented as --write-fake), we had to ensure the task work directory exists within the container to make it work with nextflow. This was done by using ch-run --write and mkdir. This wasn't optimal in the first place, because it goes against the idea of immutable containers and completely lacks provenance. Now overlayfs solves that issue and imo should be the default, but unfortunately its implementation in charliecloud requires a rather recent kernel.

The next best thing to do imo would be something along these lines:

  • leave it completely to charliecloud to handle its cache (the cache implementation anyway seems to change every couple of releases)
  • let it store "vanilla" images there, and for each nextflow run create a "working copy" of that container using ch-image convert and store it under work/charliecloud.
  • to the latter one, add the mountpoint for work/, and cross fingers that it is enough to do that once per container and not repeatedly for each task, i.e. if you bindmount work/ it will hopefully also recursively include all subdirectories thereof. Maybe give that a try.

phue avatar Sep 10 '24 14:09 phue

Just to recap why -w was needed in the first place: Before charliecloud gained support for overlayfs (implemented as --write-fake), we had to ensure the task work directory exists within the container to make it work with nextflow. This was done by using ch-run --write and mkdir. This wasn't optimal in the first place, because it goes against the idea of immutable containers and completely lacks provenance. Now overlayfs solves that issue and imo should be the default, but unfortunately its implementation in charliecloud requires a rather recent kernel.

I agree that --write-fake should be used by default, but I will also be honest that a big motivation for me is having an implementation that works for me, and overlayfs is not an option with my infrastructure.

leave it completely to charliecloud to handle its cache (the cache implementation anyway seems to change every couple of releases)

This was how the cache was used up until my latest commit. Even with this commit charliecloud will still handle it's own cache, but nextflow will use a different cache ;)

let it store "vanilla" images there, and for each nextflow run create a "working copy" of that container using ch-image convert and store it under work/charliecloud. to the latter one, add the mountpoint for work/, and cross fingers that it is enough to do that once per container and not repeatedly for each task, i.e. if you bindmount work/ it will hopefully also recursively include all subdirectories thereof. Maybe give that a try.

I have contemplated this before going with the current approach. The main problem is that this would require some additional parts in the charliecloudCache, basically replicating behaviour of the imageDownloader to have an imageCloner or something. Since ch-convert is not significantly faster than ch-image pull (personal observation), the difference between downloading images to somewhere that is not CH_IMAGE_STORAGE and downloading them into CH_IMAGE_STORAGE and then converting images from CH_IMAGE_STORAGE to that other place seems kinda minor, but come with a bunch of extras that may cause issues in the future? If there is some cool advantage that I am missing I am happy to learn more about this.

Conversion of the container into a single "working copy" cannot easily be done inside the charliecloudBuilder, because those may run parallel and I assume it would create a massive mess if several tasks try to do the same ch-convert. Of course checks could be added to avoid this, but I think it would be kinda messy.

I should add that the current approach may lead to Issue 3964 coming up again..

nschan avatar Sep 10 '24 14:09 nschan

I don't quite understand this. Isn't ch-image pull -s <cacheDir> (as done in CharliecloudCache) exactly the same as doing CH_IMAGE_STORAGE=<cacheDir> ch-image pull ...? And if so, I believe you'll run into https://github.com/nextflow-io/nextflow/issues/4463 again.

The main problem is that this would require some additional parts in the charliecloudCache, basically replicating behaviour of the imageDownloader to have an imageCloner or something. Since ch-convert is not significantly faster than ch-image pull (personal observation), the difference between downloading images to somewhere that is not CH_IMAGE_STORAGE and downloading them into CH_IMAGE_STORAGE and then converting images from CH_IMAGE_STORAGE to that other place seems kinda minor, but come with a bunch of extras that may cause issues in the future? If there is some cool advantage that I am missing I am happy to learn more about this.

The main difference is that this will populate a new local cache each time you start a fresh pipeline run. It may not be faster but certainly cheaper because using the global cache at least gives you some deduplication.

Conversion of the container into a single "working copy" cannot easily be done inside the charliecloudBuilder, because those may run parallel and I assume it would create a massive mess if several tasks try to do the same ch-convert. Of course checks could be added to avoid this, but I think it would be kinda messy.

Then extend CharliecloudCache, it already implements a mutex to avoid concurrent pulls.

phue avatar Sep 11 '24 09:09 phue

I don't quite understand this. Isn't ch-image pull -s <cacheDir> (as done in CharliecloudCache) exactly the same as doing CH_IMAGE_STORAGE=<cacheDir> ch-image pull ...?

Yes it is. One difference is that ch-run will refuse to run with -w if the container is run by name from CH_IMAGE_STORAGE. I see the issue regarding #4463 and will think about how to handle this..

Then extend CharliecloudCache, it already implements a mutex to avoid concurrent pulls.

This mutex never worked properly and caused problems. See here https://github.com/nextflow-io/nextflow/issues/3566

The main difference is that this will populate a new local cache each time you start a fresh pipeline run. It may not be faster but certainly cheaper because using the global cache at least gives you some deduplication.

You could use the same cache for different pipeline runs via NXF_CHARLIECLOUD_CACHEDIR, when not being able to make use of --write-fake

nschan avatar Sep 11 '24 09:09 nschan

So, here is the latest set of changes:

  • charliecloud.writeFake is true by default (!)
  • if charliecloud.writeFake is true and $CH_IMAGE_STORAGE is set, images will go into $CH_IMAGE_STORAGE and be run from there with --write-fake (will be run by name in that case).
  • if charliecloud.writeFake is true, but $CH_IMAGE_STORAGE is unset, images will go into charliecloud.cacheDir, $NXF_CHARLIECLOUD_CACHE or work/charliecloud (order of priority) and will be run by path (since run by name only works if $CH_IMAGE_STORAGE is set)
  • if charliecloud.writeFake is set to false, $CH_IMAGE_STORAGE will not be used to store images. Images will go into charliecloud.cacheDir, $NXF_CHARLIECLOUD_CACHE or work/charliecloud (order of priority). If the folder used to store images is the same as $CH_IMAGE_STORAGE, throw an exception. Images will be run by path.
  • If the image is supposed to be run with -w (i.e. readOnlyInputs = false and writeFake = false) and $CH_IMAGE_STORAGE is set, this will throw an exception. This is mainly to ensure that users realize that this can be a problem, I think it would be fine to not throw this here?

In summary:

  • Give preference to $CH_IMAGE_STORAGE in the default case of writeFake
  • Make some effort to avoid using $CH_IMAGE_STORAGE when writeFake is false.
  • Not using convert, since I still do not see major advantage over pulling, as far as I understand charliecloud will maintain its own cache defined by $CH_IMAGE_STORAGE even if the image goes somewhere else. nextflow will use $CH_IMAGE_STORAGE if it is set and possible (writeFake = true) and nextflow will never use $CH_IMAGE_STORAGE if -w is used to avoid messing with charliecloud cache.

nschan avatar Sep 11 '24 12:09 nschan

I would also like to add something that may not be obvious (at least to me) regarding this

The main difference is that this will populate a new local cache each time you start a fresh pipeline run. It may not be faster but certainly cheaper because using the global cache at least gives you some deduplication.

The charliecloud build cache appears to be located in $CH_IMAGE_STORAGE, but would still be used if the image is pulled somewhere else (https://hpc.github.io/charliecloud/ch-image.html#build-cache). I understand this to mean that setting $CH_IMAGE_STORAGE is generally a good idea. I will therefore modify this

If the image is supposed to be run with -w (i.e. readOnlyInputs = false and writeFake = false) and $CH_IMAGE_STORAGE is set, this will throw an exception.

To test if the image storage directory is the same as $CH_IMAGE_STORAGE and only throw an exception if that is the case.

Edit: I think I misunderstood and the cache is in the directory that is passed via -s. See also here https://github.com/hpc/charliecloud/discussions/1922

nschan avatar Sep 11 '24 12:09 nschan

To explain my reluctance to add some conversion step, besides the fact that it would be quite some work: This would only add support for cases where someone is not able to use writeFake and needs to use $CH_IMAGE_STORAGE when running nextflow. If the goal is to have a charliecloud cache for nextflow containers without writeFake, this can be defined via NXF_CHARLIECLOUD_CACHE. This writeable storage should probably be used only for nextflow anyway?

nschan avatar Sep 11 '24 17:09 nschan

Once again, something went sideways when I tried to sign-off something.. I will try to solve whatever went wrong.

nschan avatar Sep 12 '24 09:09 nschan