container-device-interface icon indicating copy to clipboard operation
container-device-interface copied to clipboard

Add support for predefined hooks

Open deitch opened this issue 1 year ago • 17 comments

I have been working with the NVIDIA container toolkit which generates a CDI yaml file, including hooks. The hooks all call nvidia-ctk to do 3 things:

  • create symlinks (relative to the root of a container)
  • chmod (container root prepended to target paths)
  • update ldcache (run ldconfig within the chroot of the container)

These are pretty basic standard activities. It seems like it would make sense to have containerEdits within the spec capable of handling these. The spec includes container edit things like mounts, env, deviceNodes, additionalGIDs, and of course hooks. I think that creating symlinks, changing permissions on dir/file, and maybe updating ldcache (I am mixed on that one) are the kinds of capabilities that would fit as first-class citizens within containerEdits.

Thoughts?

deitch avatar May 03 '24 09:05 deitch

I’m working on CDI for Charliecloud, a lightweight, fully-unprivileged container implementation for HPC applications (hpc/charliecloud#1902). (In fact, it looks like @elezar was already helpful in some offline correspondence.) This would also be useful for us.

In fact, regardless of what the standard says, I anticipate not running nvidia-ctk hook to do these things, but rather interpreting such clauses as @deitch suggests and just do the things internally. Such a kludge seems better to me than calling external programs for these “basic standard” tasks, but I’d much rather follow documented behavior.

I would be happy to provide a draft PR for folks to look at.

reidpr avatar May 16 '24 21:05 reidpr

@reidpr we did discuss this in our latest working group meeting (notes here) and will most likely not be implementing this as part of the spec for the time being.

One of the primary concerns is that even for "basic standard" tasks the applicable interfaces may not be applicable. Take a hook that "updates the ldcache" as an example. In the case of the NVIDIA Container Toolkit -- which implements this as the nvidia-ctk hook update-ldcache command (or nvidia-cdi-hook update-ldcache with the change from https://github.com/NVIDIA/nvidia-container-toolkit/pull/474) -- we take this to mean, run ldconfig from the HOST to update /etc/ld.so.cache IN THE CONTAINER using the specified folders. Here we need to make provision for different paths to ldconfig on the host (e.g. on ubuntu the correct path is ldconfig.real since ldconfig is a shell wrapper) as well as containers that don't have an ldcache (e.g. busybox).

If we need to capture this logic in some kind of binary, this still needs to be available on the host and would complicate CDI spec consumption since we would have to distribute this binary somehow.

Our current strategy for how to handle this would be to provide better documentation and examples for what these hooks do so that their behaviour can be replicated across various environments. This does place the burden on the hook vendors to provide this documentation though and cannot necessarily be enforced by the spec.

One question I have is whether there is some middle ground where we don't implement the hooks in the spec, but more strictly prescribe their behaviour? That is to say that as a vendor, NVIDIA would still distribute the nvidia-cdi-hook binary that would include an update-ldcache subcommand, but that the pre-and post conditions for this command are defined by a public spec. As the spec owners, we could even provide a set of test cases to validate this. This should in theory allow hooks to be more interchangeable and allow users to determine whether hooks are applicable given their environments.

elezar avatar May 17 '24 13:05 elezar

One of the primary concerns is that even for "basic standard" tasks the applicable interfaces may not be applicable. Take a hook that "updates the ldcache" as an example. In the case of the NVIDIA Container Toolkit -- which implements this as the nvidia-ctk hook update-ldcache command (or nvidia-cdi-hook update-ldcache with the change from https://github.com/NVIDIA/nvidia-container-toolkit/pull/474) -- we take this to mean, run ldconfig from the HOST to update /etc/ld.so.cache IN THE CONTAINER using the specified folders.

I can see that as a concern. At the same time, I do think it is a narrowly prescribed set of activities. I originally thought 3, now there is a variant on 1, so total of 4.

Would we want to implement the truly standard ones, i.e. chmod and symlinks?

deitch avatar May 17 '24 13:05 deitch

For clarification, the chmod hook that was added to nvidia-ctk was to workaround a crun bug and should be removed from this discussion.

elezar avatar May 17 '24 14:05 elezar

Thank you @elezar.

Take a hook that "updates the ldcache" as an example ... we take this to mean, run ldconfig from the HOST to update /etc/ld.so.cache IN THE CONTAINER .... Here we need to make provision for different paths to ldconfig on the host ...

I see your point. I wasn’t in that discussion obviously, but in my view that type of logic belongs more in the container implementation rather than vendor hooks. However you’ve been thinking about this longer than me.

I have a couple follow-ups.

Can you clarify the reasoning on why NVIDIA (or HPE, or ...) should be figuring out host ldconfig stuff rather than Charliecloud (or Podman, or ...)? What am I missing?

What if the host does not have ldconfig? (E.g., Alpine is based on musl, which doesn’t.)

as well as containers that don't have an ldcache...

Good point.

Do these containers need ldconfig run at all, though?

Is it reasonably common to have containers without an ldconfig executable within the container but still needing an ldcache update?

Our current strategy for how to handle this would be to provide better documentation and examples for what these hooks do .... [I]s some middle ground where we don't implement the hooks in the spec, but more strictly prescribe their behaviour?

I could work with this.

I really like the declarative aspect of CDI. Strictly prescribed hooks would be consistent with that, though (in my view) in kind of a roundabout way.

reidpr avatar May 17 '24 20:05 reidpr

latest working group meeting (notes here)

Are these meetings open to the public? It sounds like something I’d be interested in helping with.

reidpr avatar May 17 '24 20:05 reidpr

Are these meetings open to the public? It sounds like something I’d be interested in helping with.

Yes, the meetings are public. All the required details should be in the linked document. Ideally they are recorded too, but there is sometimes some lag between when these are available on YouTube.

elezar avatar May 21 '24 08:05 elezar

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Aug 20 '24 04:08 github-actions[bot]

I will comment to keep it open.

deitch avatar Aug 20 '24 09:08 deitch

I have created #225 as an initial attempt to capture the intent of the "predefined" hooks.

elezar avatar Aug 20 '24 13:08 elezar

Cool, thank you @elezar . What do we think about chmod? Or is that just an artifact going away anyways, as you commented above?

deitch avatar Aug 20 '24 13:08 deitch

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 19 '24 04:11 github-actions[bot]

Not stale.

deitch avatar Nov 19 '24 05:11 deitch

Cool, thank you @elezar . What do we think about chmod? Or is that just an artifact going away anyways, as you commented above?

The chmod hook was added to our tooling as a workaround for a particular runc bug. I don't think we should formalize this and rather focus on the remaining hooks.

elezar avatar Nov 21 '24 15:11 elezar

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Feb 20 '25 04:02 github-actions[bot]

Not stale

deitch avatar Feb 20 '25 06:02 deitch

In case it helps someone else, I’ve missed for nearly a year that the two alternatives are nvidia-ctk hook and nvidia-cdi-hook, not ctk in both cases.

reidpr avatar Apr 25 '25 21:04 reidpr

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Jul 25 '25 04:07 github-actions[bot]

This issue was automatically closed due to inactivity.

github-actions[bot] avatar Aug 25 '25 04:08 github-actions[bot]