bubblewrap Add --overlay and related options

This commit adds --overlay, --tmp-overlay, --ro-overlay, and --overlay-src options to enable bubblewrap to create overlay mounts. These options are only permitted when bubblewrap is not installed setuid.

This is a continuation/partial rewrite of #167, addressing the feedback given there.

Other improvements of note:

Tests!
Characters treated specially by overlayfs are escaped; there are now no (known?) restrictions on the characters that can be in an overlay path.
--tmp-overlay is a new contribution for a use case I frequently have: I want to mount an overlay that is writable but I don't want to persist the writes between runs.

Jan 06 '23 09:01 rhendric

The commit message should have Resolves: #412 or similar, to link it up to the feature request that you're implementing here.

Jan 06 '23 13:01 smcv

The only outstanding thread here should be getting feedback on my reasoning for option ordering and naming. Please let me know what you think and if there's anything I can do to move this forward.

Mar 05 '23 00:03 rhendric

I ran into a race when running two bwrap processes consecutively using the same overlay, and the kernel has CONFIG_OVERLAY_FS_INDEX=y (this is the default on Arch Linux).

On my Arch Linux system, this example:

#!/usr/bin/env sh
set -eu

rm -rf $HOME/bwrap_ofs_cleanup_test
mkdir $HOME/bwrap_ofs_cleanup_test && cd $HOME/bwrap_ofs_cleanup_test
mkdir -p rw wrk

bwrap --unshare-user --unshare-pid --dev-bind / / --overlay-src "$PWD" --overlay "$PWD"/rw "$PWD"/wrk "$PWD" \
  dd if=/dev/zero of=dummy bs=1M count=10 status=none
bwrap --unshare-user --unshare-pid --dev-bind / / --overlay-src "$PWD" --overlay "$PWD"/rw "$PWD"/wrk "$PWD" \
  echo SUCCESS

Fails with this error:

bwrap: Can't make overlay mount on /newroot/home/zealcharm/bwrap_ofs_cleanup_test with options upperdir=/oldroot/home/zealcharm/bwrap_ofs_cleanup_test/rw,workdir=/oldroot/home/zealcharm/bwrap_ofs_cleanup_test/wrk,lowerdir=/oldroot/home/zealcharm/bwrap_ofs_cleanup_test: Device or resource busy

And the error on dmesg is:

overlayfs: upperdir is in-use as upperdir/workdir of another mount, mount with '-o index=off' to override exclusive upperdir protection.

Inserting a short sleep or a sync in between thebwrap fixes the race.

Cause

I have taken a look and I believe that the cause is: Since I'm using --unshare-pid (without --as-pid-1), bwrap launches a reaper process as PID 1. The monitor bwrap process quits immediately after PID 2 (dd in the example above) terminates, but the PID 1 takes a while longer to terminate (it is visible in ps aux), so the overlayfs mount still lingers a little longer.

Trying `--sync-fd`

I tried using --sync-fd (I think that's the right way to use it, apologies if not) but it doesn't seem to fix it, the race still happens:

#!/usr/bin/env bash
set -eu

rm -rf $HOME/bwrap_ofs_cleanup_test
mkdir $HOME/bwrap_ofs_cleanup_test && cd $HOME/bwrap_ofs_cleanup_test
mkdir -p rw wrk

syncfile="$(mktemp)"
exec {syncfd_for_bwrap}<> "$syncfile"
exec {syncfd_for_wait}<> "$syncfile"
rm -f "$syncfile"
flock -x "$syncfd_for_bwrap"
bwrap --unshare-user --unshare-pid --sync-fd "$syncfd_for_bwrap" --dev-bind / / --overlay-src "$PWD" --overlay "$PWD"/rw "$PWD"/wrk "$PWD" \
  dd if=/dev/zero of=dummy bs=1M count=10 status=none
exec {syncfd_for_bwrap}>&-
flock -x "$syncfd_for_wait"

bwrap --unshare-user --unshare-pid --dev-bind / / --overlay-src "$PWD" --overlay "$PWD"/rw "$PWD"/wrk "$PWD" \
  echo SUCCESS

I think the problem is the same, there's nothing that guarantees that the sync FD is closed after all mounts are torn down.

Waiting for PID 1 to terminate

I tried waiting for PID 1 to terminate, and this seems to work:

diff --git a/bubblewrap.c b/bubblewrap.c
index 5e90e5e..e53f04b 100644
--- a/bubblewrap.c
+++ b/bubblewrap.c
@@ -560,6 +560,7 @@ monitor_child (int event_fd, pid_t child_pid, int setup_finished_fd)
             {
               exitc = (int) val - 1;
               report_child_exit_status (exitc, setup_finished_fd);
+              waitpid (child_pid, NULL, 0);
               return exitc;
             }
         }

However, I'm not sure if this is a bulletproof fix, and it is not a proper fix since this will also cause bwrap to wait until all backgrounded processes finish instead of just the main process.

Workaround

For now I'm explicitly disabling the inode index to avoid running into the issue:

diff --git a/bubblewrap.c b/bubblewrap.c
index 5e90e5e..a177c75 100644
--- a/bubblewrap.c
+++ b/bubblewrap.c
@@ -1268,6 +1268,7 @@ setup_newroot (bool unshare_pid,
             if (mkdir (dest, 0755) != 0 && errno != EEXIST)
               die_with_error ("Can't mkdir %s", op->dest);
 
+            strappend (&sb, "index=off,");
             if (op->source != NULL)
               {
                 strappend (&sb, "upperdir=/oldroot");

Apr 10 '23 18:04 joanbm

it is not a proper fix since this will also cause bwrap to wait until all backgrounded processes finish instead of just the main process.

Honestly, this is what I thought bwrap already did. Is there a reason this isn't an improvement?

Maybe it could be enabled by a separate flag? I would imagine that waiting for everything in the sandbox to be well and truly finished is a feature that would be useful in other cases.

Apr 10 '23 20:04 rhendric

TBH I thought the same, but you can see that that's not the case by running bwrap --unshare-all --dev-bind / / sh -c 'sleep 50 &' then ps aux.

I guess it could be useful if you are launching something that daemonizes, but it's going to be trouble for anything that locks a resource like the overlayfs inode index.

So I also think that having flag(s) to wait or kill all processes instead of just for the main process could be useful for those scenarios.

Also, I suppose the waitpid solution above can be implemented without breaking compatibility by doing it only if just PID 1 remains after the main process exits.

Apr 10 '23 20:04 joanbm

doing it only if just PID 1 remains after the main process exits.

FWIW: if pid1 exists (no matter how) everything in the pid namespace will get a SIGKILL. I.e. there is no way pid1 has exited and deamons are still running.

Of course you only have a pid1 when you use unshare-pid.

Apr 11 '23 10:04 rusty-snake

if pid1 exists

You mean if pid 1 exits, I think (unfortunately either word makes sense here, but only one is correct).

Yes, if we are unsharing the pid namespace for the sandbox, when pid 1 within that namespace exits (that's the user-specified process if you used --as-pid1, or bubblewrap's reaper process otherwise), everything else in the sandbox's pid namespace is killed (see pid_namespaces(7)).

However, I don't know whether they are killed synchronously before pid 1 is allowed to exit, or whether pid 1 can exit while the overlayfs resource is still in use...

Maybe [waiting for pid 1 to exit] could be enabled by a separate flag?

If there is any situation in which the reaper process would not exit promptly, then that would make sense as an opt-in thing. Or if the reaper process would always exit promptly in any case, then it might as well be on by default, in all cases where the reaper process gets run. (In situations where the reaper is not run, of course we must not wait for it.)

If this is likely to be a practical problem, it might make sense to have an --overlay-index [on|off] option to overrule the kernel default?

Or, you might find that your use-case is better served by creating a long-running sandbox containing an IPC server, and then poking commands into it, analogous to docker run somecontainer sleep infinity followed by one or more docker exec to poke commands into the container.

Apr 11 '23 19:04 smcv

If there is any situation in which the reaper process would not exit promptly, then that would make sense as an opt-in thing. Or if the reaper process would always exit promptly in any case, then it might as well be on by default, in all cases where the reaper process gets run. (In situations where the reaper is not run, of course we must not wait for it.)

If I understand the relevant terms correctly, we know that the reaper process doesn't exit until all children have terminated (the bwrap --unshare-all --dev-bind / / sh -c 'sleep 50 &' example demonstrates this). So having an opt-in flag for waiting would be beneficial, and it sounds like it would resolve the issue at hand without being coupled to the rest of the --overlay work. I propose tracking that as a separate issue/PR.

If this is likely to be a practical problem, it might make sense to have an --overlay-index [on|off] option to overrule the kernel default?

It might indeed make sense to have this sort of option (as well as for the other options overlayfs exposes—there are a few of them), but I'm reluctant to design and implement them without an actual need. Turning off the index for this issue is more of a workaround for not having --wait-for-pid-1 functionality (or whatever it should be called), at least as I understand it. A user would have similar problems if background processes in the sandbox held any other sort of host resource past the lifetime of the parent sandbox command—I don't think we'd want to tell such users to disable any features their sandboxed programs are using that lock resources, as a long-term solution.

Apr 12 '23 05:04 rhendric

Or, you might find that your use-case is better served by creating a long-running sandbox containing an IPC server, and then poking commands into it

I found the problem while trying to create a sandbox wrapper over an existing command line tool (makepkg, Arch Linux's package builder), with the idea that it works as a drop-in replacement for the original tool. Then, I have some scripts that run makepkg a few times consecutively that I switched to the wrapper, so that's how I ran into the issue. And sure, creating and tearing down the sandbox every time is suboptimal, but it's simple and works fast enough.

Perhaps my specific use case is a bit obscure, but I think the idea of being able to use overlays in drop-in tool wrappers that are used from scripts is something that ought to work.

So having an opt-in flag for waiting would be beneficial, [...]. I propose tracking that as a separate issue/PR.

I agree that creating a new issue/PR is the right approach, ultimately the cause of the problem I ran into is not related to this PR, but rather a more general issue with the exit/cleanup code and locked resources, and the conditions to run into it are pretty specific.

If this is likely to be a practical problem, it might make sense to have an --overlay-index [on|off] option to overrule the kernel default?

It might indeed make sense to have this sort of option (as well as for the other options overlayfs exposes—there are a few of them), but I'm reluctant to design and implement them without an actual need. Turning off the index for this issue is more of a workaround for not having --wait-for-pid-1 functionality (or whatever it should be called), at least as I understand it.

Agreed, I think bwrap should strive for simplicity and turning off the overlayfs index is just a workaround. If someone needs to tune their mounts for some specific use case, this can be done like in https://github.com/containers/bubblewrap/issues/412#issuecomment-812237642.

Apr 28 '23 03:04 joanbm

A few nits but this looks great!

Just from a simplicity perspective I think it would be easier to follow what's going on if the ops were added directly and then processed in setup_newroot by, e.g. prending them to a list and then walk the list.

The other big thing is how a user like flatpak should test for availability of this feature. How is this done with other features?

Sep 22 '23 18:09 swick

@smcv, in https://github.com/containers/bubblewrap/pull/596#issuecomment-1731706374 you said ‘solving its remaining issues’. As far as I'm aware the only outstanding issue here is your decision on whether the design choices we discussed in https://github.com/containers/bubblewrap/pull/547#discussion_r1063392904 are acceptable. Is there anything else for anyone to do? Let me know and I'll be happy to work on it.

Sep 22 '23 19:09 rhendric

FWIW, the order as implemented right now makes a lot of sense when you build the command line for it from a program like flatpak.

Sep 23 '23 22:09 swick

I have a flatpak branch now which is basically doing --overlay-src $path-to-runtime-src --overlay-src $path-to-flatpak-helper-monitor --tmp-overlay /etc and any subsequent attempt to mount something onto existing directories in /etc fails with a permission denied error because the directories are owned by nobody.

Letting bwrap chown the directory doesn't work and fails with EINVAL.

For me --tmp-overlay is not useful. What works here is to either add the file directly in a new upper layer:

--overlay-src $path-to-runtime-src --overlay-src $masking-dir --overlay-src $path-to-flatpak-helper-monitor --tmp-overlay /etc --file 14 /etc/existing-dir/file

or just bind mounting it without the tmp overlay like this:

--overlay-src $path-to-runtime-src --overlay-src $masking-dir --overlay-src $path-to-flatpak-helper-monitor --ro-overlay /etc --bind-data 14 /etc/existing-dir/file

In both cases $masking-dir contains a the file /etc/existing-dir/file with workable uid:gid.

Unfortunately the masking dir layer is on the oldroot. What I want, really is a way to add a layer from the newroot:

--overlay-src $path-to-runtime-src --overlay-src-newroot /run/etc-overlay --overlay-src $path-to-flatpak-helper-monitor --ro-overlay /etc --file 14 /run/etc-overlay/existing-dir/file

This also works for bind mounting:

--overlay-src $path-to-runtime-src --overlay-src-newroot /run/etc-overlay --overlay-src $path-to-flatpak-helper-monitor --ro-overlay /etc --dir /run/etc-overlay/mount-point --bind $some-mount /etc/mount-point

Long story short, I think an option to create an overlayfs layer from a directory from the newroot would be really helpful.

Sep 24 '23 00:09 swick

Per the overlayfs documentation, ‘Changes to the underlying filesystems while part of a mounted overlay filesystem are not allowed.’ Isn't that what you'd be doing with that proposal—mounting an overlay and then writing file to one of the underlying filesystems?

Sep 24 '23 00:09 rhendric

in https://github.com/containers/bubblewrap/pull/596#issuecomment-1731706374 you said ‘solving its remaining issues’. As far as I'm aware the only outstanding issue here is your decision on whether the design choices we discussed in https://github.com/containers/bubblewrap/pull/547#discussion_r1063392904 are acceptable. Is there anything else for anyone to do?

You might be right - sorry, I'm trying to juggle too many projects and I have difficulty keeping track of the status of PRs in each of them. In theory I have co-maintainers, and if they show up and do a review I'm happy to defer to them, but if they don't, then we're limited by my fallible memory.

I'll try to get back to this and do another review pass soon.

Sep 25 '23 11:09 smcv

@rhendric True, one would have to reorder the command line options to avoid that and possibly mount over the masking layer directory.

Sep 28 '23 13:09 swick

The path escaping tests fail for me.

Sep 28 '23 16:09 swick

Added support for adding layers inside the sandbox. The tests show an example of how this could be used and how I intend to use it with flatpak. A few other changes are included as well: the two commits before the existing commit are cleanups that have separate pull requests. The ones after could be squashed down.

https://github.com/swick/bubblewrap/tree/wip/overlayfs

Sep 28 '23 17:09 swick

Re the in-sandbox layers, even if not intending to use them in a way the documentation says is unsafe, I'm pretty reluctant to bolt that on given that it's straightforward to build that layer outside of the sandbox, and it's a footgun if the layer isn't adequately secured against whatever is running in the sandbox. (What if modifying this layer triggers a kernel bug that allows escaping?) This is an intentionally small, high-trust executable and we should be cautious about adding convenience features.

The root issue is basically the same as a problem I see with non-overlay --bind mounts, which is that they also preserve their source permission bits and consequently it is often difficult to mount or edit directories inside them. I don't know why --overlay should merit a special option for dealing with this when --bind doesn't.

It isn't up to me what gets merged in, of course, but I don't endorse this addition.

Re the path escaping tests, can I get, uh, any useful information? A relevant excerpt from build/meson-logs/testlog.txt? Your kernel version?

Sep 28 '23 18:09 rhendric

Building the layer outside of the sandbox is really against the way that flatpak works. When flatpak execs bwrap, it wants all resources to be cleaned up by the kernel when the process goes away. This is only possible if the layer is build in the sandbox.

I also don't understand why you think this option is special. There are serveral ways to create files and directories inside the sandbox already without having to create them outside and then binding them into it.

Regarding security: The sandbox itself doesn't see the layer because a tmpfs is mounted over it. All the options in bwrap can be used to create a sandbox that isn't secure. The kernel lets an unprivileged user mount an overlayfs where any layer could be modified at any time so it must not trigger any kernel bugs when you do so. It might only affect the correctness, not the security.

Sep 28 '23 21:09 swick

re escaping, logs:

+ /var/home/swick/Projects/bubblewrap/build/bwrap --bind / / --tmpfs /tmp --overlay-src 'lower :,\' --overlay 'upper :,\' 'work :,\' /tmp/x sh -c 'cat /tmp/x/a; printf 2 > /tmp/x/a; cat /tmp/x/a'
bwrap: Can't make overlay mount on /newroot/tmp/x with options upperdir=/oldroot/var/tmp/tap-test.HEVhAh/upper \:\,\\,workdir=/oldroot/var/tmp/tap-test.HEVhAh/work \:\,\\,lowerdir=/oldroot/var/tmp/tap-test.HEVhAh/lower \:\,\\: No such file or directory

[ 6487.508801] overlayfs: failed to resolve '/oldroot/var/tmp/tap-test.HEVhAh/upper :': -2

Linux toolbox 6.5.5-300.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Sep 23 22:53:02 UTC 2023 x86_64 GNU/Linux

Sep 28 '23 21:09 swick

Thanks, I can reproduce the escaping failure now, and I've traced it to a kernel commit between the 6.4 and 6.5 series. I'm investigating workarounds.

I also don't understand why you think this option is special. There are serveral ways to create files and directories inside the sandbox already

Of course, but I don't think there are any ways to bind anything from inside the sandbox to another location in the sandbox. That's novel. If there were already a --bind-inside option that did something similar then perhaps I'd be inclined to figure out how to make this work rather than declare it out of (my) scope for this PR that has been sitting here since January.

The sandbox itself doesn't see the layer because a tmpfs is mounted over it.

Are you proposing that the --overlay-src-inside option does this automatically? Because if not, of course a program in the sandbox could see that layer.

All the options in bwrap can be used to create a sandbox that isn't secure.

Sure, but this would be an option that exposes an opportunity to do a ‘don't do this’ operation to the untrusted process. That's insecure by default, not just potentially insecure if used incorrectly.

It might only affect the correctness, not the security.

When we're talking about kernel space, where the overlayfs driver lives regardless of what unprivileged stuff allows the driver to be engaged, I don't think there's a hard bright line between the two. In theory, nothing should go wrong security-wise, but in practice, kernel bugs exist and can be compounded by letting untrusted processes do arbitrary sketchy things. But even less severe ‘correctness’ problems like allowing the untrusted process to write to a directory mounted with --ro-overlay can be considered a security issue.

I would be a lot more welcoming to a feature that enabled all of bubblewrap's mounts—--bind, --tmpfs, --overlay, etc.—to be mounted to parts of the filesystem that aren't chmodded for writes, which seems to be the initial problem you engineered this --overlay-src-inside solution for. If that feature happens to use overlayfs under the hood, cool—but it should be easy to isolate the metadata layer from the sandboxed process, the same way I isolated the top layer used by --tmp-overlay. That API strikes me as a much better approach than allowing overlays from arbitrary locations inside the sandbox. (I still might not want it rolled up into this PR, though, because I've been waiting patiently for a maintainer to give this the thumbs-up all year, and while I can't do anything to gift the maintainers more time, I sure can keep this PR from inflating in scope and making it more likely I have to wait even longer.)

Sep 29 '23 00:09 rhendric

Alright, let's move the --overlay-src-inside/masking feature to a new issue (created #601).

There are still two commits which I would like you to pull in: c2e7302f63d057fa1d70b12719a11bbe7cd93d21 and 80ec9ca2572f16dc1b0dc1df002742a0008b4dc7. They simplify the way the --overlay-src operations are added and processed.

When the path escaping stuff is fixed and the two commits are pulled the MR could get merge IMO.

Sep 29 '23 11:09 swick

A few comments about the latest changes:

If the kernel changed the escaping this is a kernel regression and should be fixed in the kernel.
Using the version number to determine behavior of the kernel is a no-go. Patches can get backported, vendor kernels pull in random patches, etc. This is not reliable and you must not do it.
If you need a way to choose between the old and new mount api it's probably best to use the new api if it is available and has the features we need.
If we want to support the new mount api we should do it properly: only one syscall actually requires privileges and we can pass fd's to/from the privileged part for this operation and do the rest in the unprivileged parts.

Oct 05 '23 09:10 swick

Are you involved in kernel development? Maybe you'd get taken more seriously than me. I've exchanged a few e-mails with Amir Goldstein, who seems reluctant to fix the regression:

ovl used to do it [handle commas] with the old mount API: 91c77947133f ("ovl: allow filenames with comma")

But it does not do that after the conversion to the new mount API, where generic_parse_monolithic() is used for splitting the parameters. I prefer not to have to override ->parse_monolithic() for this behavior if it works as desired for you when using new libmount.

Libmount seems to have even more characters that can't be safely passed than the new behavior of the mount syscall, but the maintainers of libmount so far haven't responded to my e-mails. (Libmount also does Linux version detection to figure out whether to do various things, so if you want to go all ‘must not do it’ in that direction be my guest.)

The new API doesn't work identically on 6.4 and 6.5 kernels, or else I would have just used it. I am not pleased with this state of affairs either but I think this is the best I can do barring just having bwrap die if a comma is detected. Even if someone convinces the kernel people to fix the regression, there are still all the released 6.5 kernels out there to think of.

As far as passing file descriptors around, that's a bigger job than I wanted to undertake right now, but I agree it's probably the way forward—it's just above my skill level to do confidently, I think.

Oct 05 '23 10:10 rhendric

I am not pleased with this state of affairs either but I think this is the best I can do barring just having bwrap die if a comma is detected

Honestly, if the rules have changed between kernels, we should just reject strings that need escaping (and therefore would potentially have different meanings under different kernels). If we do that, we have the option of allowing them later (after the new semantics are widespread), and it would not be an incompatible change to do so.

Oct 05 '23 10:10 smcv

That sounds good to me. There are a few ways we could get consistent escaping and we should leave a comment in the code explaining in detail why we currently don't support it and what has to happen for us to support it.

I'll try to convince the overlayfs maintainers that this is a regression and should be fixed but if it fails we at least have a plan.

Oct 05 '23 10:10 swick

Honestly, if the rules have changed between kernels, we should just reject strings that need escaping (and therefore would potentially have different meanings under different kernels). If we do that, we have the option of allowing them later (after the new semantics are widespread), and it would not be an incompatible change to do so.

Fair enough. The overlayfs driver handles : (because it's the layer separator) and \ (escape character) specially, and that hasn't changed across kernel versions (to date); if we escape those then , is the only character we need to reject, if we continue to use mount(2). Is that good, or do you want to reject : and \ as well and do no escaping whatsoever?

Oct 06 '23 02:10 rhendric

if we escape those then , is the only character we need to reject

Seems fine.

I don't think it's worth it to argue that the kernel should fix the regression then. If someone else runs into it and makes them fix it then good for us and we can lift the restriction but I don't think I'll spend time on arguing here.

Oct 06 '23 09:10 swick

I'd prefer to reject : and \ until/unless we have clarity on whether the rules are stable.

Oct 06 '23 11:10 smcv

Add --overlay and related options

Cause

Trying --sync-fd

Waiting for PID 1 to terminate

Workaround

Trying `--sync-fd`