[cells] PID not found errors when stopping running executables
Attempting to stop a running executable seems to have the following behavior on my (x86_64 Ubuntu 24.04) system:
-
sh -c <executable>process is created and has its PID tracked in theExecutablescache. - Child
<executable>process is not tracked. - Executable stop command is issued, Auraed errors with
PID not found - Parent shell process is killed/missing, child executable process remains as a zombie.
$ ps aux | grep "tail -f"
root 4095196 0.0 0.0 8320 1792 ? S 09:34 0:00 tail -f /dev/null
Testing
Rust
Rust tests, configuring new remote client for nested auraed
#[test_helpers_macros::shared_runtime_test]
async fn cells_start_stop_delete() {
skip_if_not_root!("cells_start_stop_delete");
skip_if_seccomp!("cells_start_stop_delete");
let client = common::auraed_client().await;
// Allocate a cell
let cell_name = retry!(
client
.allocate(
common::cells::CellServiceAllocateRequestBuilder::new().build()
)
.await
)
.unwrap()
.into_inner()
.cell_name;
// Start the executable
let req = common::cells::CellServiceStartRequestBuilder::new()
.cell_name(cell_name.clone())
.executable_name("aurae-exe".to_string())
.build();
let _ = retry!(client.start(req.clone()).await).unwrap().into_inner();
// Stop the executable
let _ = retry!(
client
.stop(proto::cells::CellServiceStopRequest {
cell_name: Some(cell_name.clone()),
executable_name: "aurae-exe".to_string(),
})
.await
)
.unwrap();
// Delete the cell
let _ = retry!(
client
.free(proto::cells::CellServiceFreeRequest {
cell_name: cell_name.clone()
})
.await
)
.unwrap();
}
sudo -E cargo test -p auraed --test cell_start_stop_delete -- --include-ignored
[...snip...]
2024-11-07T01:30:08.068934Z INFO start: auraed::cells::cell_service::cell_service: CellService: start() executable=ValidatedExec
utable { name: ExecutableName("aurae-exe"), command: "sleep 400", description: "description" } request=ValidatedCellServiceStartR
equest { cell_name: None, executable: ValidatedExecutable { name: ExecutableName("aurae-exe"), command: "sleep 400", description:
"description" }, uid: None, gid: None }
2024-11-07T01:30:08.069353Z INFO start: auraed::observe::observe_service: Registering channel for pid 1668303 Stdout request=Val
idatedCellServiceStartRequest { cell_name: None, executable: ValidatedExecutable { name: ExecutableName("aurae-exe"), command: "s
leep 400", description: "description" }, uid: None, gid: None }
2024-11-07T01:30:08.069445Z INFO start: auraed::observe::observe_service: Registering channel for pid 1668303 Stderr request=Val
idatedCellServiceStartRequest { cell_name: None, executable: ValidatedExecutable { name: ExecutableName("aurae-exe"), command: "s
leep 400", description: "description" }, uid: None, gid: None }
2024-11-07T01:30:08.103119Z INFO stop: auraed::cells::cell_service::cell_service: CellService: stop() executable_name=Executable
Name("aurae-exe") request=ValidatedCellServiceStopRequest { cell_name: None, executable_name: ExecutableName("aurae-exe") }
2024-11-07T01:30:08.103377Z ERROR stop: auraed::cells::cell_service::error: executable 'aurae-exe' failed to stop: No child proce
sses (os error 10) request=ValidatedCellServiceStopRequest { cell_name: None, executable_name: ExecutableName("aurae-exe") }
thread 'cells_start_stop_delete' panicked at auraed/tests/cell_list_must_list_allocated_cells_recursively.rs:172:6:
called `Result::unwrap()` on an `Err` value: Status { code: Internal, message: "executable 'aurae-exe' failed to stop: No child p
rocesses (os error 10)", metadata: MetadataMap { headers: {"content-type": "application/grpc", "content-length": "0", "date": "Th
u, 07 Nov 2024 01:30:08 GMT"} }, source: None }
Manually with aer and cloud-hypervisor
Install cloud-hypervisor and build guest image/kernel
sudo make /opt/aurae/cloud-hypervisor/cloud-hypervisor
sudo make build-guest-kernel
sudo make prepare-image
Run cloud-hypervisor with the auraed pid1 image
sudo cloud-hypervisor --kernel /var/lib/aurae/vm/kernel/vmlinux.bin \
--disk path=/var/lib/aurae/vm/image/disk.raw \
--cmdline "console=hvc0 root=/dev/vda1 rw" \
--cpus boot=4 \
--memory size=4096M \
--net "tap=tap0,mac=aa:ae:00:00:00:01,id=eth0"
Retrieve zone ID from tap0 (13 in my case):
ip link show tap0
13: tap0: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 06:66:42:a8:3f:e1 brd ff:ff:ff:ff:ff:ff
Configure aurae client config in ~/.aurae/config:
[system]
socket = "[fe80::2%13]:8080"
Verify cells run:
aer cell allocate sleeper
aer cell start --executable-command "sleep 9000" sleeper sleep-forever
aer cell list
aer cell stop sleeper sleep-forever
aer cell free sleeper
i'm going to see if i can create a cell service level test for this use case.
i have a test that should be testing this scenario, and it works. note though that it's not using nested cells, so i suspect this is where the issue is.
#535 is the current draft PR.
next step is to change it to use nested cells instead and see if it starts failing :)
well well well.
2024-11-13T11:02:09.173289Z ERROR start_in_cell: auraed::cells::cell_service::error: cgroup 'ae-test-aab9ac5e-a042-4f05-a7e1-6a0f1ecf70ec' exists on host, but is not controlled by auraed cell_name=CellName("ae-test-aab9ac5e-a042-4f05-a7e1-6a0f1ecf70ec") request=CellServiceStartRequest { cell_name: None, executable: Some(Executable { name: "ae-exec-f94a9213-518d-40b6-8b66-71a1a67d0f03", command: "tail -f /dev/null", description: "" }), uid: None, gid: None }
11:02:09 [ERROR] failed to start in cell: status: FailedPrecondition, message: "cgroup 'ae-test-aab9ac5e-a042-4f05-a7e1-6a0f1ecf70ec' exists on host, but is not controlled by auraed", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Wed, 13 Nov 2024 11:02:09 GMT", "content-length": "0"} }
error: test failed, to rerun pass `-p auraed --lib`
something very odd is going on with the cell cache. i confirmed that we're inserting into the cache on allocate, but when we try to get the cell back out of the cache it isn't there, but the cgroup exists.
out of time for debugging for now but i'll keep hacking on this later.
confirmed the cell name is a key in the cache at the moment we call self.cache.get, but this call is returning None.
leaving this here as a note to myself:
allocated ae-test-start-stop-in-cell
getting ae-test-start-stop-in-cell from cache
get cell ae-test-start-stop-in-cell
cgroup ae-test-start-stop-in-cell exists
cache size: 1
CellName("ae-test-start-stop-in-cell")
MATCH
2024-11-13T13:22:37.869166Z ERROR start_in_cell: auraed::cells::cell_service::cells::cells: get cell ae-test-start-stop-in-cell: cell not in cache cell_name=CellName("ae-test-start-stop-in-cell") request=CellServiceStartRequest { cell_name: None, executable: Some(Executable { name: "ae-exec-start-stop-in-cell", command: "tail -f /dev/null", description: "" }), uid: None, gid: None }
2024-11-13T13:22:37.869241Z ERROR start_in_cell: auraed::cells::cell_service::error: cgroup 'ae-test-start-stop-in-cell' exists on host, but is not controlled by auraed cell_name=CellName("ae-test-start-stop-in-cell") request=CellServiceStartRequest { cell_name: None, executable: Some(Executable { name: "ae-exec-start-stop-in-cell", command: "tail -f /dev/null", description: "" }), uid: None, gid: None }
i think the issue is somewhere between how we "start in cell" and how we "proxy if needed". i'm debating stripping out a lot of the complexity here as i'm not sure it's necessary.
ok all of this is a red herring based on things running in parallel. the actual bug is in the executables cache. when we stop, we're returning an error in not an error case (by the look of it). looks like maybe a bad merge. i'm on it now :)
i've narrowed this down to Executable::kill. i thought maybe it was because we are calling kill (which waits) and then wait to get the exit status, but replacing the kill with start_kill (which doesn't wait) doesn't resolve the issue.
the error "No child processes (os error 10)" that i'm seeing is coming from the child exiting. however, i can't see why this error is being reported.
I also see this error in CellService::stop on aarch64 and x86_64 using Ubuntu 22.04 or 24.04.
When I run @dmah42's test it hangs on stop and I see the situation @mccormickt describes, the sh process is gone and the tail process has been taken over by systemd. The test blocks and when I kill the tail process with another terminal it continues and the test succeeds.
It seems we get an ECHILD (os error 10) on wait [or on kill because it includes a wait]. So I used a start_kill and ignored the ECHILD since there is not much we can do and other projects like ruwasi, seem to ignore ECHILD errors as well. But that still leaves the problem with the zombie process. I think not propagating a kill signal to a child process is expected behavior for a shell, so I decided to use a process group to kill all processes together, it seems like no big deal to implement, but I used process-wrap for convenience. Finally, free failed with an ESRCH (os error 3) because it tries to send a kill to a PID that does not exist. (This may be a problem because I did not update the cache correctly). Ignoreing that, the test turns green and there are no zombies left.
It feels a bit like looking for a workaround for something that should have been there before; do you remember when it started failing, or a mechanism that cleaned up leftover processes in the past?
it was working until we introduced the "do in cell" mechanic I think. and that's where I think the problem is. I think we are either not propagating calls correctly or more likely we are putting things in two caches, one per cell service (assuming I understand that code, which is not a given), and one is in the wrong cache.
my plan which I haven't had time for was to strip out the "do in cell" and see if it works again, then work out why we needed it.
I think it was to put auraed in the cell like a sidecar/pid 1 analog and then push calls down from the host to the cell. but if auraed is a controller at the host level, I don't think it's necessary to delegate like that.
I wasted some time building an old version of aurae before do_in_cell was introduced. Although it finally compiled, I found that there was no aer back then, and I had to update buf and so many dependencies that it reminded me of the "boat of Theseus". However, I was not able to run cells with it 🙈.
In conclusion, I cannot recommend this approach 😅.
oh no. I'm sorry if I led you wrong. I'm sure we could run cells then and that there was aer. but maybe I'm misremembering how early do_in_cell was introduced.
so if it's not that, maybe it's some of the pid1 stuff that came more recently. or maybe it's just always been broken and we somehow never noticed, but that seems unlikely.
my best guess at this point though is the caching (it's always caching). we track the executables we're running in each cell, and i suspect somewhere we're putting the process in the wrong layer of cache.