backend.ai Agent's kernel registry sees stale containers when they fail at docker API

When there are configuration issues such as enabling the experimental VirtioFS in Docker Desktop for Mac which does not support mounting UNIX domain sockets into containers and thus making the socket-relay container to repeat restarting infinitely, the agent observes an immediate failure upon the Docker's container start API.

When this happens, the agent's stat collector tries to access those non-created containers and keeps saying:

There may be other errors when creating containers even after validating all the configurations in our side, such as the network storage mount is gone away for unknown hardware/network failures (sometimes happened in our on-premise customer sites). In that case the failure should be handled gracefully.

Let's properly retire such stale entries.

Jul 15 '22 02:07 achimnol

we can simply retire stale entries by this https://github.com/lablup/backend.ai/pull/617 but I doubt how it happens.

agent collects container_id by appending them into a list and pass the list to StatContext. but how can KeyError occurs? It should occur in agent before it calls collect_container_stat in StatContext. I guess we need to apply asyncio.lock somewhere.

Sep 02 '22 09:09 fregataa

reopen this issue. this error still occurs due to some of the container creation failures are not handled

Feb 02 '23 01:02 fregataa