[Bug]: PostgreSQL startup times out of logs can't be tailed
Module
PostgreSQL
Testcontainers version
1.17.3
Using the latest Testcontainers version?
Yes
Host OS
Linux
Host Arch
x86_64
Docker version
Docker version 20.10.16, build aa7e414
What happened?
I am trying to run our testsuite in a loop to catch a race condition. However I see that in about 1 out of 10 cases the test fails due to timing out when waiting for the PostgreSQL container to start up. The logs suggest that the line it is waiting for is actually printed as expected; checking the code I suspect that the OutputFrame does not contain the full line and therefore the expression does not match. Is this possible?
Relevant log output
2022-08-02 13:26:09,614 ERROR [🐳 [postgres:12]] (pool-146-thread-1) Could not start container: org.testcontainers.containers.ContainerLaunchException: Timed out waiting for log output matching '.*database system is ready to accept connections.*\s'
at org.testcontainers.containers.wait.strategy.LogMessageWaitStrategy.waitUntilReady(LogMessageWaitStrategy.java:49)
at org.testcontainers.containers.wait.strategy.AbstractWaitStrategy.waitUntilReady(AbstractWaitStrategy.java:51)
at org.testcontainers.containers.PostgreSQLContainer.waitUntilContainerStarted(PostgreSQLContainer.java:139)
at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:480)
at org.testcontainers.containers.GenericContainer.lambda$doStart$0(GenericContainer.java:340)
at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:81)
at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:338)
at org.testcontainers.containers.GenericContainer.start(GenericContainer.java:326)
at io.hyperfoil.tools.horreum.test.PostgresResource.start(PostgresResource.java:18)
at io.quarkus.test.common.TestResourceManager$TestResourceEntryRunnable.run(TestResourceManager.java:452)
at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
2022-08-02 13:26:09,674 ERROR [🐳 [postgres:12]] (pool-146-thread-1) Log output from the failed container:
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
fixing permissions on existing directory /var/lib/postgresql/data ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok
initdb: warning: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
Success. You can now start the database server using:
pg_ctl -D /var/lib/postgresql/data -l logfile start
waiting for server to start....2022-08-02 11:25:10.351 UTC [47] LOG: starting PostgreSQL 12.8 (Debian 12.8-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2022-08-02 11:25:10.351 UTC [47] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-08-02 11:25:10.360 UTC [48] LOG: database system was shut down at 2022-08-02 11:25:10 UTC
2022-08-02 11:25:10.367 UTC [47] LOG: database system is ready to accept connections
done
server started
CREATE DATABASE
/usr/local/bin/docker-entrypoint.sh: ignoring /docker-entrypoint-initdb.d/*
2022-08-02 11:25:10.571 UTC [47] LOG: received fast shutdown request
2022-08-02 11:25:10.571 UTC [47] LOG: aborting any active transactions
waiting for server to shut down....2022-08-02 11:25:10.572 UTC [47] LOG: background worker "logical replication launcher" (PID 54) exited with exit code 1
2022-08-02 11:25:10.572 UTC [49] LOG: shutting down
2022-08-02 11:25:10.576 UTC [47] LOG: database system is shut down
done
server stopped
PostgreSQL init process complete; ready for start up.
2022-08-02 11:25:10.683 UTC [1] LOG: starting PostgreSQL 12.8 (Debian 12.8-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit
2022-08-02 11:25:10.683 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
2022-08-02 11:25:10.684 UTC [1] LOG: listening on IPv6 address "::", port 5432
2022-08-02 11:25:10.684 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-08-02 11:25:10.693 UTC [75] LOG: database system was shut down at 2022-08-02 11:25:10 UTC
2022-08-02 11:25:10.697 UTC [1] LOG: database system is ready to accept connections
Additional Information
No response
Okay, I found that this is probably not due to breaking the output incorrectly but due to docker logs -f ... failing as I had ulimit -n set to default 1024 and I ran out of file descriptors. In that case docker logs -f ... exited with an error but this was not what testcontainers were expecting.
I guess that there's two things to fix: WaitingConsumer.waitUntil should check on OutputFrame.type and if that's END the waiting should end right away. In addition to that, failure to execute the command should throw an exception.
Thanks for raising this issue and doing the investigation @rvansa. Since the log message is an indicator of readiness, we can't end the waiting and assume the container to be ready. But we should look into failing explicitly and communicating the issue to the user instead (unexpected end of log stream or something).
However, looking at your logs, we can actually see the expected log lines, why is that? And note, Testcontainers does not use docker CLI, so we don't call docker logs -f, but instead the equivalent command on the Docker API.
I have found out the problem when this started happening reproducibly every time; I saw the message error getting journald fd: Too many open files when I've placed a breakpoint into the predicate. After rebooting it happens only sporadically again.
Strange enough docker logs <container> itself always worked even though docker logs -f <container> was failing (with the same error I saw in the debugger) so it might be that the error is raised directly in docker daemon, not in the client. Since the logs are printed, is it possible that these are obtained using a different command than the one we see to fail?