cluster smoke test: pg-snapshot-resumption/03-while-storaged-down.td:10:1: error: deadline has elapsed
What version of Materialize are you using?
1f24587750b1fff661f0c4de64774c3aeff0ec06
How did you install Materialize?
Built from source
What is the issue?
bin/mzcompose --find cluster run default fails in pg-snapshot-resumption/03-while-storaged-down.td:10:1: error: deadline has elapsed
It seems that storaged doesn't restart (in time?) and hence the read fails indefinitely.
Relevant log output
cluster-storaged-1 | thread 'tokio-runtime-worker' panicked at 'replication snapshot for source u3 failed: recoverable errors should crash the process', src/storage/src/source/postgres.rs:348:17
cluster-storaged-1 | stack backtrace:
cluster-storaged-1 | 0: rust_begin_unwind
cluster-storaged-1 | at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:584:5
cluster-storaged-1 | 1: core::panicking::panic_fmt
cluster-storaged-1 | at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panicking.rs:142:14
cluster-storaged-1 | 2: mz_storage::source::postgres::postgres_replication_loop_inner::{{closure}}
cluster-storaged-1 | at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0415c79f141d8d3c5-1/materialize/tests/src/storage/src/source/postgres.rs:348:17
cluster-storaged-1 | 3: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
cluster-storaged-1 | at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/future/mod.rs:91:19
cluster-storaged-1 | 4: mz_storage::source::postgres::postgres_replication_loop::{{closure}}
cluster-storaged-1 | at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0415c79f141d8d3c5-1/materialize/tests/src/storage/src/source/postgres.rs:307:58
cluster-storaged-1 | 5: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
cluster-storaged-1 | at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/future/mod.rs:91:19
cluster-storaged-1 | 6: tokio::runtime::task::core::CoreStage<T>::poll::{{closure}}
cluster-storaged-1 | at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/task/core.rs:165:17
cluster-storaged-1 | 7: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
cluster-storaged-1 | at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/loom/std/unsafe_cell.rs:14:9
cluster-storaged-1 | 8: tokio::runtime::task::core::CoreStage<T>::poll
cluster-storaged-1 | at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/task/core.rs:155:13
cluster-storaged-1 | 9: tokio::runtime::task::harness::poll_future::{{closure}}
cluster-storaged-1 | at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/task/harness.rs:470:19
cluster-storaged-1 | 10: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
cluster-storaged-1 | at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panic/unwind_safe.rs:271:9
cluster-storaged-1 | 11: std::panicking::try::do_call
cluster-storaged-1 | at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:492:40
cluster-storaged-1 | 12: std::panicking::try
cluster-storaged-1 | at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:456:19
cluster-storaged-1 | 13: std::panic::catch_unwind
cluster-storaged-1 | at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panic.rs:137:14
cluster-storaged-1 | 14: tokio::runtime::task::harness::poll_future
cluster-storaged-1 | at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/task/harness.rs:458:18
cluster-storaged-1 | 15: tokio::runtime::task::harness::Harness<T,S>::poll_inner
cluster-storaged-1 | at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/task/harness.rs:104:27
cluster-storaged-1 | 16: tokio::runtime::task::harness::Harness<T,S>::poll
cluster-storaged-1 | at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/task/harness.rs:57:15
cluster-storaged-1 | 17: tokio::runtime::task::raw::RawTask::poll
cluster-storaged-1 | at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/task/raw.rs:84:18
cluster-storaged-1 | 18: tokio::runtime::task::LocalNotified<S>::run
@petrosagg Could you take a look?
another instance here https://buildkite.com/materialize/tests/builds/40641#_
another instance here: https://buildkite.com/materialize/tests/builds/40747#0182fa0b-bec3-49be-a5aa-67305308ceee
Another instance: #14591
And another https://buildkite.com/materialize/tests/builds/40978#01831369-adb0-4a58-bac3-265094535f78
https://buildkite.com/materialize/tests/builds/41028#01831648-d42f-4452-adef-8f90a3137b41
I ran explain timestamp for SELECT COUNT(*) FROM t1 (during a failure I observed locally) and the since and uppers were all 0, while the query ts was now-ish. Dunno if that's helpful but probably! Would be curious to see what those numbers are for the success case.
This is trivial to reproduce locally (at least on my machine) via:
$ bin/mzcompose --find cluster run pg-snapshot-resumption
You can then do a docker ps to find out the port that environmentd is running on and psql into it via
$ psql -U materialize -h localhost -p <port> materialize
I can then repeatedly do a
materialize=> explain timestamp for select * from t1;
Timestamp
----------------------------------------------------
timestamp: 1662641061143 +
since:[1662639799000] +
upper:[1662639800001] +
has table: false +
+
source materialize.public.mz_source (u3, storage):+
read frontier:[1662639799000] +
write frontier:[1662639800001] +
(1 row)
to see that the query timestamp is advancing but the view/source will just not be queryable at that time. The upper of the source will not advance because the test purposefully kills the storaged process.
I'm wondering why this isn't more flaky, on my machine I have not yet managed to get a successful run.
Potential solutions:
- run the query at a timestamp at which it can succeed. We can't fix the timestamp selection logic because it behaves as it does for strict serializability. Right, @jkosh44?
- change the test to manually run the query at a time that works? Maybe we can use
EXPLAIN TIMESTAMP FORto get a valid timestamp and then run the query with asAS OF .... @philip-stoev, what do you think?
Yes that sounds correct. We can't change the timestamp selection logic easily without violating strict serializability. We can however change the isolation level of the test to Serializable. That way we are way more likely to find a timestamp that works for t1. We may still see some flakes due to https://github.com/MaterializeInc/materialize/issues/14696, but once that is fixed then it will stop flaking.
Looks like this is still happening, so I'm going to reopen this until #14696 merges.
https://buildkite.com/materialize/tests/builds/41444#01834164-8bec-4af1-9a6e-f2a639c9ffab
After discussing this for a bit with @benesch, I think that there's an additional bug going on causing this to flake, and #14717 should have actually completely fixed this.
t1 should be the only object in the time domain, which means it is the only [since, upper) range that we look at, which means that we should always be able to find a timestamp for reading t1. We should look into this a little more to see if it's actually the same issue or something else.
Also for some reason, it seems to have increased in frequency by a lot within the past couple of days. Maybe we should disable it it in the meantime.
We've disabled the test temporarily because it is causing consistent CI failures, blocking people from merging their PRs. We're doing a timeboxed investigation into the root cause now.
So I used the patented @aljoscha approach for debugging this, and ran the test locally. i.e.
$ bin/mzcompose --find cluster run pg-snapshot-resumption
$ psql -U materialize -h localhost -p <port> materialize
Then when looking at the query I see the following:
materialize=> EXPLAIN TIMESTAMP FOR SELECT COUNT(*) FROM t1;
Timestamp
----------------------------------------------------
timestamp: 1663792772872 +
since:[ 0] +
upper:[ 0] +
has table: false +
+
source materialize.public.mz_source (u3, storage):+
read frontier:[ 0] +
write frontier:[ 0] +
(1 row)
The upper seems to be stuck at 0, which is why the source is un-queryable. This confirms to me that the flake is not due to #14696 and there is something else going on.
I think I got to the bottom of this: the problem is in the timing of when the failpoint fires and when we do re-clocking, combined with how we currently don't hold back the reclocked frontier upper while the source frontier upper is still at "zero".
There are two interesting orderings of events:
- reclocking manages to mint some mapping
0 -> t, which communicates "as of timetI haven't seen any data yet from the source" - this makes
t1queryable because there is some non-zero upper (and since) at which it can be queried - we write an empty update that advances the upper to the collection
- the failpoint fires and
storagedpanics - the
select count(*) from t1query returns0in03-while-storaged-down.tdbecause we wrote sth to it beforestoragedwent down - -> the test succeeds
- the failpoint fires and
storagedpanics - we never write anything to the collection nor advance the upper
- the
select count(*) from t1query in03-while-storaged-down.tdnever returns because the collection is not yet queryable - -> the test times out
Context: the test sets a fail_point that will panic the storaged process while trying to ingest from postgres. This failpoint is "unset" before we run 04-verify-data.td.
Some observations
- we probably shouldn't use a
SELECT ...query to figure out if the failpoint fired succesfully - we might want to have something that was called "real-time recency" in the past that ensures that the source/table is not queryable while we have not caught up/read all there was when we started up
- we might want to ensure that the source is not queryable while still reading the initial snapshot, which we could do by not minting remappings while the source frontier is at "zero"
Potential next steps:
- immediate: @philip-stoev can we change the logic that ensures the failpoint fired to sth that explicitly checks the logs of
storagedand continues once it sees it required failure. How much work would that be? Or do we even want to make it this complicated? - long term: pick up thinking about real-time recency and when collections should start to become queryable? This will require some thinking, before concrete work items can fall out.
@aljoscha parsing the logs was something @philip-stoev and I concluded would be fine!
That sounds a little crazy to me! Can we figure out a more structured way to get the information you need?
On Fri, Sep 23, 2022 at 11:57 AM Gus Wynn @.***> wrote:
@aljoscha https://github.com/aljoscha parsing the logs was something @philip-stoev https://github.com/philip-stoev and I concluded would be fine!
— Reply to this email directly, view it on GitHub https://github.com/MaterializeInc/materialize/issues/14533#issuecomment-1256389221, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXSIGPXTTESELCQI3DI3LV7XHNVANCNFSM6AAAAAAQBGDX7Q . You are receiving this because you were mentioned.Message ID: @.***>
Related: https://github.com/MaterializeInc/materialize/issues/14914
@benesch the only mitigating circumstance re. parsing the logs is that you can put a string in the failpoint and then fish for that exact string in the output. So hopefully the entire thing will be immune from any cosmetic changes in the log output .
Ideally we would want some sort of a health check to detect that the storaged has crashed and report it in an introspection table. However, I do not think we catch transient crashes reliably with the current healthcheck, especially with the process orchestrator
@benesch the only mitigating circumstance re. parsing the logs is that you can put a string in the failpoint and then fish for that exact string in the output. So hopefully the entire thing will be immune from any cosmetic changes in the log output .
I'd love to instead add a separate tracing subscriber that emits, say, all events with a target of failpoint to a separate nd-limited JSON file. That sounds not a whole lot harder to rig up, and a lot less likely to break in the future!
Independent of the test bug, it looks like a bug that we present postgres data as readable before the initial timestamp, and announce with confidence that it is empty. This makes it impossible to distinguish between positive statements we make about the contents of the data ("sync'd up with postgres, and your table is empty") vs passive statements we make out of ignorance ("not yet sync'd up; let's just say it is empty?").
Imo if we don't have an issue filed for this yet, we should file one. It is not exactly the same as https://github.com/MaterializeInc/materialize/issues/2862 (which asks for an AS OF modifier to CREATE SOURCE that would need to be enforced) and closer to https://github.com/MaterializeInc/materialize/issues/11632 (that we need to respect the transaction order / nature of the source).
Imo if we don't have an issue filed for this yet, we should file one
https://github.com/MaterializeInc/materialize/issues/14914 might be it?