materialize cluster smoke test: pg-snapshot-resumption/03-while-storaged-down.td:10:1: error: deadline has elapsed

What version of Materialize are you using?

1f24587750b1fff661f0c4de64774c3aeff0ec06

How did you install Materialize?

Built from source

What is the issue?

bin/mzcompose --find cluster run default fails in pg-snapshot-resumption/03-while-storaged-down.td:10:1: error: deadline has elapsed

It seems that storaged doesn't restart (in time?) and hence the read fails indefinitely.

Relevant log output

cluster-storaged-1      | thread 'tokio-runtime-worker' panicked at 'replication snapshot for source u3 failed: recoverable errors should crash the process', src/storage/src/source/postgres.rs:348:17
cluster-storaged-1      | stack backtrace:
cluster-storaged-1      |    0: rust_begin_unwind
cluster-storaged-1      |              at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:584:5
cluster-storaged-1      |    1: core::panicking::panic_fmt
cluster-storaged-1      |              at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panicking.rs:142:14
cluster-storaged-1      |    2: mz_storage::source::postgres::postgres_replication_loop_inner::{{closure}}
cluster-storaged-1      |              at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0415c79f141d8d3c5-1/materialize/tests/src/storage/src/source/postgres.rs:348:17
cluster-storaged-1      |    3: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
cluster-storaged-1      |              at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/future/mod.rs:91:19
cluster-storaged-1      |    4: mz_storage::source::postgres::postgres_replication_loop::{{closure}}
cluster-storaged-1      |              at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0415c79f141d8d3c5-1/materialize/tests/src/storage/src/source/postgres.rs:307:58
cluster-storaged-1      |    5: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
cluster-storaged-1      |              at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/future/mod.rs:91:19
cluster-storaged-1      |    6: tokio::runtime::task::core::CoreStage<T>::poll::{{closure}}
cluster-storaged-1      |              at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/task/core.rs:165:17
cluster-storaged-1      |    7: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
cluster-storaged-1      |              at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/loom/std/unsafe_cell.rs:14:9
cluster-storaged-1      |    8: tokio::runtime::task::core::CoreStage<T>::poll
cluster-storaged-1      |              at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/task/core.rs:155:13
cluster-storaged-1      |    9: tokio::runtime::task::harness::poll_future::{{closure}}
cluster-storaged-1      |              at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/task/harness.rs:470:19
cluster-storaged-1      |   10: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
cluster-storaged-1      |              at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panic/unwind_safe.rs:271:9
cluster-storaged-1      |   11: std::panicking::try::do_call
cluster-storaged-1      |              at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:492:40
cluster-storaged-1      |   12: std::panicking::try
cluster-storaged-1      |              at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:456:19
cluster-storaged-1      |   13: std::panic::catch_unwind
cluster-storaged-1      |              at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panic.rs:137:14
cluster-storaged-1      |   14: tokio::runtime::task::harness::poll_future
cluster-storaged-1      |              at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/task/harness.rs:458:18
cluster-storaged-1      |   15: tokio::runtime::task::harness::Harness<T,S>::poll_inner
cluster-storaged-1      |              at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/task/harness.rs:104:27
cluster-storaged-1      |   16: tokio::runtime::task::harness::Harness<T,S>::poll
cluster-storaged-1      |              at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/task/harness.rs:57:15
cluster-storaged-1      |   17: tokio::runtime::task::raw::RawTask::poll
cluster-storaged-1      |              at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/task/raw.rs:84:18
cluster-storaged-1      |   18: tokio::runtime::task::LocalNotified<S>::run

Aug 31 '22 09:08 antiguru

@petrosagg Could you take a look?

Aug 31 '22 14:08 aljoscha

another instance here https://buildkite.com/materialize/tests/builds/40641#_

Aug 31 '22 17:08 sploiselle

another instance here: https://buildkite.com/materialize/tests/builds/40747#0182fa0b-bec3-49be-a5aa-67305308ceee

Sep 01 '22 17:09 jkosh44

Another instance: #14591

Sep 02 '22 14:09 aljoscha

And another https://buildkite.com/materialize/tests/builds/40978#01831369-adb0-4a58-bac3-265094535f78

Sep 06 '22 16:09 danhhz

https://buildkite.com/materialize/tests/builds/41028#01831648-d42f-4452-adef-8f90a3137b41

Sep 07 '22 05:09 madelynnblue

I ran explain timestamp for SELECT COUNT(*) FROM t1 (during a failure I observed locally) and the since and uppers were all 0, while the query ts was now-ish. Dunno if that's helpful but probably! Would be curious to see what those numbers are for the success case.

Sep 07 '22 05:09 madelynnblue

This is trivial to reproduce locally (at least on my machine) via:

$ bin/mzcompose --find cluster run pg-snapshot-resumption

You can then do a docker ps to find out the port that environmentd is running on and psql into it via

$ psql -U materialize -h localhost -p <port> materialize

I can then repeatedly do a

materialize=> explain timestamp for select * from t1;
                     Timestamp
----------------------------------------------------
      timestamp: 1662641061143                     +
          since:[1662639799000]                    +
          upper:[1662639800001]                    +
      has table: false                             +
                                                   +
 source materialize.public.mz_source (u3, storage):+
  read frontier:[1662639799000]                    +
 write frontier:[1662639800001]                    +

(1 row)

to see that the query timestamp is advancing but the view/source will just not be queryable at that time. The upper of the source will not advance because the test purposefully kills the storaged process.

I'm wondering why this isn't more flaky, on my machine I have not yet managed to get a successful run.

Potential solutions:

run the query at a timestamp at which it can succeed. We can't fix the timestamp selection logic because it behaves as it does for strict serializability. Right, @jkosh44?
change the test to manually run the query at a time that works? Maybe we can use EXPLAIN TIMESTAMP FOR to get a valid timestamp and then run the query with as AS OF .... @philip-stoev, what do you think?

Sep 08 '22 12:09 aljoscha

Yes that sounds correct. We can't change the timestamp selection logic easily without violating strict serializability. We can however change the isolation level of the test to Serializable. That way we are way more likely to find a timestamp that works for t1. We may still see some flakes due to https://github.com/MaterializeInc/materialize/issues/14696, but once that is fixed then it will stop flaking.

Sep 08 '22 15:09 jkosh44

Looks like this is still happening, so I'm going to reopen this until #14696 merges.

https://buildkite.com/materialize/tests/builds/41444#01834164-8bec-4af1-9a6e-f2a639c9ffab

Sep 15 '22 14:09 danhhz

After discussing this for a bit with @benesch, I think that there's an additional bug going on causing this to flake, and #14717 should have actually completely fixed this.

t1 should be the only object in the time domain, which means it is the only [since, upper) range that we look at, which means that we should always be able to find a timestamp for reading t1. We should look into this a little more to see if it's actually the same issue or something else.

Sep 20 '22 23:09 jkosh44

Also for some reason, it seems to have increased in frequency by a lot within the past couple of days. Maybe we should disable it it in the meantime.

Sep 20 '22 23:09 jkosh44

We've disabled the test temporarily because it is causing consistent CI failures, blocking people from merging their PRs. We're doing a timeboxed investigation into the root cause now.

Sep 21 '22 20:09 chaas

So I used the patented @aljoscha approach for debugging this, and ran the test locally. i.e.

$ bin/mzcompose --find cluster run pg-snapshot-resumption
$ psql -U materialize -h localhost -p <port> materialize

Then when looking at the query I see the following:

materialize=> EXPLAIN TIMESTAMP FOR SELECT COUNT(*) FROM t1;
                     Timestamp                      
----------------------------------------------------
      timestamp: 1663792772872                     +
          since:[            0]                    +
          upper:[            0]                    +
      has table: false                             +
                                                   +
 source materialize.public.mz_source (u3, storage):+
  read frontier:[            0]                    +
 write frontier:[            0]                    +
 
(1 row)

The upper seems to be stuck at 0, which is why the source is un-queryable. This confirms to me that the flake is not due to #14696 and there is something else going on.

Sep 21 '22 20:09 jkosh44

I think I got to the bottom of this: the problem is in the timing of when the failpoint fires and when we do re-clocking, combined with how we currently don't hold back the reclocked frontier upper while the source frontier upper is still at "zero".

There are two interesting orderings of events:

reclocking manages to mint some mapping 0 -> t, which communicates "as of time t I haven't seen any data yet from the source"
this makes t1 queryable because there is some non-zero upper (and since) at which it can be queried
we write an empty update that advances the upper to the collection
the failpoint fires and storaged panics
the select count(*) from t1 query returns 0 in 03-while-storaged-down.td because we wrote sth to it before storaged went down
-> the test succeeds

the failpoint fires and storaged panics
we never write anything to the collection nor advance the upper
the select count(*) from t1 query in 03-while-storaged-down.td never returns because the collection is not yet queryable
-> the test times out

Context: the test sets a fail_point that will panic the storaged process while trying to ingest from postgres. This failpoint is "unset" before we run 04-verify-data.td.

Some observations

we probably shouldn't use a SELECT ... query to figure out if the failpoint fired succesfully
we might want to have something that was called "real-time recency" in the past that ensures that the source/table is not queryable while we have not caught up/read all there was when we started up
we might want to ensure that the source is not queryable while still reading the initial snapshot, which we could do by not minting remappings while the source frontier is at "zero"

Sep 23 '22 14:09 aljoscha

Potential next steps:

immediate: @philip-stoev can we change the logic that ensures the failpoint fired to sth that explicitly checks the logs of storaged and continues once it sees it required failure. How much work would that be? Or do we even want to make it this complicated?
long term: pick up thinking about real-time recency and when collections should start to become queryable? This will require some thinking, before concrete work items can fall out.

Sep 23 '22 15:09 aljoscha

@aljoscha parsing the logs was something @philip-stoev and I concluded would be fine!

Sep 23 '22 15:09 guswynn

That sounds a little crazy to me! Can we figure out a more structured way to get the information you need?

On Fri, Sep 23, 2022 at 11:57 AM Gus Wynn @.***> wrote:

@aljoscha https://github.com/aljoscha parsing the logs was something @philip-stoev https://github.com/philip-stoev and I concluded would be fine!

— Reply to this email directly, view it on GitHub https://github.com/MaterializeInc/materialize/issues/14533#issuecomment-1256389221, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXSIGPXTTESELCQI3DI3LV7XHNVANCNFSM6AAAAAAQBGDX7Q . You are receiving this because you were mentioned.Message ID: @.***>

Sep 23 '22 16:09 benesch

Related: https://github.com/MaterializeInc/materialize/issues/14914

Sep 23 '22 16:09 aljoscha

@benesch the only mitigating circumstance re. parsing the logs is that you can put a string in the failpoint and then fish for that exact string in the output. So hopefully the entire thing will be immune from any cosmetic changes in the log output .

Ideally we would want some sort of a health check to detect that the storaged has crashed and report it in an introspection table. However, I do not think we catch transient crashes reliably with the current healthcheck, especially with the process orchestrator

Sep 23 '22 16:09 philip-stoev

@benesch the only mitigating circumstance re. parsing the logs is that you can put a string in the failpoint and then fish for that exact string in the output. So hopefully the entire thing will be immune from any cosmetic changes in the log output .

I'd love to instead add a separate tracing subscriber that emits, say, all events with a target of failpoint to a separate nd-limited JSON file. That sounds not a whole lot harder to rig up, and a lot less likely to break in the future!

Sep 28 '22 05:09 benesch

Independent of the test bug, it looks like a bug that we present postgres data as readable before the initial timestamp, and announce with confidence that it is empty. This makes it impossible to distinguish between positive statements we make about the contents of the data ("sync'd up with postgres, and your table is empty") vs passive statements we make out of ignorance ("not yet sync'd up; let's just say it is empty?").

Imo if we don't have an issue filed for this yet, we should file one. It is not exactly the same as https://github.com/MaterializeInc/materialize/issues/2862 (which asks for an AS OF modifier to CREATE SOURCE that would need to be enforced) and closer to https://github.com/MaterializeInc/materialize/issues/11632 (that we need to respect the transaction order / nature of the source).

Sep 28 '22 06:09 frankmcsherry

Imo if we don't have an issue filed for this yet, we should file one

https://github.com/MaterializeInc/materialize/issues/14914 might be it?

Sep 28 '22 14:09 nmeagan11