materialize icon indicating copy to clipboard operation
materialize copied to clipboard

cluster smoke test: pg-snapshot-resumption/03-while-storaged-down.td:10:1: error: deadline has elapsed

Open antiguru opened this issue 3 years ago • 12 comments

What version of Materialize are you using?

1f24587750b1fff661f0c4de64774c3aeff0ec06

How did you install Materialize?

Built from source

What is the issue?

bin/mzcompose --find cluster run default fails in pg-snapshot-resumption/03-while-storaged-down.td:10:1: error: deadline has elapsed

It seems that storaged doesn't restart (in time?) and hence the read fails indefinitely.

Relevant log output

cluster-storaged-1      | thread 'tokio-runtime-worker' panicked at 'replication snapshot for source u3 failed: recoverable errors should crash the process', src/storage/src/source/postgres.rs:348:17
cluster-storaged-1      | stack backtrace:
cluster-storaged-1      |    0: rust_begin_unwind
cluster-storaged-1      |              at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:584:5
cluster-storaged-1      |    1: core::panicking::panic_fmt
cluster-storaged-1      |              at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panicking.rs:142:14
cluster-storaged-1      |    2: mz_storage::source::postgres::postgres_replication_loop_inner::{{closure}}
cluster-storaged-1      |              at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0415c79f141d8d3c5-1/materialize/tests/src/storage/src/source/postgres.rs:348:17
cluster-storaged-1      |    3: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
cluster-storaged-1      |              at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/future/mod.rs:91:19
cluster-storaged-1      |    4: mz_storage::source::postgres::postgres_replication_loop::{{closure}}
cluster-storaged-1      |              at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0415c79f141d8d3c5-1/materialize/tests/src/storage/src/source/postgres.rs:307:58
cluster-storaged-1      |    5: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
cluster-storaged-1      |              at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/future/mod.rs:91:19
cluster-storaged-1      |    6: tokio::runtime::task::core::CoreStage<T>::poll::{{closure}}
cluster-storaged-1      |              at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/task/core.rs:165:17
cluster-storaged-1      |    7: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
cluster-storaged-1      |              at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/loom/std/unsafe_cell.rs:14:9
cluster-storaged-1      |    8: tokio::runtime::task::core::CoreStage<T>::poll
cluster-storaged-1      |              at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/task/core.rs:155:13
cluster-storaged-1      |    9: tokio::runtime::task::harness::poll_future::{{closure}}
cluster-storaged-1      |              at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/task/harness.rs:470:19
cluster-storaged-1      |   10: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
cluster-storaged-1      |              at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/panic/unwind_safe.rs:271:9
cluster-storaged-1      |   11: std::panicking::try::do_call
cluster-storaged-1      |              at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:492:40
cluster-storaged-1      |   12: std::panicking::try
cluster-storaged-1      |              at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panicking.rs:456:19
cluster-storaged-1      |   13: std::panic::catch_unwind
cluster-storaged-1      |              at ./rustc/4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/std/src/panic.rs:137:14
cluster-storaged-1      |   14: tokio::runtime::task::harness::poll_future
cluster-storaged-1      |              at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/task/harness.rs:458:18
cluster-storaged-1      |   15: tokio::runtime::task::harness::Harness<T,S>::poll_inner
cluster-storaged-1      |              at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/task/harness.rs:104:27
cluster-storaged-1      |   16: tokio::runtime::task::harness::Harness<T,S>::poll
cluster-storaged-1      |              at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/task/harness.rs:57:15
cluster-storaged-1      |   17: tokio::runtime::task::raw::RawTask::poll
cluster-storaged-1      |              at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/task/raw.rs:84:18
cluster-storaged-1      |   18: tokio::runtime::task::LocalNotified<S>::run

antiguru avatar Aug 31 '22 09:08 antiguru

@petrosagg Could you take a look?

aljoscha avatar Aug 31 '22 14:08 aljoscha

another instance here https://buildkite.com/materialize/tests/builds/40641#_

sploiselle avatar Aug 31 '22 17:08 sploiselle

another instance here: https://buildkite.com/materialize/tests/builds/40747#0182fa0b-bec3-49be-a5aa-67305308ceee

jkosh44 avatar Sep 01 '22 17:09 jkosh44

Another instance: #14591

aljoscha avatar Sep 02 '22 14:09 aljoscha

And another https://buildkite.com/materialize/tests/builds/40978#01831369-adb0-4a58-bac3-265094535f78

danhhz avatar Sep 06 '22 16:09 danhhz

https://buildkite.com/materialize/tests/builds/41028#01831648-d42f-4452-adef-8f90a3137b41

madelynnblue avatar Sep 07 '22 05:09 madelynnblue

I ran explain timestamp for SELECT COUNT(*) FROM t1 (during a failure I observed locally) and the since and uppers were all 0, while the query ts was now-ish. Dunno if that's helpful but probably! Would be curious to see what those numbers are for the success case.

madelynnblue avatar Sep 07 '22 05:09 madelynnblue

This is trivial to reproduce locally (at least on my machine) via:

$ bin/mzcompose --find cluster run pg-snapshot-resumption

You can then do a docker ps to find out the port that environmentd is running on and psql into it via

$ psql -U materialize -h localhost -p <port> materialize

I can then repeatedly do a

materialize=> explain timestamp for select * from t1;
                     Timestamp
----------------------------------------------------
      timestamp: 1662641061143                     +
          since:[1662639799000]                    +
          upper:[1662639800001]                    +
      has table: false                             +
                                                   +
 source materialize.public.mz_source (u3, storage):+
  read frontier:[1662639799000]                    +
 write frontier:[1662639800001]                    +

(1 row)

to see that the query timestamp is advancing but the view/source will just not be queryable at that time. The upper of the source will not advance because the test purposefully kills the storaged process.

I'm wondering why this isn't more flaky, on my machine I have not yet managed to get a successful run.

Potential solutions:

  • run the query at a timestamp at which it can succeed. We can't fix the timestamp selection logic because it behaves as it does for strict serializability. Right, @jkosh44?
  • change the test to manually run the query at a time that works? Maybe we can use EXPLAIN TIMESTAMP FOR to get a valid timestamp and then run the query with as AS OF .... @philip-stoev, what do you think?

aljoscha avatar Sep 08 '22 12:09 aljoscha

Yes that sounds correct. We can't change the timestamp selection logic easily without violating strict serializability. We can however change the isolation level of the test to Serializable. That way we are way more likely to find a timestamp that works for t1. We may still see some flakes due to https://github.com/MaterializeInc/materialize/issues/14696, but once that is fixed then it will stop flaking.

jkosh44 avatar Sep 08 '22 15:09 jkosh44

Looks like this is still happening, so I'm going to reopen this until #14696 merges.

https://buildkite.com/materialize/tests/builds/41444#01834164-8bec-4af1-9a6e-f2a639c9ffab

danhhz avatar Sep 15 '22 14:09 danhhz

After discussing this for a bit with @benesch, I think that there's an additional bug going on causing this to flake, and #14717 should have actually completely fixed this.

t1 should be the only object in the time domain, which means it is the only [since, upper) range that we look at, which means that we should always be able to find a timestamp for reading t1. We should look into this a little more to see if it's actually the same issue or something else.

jkosh44 avatar Sep 20 '22 23:09 jkosh44

Also for some reason, it seems to have increased in frequency by a lot within the past couple of days. Maybe we should disable it it in the meantime.

jkosh44 avatar Sep 20 '22 23:09 jkosh44

We've disabled the test temporarily because it is causing consistent CI failures, blocking people from merging their PRs. We're doing a timeboxed investigation into the root cause now.

chaas avatar Sep 21 '22 20:09 chaas

So I used the patented @aljoscha approach for debugging this, and ran the test locally. i.e.

$ bin/mzcompose --find cluster run pg-snapshot-resumption
$ psql -U materialize -h localhost -p <port> materialize

Then when looking at the query I see the following:

materialize=> EXPLAIN TIMESTAMP FOR SELECT COUNT(*) FROM t1;
                     Timestamp                      
----------------------------------------------------
      timestamp: 1663792772872                     +
          since:[            0]                    +
          upper:[            0]                    +
      has table: false                             +
                                                   +
 source materialize.public.mz_source (u3, storage):+
  read frontier:[            0]                    +
 write frontier:[            0]                    +
 
(1 row)

The upper seems to be stuck at 0, which is why the source is un-queryable. This confirms to me that the flake is not due to #14696 and there is something else going on.

jkosh44 avatar Sep 21 '22 20:09 jkosh44

I think I got to the bottom of this: the problem is in the timing of when the failpoint fires and when we do re-clocking, combined with how we currently don't hold back the reclocked frontier upper while the source frontier upper is still at "zero".

There are two interesting orderings of events:

  • reclocking manages to mint some mapping 0 -> t, which communicates "as of time t I haven't seen any data yet from the source"
  • this makes t1 queryable because there is some non-zero upper (and since) at which it can be queried
  • we write an empty update that advances the upper to the collection
  • the failpoint fires and storaged panics
  • the select count(*) from t1 query returns 0 in 03-while-storaged-down.td because we wrote sth to it before storaged went down
  • -> the test succeeds
  • the failpoint fires and storaged panics
  • we never write anything to the collection nor advance the upper
  • the select count(*) from t1 query in 03-while-storaged-down.td never returns because the collection is not yet queryable
  • -> the test times out

Context: the test sets a fail_point that will panic the storaged process while trying to ingest from postgres. This failpoint is "unset" before we run 04-verify-data.td.

Some observations

  • we probably shouldn't use a SELECT ... query to figure out if the failpoint fired succesfully
  • we might want to have something that was called "real-time recency" in the past that ensures that the source/table is not queryable while we have not caught up/read all there was when we started up
  • we might want to ensure that the source is not queryable while still reading the initial snapshot, which we could do by not minting remappings while the source frontier is at "zero"

aljoscha avatar Sep 23 '22 14:09 aljoscha

Potential next steps:

  • immediate: @philip-stoev can we change the logic that ensures the failpoint fired to sth that explicitly checks the logs of storaged and continues once it sees it required failure. How much work would that be? Or do we even want to make it this complicated?
  • long term: pick up thinking about real-time recency and when collections should start to become queryable? This will require some thinking, before concrete work items can fall out.

aljoscha avatar Sep 23 '22 15:09 aljoscha

@aljoscha parsing the logs was something @philip-stoev and I concluded would be fine!

guswynn avatar Sep 23 '22 15:09 guswynn

That sounds a little crazy to me! Can we figure out a more structured way to get the information you need?

On Fri, Sep 23, 2022 at 11:57 AM Gus Wynn @.***> wrote:

@aljoscha https://github.com/aljoscha parsing the logs was something @philip-stoev https://github.com/philip-stoev and I concluded would be fine!

— Reply to this email directly, view it on GitHub https://github.com/MaterializeInc/materialize/issues/14533#issuecomment-1256389221, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXSIGPXTTESELCQI3DI3LV7XHNVANCNFSM6AAAAAAQBGDX7Q . You are receiving this because you were mentioned.Message ID: @.***>

benesch avatar Sep 23 '22 16:09 benesch

Related: https://github.com/MaterializeInc/materialize/issues/14914

aljoscha avatar Sep 23 '22 16:09 aljoscha

@benesch the only mitigating circumstance re. parsing the logs is that you can put a string in the failpoint and then fish for that exact string in the output. So hopefully the entire thing will be immune from any cosmetic changes in the log output .

Ideally we would want some sort of a health check to detect that the storaged has crashed and report it in an introspection table. However, I do not think we catch transient crashes reliably with the current healthcheck, especially with the process orchestrator

philip-stoev avatar Sep 23 '22 16:09 philip-stoev

@benesch the only mitigating circumstance re. parsing the logs is that you can put a string in the failpoint and then fish for that exact string in the output. So hopefully the entire thing will be immune from any cosmetic changes in the log output .

I'd love to instead add a separate tracing subscriber that emits, say, all events with a target of failpoint to a separate nd-limited JSON file. That sounds not a whole lot harder to rig up, and a lot less likely to break in the future!

benesch avatar Sep 28 '22 05:09 benesch

Independent of the test bug, it looks like a bug that we present postgres data as readable before the initial timestamp, and announce with confidence that it is empty. This makes it impossible to distinguish between positive statements we make about the contents of the data ("sync'd up with postgres, and your table is empty") vs passive statements we make out of ignorance ("not yet sync'd up; let's just say it is empty?").

Imo if we don't have an issue filed for this yet, we should file one. It is not exactly the same as https://github.com/MaterializeInc/materialize/issues/2862 (which asks for an AS OF modifier to CREATE SOURCE that would need to be enforced) and closer to https://github.com/MaterializeInc/materialize/issues/11632 (that we need to respect the transaction order / nature of the source).

frankmcsherry avatar Sep 28 '22 06:09 frankmcsherry

Imo if we don't have an issue filed for this yet, we should file one

https://github.com/MaterializeInc/materialize/issues/14914 might be it?

nmeagan11 avatar Sep 28 '22 14:09 nmeagan11