materialize icon indicating copy to clipboard operation
materialize copied to clipboard

Input -> output latency introspection relations only show `persist` source dependencies, failing to capture index dependencies

Open vmarcos opened this issue 3 years ago • 0 comments

What version of Materialize are you using?

v0.26.1-dev (8dbe5bd60)

How did you install Materialize?

Built from source

What is the issue?

Recently, introspection information was deemed necessary to observe the input → output latency for dataflow updates (see #13885 and #13887). The following catalog relations were added:

  • mz_worker_materialization_delays, capturing a per-worker histogram of delays in frontier advancements for a dataflow backing an index or materialized view after a frontier advancement from a persist storage source instantiation is observed in the context of that dataflow.
  • mz_worker_materialization_source_frontiers, capturing the currently known frontier, per worker, of a persist storage source instantiation in the context of a dataflow backing an index or materialized view.
  • mz_materialization_source_frontiers, aggregating the data in the view above across all workers.

Even though the new catalog relations were scoped to try to characterize the differences and delays in frontiers between dataflows backing indexes and materialized views and persist storage sources, they fail to allow for a complete characterization of the desired input → output latency for dataflow updates. This is because not only may a dataflow depend on frontier advancements from persist sources, but also the frontier advancements of other indexes that it depends on. This complete dependency structure is captured in the catalog relation mz_materialization_dependencies, but it is not reflected in the three catalog relations named above.

To observe this effect, the following commands can be executed:

materialize=> CREATE TABLE t1 (a int);
CREATE TABLE
materialize=> CREATE TABLE t2 (a int);
CREATE TABLE
[...]
materialize=> CREATE INDEX i2 ON t2 (a);
CREATE INDEX
materialize=> CREATE MATERIALIZED VIEW mv AS SELECT * FROM t1 JOIN t2 USING (a);
CREATE MATERIALIZED VIEW
materialize=> SELECT global_id, name FROM mz_materializations mats
              JOIN (SELECT id AS global_id, name 
                    FROM mz_indexes 
                    UNION ALL
                    SELECT id AS global_id, name
                    FROM mz_materialized_views)
              USING (global_id);
 global_id | name 
-----------+------
 u4        | i2
 u5        | mv
(2 rows)

materialize=> SELECT id, name FROM mz_tables WHERE id LIKE 'u%';
 id | name 
----+------
 u1 | t1
 u2 | t2
(2 rows)

materialize=> SELECT * FROM mz_materialization_dependencies;
 dataflow | source | worker 
----------+--------+--------
 u4       | u2     |      0
 u5       | u1     |      0
 u5       | u4     |      0
(3 rows)

materialize=> SELECT * FROM mz_materialization_frontiers
              JOIN mz_materialization_source_frontiers
              USING (global_id);
 global_id |     time      | source |     time      
-----------+---------------+--------+---------------
 u4        | 1659697459884 | u2     | 1659697459884
 u5        | 1659697459884 | u1     | 1659697459884
(2 rows)

materialize=> SELECT * FROM mz_worker_materialization_delays;
 global_id | source | worker |  delay_ns  | count 
-----------+--------+--------+------------+-------
 u4        | u2     |      0 |      65536 |   385
 u4        | u2     |      0 |     131072 |   281
 u4        | u2     |      0 |     262144 |    16
 u4        | u2     |      0 |     524288 |     1
 u4        | u2     |      0 |    1048576 |     1
 u4        | u2     |      0 |    2097152 |     1
 u4        | u2     |      0 |      16384 |     9
 u4        | u2     |      0 |      32768 |    59
 u5        | u1     |      0 |   16777216 |    46
 u5        | u1     |      0 |   33554432 |    26
 u5        | u1     |      0 |   67108864 |    41
 u5        | u1     |      0 |  134217728 |    75
 u5        | u1     |      0 |  268435456 |    99
 u5        | u1     |      0 |  536870912 |    86
 u5        | u1     |      0 | 1073741824 |     3
 u5        | u1     |      0 |    1048576 |    24
 u5        | u1     |      0 |    2097152 |   163
 u5        | u1     |      0 |    4194304 |   124
 u5        | u1     |      0 |    8388608 |    54
(19 rows)

Note that above, u5 depends on both u1 and u4. However, since u4 is an index and not served from persist, then this dependency becomes invisible to the catalog relations exposing input → output frontier lags and delays.

This issue is thus a request that we capture these additional dependencies in the three aforementioned catalog relations.

Implementation notes: To keep symmetry with how persist sources are introspected at the COMPUTE layer, we should consider capturing via COMPUTE logging the frontier advancements of indexes imported into a dataflow, as opposed to capturing the frontiers at index exports. The latter might be achieved with an inspect_container operator, similarly to what was implemented for persist sources. We take this approach with storage sources as there may be a difference between the frontier as known at the STORAGE layer and the frontier as perceived by different dataflows at the COMPUTE layer (refer also to issue #13489).

Relevant log output

No response

vmarcos avatar Aug 05 '22 16:08 vmarcos