Input -> output latency introspection relations only show `persist` source dependencies, failing to capture index dependencies
What version of Materialize are you using?
v0.26.1-dev (8dbe5bd60)
How did you install Materialize?
Built from source
What is the issue?
Recently, introspection information was deemed necessary to observe the input → output latency for dataflow updates (see #13885 and #13887). The following catalog relations were added:
-
mz_worker_materialization_delays, capturing a per-worker histogram of delays in frontier advancements for a dataflow backing an index or materialized view after a frontier advancement from apersiststorage source instantiation is observed in the context of that dataflow. -
mz_worker_materialization_source_frontiers, capturing the currently known frontier, per worker, of apersiststorage source instantiation in the context of a dataflow backing an index or materialized view. -
mz_materialization_source_frontiers, aggregating the data in the view above across all workers.
Even though the new catalog relations were scoped to try to characterize the differences and delays in frontiers between dataflows backing indexes and materialized views and persist storage sources, they fail to allow for a complete characterization of the desired input → output latency for dataflow updates. This is because not only may a dataflow depend on frontier advancements from persist sources, but also the frontier advancements of other indexes that it depends on. This complete dependency structure is captured in the catalog relation mz_materialization_dependencies, but it is not reflected in the three catalog relations named above.
To observe this effect, the following commands can be executed:
materialize=> CREATE TABLE t1 (a int);
CREATE TABLE
materialize=> CREATE TABLE t2 (a int);
CREATE TABLE
[...]
materialize=> CREATE INDEX i2 ON t2 (a);
CREATE INDEX
materialize=> CREATE MATERIALIZED VIEW mv AS SELECT * FROM t1 JOIN t2 USING (a);
CREATE MATERIALIZED VIEW
materialize=> SELECT global_id, name FROM mz_materializations mats
JOIN (SELECT id AS global_id, name
FROM mz_indexes
UNION ALL
SELECT id AS global_id, name
FROM mz_materialized_views)
USING (global_id);
global_id | name
-----------+------
u4 | i2
u5 | mv
(2 rows)
materialize=> SELECT id, name FROM mz_tables WHERE id LIKE 'u%';
id | name
----+------
u1 | t1
u2 | t2
(2 rows)
materialize=> SELECT * FROM mz_materialization_dependencies;
dataflow | source | worker
----------+--------+--------
u4 | u2 | 0
u5 | u1 | 0
u5 | u4 | 0
(3 rows)
materialize=> SELECT * FROM mz_materialization_frontiers
JOIN mz_materialization_source_frontiers
USING (global_id);
global_id | time | source | time
-----------+---------------+--------+---------------
u4 | 1659697459884 | u2 | 1659697459884
u5 | 1659697459884 | u1 | 1659697459884
(2 rows)
materialize=> SELECT * FROM mz_worker_materialization_delays;
global_id | source | worker | delay_ns | count
-----------+--------+--------+------------+-------
u4 | u2 | 0 | 65536 | 385
u4 | u2 | 0 | 131072 | 281
u4 | u2 | 0 | 262144 | 16
u4 | u2 | 0 | 524288 | 1
u4 | u2 | 0 | 1048576 | 1
u4 | u2 | 0 | 2097152 | 1
u4 | u2 | 0 | 16384 | 9
u4 | u2 | 0 | 32768 | 59
u5 | u1 | 0 | 16777216 | 46
u5 | u1 | 0 | 33554432 | 26
u5 | u1 | 0 | 67108864 | 41
u5 | u1 | 0 | 134217728 | 75
u5 | u1 | 0 | 268435456 | 99
u5 | u1 | 0 | 536870912 | 86
u5 | u1 | 0 | 1073741824 | 3
u5 | u1 | 0 | 1048576 | 24
u5 | u1 | 0 | 2097152 | 163
u5 | u1 | 0 | 4194304 | 124
u5 | u1 | 0 | 8388608 | 54
(19 rows)
Note that above, u5 depends on both u1 and u4. However, since u4 is an index and not served from persist, then this dependency becomes invisible to the catalog relations exposing input → output frontier lags and delays.
This issue is thus a request that we capture these additional dependencies in the three aforementioned catalog relations.
Implementation notes: To keep symmetry with how persist sources are introspected at the COMPUTE layer, we should consider capturing via COMPUTE logging the frontier advancements of indexes imported into a dataflow, as opposed to capturing the frontiers at index exports. The latter might be achieved with an inspect_container operator, similarly to what was implemented for persist sources. We take this approach with storage sources as there may be a difference between the frontier as known at the STORAGE layer and the frontier as perceived by different dataflows at the COMPUTE layer (refer also to issue #13489).
Relevant log output
No response