marquez
marquez copied to clipboard
Partition runs table for large installations
The runs table and lineage_events table get pretty large. Oftentimes, we only want to search for runs within a given time frame (e.g., the last week). We should consider partitioning these large tables on date so that our queries can take advantage of the time-oriented nature of many requests.
A strategy as described in https://wiki.postgresql.org/wiki/Month_based_partitioning would be useful. Assuming daily partitions (rather than monthly), we could write a function to query runs only within a specific date range with the following function, which queries each partition with the name pattern runs_YYYY_MM_DD
CREATE OR REPLACE FUNCTION latest_runs(start_date text, end_date text, lim integer) RETURNS SETOF runs AS $$
DECLARE
count int := 0;
curr_date date := to_date(end_date, 'YYYY-MM-DD') - interval '1 day';
BEGIN
CREATE TEMP TABLE runs_between (LIKE runs) ON COMMIT DROP;
WHILE count < lim AND curr_date > to_date(start_date, 'YYYY-MM-DD') LOOP
execute 'INSERT INTO runs_between '
|| 'SELECT * FROM runs_' || to_char(curr_date, 'YYYY_MM_DD');
curr_date = curr_date - interval '1 day';
count=COUNT(*) FROM runs_between;
END LOOP;
RETURN QUERY SELECT * FROM runs_between;
DROP TABLE runs_between;
END;
$$ LANGUAGE plpgsql;
-- query the function - it's effectively like querying a single partition as long as that partition contains the
-- number of records we want
SELECT * FROM latest_runs('2022-06-01'::text, '2022-07-13'::text, 10000);