marquez icon indicating copy to clipboard operation
marquez copied to clipboard

Partition runs table for large installations

Open collado-mike opened this issue 3 years ago • 1 comments

The runs table and lineage_events table get pretty large. Oftentimes, we only want to search for runs within a given time frame (e.g., the last week). We should consider partitioning these large tables on date so that our queries can take advantage of the time-oriented nature of many requests.

collado-mike avatar Jul 20 '22 21:07 collado-mike

A strategy as described in https://wiki.postgresql.org/wiki/Month_based_partitioning would be useful. Assuming daily partitions (rather than monthly), we could write a function to query runs only within a specific date range with the following function, which queries each partition with the name pattern runs_YYYY_MM_DD

CREATE OR REPLACE FUNCTION latest_runs(start_date text, end_date text, lim integer) RETURNS SETOF runs AS $$
DECLARE
    count int := 0;
    curr_date date := to_date(end_date, 'YYYY-MM-DD') - interval '1 day';
BEGIN
    CREATE TEMP TABLE runs_between (LIKE runs) ON COMMIT DROP;
    WHILE count < lim AND curr_date > to_date(start_date, 'YYYY-MM-DD') LOOP
        execute 'INSERT INTO runs_between '
                    || 'SELECT * FROM runs_' || to_char(curr_date, 'YYYY_MM_DD');
        curr_date = curr_date - interval '1 day';
        count=COUNT(*) FROM runs_between;
    END LOOP;
    RETURN QUERY SELECT * FROM runs_between;
    DROP TABLE runs_between;
END;
$$ LANGUAGE plpgsql;

-- query the function - it's effectively like querying a single partition as long as that partition contains the
-- number of records we want
SELECT * FROM latest_runs('2022-06-01'::text, '2022-07-13'::text, 10000);

collado-mike avatar Jul 20 '22 21:07 collado-mike