Postgres Pagination Strategy for List Endpoints

Background

The list* endpoints in Marquez v0.1 have (or will have soon) implemented result pagination to bound the response sizes. To ensure Marquez Core's scalability, and to enable a low-latency UI, we should standardize pagination strategy which will scale with Marquez's data volume.

Assumptions

list* endpoint results should always be sorted by the same table column(s)
Clients do not require random access into the list of objects

Goals

Simple to implement within the application and the database
Performance and database load should be constant regardless of cardinality of underlying table

Strategy

Keyset Pagination

The Keyset Pagination approach is simplest method for both the Marquez application and Postgres database to implement. It requires an index on the column used to sort the data. The application should be passed two values: after and n:

n is the number of results to return
after is the value that the next n results should follow in the sort order

There are a few benefits to this approach: Btree indexes can fulfill this query easily with a range scan, regardless of the indexed column cardinality. It also can be combined with filtering in the WHERE clause, which could be desirable client-side. However, the requirement of an after value requires the client to track the last value (or first value if paging backwards) from the previous page and pass it back to the server to fetch another set of results. It also means that the client cannot arbitrarily jump to a page which is not immediately before or after the last page fetched.

screen shot 2019-02-05 at 3 53 31 pm ¹

Alternatives Considered

Limit/Offset

The initial Marquez pagination implementation utilized the LIMIT, OFFSET features in Postgres. This works for rapid testing and development with small data volume, but querying fetching large OFFSETs is increasingly expensive on the database².

screen shot 2019-02-05 at 3 48 40 pm ¹

Internal CTIDs

Postgres tracks an internal ID for each row in a table which corresponds to the physical ordering of the data in the table. For append-only data, the CTID will increase monotonically for new rows and could potentially be used for random access to pages. However, this will cause issues for tables where data is deleted since Postgres will fill the CTID gaps with new rows. At the moment, Marquez Core is actually append-only and never deletes any data, so this could potentially work. However, paginating with CTIDs cannot be combined with filtering³ which will block useful UI feature for filtering on potentially large lists.

Feb 06 '19 00:02 sshah-wework

/cc @hougs

Mar 05 '19 22:03 wslulciuc

Keyset pagination seems like a very reasonable approach. The drawback of not being able to jump to an arbitrary page will hopefully not be too painful as long as we provide a ui that filters on things people care about. I'm trying to imagine situations where people know they want to see the 7th page of results. Anytime I've done that in a ui I'm scanning results, looking for something. I don't imagine a lot would be lost by forcing people to see pages of results in order. I may be oversimplifying the issue but it also doesn't seem impossible or bad for us to eventually provide a "jump to 7th page" functionality by internally tracking ids for the previous pages and not returning result sets until we have reached the page the client wants. Though, it doesn't seem worth building initially.

Mar 13 '19 22:03 hougs

Totally agreed @hougs, I don't think paging to the Nth page of some results make sense. We'll see if it is a feature request, in which case we can rethink this approach.

Mar 21 '19 20:03 sshah-wework