druid Druid Catalog basics

The Druid catalog provides a collection of metadata "hints" about tables (datasources, input sources, views, etc.) within Druid. This PR provides the foundation: the DB and REST layer, but not yet the integration with the Calcite SQL layer. This is a much-refined version of the earlier catalog PR.

The DB layer extends what is done for other Druid metadata tables. The semantic ("business logic") layer provides the usual CRUD operations on tables. The entire design is pretty standard and follows Druid patterns. The key difference is the rather extreme lengths taken by the implementation to ensure each bit is easily testable without mocks. That means many interfaces which can be implemented in multiple ways.

Parts available in this PR include:

The metadata DB storage layer (in an extension)
The basic "catalog object model" that describes the properties and columns which describe catalog tables.
A basic set of tables: two kinds of datasources (detail and rollup) and three kinds of external tables (inline, local, and HTTP).
A REST API layer to perform CRUD operations on tables.
Unit tests
An integration test of the catalog REST API.

The catalog mechanism is split into two parts.

The "core" part which describes catalog objects, and which is can model data from a variety of catalog systems.
The druid-catalog extension which stores data in the Druid metadata database.

This split exists for two reasons:

Many metadata systems exist: HMS, Amazon Glue, various commercial solutions, etc. We anticipate that some shops may wish to obtain metadata from these other systems, in the same way that some shops get their security information from external systems.
Druid's database schema evolution system is rather basic. (And, by "basic", we mean "nonexistent.") There is some chance that the remaining development will change the schema, which upgrades cannot support. Users who enable druid-catalog extension now acknowledge that they are using it only for testing, not production, and at their own risk.

Functionality not in this PR, but which will appear in the next one, includes:

The synchronization mechanism between the Coordinator and Broker.
SQL table functions to make use of catalog entries.
Integration of catalog properties to simplify MSQ ingest statements (INSERT and REPLACE).
Integration of catalog schema information with SELECT queries.
The remaining set of external table types.
Views.

This is a great opportunity for reviewers to provide guidance on the basic catalog mechanism before we start building SQL integration on top. Please see the Druid catalog issue for additional details about the goals and design of the catalog.

This PR has:

[X] been self-reviewed.
[X] has a design document here.
[ ] added documentation for new or modified features or behaviors. (Not yet: the functionality is not yet user visible.)
[X] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
[X] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
[X] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
[X] added integration tests.
[ ] been tested in a test Druid cluster.

Oct 01 '22 05:10 paul-rogers

This pull request introduces 8 alerts when merging c958a98df451ce989005867140e5ebe07986ba19 into ebfe1c0c90d86e4d188617fe840dafb2c9b7e5b0 - view on LGTM.com

new alerts:

3 for Inconsistent equals and hashCode
2 for Spurious Javadoc @param tags
1 for Unused format argument
1 for Useless null check
1 for Polynomial regular expression used on uncontrolled data

Oct 01 '22 07:10 lgtm-com[bot]

This pull request introduces 6 alerts when merging bd7c8bed54cb21b80f156326e10c26fa8c445c03 into 7fa53ff4b321590eb97100c3fe21868badbc0051 - view on LGTM.com

new alerts:

2 for Inconsistent equals and hashCode
1 for Spurious Javadoc @param tags
1 for Unused format argument
1 for Useless null check
1 for Polynomial regular expression used on uncontrolled data

Oct 04 '22 07:10 lgtm-com[bot]

Build is clean except for one unrelated flaky failure in a Kafka IT. Will rerun later as I'm sure someone will have comments that require code changes and a new build.

Oct 10 '22 22:10 paul-rogers

Here's how we propose to revise the REST API. Comments or suggestions?

Updated Oct. 31: Revised to use query parameters to specify formats on GET messages.

The goal is to use a standardized resource path while supporting a variety of formats. It appears the only way to achieve that is with a format query parameter.

The common prefix: /druid/coordinator/v1/catalog

Configuration-as-Code

CRUD for an entire TableSpec, optionally protected by a version.

POST {prefix}/schemas/{schema}/tables/{table}[?version={n}|overwrite=true|false]

Create or update a TableSpec.

With no options, the semantics are "create if not exists."
With a version, the semantics are "update if exists and is at the given version"
With "overwrite", the semantics are "create or update"

POST {prefix}/tables[?overwrite=true|false]

Possible alternative: pass a TableMetadata with the schema name, table name and version. This form is simpler: but it requires that the client build up the table metadata object. It is harder to reuse specs across dev and prod: the TableSpec is common, but the TableMetadata must be rewritten for each usage. This is one of those “we won’t know until someone complains” choices.

DELETE {prefix}/schemas/{schema}/tables/{table}

Editing

Edit operations apply a specific transform to an existing table spec. Because the transform is very specific, there is no need for a version as there is little scope for unintentional overwrites. The edit actions are usable from the Druid console or DDL statements, if/when we add them.

POST {prefix}/schemas/{schema}/tables/{table}/edit

The payload is a message that says what to change:

Hide columns (add to the hidden columns list)
Unhide columns (remove from the hidden columns list)
Drop columns (remove items from the columns list)
Move columns
Update props (merge updated properties with existing)
Update columns (merge updated columns with existing)

Retrieval

GET {prefix}/schemas[?format=name|path|metadata]

Returns one of:

name: The list of schema names (Default)
path: The list of all table paths (i.e. {schema}.{table} pairs)
metadata: A list of metadata for all tables in all schemas.

GET {prefix}/schemas/{schema}

Not supported. Reserved to obtain information about the schema itself.

GET {prefix}/schemas/{schema}/tables[?format=name|metadata]

Returns one of:

name: The list of table names within the schema.
metadata: The list of TableMetadata for each table within the schema.

GET {prefix}/schemas/{schema}/tables/{table}[?format=spec|metadata|status

Retrieves information for one table. Formats are:

spec: The TableSpec (default)
metadata: the TableMetadata
status: the TableMetadata without the spec (that is, the name, state, creation date and update date)

Sync

GET /sync/tables

Returns a list of TableSpec objects for bootstrapping a Broker.

Similar toGET /entry/tables, but that message returns metadata.

POST /sync/delta

On the broker, receives update (delta) messages from the coordinator. The payload is an object:

Create/update: table path plus the TableSpec
Delete: table path only

Oct 27 '22 18:10 paul-rogers

Here's how we propose to revise the REST API. Comments or suggestions?

Use a form of {topic}/{path}, where the topic is the kind of action. This resolves the ambiguity of trying to do it the other way: {path}/{action} where there is an ambiguity between paths and actions.

Configuration-as-Code

CRUD for an entire TableSpec, optionally protected by a version.

POST /resource/tables/{schema}/{table}[?version={n}|overwrite=true|false]

Create or update a TableSpec.
* With no options, the semantics are "create if not exists."

* With a version, the semantics are "update if exists and is at the given version"

* With "overwrite", the semantics are "create or update"
GET /resource/tables/{schema}/{table}

DELETE /resource/tables/{schema}/{table}

Editing

Edit operations apply a specific transform to an existing table spec. Because the transform is very specific, there is no need for a version as there is little scope for unintentional overwrites. The edit actions are usable from the Druid console or DDL statements, if/when we add them.

POST /edit/table/{schema}/{table}

The payload is a message that says what to change:
* Hide columns (add to the hidden columns list)

* Unhide columns (remove from the hidden columns list)

* Drop columns (remove items from the columns list)

* Move columns

* Update props (merge updated properties with existing)

* Update columns (merge updated columns with existing)
Retrieval

GET /names/schemas

Returns a list of the schema names.

GET /names/tables/{schema}

Returns a list of the table names within a schema.

GET /entry/tables

Returns a list of the metadata (TableSpec plus update dates, etc.) for all tables.

GET /entry/tables/{schema}

Returns a list of the metadata for tables within a schema.

GET /entry/tables/{schema}/{table}

Returns the metadata for a single table.

Sync

GET /sync/tables

Returns a list of TableSpec objects for bootstrapping a Broker.

Similar toGET /entry/tables, but that message returns metadata.

POST /sync/delta

On the broker, receives update (delta) messages from the coordinator. The payload is an object:
* Create/update: table path plus the `TableSpec`

* Delete: table path only

Thanks for updated proposal @paul-rogers . One thing I think would be good is if in the URI path, when specifying a table within a specific schema, the schema should come before table in the path, as is denotes a hierarchy, tables within a schema, not the other way around. Anyway that we can do this? Maybe the schema can be included in the request payload that creates / updates a table, and not allowed to be null? Also I think its a bit awkward that the table is specified as a resource in some apis, and as a entry in others, anyway to merge the 2?

Oct 28 '22 21:10 zachjsh

One thing I think would be good is if in the URI path, when specifying a table within a specific schema, the schema should come before table in the path

As it turns out, the schema does come before the table:

POST /resource/tables/{schema}/{table}[?version={n}|overwrite=true|false]

I suspect the confusing bit is the "/resource/tables". The thought here was that, at present, the catalog has only tables. Most DB allow user-defined schemas. We may want to add connections for things like S3, Kafka, etc. So, the thought was we'd have a variety of "resources". Each with some naming convention. So, in the future (not now):

POST /resource/schemas/{schema} POST /resource/connections/{conn}

Etc.

Here, we could simplify: the resource part could be removed:

POST /tables/{schema}/{table} POST /schemas/{schema} POST /connections/{conn}

(Everything here is simplified, BTW, there is a common prefix which I'm omitting.)

Maybe the schema can be included in the request payload

That is one solution. To add, POST /tables. The problem is, the result is asymmetric on get: GET /tables might return everything, so one would do GET /tables/{schema}/{table}. Plus, the content would differ: to create we provide a name, but to get, we don't need a response with the name because we already have the name.

Then, there is the update ambiguity: POST /tables/{schema}/{name} says which table we want to update. If the name also appears in the request, then we (and the user) would have to ensure that they match. I suppose we could do update as POST /tables with the name in the body...

A final comment is that the present design says that the name is the place you store your table spec: it isn't an attribute of the spec. This means I can post the same spec under multiple names: one for dev, another for test, and another for prod. (Since Druid doesn't allow user-defined namespaces, the best thing is "dev_events", "test_events" and "event" for dev, test and prod.) If the name were in the spec, then the spec would have to be modified for each use. (And, the DB record would store the name twice: once in the key field, another in the spec, resulting in redundancy and another thing to verify on every update.)

The existing, and proposed, designs allow the same spec format for create, update and read. It allows the same spec to be posted to dev, test and prod tables. I think we want to keep each of these features. That said, I'm open to revisions about how we provide those features.

its a bit awkward that the table is specified as a resource in some apis, and as a entry in others

The difference in "themes" was due to the /schemas/{schema}/{table}/ schemas/{schema}/{operation} ambiguity if we do the obvious solution and try to use a common base for both. But, if we're OK with /schemas/{schema}/{operation} and /tables/{schema}/{table}/{operation}, then we can almost, but not quite, combine resources and entries.

For tables:

POST /tables/{schema}/{table}[?version={n}|overwrite=true|false] add/update a table
POST /tables/{schema}/{table}/edit "edit" (incremental update) a table
GET /tables/{schema}/{table} Get the table spec for a table (same object a for create/update)
GET /tables/{schema}/{table}/metadata Get the table metadata (name, update date, state, spec, etc.)
GET /tables Get metadata for all tables in all schemas

For schemas:

POST /schemas/{schema} Create a schema (not yet supported!)
GET /schemas/{schema} Get metadata for a schema (not yet supported!)
GET /schemas/{schema}/names Get the names of tables within the schema
GET /schemas/{schema}/tables Get the metadata for each table in the schema

In the above, however, there is no good way to get the names of all tables in all schemas: GET /schemas/names won't work (ambiguous). GET /schemas won't work (would imply getting the metadata (contents) for all schemas.

This is the "trying to be too clever" issue that made the original API a bit awkward: had to do some song and dance to work around ambiguities.

The proposal in the earlier message resolves these issues by saying what you want to do, then saying, what you want to do it on. That way, to get names:

GET /names/schemas says to get all schema names
GET /names/tables says to get all table names in all schemas
GET /names/schema/{schema} says to get all table names in the given schema

(The above is a refinement of the earlier proposal.)

The same pattern is then repeated for metadata with entries.

Again, I think we need the lists of names, and the lists of contents. We need it for the whole system, for everything in a schema, and for a single table. Again, I'm open to other ways of accomplishing the goals.

See a revision to the earlier post for the updated proposal.

Oct 28 '22 23:10 paul-rogers

PR has been updated with a large set of changes based on review comments:

Revised the REST API per the design outlined above.
Replaced the ad-hoc ways to "edit" a table with a new API based on "edit commands."
Replaced sync protocol with one based on "sync events"
Revised the UTs and ITs to match

Nov 01 '22 16:11 paul-rogers

This pull request introduces 1 alert when merging 5de07ce9341ac324b8bf1acd1844fa8398e6a8a3 into c875f4bd0402d49c57d56068537b8babc52386d1 - view on LGTM.com

new alerts:

1 for Spurious Javadoc @param tags

Nov 03 '22 22:11 lgtm-com[bot]