superset [SIP-95] Proposal for Catalog Support in SQL Lab

[SIP] Proposal for Catalog Support in SQL Lab

Motivation

The current SQL Lab explorer is not very intuitive as there is no way to explore catalogs, and can only explore the default catalog.

Proposed Change

Add a new Catalog dropdown in the sqllab editor (similar to schema).

New or Changed Public Interfaces

This should only affect the sqllab query tab. The Catalog dropdown will be similar enough to the Schema explorer. The word schema is kind of overloaded throughout Superset, so it's not entirely clear to me yet whether a new endpoint is necessary.

New dependencies

None

Migration Plan and Compatibility

None

Rejected Alternatives

None

Jan 26 '23 00:01 JELGT2011

I would love to see this, but there are few complications. The biggest one is that we need support from the SQL Alchemy dialects, and I'm not sure how many support dynamic catalogs. For example, both Hive and Trino require the catalog to be specified when the connection is created:

trino://<username>:<password>@<host>:<port>/<catalog>/<schema>
hive://<host>:<port>/<catalog>

For engines like this the user would have to choose a default catalog when the database is added, and each DB engine spec in Superset would have to know how to replace the catalog in the URL, in case a user chooses one different from the default after the DB has been created.

It's definitely doable, and something we should do. We've been slowly adding support for catalogs:

https://github.com/apache/superset/blob/383313b105b0e82bea0f38cc971630eded5affe0/superset/sql_parse.py#L172

Are you planning to work on this, or are you just proposing it? If you're not going to work on it I can start taking a look.

Jan 26 '23 00:01 betodealmeida

I was planning to work on this, but of course I’d love help as I’ve never made a contribution to Superset. I was planning on doing a smaller task first (adding a configurable header/footer) since I figured this one might be complex.

Jan 26 '23 00:01 JELGT2011

I was planning to work on this, but of course I’d love help as I’ve never made a contribution to Superset. I was planning on doing a smaller task first (adding a configurable header/footer) since I figured this one might be complex.

Ah, great! From the wording of the SIP I wasn't sure, and people are welcome to propose changes even if they're not going to work on it. I'm happy to help in any way you need, let me know!

Jan 26 '23 00:01 betodealmeida

hey @betodealmeida, good to see you are still pushing Superset forward!

We worked around originally by concatenating catalog with schema, but patching got nasty, as we would pass it thought to Trino driver.

Here is how it looks:

Jan 26 '23 18:01 afilipchik

I think a good approach for this would be modifying BaseEngineSpec.get_engine to allow passing a catalog. The method would then pass it to get_sqla_engine_with_context, which would call a DB engine specific method to mutate the SQLAlchemy URI so it uses the requested catalog. Something like (pseudocode):

BaseEngineSpec:

    @classmethod
    def get_engine(
        cls,
        database: Database,
        catalog: Optional[str] = None,
        schema: Optional[str] = None,
        source: Optional[QuerySource] = None,
    ) -> ContextManager[Engine]:
        # get_sqla_engine_with_context needs to call database.engine_spec.apply_catalog_to_sqlalchemy_uri
        return database.get_sqla_engine_with_context(
            catalog=catalog,
            schema=schema,
            source=source
        )
    
    @classmethod
    def apply_catalog_to_sqlalchemy_uri(cls, database: Database, catalog: str) -> URL:
        return database.sqlalchemy_uri

    @classmethod
    def get_catalog_names(
        cls,
        database: Database,
        inspect: Inspector,
    ) -> Set[str]:
        return set()

Individual DB engine specs like Trino/Hive/Snowflake/BigQuery would implement custom apply_catalog_to_sqlalchemy_uri methods to build the SQLAlchemy URI for a given catalog. They would also implement custom get_catalog_names to list all available catalogs, since there's no standard SQLAlchemy way of doing that, as far as I know.

Jan 26 '23 18:01 betodealmeida

Actually, we would need the apply_catalog_to_sqlalchemy_uri to also pass schema:

    @classmethod
    def apply_catalog_to_sqlalchemy_uri(
        cls,
        database: Database,
        catalog: Optional[str],
        schema: Optional[str],
    ) -> URL:
        try:
            connect_args = database.get_extra()["engine_params"]["connect_args"]
        except KeyError:
            connect_args = {}

        return database.sqlalchemy_uri, connect_args

For Postgres, this would look like this (untested):

    @classmethod
    def apply_catalog_to_sqlalchemy_uri(
        cls,
        database: Database,
        catalog: Optional[str],
        schema: Optional[str],
    ) -> Tuple[URL, Dict[str, Any]]:
        url, connect_args = super().apply_catalog_to_sqlalchemy_uri(database, catalog, schema)

        if catalog:
            # for Postgres a catalog is a database
            url = url.set(database=catalog)

        if schema:
            connect_args["options"] = f"-csearch_path={schema}"

        return url, connect_args

See the discussion at https://github.com/apache/superset/pull/23356.

Mar 14 '23 16:03 betodealmeida

Actually we already have a method to change the schema (adjust_database_uri), we should piggyback on that.

Mar 14 '23 19:03 betodealmeida

While working on an issue with schemas, I've done some foundational work for supporting catalogs:

https://github.com/apache/superset/pull/23356
https://github.com/apache/superset/pull/23385
https://github.com/apache/superset/pull/23401

Once we have that, I think we would still need the following steps:

Implement catalog-level permissions (@dpgaspar can probably help here). We also need to modify the parser to extract the catalog from queries, and have the security manager handle catalogs when checking for access.
Implement a get_catalog_names in the DB engine specs that support catalogs (Presto, Trino, Snowflake, BigQuery, Postgres, ...?). Also update their adjust_engine_params methods to modify the SQL Alchemy URI to use the chosen catalog.
Add a catalog dropdown to SQL Lab, and make sure the selected catalog is being passed to _get_sqla_engine in the database model, and from there passed to adjust_engine_params.
Modify the dataset model and add a catalog attribute.
Add a catalog dropdown to the dataset creator.

Mar 17 '23 17:03 betodealmeida

I'm going to work on (1) and (2) this week, then we can put the SIP to vote and you (or someone else) can work on the UI part, @JELGT2011.

Mar 20 '23 16:03 betodealmeida

that's awesome news @betodealmeida! I've been a little delayed trying to upgrade our existing instance of superset, but I could definitely do some frontend changes (3), (4?), and (5). I think I have a draft work in progress on it already so that works out.

Mar 20 '23 18:03 JELGT2011

Numbering this as SIP-95. If you'd like to continue pursuing this, pleas submit it for a DISCUSS thread on the Superset @dev mailing list, so we can start progressing it toward a vote. Thanks!

Jun 07 '23 17:06 rusackas

It seems like this work is incrementally moving forward, but the SIP itself has still not been brought up for a DISCUSS thread or a VOTE thread on the Apache mailing list. @JELGT2011 / @betodealmeida is there any need or intent to continue with the SIP process here?

Oct 03 '23 20:10 rusackas

It would be huge to get this going for Databricks. I could probably contribute - especially since part of the approach was laid out for different engines

Oct 05 '23 16:10 kdazzle

I also support putting this SIP-95 up for a DISCUSS/VOTE thread! @MotherDuckdb has a similar scenario where a single DuckDB instance can connect to multiple databases (see https://duckdb.org/docs/sql/statements/attach.html). It would be great to be able to use the Catalog drop-down to scope the Schema options.

Dec 05 '23 18:12 guenp

@guenp please feel free to kick off the Discussion thread on the [email protected] mailing list! Let me know if you need help with that, you can ping me here or on Superset Slack

Feb 01 '24 20:02 rusackas

Hi @JELGT2011 / @guenp if anyone wants to move this SIP forward, please put it up for a [DISCUSS] thread on the mailing list. Otherwise, it's at risk of being closed/discarded.

Apr 03 '24 15:04 rusackas

I'll bring it up for a discussion, I don't think we're far away from finishing this.

Apr 03 '24 15:04 betodealmeida

@betodealmeida regarding your comment in the "[DISCUSS] [SIP-95] Proposal for Catalog Support in SQL Lab" email,

I'd like to increase the scope of SIP-95 by introducing catalogs not only in SQL Lab, but throughout the whole application.

I totally agree as it likely is a necessary requirement if one wanted to explore a SQL Lab result set.

Apr 04 '24 22:04 john-bodley

@john-bodley right, I just called that out because the original SIP mentioned only SQL Lab in the title, but I don't see how that would work. 🙃

Apr 05 '24 16:04 betodealmeida

@betodealmeida I think you just need to close the vote for this with a [RESULT] thread

May 07 '24 15:05 rusackas

@betodealmeida I think you just need to close the vote for this with a [RESULT] thread

Done!

May 07 '24 22:05 betodealmeida

It's official, we now have catalog support in Apache Superset. Currently Postgres and Databricks are supported, and I have a PR out for:

Trino
Presto
BigQuery
Snowflake

Adding support to new DB engine specs should be straightforward, requiring just 3 methods. I suggest using https://github.com/apache/superset/pull/28416 as a reference implementation.

May 09 '24 22:05 betodealmeida