sql-formatter icon indicating copy to clipboard operation
sql-formatter copied to clipboard

Add DuckDB Dialect Support

Open hughcameron opened this issue 1 year ago • 9 comments

This pull request adds support for the DuckDB SQL dialect to the SQL Formatter library.

Description:

  • Extends the supported dialects to include DuckDB.
  • Updates the documentation (README.md and potentially others) to reflect the addition of DuckDB support.
  • Includes any necessary tests to ensure proper formatting for DuckDB queries.

Benefits:

  • Users working with DuckDB can now leverage the SQL Formatter library for consistent and readable SQL code.
  • Enhances the overall library coverage by including a popular in-memory database.

Testing:

  • A test suite is included to verify accurate formatting for various DuckDB queries.

Please review the changes and provide feedback.

hughcameron avatar Apr 30 '24 07:04 hughcameron

This pull request is automatically built and testable in CodeSandbox.

To see build info of the built libraries, click here or the icon next to each commit SHA.

codesandbox-ci[bot] avatar Apr 30 '24 07:04 codesandbox-ci[bot]

Thanks for the PR. A few quick questions and thoughts:

  • What's the relationship between DuckDB and PostgreSQL? I saw in DuckDB docs that it uses PostgreSQL parser. But does it actually support all of the PostgreSQL syntax? Like, does it support all these operators that you have listed?
  • You probably have listed too many keywords. Best to only include reserved keywords. Otherwise you'll have common field names like id, name, location, type detected as keywords and converted to uppercase when using keywordCase: "upper".
  • Your data types list seems to be missing some basic stuff like INT, CHAR, ARRAY.

I'm pretty busy this week... not sure how much time I have to properly review this.

nene avatar Apr 30 '24 08:04 nene

For bonus points, you can update the wiki with information about DuckDB. That will also make it easier for me to review this. Otherwise I'll have to go and figure out all of this about DuckDB by myself.

nene avatar Apr 30 '24 08:04 nene

I was about to submit a PR also, I will make some comment in your code.

PMassicotte avatar Apr 30 '24 10:04 PMassicotte

https://github.com/PMassicotte/sql-formatter/tree/duckdb-dialect

PMassicotte avatar Apr 30 '24 11:04 PMassicotte

Thanks for the comments above. The errors from the test suite are now down to five:

DuckDBFormatter
    ✕ supports ARRAY[] literals (2 ms)
    ✕ dataTypeCase option does NOT affect ARRAY[] literal case
    ✕ keywordCase option affects ARRAY[] literal case
    ✕ dataTypeCase option affects ARRAY type case (2 ms)
    ✕ supports array slice operator

@nene - I'll look into filling out the wiki 👍

hughcameron avatar May 01 '24 07:05 hughcameron

To fix ARRAY[] tests:

  • add ARRAY to dataTypes and remove it from keywords and functions lists.

To fix the bar[1:] test:

  • remove BAR from list of functions

I would guess the builtin BAR() function is rarely used. Well... at least I failed to find documentation of it, because it's super hard to google as Postgres docs contain loads of foo and bar in example code. I think it's better for the formatter to also support the more common use case of bar as a name used in example code.

PS. Make sure to run yarn pretty. (Looks like that currently also changes the .pre-commit-hooks.yaml file... you can let Prettier to reformat that, or leave it as-is. Either way is fine.)

nene avatar May 02 '24 06:05 nene

The test suite is passing completely now 🎉

I'll collate some notes for the wiki over the next week. Are there any other steps needed before merging?

hughcameron avatar May 03 '24 00:05 hughcameron

Thanks. I don't think there's anything else.

nene avatar May 03 '24 05:05 nene

Hello, We're are working on a project using DuckDB and have a few issues with the standard SQL format (->> replaced with - > > for instance). Do you know when this PR will be merged? Thanks

Zank94 avatar Nov 20 '24 17:11 Zank94

Thanks for the interest. This thing has indeed been sitting here for a while now. Will need to dig back into this to see if there's any reasons why it hasn't been merged yet. I think one of the main reasons was that it seemed really-really similar to PostgreSQL.

So a question to you @Zank94, if you just configure SQL Formatter to treat your SQL as PostgreSQL, would that solve your problems (e.g. the ->> operator is also present in PostgreSQL).

nene avatar Nov 20 '24 18:11 nene

I see, thank you for the advice I will give it a try :+1:

Zank94 avatar Nov 21 '24 08:11 Zank94

I was using the postgres formatter for duckdb queries, but it fails when the query is having <<= which gets formatted to << = Any updates, if this PR is going to be merged anytime soon?

karanpopat avatar Apr 16 '25 11:04 karanpopat

Well, the thing is that this PR has a load of failing tests. Interestingly all failing tests were fixed at one point in eb7d9b1ecfb2adefbd41b563ab68015faa073980, but after that bunch of more changes were added and now it has a total of 92 failing tests. (Don't know why Github says that "All checks have passed".)

I personally have no real knowledge of DuckDB and the documentation of DuckDB seems to be lacking. For example I tried to find information about that <<= operator, but was not able to. Really I couldn't even find information about which basic arithmetic operators are supported. The expressions documentation is kinda short. It does mention the == operator, which however this PR does not include, just as it doesn't include the <<= operator.

Initially I had the false impression that DuckDB is pretty much just PostgreSQL with a few minor differences, I have now come to a conclusion that it's more like DuckDB supports some small subset of PostgreSQL syntax (plus some DuckDB-specific additions like CREATE MACRO).

Might be that this whole PR should be started from scratch. I would personally start with filling out the wiki with information about DuckDB. But because of DuckDB lack of documentation, it seems like an inconvenient task to undertake.

nene avatar Apr 18 '25 08:04 nene

@nene , I have to disagree with you here. I think DuckDB has fantastic documentation, but it has a LOT of functionality above and beyond Postgres, so the documentation can be pretty dense for the uninitiated.

@karanpopat , what the heck is the <<= operator? I use DuckDB every day, but I'm not familiar with that one, and I can't find it anywhere in the docs.

riziles avatar Apr 18 '25 17:04 riziles

@riziles it's an inet extension

https://duckdb.org/docs/stable/extensions/inet.html#-predicate

karanpopat avatar Apr 18 '25 17:04 karanpopat

@riziles It might very well be that the documentation is great and I just don't know how to use it. Like this page which at first glance seems to document differences from PostgreSQL, but it's a pretty short page. I guess it actually tries to document the "important" differences for ordinary users.

I think I now finally found that most of the operators are documented in the Functions section. But not all operators can be found there. For example the ~ operator isn't mentioned under Regular expression functions while it is in fact equivalent of regexp_full_match.

nene avatar Apr 18 '25 17:04 nene

@karanpopat , that seems like a pretty niche requirement. Can't you just use the "denseOperators" flag? https://github.com/sql-formatter-org/sql-formatter/blob/master/docs/denseOperators.md

riziles avatar Apr 18 '25 22:04 riziles

I think the denseOperators flag is really a pretty crappy solution as it forces all operators to be formatted with no spaces around them. I frankly don't know why anyone would ever use it.

One can instead just extends the postgresql formatter configuration with an additional operator. Something like:

import { formatDialect, postgresql } from 'sql-formatter';

const duckdb = {
    ...postgresql,
    tokenizerOptions: {
        ...postgresql.tokenizerOptions
        operators: [...postgresql.tokenizerOptions.operators, '<<=', '>>='],
    }
};

formatDialect('SELECT foo <<= bar FROM tbl', { dialect: duckdb });

nene avatar Apr 19 '25 06:04 nene

I have now digged a bit deeper to the DuckDB syntax and I think I was mislead earlier when I read from the docs that DuckDB uses PostgreSQL parser. I frankly can't find that from the docs any more. I guess that part was removed. Turns out they instead used a (likely heavily modified) fork of PostgreSQL parser and for all I know they might be using a completely custom parser by now.

There are just so-so many differences in syntax, that I think it makes no sense to treat it as a completely different dialect. Some most notable DuckDB-specific syntax I've found so far:

  • Prefix aliases: SELECT foo: 1, bar: 2
  • List literals: SELECT [1, 2, 3]
  • Struct literals: SELECT {foo: 1, bar: 2}
  • List slice operator: SELECT list[1:10]
  • POSITIONAL and ASOF joins.
  • Percentage-based limit: SELECT * FROM tbl LIMIT 10%

nene avatar Apr 19 '25 07:04 nene

It's definitely based on Postgres syntax, i.e. most vanilla Postgres queries would work fine in DuckDB, but it has capabilities far above and beyond for analytics workflows: https://www.theregister.com/2024/08/20/postgresql_duckdb_extension/

riziles avatar Apr 19 '25 19:04 riziles

So, I ended up putting this PR aside and creating a new DuckDB configuration from scratch: #857

I ended up using the functions, keywords and data types lists from this PR. Thanks for that @hughcameron.

Also thanks to everybody else who has provided information about DuckDB in this thread.

nene avatar Apr 20 '25 15:04 nene