Add DuckDB Dialect Support
This pull request adds support for the DuckDB SQL dialect to the SQL Formatter library.
Description:
- Extends the supported dialects to include DuckDB.
- Updates the documentation (README.md and potentially others) to reflect the addition of DuckDB support.
- Includes any necessary tests to ensure proper formatting for DuckDB queries.
Benefits:
- Users working with DuckDB can now leverage the SQL Formatter library for consistent and readable SQL code.
- Enhances the overall library coverage by including a popular in-memory database.
Testing:
- A test suite is included to verify accurate formatting for various DuckDB queries.
Please review the changes and provide feedback.
This pull request is automatically built and testable in CodeSandbox.
To see build info of the built libraries, click here or the icon next to each commit SHA.
Thanks for the PR. A few quick questions and thoughts:
- What's the relationship between DuckDB and PostgreSQL? I saw in DuckDB docs that it uses PostgreSQL parser. But does it actually support all of the PostgreSQL syntax? Like, does it support all these operators that you have listed?
- You probably have listed too many keywords. Best to only include reserved keywords. Otherwise you'll have common field names like
id,name,location,typedetected as keywords and converted to uppercase when usingkeywordCase: "upper". - Your data types list seems to be missing some basic stuff like
INT,CHAR,ARRAY.
I'm pretty busy this week... not sure how much time I have to properly review this.
For bonus points, you can update the wiki with information about DuckDB. That will also make it easier for me to review this. Otherwise I'll have to go and figure out all of this about DuckDB by myself.
I was about to submit a PR also, I will make some comment in your code.
https://github.com/PMassicotte/sql-formatter/tree/duckdb-dialect
Thanks for the comments above. The errors from the test suite are now down to five:
DuckDBFormatter
✕ supports ARRAY[] literals (2 ms)
✕ dataTypeCase option does NOT affect ARRAY[] literal case
✕ keywordCase option affects ARRAY[] literal case
✕ dataTypeCase option affects ARRAY type case (2 ms)
✕ supports array slice operator
@nene - I'll look into filling out the wiki 👍
To fix ARRAY[] tests:
- add
ARRAYtodataTypesand remove it from keywords and functions lists.
To fix the bar[1:] test:
- remove
BARfrom list of functions
I would guess the builtin BAR() function is rarely used. Well... at least I failed to find documentation of it, because it's super hard to google as Postgres docs contain loads of foo and bar in example code. I think it's better for the formatter to also support the more common use case of bar as a name used in example code.
PS. Make sure to run yarn pretty. (Looks like that currently also changes the .pre-commit-hooks.yaml file... you can let Prettier to reformat that, or leave it as-is. Either way is fine.)
The test suite is passing completely now 🎉
I'll collate some notes for the wiki over the next week. Are there any other steps needed before merging?
Thanks. I don't think there's anything else.
Hello, We're are working on a project using DuckDB and have a few issues with the standard SQL format (->> replaced with - > > for instance). Do you know when this PR will be merged? Thanks
Thanks for the interest. This thing has indeed been sitting here for a while now. Will need to dig back into this to see if there's any reasons why it hasn't been merged yet. I think one of the main reasons was that it seemed really-really similar to PostgreSQL.
So a question to you @Zank94, if you just configure SQL Formatter to treat your SQL as PostgreSQL, would that solve your problems (e.g. the ->> operator is also present in PostgreSQL).
I see, thank you for the advice I will give it a try :+1:
I was using the postgres formatter for duckdb queries, but it fails when the query is having <<= which gets formatted to << =
Any updates, if this PR is going to be merged anytime soon?
Well, the thing is that this PR has a load of failing tests. Interestingly all failing tests were fixed at one point in eb7d9b1ecfb2adefbd41b563ab68015faa073980, but after that bunch of more changes were added and now it has a total of 92 failing tests. (Don't know why Github says that "All checks have passed".)
I personally have no real knowledge of DuckDB and the documentation of DuckDB seems to be lacking. For example I tried to find information about that <<= operator, but was not able to. Really I couldn't even find information about which basic arithmetic operators are supported. The expressions documentation is kinda short. It does mention the == operator, which however this PR does not include, just as it doesn't include the <<= operator.
Initially I had the false impression that DuckDB is pretty much just PostgreSQL with a few minor differences, I have now come to a conclusion that it's more like DuckDB supports some small subset of PostgreSQL syntax (plus some DuckDB-specific additions like CREATE MACRO).
Might be that this whole PR should be started from scratch. I would personally start with filling out the wiki with information about DuckDB. But because of DuckDB lack of documentation, it seems like an inconvenient task to undertake.
@nene , I have to disagree with you here. I think DuckDB has fantastic documentation, but it has a LOT of functionality above and beyond Postgres, so the documentation can be pretty dense for the uninitiated.
@karanpopat , what the heck is the <<= operator? I use DuckDB every day, but I'm not familiar with that one, and I can't find it anywhere in the docs.
@riziles it's an inet extension
https://duckdb.org/docs/stable/extensions/inet.html#-predicate
@riziles It might very well be that the documentation is great and I just don't know how to use it. Like this page which at first glance seems to document differences from PostgreSQL, but it's a pretty short page. I guess it actually tries to document the "important" differences for ordinary users.
I think I now finally found that most of the operators are documented in the Functions section. But not all operators can be found there. For example the ~ operator isn't mentioned under Regular expression functions while it is in fact equivalent of regexp_full_match.
@karanpopat , that seems like a pretty niche requirement. Can't you just use the "denseOperators" flag? https://github.com/sql-formatter-org/sql-formatter/blob/master/docs/denseOperators.md
I think the denseOperators flag is really a pretty crappy solution as it forces all operators to be formatted with no spaces around them. I frankly don't know why anyone would ever use it.
One can instead just extends the postgresql formatter configuration with an additional operator. Something like:
import { formatDialect, postgresql } from 'sql-formatter';
const duckdb = {
...postgresql,
tokenizerOptions: {
...postgresql.tokenizerOptions
operators: [...postgresql.tokenizerOptions.operators, '<<=', '>>='],
}
};
formatDialect('SELECT foo <<= bar FROM tbl', { dialect: duckdb });
I have now digged a bit deeper to the DuckDB syntax and I think I was mislead earlier when I read from the docs that DuckDB uses PostgreSQL parser. I frankly can't find that from the docs any more. I guess that part was removed. Turns out they instead used a (likely heavily modified) fork of PostgreSQL parser and for all I know they might be using a completely custom parser by now.
There are just so-so many differences in syntax, that I think it makes no sense to treat it as a completely different dialect. Some most notable DuckDB-specific syntax I've found so far:
- Prefix aliases:
SELECT foo: 1, bar: 2 - List literals:
SELECT [1, 2, 3] - Struct literals:
SELECT {foo: 1, bar: 2} - List slice operator:
SELECT list[1:10] -
POSITIONALandASOFjoins. - Percentage-based limit:
SELECT * FROM tbl LIMIT 10%
It's definitely based on Postgres syntax, i.e. most vanilla Postgres queries would work fine in DuckDB, but it has capabilities far above and beyond for analytics workflows: https://www.theregister.com/2024/08/20/postgresql_duckdb_extension/
So, I ended up putting this PR aside and creating a new DuckDB configuration from scratch: #857
I ended up using the functions, keywords and data types lists from this PR. Thanks for that @hughcameron.
Also thanks to everybody else who has provided information about DuckDB in this thread.