WHERE clause eats custom Keyword but not others
Describe the bug
I'm attempting to define some custom keywords for the dialect I'm using, specifically the Keyword FACET. I'm noticing that when FACET appears after WHERE in my querystring, sqlparse groups FACET into the WHERE's TokenList. However, if another Keyword such as GROUP BY appears between WHERE and FACET, then WHERE, GROUP BY, and FACET all show up in the top level of the parsed tree as expected. I'm trying to understand why this is the case and if I'm doing something wrong.
To Reproduce
import re
import sqlparse
from sqlparse import keywords
from sqlparse.lexer import Lexer
from pprint import pprint
# get the lexer singleton object to configure it
lex = Lexer.get_default_instance()
# Clear the default configurations.
# After this call, reg-exps and keyword dictionaries need to be loaded
# to make the lexer functional again.
lex.clear()
my_regex = (r"FACET\b", sqlparse.tokens.Keyword)
# slice the default SQL_REGEX to inject the custom object
sql_regexes = keywords.SQL_REGEX[:38] + [my_regex] + keywords.SQL_REGEX[38:]
lex.set_SQL_REGEX(sql_regexes)
# pprint([(str(sql_regex[0]), sql_regex[1]) for sql_regex in sql_regexes])
pprint(keywords.KEYWORDS)
# add the default keyword dictionaries
lex.add_keywords(keywords.KEYWORDS_COMMON)
lex.add_keywords(keywords.KEYWORDS_ORACLE)
lex.add_keywords(keywords.KEYWORDS_PLPGSQL)
lex.add_keywords(keywords.KEYWORDS_HQL)
lex.add_keywords(keywords.KEYWORDS_MSACCESS)
lex.add_keywords(keywords.KEYWORDS)
# add a custom keyword dictionary
lex.add_keywords({'FACET': sqlparse.tokens.Keyword})
querystring = "SELECT x, latest(y) FROM (SELECT x, y, z FROM table) WHERE x = 5 FACET x.name"
# no configuration is passed here. The lexer is used as a singleton.
parsed = sqlparse.parse(querystring)
pprint(parsed[0].tokens)
# produces:
#
# [<DML 'SELECT' at 0x104A934C0>,
# <Whitespace ' ' at 0x104A93820>,
# <IdentifierList 'x, lat...' at 0x104D1A450>,
# <Whitespace ' ' at 0x104CEEF80>,
# <Keyword 'FROM' at 0x104CEEFE0>,
# <Whitespace ' ' at 0x104CEF040>,
# <Parenthesis '(SELEC...' at 0x104D19DD0>,
# <Whitespace ' ' at 0x104CEF640>,
# <Where 'WHERE ...' at 0x104D19ED0>]
# However, when querystring is
# querystring = "SELECT x, latest(y) FROM (SELECT x, y, z FROM table) WHERE x = 5 GROUP BY y FACET x.name"
#
# produces:
#
# [<DML 'SELECT' at 0x1002674C0>,
# <Whitespace ' ' at 0x100267820>,
# <IdentifierList 'x, lat...' at 0x1004EE3D0>,
# <Whitespace ' ' at 0x1004C2F80>,
# <Keyword 'FROM' at 0x1004C2FE0>,
# <Whitespace ' ' at 0x1004C3040>,
# <Parenthesis '(SELEC...' at 0x1004EDD50>,
# <Whitespace ' ' at 0x1004C3640>,
# <Where 'WHERE ...' at 0x1004EDE50>,
# <Keyword 'GROUP ...' at 0x1004C39A0>,
# <Whitespace ' ' at 0x1004C3A00>,
# <Identifier 'y' at 0x1004EE2D0>,
# <Whitespace ' ' at 0x1004C3AC0>,
# <Keyword 'FACET' at 0x1004C3B20>,
# <Whitespace ' ' at 0x1004C3B80>,
# <Identifier 'x.name' at 0x1004EDED0>]
Expected behavior I would expect that the latter output would appear regardless of where FACET appears in my query.
Versions (please complete the following information):
- Python: 3.11
- sqlparse: 0.4.4
Additional context Is it possible that I'm simply not understanding something about sqlparse? Is there a reason why WHERE would consume my custom Keyword and not others?