sql icon indicating copy to clipboard operation
sql copied to clipboard

[BUG] Expressions in PPL require whitespace around * characters

Open kroepke opened this issue 3 years ago • 5 comments

What is the bug? If a multiplication expression is written without whitespace around the operator, the lexer mistakenly tokenizes it as ID_LITERAL. This leads to unexpected behavior (and IMHO wrong parsing) wherever multiplication is allowed.

How can one reproduce the bug?

  1. Execute the PPL query search source=opensearch_dashboards_sample_data_logs | stats avg(bytes*1024)
  2. The query fails with org.opensearch.sql.exception.SemanticCheckException: can't resolve Symbol(namespace=FIELD_NAME, name=bytes*1024) in type env
  3. Change the query to search source=opensearch_dashboards_sample_data_logs | stats avg(bytes * 1024) (note the surrounding whitespace)
  4. Run it again, the query succeeds

What is the expected behavior? An expression should parse successfully with or without optional whitespace present, the queries above should behave identically. If certain identifiers can contain wildcards those should be a different token, as they are not allowed everywhere where a plain ID is allowed at.

What is your host/environment?

  • Version 1.3, 2.1
  • Plugins SQL since 1.13.2.0 up until current

Do you have any additional context? The change was introduced in https://github.com/opensearch-project/sql/commit/7dc8699d62180c803fd5147d03331b787360d9a0#diff-fbfc17858912dc1b9fbe756efdc25267d6eb4f0f6e4b9d4fc1500d072f526892 and tests were deliberately changed to include whitespace to work around the lexer issue. I believe this is wrong behavior, which is very difficult to understand for users.

As far as I can tell from the code, the intention was to allow wildcards and other non-numeric and non-alpha characters for index names. If that's the case the grammar should reflect the two different uses and not treat them as the same token type.

Unfortunately with the way things are now it becomes essentially impossible to disambiguate in the lexer.

kroepke avatar Aug 02 '22 18:08 kroepke

Failing test in org.opensearch.sql.ppl.antlr.PPLSyntaxParserTest

  @Test
  public void testLiteralSwallowsExpression() {
    ParseTree tree = new PPLSyntaxParser().parse("source=foo | stats avg(bytes*1024)");
    assertNotEquals(null, tree);
    new OpenSearchPPLParserBaseVisitor<Void>(){
      @Override
      public Void visitPrimaryExpression(OpenSearchPPLParser.PrimaryExpressionContext ctx) {
        if (ctx.start.equals(ctx.stop)) {
          if (ctx.start.getText().equals("bytes*1024")) {
            fail("Expression not tokenized");
          }
        }
        return super.visitPrimaryExpression(ctx);
      }
    }.visit(tree);
  }

kroepke avatar Aug 02 '22 18:08 kroepke

I guess it duplicates or similar to https://github.com/opensearch-project/sql/issues/265

Yury-Fridlyand avatar Aug 03 '22 20:08 Yury-Fridlyand

Looks similar, but it is a different ANTLR grammar. I haven't checked it yet, but I doubt that SQL identifiers would contain * so I venture a guess that the underlying reason will be slightly different.

kroepke avatar Aug 03 '22 20:08 kroepke

Crazily, I was wrong, it is a different grammar file, but the issue is identical:

https://github.com/opensearch-project/sql/commit/0a878aba438a0edeeb946f4c879d9bbd61b4a7da#diff-de4dc8d33192023fd968c0c467b9c242904a600e388bbc49f87e90d511ce5c41R322

SQL identifiers can contain * (among other things that aren't operators) which will probably break a lot of SQL clients' assumptions generating code.

Unless it can be lexically determined where wildcard identifiers are allowed, this will be really tricky to fix while staying compatible. Is the intention to only allow wildcards in index names?

kroepke avatar Aug 04 '22 11:08 kroepke

@kroepke Yeah, I think so. Because OpenSearch allows * used in index expression to express search over multiple indexes which match the pattern. Probably we need separate the index name identifier and field name identifier in ANTLR grammar.

dai-chen avatar Aug 05 '22 21:08 dai-chen