typeql icon indicating copy to clipboard operation
typeql copied to clipboard

Backslash escape character is always stripped

Open flyingsilverfin opened this issue 3 years ago • 4 comments

DO NOT MERGE (see comment)

What is the goal of this PR?

We address various String escaping difficulties by which the user could not get a raw string containing a quote, or a backslash, into TypeDB without also serialising an escape character into the database. This is contrary to the existing database conventions.

We now expect that a user can insert any dataset into TypeDB by applying the following operation to their raw input strings:

escaped_data = raw_data.replace("\\", "\\\\").replace("\"", "\\\"")
insert_query = "insert $x isa person, has data \"" + escaped_data + "\""

This process will replace each \ character with a \\ sequence, and then replace each " with an \" sequence. This will ensure that by the time the query is parsed, the user's raw data is held in memory on the server side.

Example If raw_data above contains hello " world \ , then the escaped_data will contain hello \" world \\ . After the query is submitted to the server and parsed, the data value held for attribute data will be hello " world \ and this is the value that will be persisted into the database.

What are the changes implemented in this PR?

  • TypeQL establishes \ as the official escape character
  • Align behaviour with database convention: console query strings that include attribute values with escapes, remove those escape characters during parsing.
  • When parsing a String attribute or Regex string, any backslash character is replaced its following character
  • When printing out parsed queries, we escape strings and regexes according to the reverse convention: include a backslash before each existing backslash and double quote character.
  • Previously, we broke user expectations set by other databases: the users escaped " and \, but we always kept the escape character after parsing - as a result the user data plus backslashes ended up stored on disk.

Existing database conventions:

Let's split query languages into two domains: console languages, and programmatic languages.

Programmatic languages include parametrised queries, for example in SQL a query can exclude attribute values, and pass them to the server separately to the parametrised query. This means escaping is never needed (and protects against SQL injection attacks) because a String attribute is passed directly to the server as a string. TypeDB does not yet support parametrised queries. As a result, we must always construct query strings that are parsed by the parser in full. Neo4j and MongoDB have their own parametrised input mechanisms that behave similarly.

Console languages are languages whose queries must be inputted entirely as strings. All TypeQL queries are provided as strings, so let's define them as console languages. Console languages all have an escape mechanism to input quoting into string attribute values. SQL and Neo4j have console languages available that use different conventions:

SQL uses the quoting character itself as the escape character:

select 'it''s escaped'

This will be parsed as it's escaped and utilised in the query exection. The escaping ' is removed.

Neo4J uses backslashes:

CREATE (n:Node {name:"hello \" world \\ "}) 
RETURN n.name

n.name
hello " world \

In this case, the parsed value removes the escape character \ in both usages.

We can see that existing console languages use the convention of stripping the escape characters from the input, which TypeQL now matches.

flyingsilverfin avatar Aug 03 '22 10:08 flyingsilverfin

DO NOT MERGE We have decided to park this change since it practically breaks backward compatibility of TypeDB: user's that did string escape previously have databases containing quoted \" value \" inside. However, in the new convention we would store quoted " value " inside.

To avoid forcing users to re-load, we will save merging this until we have to break backward compatibility for other reasons.

Changes required when merging:

  • The export-import should on import detect if importing from an older TypeDB version. If so, on import replace the \" with " to bring behaviour in line with the new parsing.
  • Add BDD scenario to test escaping works as expected end to end:
  Scenario: inserting strings including escapes are stored without escaping
    When typeql insert
      """
      insert
        $x isa name; $x "Normal";
        $y isa name; $y "With quote \" character";
        $z isa name; $z "With backslash \\ character";
      """
    Then transaction commits

    When session opens transaction of type: read
    When get answers of typeql match
      """
      match $x isa name;
      """
    Then uniquely identify answer concepts
      | x                                     |
      | value:name:Normal                     |
      | value:name:With quote " character     |
      | value:name:With backslash \ character |

    When get answers of typeql match
      """
      match $x "Normal" isa name;
      """
    Then answer size is: 1

    When get answers of typeql match
      """
      match $x "With quote \" character" isa name;
      """
    Then answer size is: 1

    When get answers of typeql match
      """
      match $x "With backslash \\ character" isa name;
      """
    Then answer size is: 1

flyingsilverfin avatar Aug 03 '22 11:08 flyingsilverfin

I am writing a hard coded workaround to this issue atm, as shown here https://forum.vaticle.com/t/escape-hell-handling-string-values/166/8

brettforbes avatar Aug 24 '22 11:08 brettforbes

We have further considerations for when this issue comes back: #239 fixed this for now, but have to settle on a behaviour of how to actually insert a newline character (not the two characters "\n").

One proposal is that the parser should parse "\n" pair of raw text characters as one newline character, which is then serialisable to the server. In client-java, we have to use an unformatted-toString() to make sure that we don't introduce new tabs into strings with newlines in them with this solution (alternatively, we can introduce templated queries and serialise attribute values separately). On printing of an attribute value, we would print a newline character as a pair of "\n" characters, making this operation fully reversible. If the user wishes to write a pair of "\n" characters explicitly, they would write "\n". One interesting edge case here is that writing an actual newline (eg. "enter" key) in console leaves another way of getting a newline character into the database - perhaps this one should be banned?

flyingsilverfin avatar Oct 21 '22 14:10 flyingsilverfin

hmm, so your suggestion would work, but currently it works anyway, just using a substitution approach as shown here https://forum.vaticle.com/t/escape-hell-handling-string-values/166/12

It just took me some time to work it out, and its the kind of thing that would be useful in documentation , for the less talented people like myself (i.e. why i posted the fix in the discussion forum)

so certainly this should be way down the list of priorities, and I would need to change that code when you make the change

brettforbes avatar Oct 22 '22 02:10 brettforbes