zig icon indicating copy to clipboard operation
zig copied to clipboard

Proposal: Multiple Values In Escape Sequences

Open SeriousBusiness101 opened this issue 2 years ago • 3 comments

Preferring escape sequences to UTF-8 in source is a common coding standard, one reason being security. Directionalities, dingbats, emojis, diacritics, logograms, notations, controls... shouldn't or can't be printed in source files in many contexts. Currently, successive UTF-8 codepoints in escape sequences looks like so:

const a = "\u{a1f3b}\u{a1f3c}\u{a1f3d}\u{a1f3e}\u{a1f3f}";

The proposal is to support multiple values in escape sequences with this syntax:

const b = "\u{a1f3b\ a1f3c\ a1f3d\ a1f3e\ a1f3f}";

This is easier and safer to read and write. Backslash delimits at the end of a codepoint. Also applies for #17376 if accepted. See comment https://github.com/ziglang/zig/issues/17376#issuecomment-1745072369

SeriousBusiness101 avatar Oct 03 '23 14:10 SeriousBusiness101

To clear up ambiguities:

  • Escape sequences as described in the proposal cannot be multiline - that would be unnecessarily abstruse.
  • The backslash delimiter is used in place of a comma to distinguish it as part of an escape sequence within a literal. To avoid further confusions, trailing delimiter (which would cause a \} sequence) may be forbidden. #17584.
  • (After { or delimiter) in-between spaces should be respected as formatting. Presumably this won't necessitate zig fmt changes. Visual underscore separators would be good for symmetry with number literals and highlighting Unicode planes and ranges. https://github.com/ziglang/zig/issues/17376#issuecomment-1745072369

This proposal would also make it intuitive to handle Unicode grapheme clusters, ZWJ / VS15 / VS16 emojis, and other needs. Minutia can be changed around, but I think this is approximately the right way to go about it.

SeriousBusiness101 avatar Dec 22 '23 00:12 SeriousBusiness101

Is there precedence for \ meaning multiple elements? I agree that it's an improvement, but just naively looking at this, I think simply delimiting elements via comma , (maybe which optional space after it) would look even more readable to me personally. (I assume a parser already has to enter a special state to implement this, so I don't think giving special meaning to , in this context would affect performance much, if that was the reason.)

rohlem avatar Apr 08 '24 16:04 rohlem

  • There is a precedence for \\ denoting ~~multiple~~ sequenced elements in the form of multiline strings.
  • There is a precedence for \ delimiting escape sequences within a string literal.
  • There is a precedence for , being a character within a string literal.

@rohlem it's more about not introducing regrettable ambiguities into string literals.

SeriousBusiness101 avatar Apr 13 '24 17:04 SeriousBusiness101