vscode-textmate icon indicating copy to clipboard operation
vscode-textmate copied to clipboard

TextMate lacks means to match a pattern exactly one time

Open msftrncs opened this issue 6 years ago • 5 comments

TextMate seems to lack a means to indicate that a pattern is to be matched only one time, similar to the REGEX ? quantifier. Without this, it makes some languages difficult to correctly scope.

Take for instance PowerShell, which doesn't really have any reserved keywords. PowerShell does support if, elseif and else keywords for flow control, and like many languages, else is only supported once after an if statement. However, because keywords are not reserved, elseif and else can be reused as command names, and the only factor that controls that, is context. While in the context of an if statement, elseif and else serve as keywords, with else or anything that is not elseif terminating the if context. Additionally, () condition group and {} statement blocks are required for if and elseif and {} is required for else (but that is not that important since the if context is already terminated effectively).

# general syntax (not including comments and line breaks)
if (condition) {statement} elseif (condition-elseif) {statement-elseif} else {statement-else}

# demonstration
# `else another-command` is actually another command (native executable or user defined function)
if(1){do-something}else{do-something else}else another-command

To further complicate things, the state at which this statement is reached is unknown, as there are multiple ways to arrive at it. This prevents having a forceful exit strategy where by the grammar can force a specific rule to progress until it is clear to back out of the current stack.

Example

# hashtable using `if` in assignment
@{
    key = if (cond) {statement}
    key2 = if (cond) {statement} else {statement else}
    else = 'const' # else here is valid hashtable literal key name, because `if` context ended above.
    key3 = if (condition) {statement}
    else {another statement} # `if ` context was not closed in previous line so it continued to this line
}

(Note the grammar that GitHub uses only differentiates the else hash key above based on the presence of an =, not on the context of the if statement.)

I cannot find a means to formulate a TextMate grammer than can properly describe this situation. I think this is due to the lack of a property on each begin or match rule such as applyPatternOnce, which would limit the matching of the pattern to only once in the current stack scope.

  • else should only be allowed once per if context.
  • () condition should be required but only once per if or elseif subcontext.
  • {} statement block should be required, but only once per if, elseif or else subcontext, but only after the () condition for if and elseif.

Grammar constructed so far: (only partial file) (if testing, only use empty conditions and empty statement blocks and no comments, as I am not including those subsections as they are not relevant to this issue.)

{
	"patterns": [
		{
			"comment": "else,elseif: only after if,elseif",
			"begin": "(?i)(?=if[\\s{(,;&|)}])",
			"end": "(?!\\G)",
			"patterns": [
				{
					"include": "#ifStatement"
				}
			]
		}
	],
	"repository": {
		"ifStatement": {
			"comment": "else,elseif: only after if,elseif",
			"begin": "\\G(?i:(if)|(elseif))(?=[\\s{(,;&|)}])",
			"beginCaptures": {
				"1": {
					"name": "keyword.control.if.powershell"
				},
				"2": {
					"name": "keyword.control.if-elseif.powershell"
				}
			},
			"end": "(?=.|$)",
			"applyEndPatternLast": true,
			"patterns": [
				{
					"include": "#advanceToToken"
				},
				{
					"begin": "(?<![)}])(?=\\()",
					"end": "(?=.|$)",
					"applyEndPatternLast": true,
					"patterns": [
						{
							"begin": "\\G\\(",
							"beginCaptures": {
								"0": {
									"name": "punctuation.section.group.begin.powershell"
								}
							},
							"end": "\\)",
							"endCaptures": {
								"0": {
									"name": "punctuation.section.group.end.powershell"
								}
							},
							"name": "meta.if-condition.powershell",
							"patterns": [
								{
									"comment": "`;` not permitted here",
									"match": ";",
									"name": "invalid.source.powershell"
								},
								{
									"include": "#command_mode"
								}
							]
						},
						{
							"begin": "(?<=\\))(?=[\\s#]|<#|`\\s|{)",
							"end": "(?=.|$)",
							"applyEndPatternLast": true,
							"patterns": [
								{
									"include": "#advanceToToken"
								},
								{
									"begin": "(?<!})(?={)",
									"end": "(?=.|$)",
									"applyEndPatternLast": true,
									"patterns": [
										{
											"begin": "\\G\\{",
											"beginCaptures": {
												"0": {
													"name": "punctuation.section.braces.begin.powershell"
												}
											},
											"end": "}",
											"endCaptures": {
												"0": {
													"name": "punctuation.section.braces.end.powershell"
												}
											},
											"name": "meta.statements.if-condition.powershell",
											"patterns": [
												{
													"include": "$self"
												}
											]
										},
										{
											"begin": "(?<=})(?=[\\s#]|<#|`\\s)",
											"end": "(?=.|$)",
											"applyEndPatternLast": true,
											"patterns": [
												{
													"include": "#advanceToToken"
												}
											]
										}
									]
								}
							]
						}
					]
				},
				{
					"begin": "(?i:else)(?=[\\s{(,;&|)}])",
					"beginCaptures": {
						"0": {
							"name": "keyword.control.if-else.powershell"
						}
					},
					"end": "(?=.|$)",
					"applyEndPatternLast": true,
					"patterns": [
						{
							"include": "#advanceToToken"
						},
						{
							"begin": "(?<!}){",
							"beginCaptures": {
								"0": {
									"name": "punctuation.section.braces.begin.powershell"
								}
							},
							"end": "}",
							"endCaptures": {
								"0": {
									"name": "punctuation.section.braces.end.powershell"
								}
							},
							"name": "meta.statements.if-else-condition.powershell",
							"patterns": [
								{
									"include": "$self"
								}
							]
						}
					]
				},
				{
					"begin": "(?i)(?=elseif[\\s{(,;&|)}])",
					"end": "(?!\\G)",
					"patterns": [
						{
							"include": "#ifStatement"
						}
					]
				}
			]
		},
		"advanceToToken": {
			"comment": "consume spaces and comments and line ends until the next token appears",
			"begin": "\\G(?=[\\s#]|<#|`\\s)",
			"end": "(?!\\s)(?!$)",
			"applyEndPatternLast": true,
			"patterns": [
				{
					"comment": "useless escape, and doesn't count as a token",
					"match": "`\\s",
					"name": "invalid.character.escape.powershell"
				},
				{
					"include": "#commentLine"
				},
				{
					"include": "#commentBlock"
				}
			]
		}
	}
}

For reference to this grammar (the complete grammar): https://github.com/msftrncs/PowerShell.tmLanguage/blob/argumentmode_2ndtry/powershell.tmLanguage.json

msftrncs avatar Sep 24 '19 03:09 msftrncs

This is very well written, and describes a hard limitation of TextMate grammars. This limitation appears to be solvable via some new TM construct, but IMHO there is an entire class of cases that TM cannot handle, simply because it is a top-down single-pass parser.

There are cases where there is some bit of information lower in the file that ends up influencing a coloring decision done at the beginning of the file, context-sensitive keywords are a great example of that.

We do not have any plans to expand or diverge from the TextMate grammar implementation, and we try as much as possible to align to TextMate... We plan to solve context sensitive coloring via special purpose semantic coloring API in VS Code.

alexdima avatar Nov 20 '19 09:11 alexdima

@alexdima Thanks for the info, especially this part.

We plan to solve context sensitive coloring via special purpose semantic coloring API in VS Code.

Is there some sort of discussion thread or proposal where I can follow the whereabouts of this feature?

Right now I imagine having syntax colouring/classification provider as one of Programmatic Language Features - in fact, that's the first place I looked for customised syntax classification API as an alternative for simpler but sometimes lacking TextMate grammars. Such provider would take a TextDocument (possibly some other parameters) and output a collection of syntactic scope objects, each containing the scope name - just like in TextMate, so that themes can be easily reused - and the range in TextDocument it applies to.

I think this would cover pretty much any reasonable syntax classification scenario and also fit well with other programmatic language features (the code analyzer used for implementing something like "Go To Definition" or "Find all references" would likely be easily reused for syntax classification).

Alphish avatar Feb 17 '20 08:02 Alphish

Nevermind, I've found the related issue, which leads to more info on that matter and it looks promising, both the API itself and how it works in the recent feature preview: https://github.com/microsoft/vscode/issues/86415

It's a good time to be a grammar extension author, I suppose.

Alphish avatar Feb 19 '20 18:02 Alphish

TextMate in general is very limited, are you aware if there's a way to just use a javascript file and write your own tokeniser for a custom syntax highlighting plugin? I am trying setting up a whole thing in one JSON file, but that is pure pain.

Anatoly03 avatar Sep 29 '21 11:09 Anatoly03

@Anatoly03 I suggest checking out Semantic Highlighting Guide along with Semantic Token Sample.

Based on these, I could successfully write a silly tokeniser (in TypeScript) that coloured "good" words green and "bad" words orange (I even made it so that good and bad words are defined in a file at the top-level of the workspace). So I'd say it's a pretty good option when you don't want to play TextMate grammar declarative golf.

Alphish avatar Sep 30 '21 17:09 Alphish