vscode icon indicating copy to clipboard operation
vscode copied to clipboard

Run regexes for TM grammars in native JS for perf

Open slevithan opened this issue 1 year ago • 1 comments

A couple of years ago in #165506, @fabiospampinato raised the idea of running TextMate grammars in JS using an Oniguruma to JS regex transpiler (for performance reasons and also potentially to remove the large Oniguruma dependency). However, the benefit was hypothetical at the time since there wasn't any regex transpiler written in JS that could actually do this, so a Ruby library was used as a proof point (but that wouldn't have worked since it's written in Ruby, transpiles Onigmo rather than Oniguruma, wasn't designed to support the way regexes are used in TextMate grammars, and wasn't robust enough to cover the long tail of grammars that often include complex regexes that rely on Oniguruma edge cases).

A library now exists (Oniguruma-To-ES) that solves these problems. It's lightweight and has been used for a while by the Shiki library, with support for the vast majority of TM grammars. Here's Shiki's compatibility list, which checks that its JS and WASM engines output identical highlighting results for Shiki's language samples. The issues with the handful of remaining unsupported grammars are well understood -- they are the result of bugs in the grammars (i.e., inclusion of an invalid Oniguruma regex), bugs in Oniguruma, or use of a few extremely rare features that can be supported in future versions or worked around.

Of course, VS Code wants to be a good OSS citizen and not break any grammars. Oniguruma-To-ES (as of v2.0) is up to the challenge, at a deep level. Perhaps, as a starting point, a few grammars that offer better performance could be marked to use JavaScript rather than Oniguruma, and then if everything goes smoothly its use could be expanded to additional grammars.

In a basic benchmark of Shiki's JS vs WASM engine (using precompiled versions of the grammars that had been pre-run through Oniguruma-To-ES using these options), the JS engine performed faster in some cases including the following examples (all with identical highlighting results compared to the WASM engine):

  • Python: ~8.5x faster.
  • MDC: ~13.5x faster.
  • Markdown: ~3.3x faster.
  • CSS: ~2.5x faster.
  • SCSS: ~3.5x faster.
  • Bash: ~2.6x faster.
  • Kotlin: ~1.2x faster.
  • Perl: ~1.4x faster.
  • PHP: ~1.3x faster.
  • Go: ~1.4x faster.
  • Objective-C: ~1.3x faster.

These times are based on processing the language samples that Shiki provides; e.g. here's the Kotlin sample.

The JS engine with precompiled regexes is not faster than Oniguruma (via WASM) with all grammars, but there are optimization opportunities (this issue includes an example) that might increase the number of cases where it's faster.

Also note that Oniguruma-To-ES is faster than Oniguruma via WASM with some grammars even when transpiling regexes at runtime (without pre-running a grammar's regexes through it). In fact, Shiki doesn't pre-compile when using its standard JS engine. So it's not necessary to have separate grammar files (with an extra build step) to get some of the benefit.

slevithan avatar Jan 09 '25 02:01 slevithan

I've updated the comment above based on updates to the library and more representative perf testing.

slevithan avatar Jan 15 '25 13:01 slevithan

Thanks, this sounds very promising!

hediet avatar Feb 13 '25 18:02 hediet

This feature request is now a candidate for our backlog. The community has 60 days to upvote the issue. If it receives 20 upvotes we will move it to our backlog. If not, we will close it. To learn more about how we handle feature requests, please see our documentation.

Happy Coding!

:slightly_smiling_face: This feature request received a sufficient number of community upvotes and we moved it to our backlog. To learn more about how we handle feature requests, please see our documentation.

Happy Coding!

With the latest library and TM grammar updates, Shiki's JS regex engine (built on Oniguruma-To-ES) supports 100% (all 222) of Shiki's built-in languages. See: https://shiki.style/references/engine-js-compat

slevithan avatar Aug 02 '25 19:08 slevithan

Anyone is working on this?

I'm currently try to make a monkey patch to ./vscode/node_modules/vscode-oniguruma/release/main.js for testing.


Update: My monkey patch works well now! see https://gist.github.com/kkocdko/b54dcee692deb67a13ec811fba5282c0


Update: Is this really faster? To avoid my own fault, I tries shiki's demo here https://textmate-grammars-themes.netlify.app (repo), in the top bar, switch to the JavaScript mode seems even slower than Oniguruma.

The test input is this . I press Enter to input newline twice and Backspace twice. Use F12 to record performance. The result:

Oniguruma (367ms):

perf of mode Oniguruma

JavaScript (2.60s):

perf of mode JavaScript

The result is strongly related to the styled language, of course, more regexp patterns, more match tries. I see in the demo's findNextMatchSync it also use a simple for loop. For Oniguruma, it can do match only once per call, dig into engine to get which sub-expression is matched. But in JS I can not find out a way to do this.


Update: seems that named capturing group can do this? I'm tring...


Update: IMO we should prefer tree-sitter instead of fight with this ugly TextMate.

kkocdko avatar Sep 21 '25 07:09 kkocdko

prob should look at

  • https://github.com/shikijs/shiki
  • https://github.com/microsoft/vscode-textmate
  • https://github.com/microsoft/vscode-oniguruma

RedCMD avatar Sep 21 '25 07:09 RedCMD

From the original post:

In a basic benchmark of Shiki's JS vs WASM engine (using precompiled versions of the grammars that had been pre-run through Oniguruma-To-ES using these options), the JS engine performed comparably for many grammars, faster for some, and slower for others. [...] These times are based on processing the language samples that Shiki provides

@kkocdko

Is this really faster? [...] The result is strongly related to the styled language, of course, more regexp patterns, more match tries. I see in the demo's findNextMatchSync it also use a simple for loop. For Oniguruma, it can do match only once per call

Native Oniguruma has some significant advantages and native JS has other advantages.

Note that I was using Shiki's precompiled versions of grammars (with Shiki's createJavaScriptRawEngine), and it sounds like you're not doing that here. Shiki's playground site doesn't use the precompiled grammars. But anecdotally, even on the playground page you linked to, I'm seeing faster numbers for JS highlighting than for Oniguruma via the timer that the playground self reports at the bottom. It depends on the language and sample, though.

Precompiled grammars avoid the need to transpile the regexes of a TM grammar at runtime. That said, there's an open Shiki issue for precompiled grammars (https://github.com/shikijs/shiki/issues/918, not known when I first posted here) that prevents them from working correctly with some languages. It can be fixed but hasn't been prioritized since the standard createJavaScriptRegexEngine (which transpiles regexes at runtime) already avoids the need to download the large WASM bundle, and avoiding WASM is often the main reason to use Shiki's JS engine. Things are different for VSCode, since presumably it won't actually be able to remove vscode-oniguruma. So perf would presumably be the main benefit, but perf characteristics are different per grammar. Note that interest from VS Code in using precompiled JS grammars could accelerate Shiki support for the issue I linked to.

slevithan avatar Nov 21 '25 18:11 slevithan

@slevithan

Note that I was using Shiki's precompiled versions of grammars (with Shiki's createJavaScriptRawEngine), and it sounds like you're not doing that here.

Yep but that only make scene on the engine creation. I think after the engine was created it does not do regexp compile more?

even on the playground page you linked to, I'm seeing faster numbers for JS highlighting than for Oniguruma via the timer that the playground self reports at the bottom. It depends on the language and sample, though.

Thanks for your hint, I never found this timer before, hah. After some testing, I found that: JS engine surpass Oniguruma on the language using less rules like CSS, but for complex lang like TS/C++/Rust it will be much slower. In the official demo with prefilled content, Oniguruma slow at startup but if you copy-paste the content by 32 times, you'll found that JS engine begin to be slower than Oniguruma.

kkocdko avatar Nov 24 '25 03:11 kkocdko

After some testing, I found that: JS engine surpass Oniguruma on the language using less rules like CSS, but for complex lang like TS/C++/Rust it will be much slower. In the official demo with prefilled content, Oniguruma slow at startup but if you copy-paste the content by 32 times, you'll found that JS engine begin to be slower than Oniguruma.

This sounds like a reason to close this feature request, no?

@slevithan can you verify or refute that claim?

If there is not a significant advantage, it will not be worth it to adopt a different library, as chances are high it might break syntax highlighting for several languages.

hediet avatar Jan 12 '26 11:01 hediet

There are optimizations not used in the testing above like precompiling the grammars, applying progressively, etc., and I mentioned up front that it would be slower for some grammars so it should only be enabled for specific grammars where it is faster. However, since VS Code wouldn't be able to remove Oniguruma altogether and therefore wouldn't get additional benefit from that, I'll go ahead and close this because I think it might introduce more complexity and uncertainty than justified for VS Code to adopt this in a way that is strictly beneficial.

slevithan avatar Jan 12 '26 11:01 slevithan