Run regexes for TM grammars in native JS for perf
A couple of years ago in #165506, @fabiospampinato raised the idea of running TextMate grammars in JS using an Oniguruma to JS regex transpiler (for performance reasons and also potentially to remove the large Oniguruma dependency). However, the benefit was hypothetical at the time since there wasn't any regex transpiler written in JS that could actually do this, so a Ruby library was used as a proof point (but that wouldn't have worked since it's written in Ruby, transpiles Onigmo rather than Oniguruma, wasn't designed to support the way regexes are used in TextMate grammars, and wasn't robust enough to cover the long tail of grammars that often include complex regexes that rely on Oniguruma edge cases).
A library now exists (Oniguruma-To-ES) that solves these problems. It's lightweight and has been used for a while by the Shiki library, with support for the vast majority of TM grammars. Here's Shiki's compatibility list, which checks that its JS and WASM engines output identical highlighting results for Shiki's language samples. The issues with the handful of remaining unsupported grammars are well understood -- they are the result of bugs in the grammars (i.e., inclusion of an invalid Oniguruma regex), bugs in Oniguruma, or use of a few extremely rare features that can be supported in future versions or worked around.
Of course, VS Code wants to be a good OSS citizen and not break any grammars. Oniguruma-To-ES (as of v2.0) is up to the challenge, at a deep level. Perhaps, as a starting point, a few grammars that offer better performance could be marked to use JavaScript rather than Oniguruma, and then if everything goes smoothly its use could be expanded to additional grammars.
In a basic benchmark of Shiki's JS vs WASM engine (using precompiled versions of the grammars that had been pre-run through Oniguruma-To-ES using these options), the JS engine performed faster in some cases including the following examples (all with identical highlighting results compared to the WASM engine):
- Python: ~8.5x faster.
- MDC: ~13.5x faster.
- Markdown: ~3.3x faster.
- CSS: ~2.5x faster.
- SCSS: ~3.5x faster.
- Bash: ~2.6x faster.
- Kotlin: ~1.2x faster.
- Perl: ~1.4x faster.
- PHP: ~1.3x faster.
- Go: ~1.4x faster.
- Objective-C: ~1.3x faster.
These times are based on processing the language samples that Shiki provides; e.g. here's the Kotlin sample.
The JS engine with precompiled regexes is not faster than Oniguruma (via WASM) with all grammars, but there are optimization opportunities (this issue includes an example) that might increase the number of cases where it's faster.
Also note that Oniguruma-To-ES is faster than Oniguruma via WASM with some grammars even when transpiling regexes at runtime (without pre-running a grammar's regexes through it). In fact, Shiki doesn't pre-compile when using its standard JS engine. So it's not necessary to have separate grammar files (with an extra build step) to get some of the benefit.
I've updated the comment above based on updates to the library and more representative perf testing.
Thanks, this sounds very promising!
This feature request is now a candidate for our backlog. The community has 60 days to upvote the issue. If it receives 20 upvotes we will move it to our backlog. If not, we will close it. To learn more about how we handle feature requests, please see our documentation.
Happy Coding!
:slightly_smiling_face: This feature request received a sufficient number of community upvotes and we moved it to our backlog. To learn more about how we handle feature requests, please see our documentation.
Happy Coding!
With the latest library and TM grammar updates, Shiki's JS regex engine (built on Oniguruma-To-ES) supports 100% (all 222) of Shiki's built-in languages. See: https://shiki.style/references/engine-js-compat
Anyone is working on this?
I'm currently try to make a monkey patch to ./vscode/node_modules/vscode-oniguruma/release/main.js for testing.
Update: My monkey patch works well now! see https://gist.github.com/kkocdko/b54dcee692deb67a13ec811fba5282c0
Update: Is this really faster? To avoid my own fault, I tries shiki's demo here https://textmate-grammars-themes.netlify.app (repo), in the top bar, switch to the JavaScript mode seems even slower than Oniguruma.
The test input is this . I press Enter to input newline twice and Backspace twice. Use F12 to record performance. The result:
Oniguruma (367ms):
JavaScript (2.60s):
The result is strongly related to the styled language, of course, more regexp patterns, more match tries. I see in the demo's findNextMatchSync it also use a simple for loop. For Oniguruma, it can do match only once per call, dig into engine to get which sub-expression is matched. But in JS I can not find out a way to do this.
Update: seems that named capturing group can do this? I'm tring...
Update: IMO we should prefer tree-sitter instead of fight with this ugly TextMate.
prob should look at
- https://github.com/shikijs/shiki
- https://github.com/microsoft/vscode-textmate
- https://github.com/microsoft/vscode-oniguruma
From the original post:
In a basic benchmark of Shiki's JS vs WASM engine (using precompiled versions of the grammars that had been pre-run through Oniguruma-To-ES using these options), the JS engine performed comparably for many grammars, faster for some, and slower for others. [...] These times are based on processing the language samples that Shiki provides
@kkocdko
Is this really faster? [...] The result is strongly related to the styled language, of course, more regexp patterns, more match tries. I see in the demo's
findNextMatchSyncit also use a simple for loop. For Oniguruma, it can do match only once per call
Native Oniguruma has some significant advantages and native JS has other advantages.
Note that I was using Shiki's precompiled versions of grammars (with Shiki's createJavaScriptRawEngine), and it sounds like you're not doing that here. Shiki's playground site doesn't use the precompiled grammars. But anecdotally, even on the playground page you linked to, I'm seeing faster numbers for JS highlighting than for Oniguruma via the timer that the playground self reports at the bottom. It depends on the language and sample, though.
Precompiled grammars avoid the need to transpile the regexes of a TM grammar at runtime. That said, there's an open Shiki issue for precompiled grammars (https://github.com/shikijs/shiki/issues/918, not known when I first posted here) that prevents them from working correctly with some languages. It can be fixed but hasn't been prioritized since the standard createJavaScriptRegexEngine (which transpiles regexes at runtime) already avoids the need to download the large WASM bundle, and avoiding WASM is often the main reason to use Shiki's JS engine. Things are different for VSCode, since presumably it won't actually be able to remove vscode-oniguruma. So perf would presumably be the main benefit, but perf characteristics are different per grammar. Note that interest from VS Code in using precompiled JS grammars could accelerate Shiki support for the issue I linked to.
@slevithan
Note that I was using Shiki's precompiled versions of grammars (with Shiki's createJavaScriptRawEngine), and it sounds like you're not doing that here.
Yep but that only make scene on the engine creation. I think after the engine was created it does not do regexp compile more?
even on the playground page you linked to, I'm seeing faster numbers for JS highlighting than for Oniguruma via the timer that the playground self reports at the bottom. It depends on the language and sample, though.
Thanks for your hint, I never found this timer before, hah. After some testing, I found that: JS engine surpass Oniguruma on the language using less rules like CSS, but for complex lang like TS/C++/Rust it will be much slower. In the official demo with prefilled content, Oniguruma slow at startup but if you copy-paste the content by 32 times, you'll found that JS engine begin to be slower than Oniguruma.
After some testing, I found that: JS engine surpass Oniguruma on the language using less rules like CSS, but for complex lang like TS/C++/Rust it will be much slower. In the official demo with prefilled content, Oniguruma slow at startup but if you copy-paste the content by 32 times, you'll found that JS engine begin to be slower than Oniguruma.
This sounds like a reason to close this feature request, no?
@slevithan can you verify or refute that claim?
If there is not a significant advantage, it will not be worth it to adopt a different library, as chances are high it might break syntax highlighting for several languages.
There are optimizations not used in the testing above like precompiling the grammars, applying progressively, etc., and I mentioned up front that it would be slower for some grammars so it should only be enabled for specific grammars where it is faster. However, since VS Code wouldn't be able to remove Oniguruma altogether and therefore wouldn't get additional benefit from that, I'll go ahead and close this because I think it might introduce more complexity and uncertainty than justified for VS Code to adopt this in a way that is strictly beneficial.