highlighter: Change the parsing approach to significantly improve performance
The performance of the current parsing approach can't be improved without changing the whole highlighter code. Due to this the change isn't without any risk, but it's definitely worth the try. Please see the following list, which has been done with the same host and micro -profile. The test has been stopped after complete highlight within 80x24 or aborted due to known very long and inefficient recursion (DNF). Afterwards the top1 has been printed with pprof.
| file references | before | after |
|---|---|---|
| 1. | 1.93s 12.26% 12.26% 1.93s 12.26% unicode/utf8.DecodeRune | 10ms 100% 100% 10ms 100% runtime.futex |
| 2. | (DNF) 5.41s 14.64% 14.64% 5.41s 14.64% unicode/utf8.DecodeRune | 10ms 100% 100% 10ms 100% runtime.writeHeapBits.flush |
| 3. | 10ms 20.00% 20.00% 10ms 20.00% github.com/zyedidia/tcell/v2.(*tScreen).SetContent | 20ms 40.00% 40.00% 20ms 40.00% github.com/zyedidia/micro/v2/internal/util.CharacterCount |
| 4. | 10ms 20.00% 20.00% 10ms 20.00% crypto/md5.block | 10ms 25.00% 25.00% 10ms 25.00% gopkg.in/yaml%2ev2.yaml_parser_update_buffer |
| 5. | 10ms 50.00% 50.00% 10ms 50.00% gopkg.in/yaml%2ev2.yaml_parser_scan_plain_scalar | 10ms 33.33% 33.33% 10ms 33.33% runtime.(*consistentHeapStats).acquire |
| 6. | (DNF) 8.79s 27.01% 27.01% 14.53s 44.65% regexp.(*Regexp).tryBacktrack | 10ms 20.00% 20.00% 10ms 20.00% github.com/zyedidia/micro/v2/internal/util.DecodeCharacter |
- tileset_env_test from #3115 (reduced version)
- tileset_env_test from #3115
- sample.md from #2839
-
sample.md from #2839 (with inserted
<script>) - Firefox's new tab page (reduced version)
- Firefox's new tab page
My available test files created the same or even more complex highlighting (e.g. pattern highlight within regions in HTMLs) results. Most probably the logic isn't in a perfect shape yet, but definitely feasible as proof of concept thought.
Please help to test and improving it with a review. It took a lot of days to get this far and would be a shame when we didn't get this upstream in any form. :wink:
Fixes #2839 Fixes #3115 Fixes #3487 Closes #3242 Fixes #3765 Fixes #3776
By default Vim stops at a line length of 3k, but micro now doesn't (care):
- Can confirm it really solves https://github.com/zyedidia/micro/issues/3115 1.1.While I tried to to test if it solves the issue I noticed nothing not desirable
- Replaced my current executable with executable of PR (+ some other PR which I believe won't influence this PR) so will say if find regressions
There is a problem maybe related maybe not. Create a file with 1MB of characters on one line and micro will freeze or slow but will work. It will start to work normally when there is no any characters from that big line is displayed on the screen
Create a file with 1MB of characters on one line and micro will freeze or slow but will work.
This is caused by https://github.com/zyedidia/micro/blob/master/internal/util/unicode.go#L69 resp. utf8.DecodeRune which takes the most of the time. The behavior was more or less the same before.
Create a file with 1MB of characters on one line and micro will freeze or slow but will work.
This is caused by https://github.com/zyedidia/micro/blob/master/internal/util/unicode.go#L69 resp.
utf8.DecodeRunewhich takes the most of the time. The behavior was more or less the same before.
Yeah but I am almost sure not fully. Why? I tried one time assume that my file will be ascii only so no unicode and I deleted the length checking (https://github.com/dustdfg/micro/tree/unicode_length_unchecked) and performance became better but not much
Yeah but I am almost sure not fully. Why? I tried one time assume that my file will be ascii only so no unicode and I deleted the length checking (https://github.com/dustdfg/micro/tree/unicode_length_unchecked) and performance became better but not much
Yes, you're right. Due to the randomness it finds a lot of "regions" within this long line and starts slicing them one after an other. The problem is, that this slicing process is based on the decoding of the characters too, to identify the correct position of the code points within the byte slice. Currently this isn't that optimized so far, but I've no cool idea how this can be done better.
Yeah but I am almost sure not fully. Why? I tried one time assume that my file will be ascii only so no unicode and I deleted the length checking (https://github.com/dustdfg/micro/tree/unicode_length_unchecked) and performance became better but not much
Yes, you're right. Due to the randomness it finds a lot of "regions" within this long line and starts slicing them one after an other. The problem is, that this slicing process is based on the decoding of the characters too, to identify the correct position of the code points within the byte slice. Currently this isn't that optimized so far, but I've no cool idea how this can be done better.
Randomness?
Create a file with 1MB of characters on one line and micro will freeze or slow but will work.
Didn't you use something like the following to create the file?
cat /dev/urandom | tr -dc 'A-Za-z0-9!"#$%&'''()*+,-./:;<=>?@[]^_`{|}~' | head -c 1M > test.sh
That was what I used and due to that I will receive some of "" & '' regions.
Create a file with 1MB of characters on one line and micro will freeze or slow but will work.
Didn't you use something like the following to create the file?
cat /dev/urandom | tr -dc 'A-Za-z0-9!"#$%&'''()*+,-./:;<=>?@[]^_`{|}~' | head -c 1M > test.sh
That was what I used and due to that I will receive some of
""&''regions.
Yeah but I used base64 which doesn't include (as I know) any quotes characters + sed (to delete the \n because micro couldn't do it)
I believe the problem with very long lines is that micro stores line data in memory simply as an array of raw bytes, not an array of unicode runes. I.e. it doesn't decode the line into runes beforehand. So whenever micro needs to know anything about a visual representation of any part of a line, it decodes it on the fly. (So for instance, if this is a very long line and we are closer to the end of it, micro has to decode almost the entire line, and it has to do that every time when redrawing the screen and so on. So things become extremely slow.)
Heck, micro doesn't even know how many characters are there in a line. It has to call util.CharacterCount() every time to find it out.
If we change it to decode and store lines as slices of runes beforehand, it should solve these problems, and generally improve performance. I really don't get why it wasn't done this way from the beginning.
Yep, that's an good idea to solve this here as well. Maybe I'll take care of this in the next few days. If someone would like to improve this earlier then give it a go and we will merge it afterwards into this PR. :wink:
If we change it to decode and store lines as slices of runes beforehand, it should solve these problems, and generally improve performance. I really don't get why it wasn't done this way from the beginning.
I see one possible reason: Rune is int32. It would be like using UTF32 for all characters even if our document contains only ASCII
Will be great if you take care of it. But why do it in this PR? It is not really a highlighter issue. It is an issue of efficient representation of a line in the basic data structures (so it is more fundamental than the highlighter issues, and affects us even when syntax highlighting is not used at all).
Initial rough idea of where we might start:
--- a/internal/buffer/line_array.go
+++ b/internal/buffer/line_array.go
@@ -41,10 +41,16 @@ type searchState struct {
done bool
}
+type Character struct {
+ r rune
+ combc []rune
+}
+
// A Line contains the data in bytes as well as a highlight state, match
// and a flag for whether the highlighting needs to be updated
type Line struct {
data []byte
+ char []Character
state highlight.State
match highlight.LineMatch
I see one possible reason: Rune is int32. It would be like using UTF32 for all characters even if our document contains only ASCII
Yes, it costs memory. Isn't it worth it?
I see one possible reason: Rune is int32. It would be like using UTF32 for all characters even if our document contains only ASCII
Yes, it costs memory. Isn't it worth it?
Idk. I can only say if I would a developer that writes a new app I would prefer to make something on bytes because they are cheaper (in size) just because it is the most obvious solution.
If we have 1MB file we would store 4MB in memory it isn't so bad for most hardware but I am not sure I would want it. Just as a user I would prefer to have different modes which isn't so easy for developer...
Will be great if you take care of it. But why do it in this PR? It is not really a highlighter issue. It is an issue of efficient representation of a line in the basic data structures (so it is more fundamental than the highlighter issues, and affects us even when syntax highlighting is not used at all).
Yeah it is a separate issue
Initial rough idea of where we might start:
--- a/internal/buffer/line_array.go +++ b/internal/buffer/line_array.go @@ -41,10 +41,16 @@ type searchState struct { done bool } +type Character struct { + r rune + combc []rune +} + // A Line contains the data in bytes as well as a highlight state, match // and a flag for whether the highlighting needs to be updated type Line struct { data []byte + char []Character state highlight.State match highlight.LineMatch
Storing data in bytes and storing the same data in the ints? I can understand replacement but addition? Why?
Storing data in bytes and storing the same data in the ints? I can understand replacement but addition? Why?
As I said, this is just an initial rough idea. I hope we won't need to keep both representations. OTOH, for example, if we need to keep LineBytes() for compatibility with plugins... (Some of my plugins do use it, I can change them if needed, but I'm not sure about other peoples plugins...)
But why do it in this PR? It is not really a highlighter issue.
Yeah it is a separate issue
I give both of you right, that the non-decoded bytes are and thus their repetitive decoding is a common performance degradation independent from the highlighting but the feature suffering the most from it is the highlighting...at least from end user perspective. The impact and result is easier to "feel"/measure with the syntax highlighting and due to the common usage of byte streams this interface this tightly coupled into the highlighting process. A rebase is then most likely. :wink:
Anway, I will find a solution for that lazy developer reason. :smile:
but the feature suffering the most from it is the highlighting...at least from end user perspective.
I don't think so. Try it: create a file long.txt with a long line containing 1 million characters, open it (the filetype will be unknown => no highlighting), move to the end of this long line and, for example, just try moving the cursor around. It's gonna be very slow. And if you also, for example, enable softwrap, it will be very slow even at the beginning of the long line.
Just as an example, how many times do we call RuneUnder() every time we refresh the display in displayBuffer()?
due to the common usage of byte streams this interface this tightly coupled into the highlighting process.
I don't think the highlighting is special here. Replacing byte streams with rune streams, although conceptually simple, is gonna be a huge change, reworking so many parts of code throughout micro (look at all the places where LineBytes() is used, for example). Changes in the highlighter would be just a small part of it, I think.
A rebase is then most likely.
Nothing's wrong with that.
Ok, you convinced me once again. I started yesterday in the evening with a small rework (independent from the highlighting): https://github.com/JoeKar/micro/tree/feature/perf-rune-lines
The branch is highly floating and the result (line handling) far away from a productive state. The start is done and now needs a lot of refactoring, fixing and adaptations at a lot of different locations.
Can confirm it really solves Micro does not respnoe when edit specific file #3115 1.1.While I tried to to test if it solves the issue I noticed nothing not desirable
- Replaced my current executable with executable of PR (+ some other PR which I believe won't influence this PR) so will say if find regressions
There is a problem maybe related maybe not. Create a file with 1MB of characters on one line and micro will freeze or slow but will work. It will start to work normally when there is no any characters from that big line is displayed on the screen
I still use it and didn't encounter any problems so I think @dmaluka is right and highlight should be separate pull request (and issue?)
I was looking at /usr/include/unistd.h in amd64 version of libc6-dev package in Debian 11 but there were multi-line comments with text not highlighted as text in a comment like this:
I was looking at this file when testing a bit: c-comment.c.txt
I think only the line where a multi-line comment is started is highlighted as text in a comment if there is text in it highlighted using regions when not in a comment. The other text in the line is also not highlighted as text in a comment if the region would end in it when not in a comment.
The lines between the first and last line in a character literal are not highlighted as text with an error. The other text in the line where a character literal ends is highlighted as text in a literal if the literal ends in another line.
There are similar bugs but I do not know if they are actually the same so I have not written about them.
@niten94 I can also see this bug (and don't see it without this PR).
[...] so I think @dmaluka is right and highlight should be separate pull request (and issue?)
Yes, this one wouldn't be mixed up with the other idea/improvement optimizing and storing the line data.
@niten94 I can also see this bug (and don't see it without this PR).
P.S. This is not the most important thing at the moment. I'd suggest to first address the regression reported by @niten94 in https://github.com/zyedidia/micro/pull/3127#issuecomment-1937714385 because maybe it means that the entire approach of this PR is wrong.
Yes, it's a regression which will be fixed in the next few days. Sorry for the inconveniences. AFAIK it sounds again like a logic issue within the nested region detection, which now uses all indices found in a line. But I thank you already for the testing, because this is really welcome here.
To make the API more intuitive, I'd suggest to remove
statesOnlyfromHighlight()(but not from the internalhighlight()function) and restore theReHighlightStates()function which would callhighlight()with statesOnly=true.
Side note: it seems we have redundant steps in this loop, which we can avoid by changing
i++toi = l + 1.
I will consider both.
I was looking at this file when testing a bit: c-comment.c.txt
Thanks a lot for providing the example file. This helps to save time and invest it into fixing the bugs. :wink:
The PR handles tab characters with mistakes:
When I start typing a name of a property for svg element (in the end of the tag) it doesn't highlight it fully but partially. It depends on amount of tabs in the line. The amount of not highlighted symbols is countOfTabs + 1
One more strange thing about svg highlight (exists in whole PR not only last patch):
File: marker.txt
New behavior is that it doesn't highlight marker and path as tag names
Old behavior (which is also wrong...) is that it highlighted viewBox with wrong color
Last patch broke multi line properties in xml which are yes usually not used by people...
File: tmp.txt
This patch behavior:
Previous version of this PR:
Before PR:
The most funny is that all three version have some mistakes but everywhere different...
@dustdfg I will check all of your complains and thank you for the example files.
Doesn't look like it's going to leave me alone any time soon. :cry:
- [x] The PR handles tab characters with mistakes:
- [x] One more strange thing about svg highlight (exists in whole PR not only last patch):
- [x] Last patch broke multi line properties in xml which are yes usually not used by people...
I just used the multi line one more time and found that problem doesn't occur with tabs but occurs with spaces in last patch (3rd problem)
Unfortunately the SVG highlighting is still not perfect:
Look at the
/ at the end of <path d="M 0 0 L 10 5 L 0 10 z"/>, which is highlighted as identifier instead of symbol.tag. This happens due to the fact that in the first round scanning for & = it will find the subsequent matches within the upcoming string, but the string is parsed in the next round and the current region doesn't know about it. That is then indeed a point for the old approach scanning/slicing region per region per region ... , which was slow because of parsing the same data over and over again and again.
I can hear @dmaluka already thinking:
When does this guy stop riding his dead horse? The performance of the highlighting wouldn't be that bad any longer, when the line data approach has changed from
bytesto (combined)runes!
:smile:
To be honest, I'm running out of arguments for this changed highlighting mechanism, because the maintenance becomes very complex right now and still not every corner case is 100% supported and thus not perfectly highlighted.
Maybe it's now a good point to decide if it's really worth the effort to further try improving this "proof"-of-concept after some more uncovered corner cases are found.
What do you think? Better invest the time in the rework of the line_array and every dependent?