Add optional unicode normalization before passing strings to speech or braille
Is your feature request related to a problem? Please describe.
In some cases, text can contain ligature characters that are not provided in a braille table. Alternatively, a speech synthesizer kan really struggle with these.
An example is the ligature ij (ij), in dutch as in ijsbeer (polar bear). The Dutch version of ESPeak is unable to pronounce this word correctly.
An exactly opposite example is á, which is composed of two characters, namely the letter a and the modifier ́.
Describe the solution you'd like
For both speech and braille, i propose adding the ability to enable unicode normalization with the NFKC algoritm (Normalization Form Compatibility Composition). This algorithm ensures that most ligatures are properly decomposed before passing them to the synthesizer while composing characters like á, which is much more common than á.
Note that while this sounds utterly complex, it is basically adding one line of code:
processed = unicodedata.normalize("NFKC", original)
Some answers/reactions:
- Unicode normalization on "bœr" does not seem to change anything on my end. How does it improve things for Dutch reading?
- Is the issue only with eSpeak or with many TTS?
- I have not noticed any issue with "œ" character in French TTS. This ligature is also used in French but TTS seem to behave the same no matter if we have the ligature or two separate characters ("oe")
- Regarding the characters with diacritic, using one compound character ("á") behaves much better with TTSs then two characters ("á") at least when reading French.
- I think that such normalization is offered and customizable in at least one TTS; I do not remember which one.
- I do not know the difference between NFKC and NFC; is one normalization form preferable and why?
1. Unicode normalization on "bœr" does not seem to change anything on my end. How does it improve things for Dutch reading?
I'm very sorry. I actually performed tests for ij and somehow assumed it also worked for œ, but turns out it doesn't. For ij it definitely works: unicodedata.normalize("NFKC", "ijsbeer") returns ijsbeer
I updated the initial description accordingly.
2. Is the issue only with eSpeak or with many TTS?
I can reproduce the issue with ijsbeer with ESPeak, OneCore and Vocalizer Expressive.
3. I have not noticed any issue with "œ" character in French TTS. This ligature is also used in French but TTS seem to behave the same no matter if we have the ligature or two separate characters ("oe")
For the Dutch ESpeak, bœr pronounces as bor (i.e. it behaves like only the o is present). Vocalizer and OneCore seem to ignore the ligature completely.
4. Regarding the characters with diacritic, using one compound character ("á") behaves much better with TTSs then two characters ("á") at least when reading French.
Same applies to Dutch.
6. I do not know the difference between NFKC and NFC; is one normalization form preferable and why?
The NFC variant doesn't provide compatible characters for ligatures and therefore doesn't touch them. While NFKD does decompose the ligeature, it also decomposes á. Therefore NFKC seems to be the only method that really makes sense.
That said, I really want to know why œ is left alone and ij is decomposed correctly.
This sounds like a good idea. Would it be possible to normalize only once if the user had both speech and braille enabled? While the overhead seems small to me, it wouldn't hurt to try and minimize it.
I think that even when we normalize on the TextInfo level, normalization would happen once for braille and once for speech. That said, I now realize that normalizing might break offset based text infos. Need to check that.
A question, does the normalization could solve the problem of italic or bold unicode characters that synthesizers can't read?
Do you have examples of this? I'm certainly willing to investigate. In the end, I'm searching for a normalization strategy that works best for anyone.
Do you have examples of this? I'm certainly willing to investigate.
Hi @LeonarddeR, this should be a good resource. If you want, see also this add-on, that uses Unidecode library.
Yes, 𝒊𝒕𝒂𝒍𝒊𝒄𝒔 is normalized to italics
Thus, a such PR is fundamental, in my opinion ⚡ Seriously, whole keywords, titles and phrases are written with these characters nowadays, on posts of social networks, nicknames in chats, and so on. In the majority of cases, NVDA simply ignores them. Working to fix textInfos offsets should bring to a Braille emoji translations too, that is a welcome feature directly in core, for me (and other users with hearing problems, I think). And I appreciate a lot even ligature fix, that struggled me from time to time during university studies.
Actually, thanks to very valuable work by @mltony in https://github.com/nvaccess/nvda/pull/16219, I think we can make this work. We can create a new offset converter for normalization and use that to map normalized positions to real raw positions in the text. This way cursor routing and presentation should still work.
I created an offset converter that seems to do this reliably now. We can add this as an optional feature to speech and braille output.
Why should this be optional?
It should be configurable because there are both uses cases with the feature on and off:
- normalization off (as today): allows to detect easily letter-like characters used in fishing e-mails, either because your synth ignores it or because it reports it differently
- normalization on:
- useful to be able to read with some TTSs letters with diacritics written with two separate characters instead of a combined one.
- useful to read text written with specific unicode characters (e.g. unicode italic or unicode bold) instead of standard ones, e.g. frequently used for nicknames in some forums
@LeonarddeR you just made all equations in Ms Word accessible. All alphanumeric characters of unicode seem to be read properly now by synthesizers no matter where they appear. This is really great work!
Is it possible to make this also work when using left and right arrow to move character by character? Otherwise it is really difficult to explore equations character by character.
Example document with an equation and example table with all unicode alphanumeric characters: Newton.docx alphanumeric Mathematical symbols.xlsx
@LeonarddeR I really advise to have this enabled by default, this will make users read mathematical content right away in documents such as pdfs or MS Word where alphanumeric characters are used to build equations. People who don't need the normalization can turn it off anyway, but I think the basis or users who benefit of this normalization is huge.
Can you please also adjust the user guide to include that alphanumeric characters are also included into this normalization?
cc: @michaelDCurran what do you think?
Also this does not have a feature flag. Why did you add the additional standard (disabled) value? Could this not be just a checkbox?
I think also that the normalized character pronounciation is ok, if people want to know the detailed character information they could use the character information add-on written by @CyrilleB79 for example.
@CyrilleB79
normalization off (as today): allows to detect easily letter-like characters used in fishing e-mails, either because your synth ignores it or because it reports it differently
I think fishing emails are recognizable even withouth normalization off. I don't see a real use case why people would like to have this setting to off, unless they want more information about the character which they can retrieve via nvda+dot, or by using the character information add-on.
@LeonarddeR you just made all equations in Ms Word accessible. All alphanumeric characters of unicode seem to be read properly now by synthesizers no matter where they appear. This is really great work!
That's a nice side effect really!
Is it possible to make this also work when using left and right arrow to move character by character? Otherwise it is really difficult to explore equations character by character.
This can be done, but it introduces a drawback where you can no longer identify foreign characters with speech when reading character by character. We can make the character by character movement an additional option, but then it gets really messy in the end.
Also this does not have a feature flag. Why did you add the additional standard (disabled) value? Could this not be just a checkbox?
This definitely is a feature flag internally. Behavior is same as the "interrupt speech while scrolling" option, for example.
This can be done, but it introduces a drawback where you can no longer identify foreign characters with speech when reading character by character. We can make the character by character movement an additional option, but then it gets really messy in the end.
However, with the normalization the pronounciation is really different from the current language even when you read foreign characters, so it should be confortable enough to know this is a foreign character. Then people could use nvda+dot or the said add-on to identify its details. An alternative could be to keep the current character by character pronounciation only for the review cursor, but not for the system carret.
This definitely is a feature flag internally. Behavior is same as the "interrupt speech while scrolling" option, for example.
Do you mean cancellable speech? That one is well tested and there should actually not be any feature flag on it at all. Reef put that flag on it because there were some issues at the beginning when switching windows, but that has been fixed long ago.
However, with the normalization the pronounciation is really different from the current language even when you read foreign characters, so it should be confortable enough to know this is a foreign character.
Let's give an example:
𝐏𝐥𝐞𝐚𝐬𝐞 𝐫𝐞𝐚𝐝 𝐭𝐡𝐢𝐬 𝐬𝐞𝐧𝐭𝐞𝐧𝐜𝐞 𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫 𝐛𝐲 𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫
This sentence now reads like a charm with normalization on. However, these are definitely not normal characters. There must still be a way to recognize the characters as they are.
I think also that the normalized character pronounciation is ok, if people want to know the detailed character information they could use the character information add-on written by @CyrilleB79 for example.
Maybe delayed character description could report that a normalization has happened. Just an idea...
Let's give an example: 𝐏𝐥𝐞𝐚𝐬𝐞 𝐫𝐞𝐚𝐝 𝐭𝐡𝐢𝐬 𝐬𝐞𝐧𝐭𝐞𝐧𝐜𝐞 𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫 𝐛𝐲 𝐜𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫 This sentence now reads like a charm with normalization on. However, these are definitely not normal characters. There must still be a way to recognize the characters as they are.
That's already possible by pressing nvda+dot once or two times or three times. When using the character info add-on, the detailed info about the caracter is even displayed into a browseable window. Actually it is really not of interest of the user whether these characters are normal or not. The p from "please" looks like a p on the screen, it is indeed a MATHEMATICAL BOLD CAPITAL P and whether it has unicode 0x1d40f or what ever, it really doesn't matter when reading content on the go. These details about a character are of technical nature only.
The formatting of the letters is bold, this is the only thing that might matter in some cases, but for this the unicode name of the character needs to be reported which is done by the character info add-on. NVDA cannot report unicode names natively anyway. But this is another issue which could be addressed later on.
So I still think it is ok to add normalization also when moving character by character. Retrieving the whole typographie details of a character can already be done via other methods as already said. Actually I think normalization does not even need an option at all, it should be always enabled. But that's only my opinion.
Actually it is really not of interest of the user whether these characters are normal or not. The p from "please" looks like a p on the screen, it is indeed a MATHEMATICAL BOLD CAPITAL P and whether it has unicode 0x1d40f or what ever, it really doesn't matter when reading content on the go. These details about a character are of technical nature only.
That's not true at all. It matters because the user needs to know that these are not normal characters, e.g. in the following cases:
- Such character cannot be found as normal ones with NVDA search (
NVDA+f3) nor with other searches such as in Notepad. - If these characters are used for a file name, for example if you copy/paste these characters (e.g. "𝐏𝐥𝐞𝐚𝐬𝐞", i.e. "Please" written with math bold characters). The file will not appear with the files beginning with a normal "P" but will be sorted in Windows Explorer after the files beginning with "Z".
IMO, it's really important to keep the option configurable. And I would even say that the option should not be enabled by default, at least due to the search use case exposed above.
Ah this is a good point, thanks. Yes in this case we should document it properly in the user guide.However, when normalization is on we should be able to apply it for character by character navigation as well.Maybe it should not apply to char to char navigation when using the review cursor.Von meinem iPhone gesendetAm 22.05.2024 um 21:55 schrieb Cyrille Bougot @.***>:
Actually it is really not of interest of the user whether these characters are normal or not. The p from "please" looks like a p on the screen, it is indeed a MATHEMATICAL BOLD CAPITAL P and whether it has unicode 0x1d40f or what ever, it really doesn't matter when reading content on the go. These details about a character are of technical nature only.
That's not true at all. It matters because the user needs to know that these are not normal characters, e.g. in the following cases:
Such character cannot be found as normal ones with NVDA search (NVDA+f3) nor with other searches such as in Notepad. If these characters are used for a file name, for example if you copy/paste these characters (e.g. "𝐏𝐥𝐞𝐚𝐬𝐞", i.e. "Please" written with math bold characters). The file will not appear with the files beginning with a normal "P" but will be sorted in Windows Explorer after the files beginning with "Z".
IMO, it's really important to keep the option configurable. And I would even say that the option should not be enabled by default, at least due to the search use case exposed above.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>
But actually when I copy mathematical alphanumeric characters in to the NVDA find dialog, I can search for them without problems. Maybe this needs to be specified in the user guide as well.Von meinem iPhone gesendetAm 22.05.2024 um 22:15 schrieb Adriani Botez @.>:Ah this is a good point, thanks. Yes in this case we should document it properly in the user guide.However, when normalization is on we should be able to apply it for character by character navigation as well.Maybe it should not apply to char to char navigation when using the review cursor.Von meinem iPhone gesendetAm 22.05.2024 um 21:55 schrieb Cyrille Bougot @.>:
Actually it is really not of interest of the user whether these characters are normal or not. The p from "please" looks like a p on the screen, it is indeed a MATHEMATICAL BOLD CAPITAL P and whether it has unicode 0x1d40f or what ever, it really doesn't matter when reading content on the go. These details about a character are of technical nature only.
That's not true at all. It matters because the user needs to know that these are not normal characters, e.g. in the following cases:
Such character cannot be found as normal ones with NVDA search (NVDA+f3) nor with other searches such as in Notepad. If these characters are used for a file name, for example if you copy/paste these characters (e.g. "𝐏𝐥𝐞𝐚𝐬𝐞", i.e. "Please" written with math bold characters). The file will not appear with the files beginning with a normal "P" but will be sorted in Windows Explorer after the files beginning with "Z".
IMO, it's really important to keep the option configurable. And I would even say that the option should not be enabled by default, at least due to the search use case exposed above.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>
To be clear, the points raised by @CyrilleB79 are valid but ar not a problem, so this should not hold back normalization from becomming the default behavior.
- NVDA can search via nvda+f3 or nvda+shift+f3 also by any unicode character. It doesn't matter whether it can be entered via keyboard or not.
- How Windows sorts files is really not the job of NVDA.
The file will not appear with the files beginning with a normal "P" but will be sorted in Windows Explorer after the files beginning with "Z".
For this use case having the real characters be spoken when normalization is off or when using the review cursor to navigate the file name is confortable enough in my opinion. Now in Windows Explorer, files are grouped into bulk of letters anyway, so you can colaps or expand groups. Files names with such mathematical alphanumeric characters are very very uncommon, and yet they are all grouped under "others".
I would also be inclined to have normalization on by default, for reasons @Adriani90 gives. It is weird seeing this as a Default/enabled/disabled choice in Speech settings, instead of a simple checkbox like most of the other options there. Though I do think the ability to turn it off should be preserved, as CLDR and Delayed Descriptions can be.
As to what @CyrilleB79 said about searching: the problem comes up when the user hears "please", and doesn't know there's anything strange about it. So later he searches for it, only to have it not found, and be confused. There has to be some way for the user to know this text was normalized, unless the user intentionally turns that notification off.
But, having it on by default, solves the problem of users who don't even know that this kind of text exists, so have no idea they might want to turn it on. Read @Adriani90's example with the feature turned off. A user who doesn't know that people write with unusual characters that look different, but mean the same as normal characters, may think these are just weird graphics or symbols, and have no idea that there is supposed to be meaning there.
Some further thoughts:
- We could have, in Document Formatting settings, an option to have "Normalized"/"Out of normalized" announced around strings of such text, when reading it.
- For characters, when moving character by character, we could have an option to either play a short tone, or announced "normalized", when reading such characters. For example, it could say "normalized P", and then for the delayed character description, "Mathematical bold capital P".
- Alternatively, it could just announce the descriptive name (half-normalized?): "Mathematical bold capital P", when reviewing character by character. That would tell the user that this is a P, but would also indicate that it is an unusual character, without the user having to do anything extra to find out, just reading by character.
I agree with option 1 or 2, but I disagree with option 3 because this would make exploring such texts or equations really inefficient and too verbose from an UX perspective.Von meinem iPhone gesendetAm 23.05.2024 um 03:58 schrieb Luke Davis @.***>: Some further thoughts:
We could have, in Document Formatting settings, an option to have "Normalized"/"Out of normalized" announced around strings of such text, when reading it. For characters, when moving character by character, we could have an option to either play a short tone, or announced "normalized", when reading such characters. For example, it could say "normalized P", and then for the delayed character description, "Mathematical bold capital P". Alternatively, it could just announce the descriptive name (half-normalized?): "Mathematical bold capital P", when reviewing character by character. That would tell the user that this is a P, but would also indicate that it is an unusual character, without the user having to do anything extra to find out, just reading by character.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
As a lot of discussion is happening in this issue, let's reopen it for now. #16584 can close it again.