MathCAT Analyze text books to find most commonly used Unicode symbols and split Unicode files based on the numbers

Maybe 5 years ago I did some analysis of a few subject area math textbooks (Algebra, PreCalculus, Calculus, etc). The goal was to determine what the most common symbols are in each subject area are so that a math editor could provide a simplified symbol palette for that subject. There are surprisingly few symbols needed. See my paper for details.

There are now many more open source textbooks for many more subjects so one can now get better stats. This is useful for MathCAT. MathCAT splits the very large Unicode tables into two parts: a small common symbols file and a much larger "the rest" file. This split cuts out potentially significant start up time. MathCAT is interactive and so must respond in well less than a tenth of a second, and this includes the first time it used where it reads in the various rule files including Unicode files for speech and for braille. The current division is based on the above paper... sort of. The code is actually pretty sloppy in this respect. What should be in unicode.yaml and unicode-full.yaml needs to be fixed up based on the stats.

Knowing the most common symbols is also important so that translators know where to focus their efforts.

I have a number of open source books that use MathML. To get the stats, grab the contents of all the mi, mtext, and mo elements. To get a true denominator (total number of chars used), grabbing mn contents also makes sense. There might be a few characters in the mn beyond digits particularly since some generators don't follow the MathML spec guidelines and consider things like π a number (should be an mi). The contents should then be split apart into individual characters and stats (probably via a hash table) updated. In the end, the results get sorted and the spot representing 99.9%, 99.99%, and maybe if there aren't that many chars, 99.999% of all usages are reported.

The goal is to keep unicode.yaml somewhat short: maybe between 250 - 350 Unicode symbols so that it doesn't take much time to read on startup. So the cutoff point will depend upon what the statistics show.

Some symbols want to be added regardless of the counts:

the invisible characters U+2061 - U+2064 because canonicalization adds them
full alphabets -- if many Roman or Greek letters are in list, then include all of them as that keeps like things together and is easier for translators

Tying this into the Unicode conversion mentioned #71 so that two files get spit out would be great.

Nov 22 '22 20:11 NSoiffer

Hi, Neil - I have some node.js code that can process EPUBs. I'd be happy to work on this. Are you able to share the OERs that you have?

Jul 11 '23 14:07 brichwin

A couple of years ago @davidfarmer sent me a list of open textbooks. I thought they used MathML, but a spot check shows that at least the ones I checked are all MathJaX textbooks. You could write some JS that opens a browser on the link and then grabs the converted TeX (i.e., MathML) from the page and analyze/store that.

Alternatively, you could grab the characters from the TeX, but that's harder because they aren't Unicode chars there. You could run a TeX converter on the files directly. Since TeX differs, the safest converter would be to run MathJax locally, but having some mistakes on obscure chars wouldn't affect the results for common chars.

Jul 11 '23 17:07 NSoiffer