Additional titlecasing amendments
The original plan was to lift the regexes directly, but I’d forgotten that Standard Ebooks is a GPL3 codebase, and here is MIT. Obviously we can’t copy everything directly over, so the new plan is that I’ll copy over my original contributions, and anything that anyone else agrees should be contributed.
At Standard Ebooks we use python-titlecase to format a bunch of stuff throughout our productions (thanks!) but we also have some additional rules and changes to meet our specific needs. These start at [redacted]; the comments as a list give you a good overview:
- Uppercase Roman numerals, but only if they are valid Roman numerals and they are not
MIX(which is much more likely to be an English word than a Roman numeral) orDIwhich may be an Italian word - Lowercase
and,oreven if preceded by punctuation - pip_titlecase capitalizes all prepositions preceded by parenthesis; we only want to capitalize ones that aren't the first word of a subtitle OK: From Sergeant Bulmer (of the Detective Police) to Mr. Pendril OK: Three Men in a Boat (To Say Nothing of the Dog)
- Uppercase words preceded by en or em dash
- Lowercase
and, if it's not the very first word, and not preceded by an em-dash - Lowercase
the, if preceded by a dash (likePuss-in-BootsorJack-in-the-Box) - Lowercase "in", if followed by a semicolon (but not words like "inheritance")
- Lowercase
th’, sometimes used poetically - Lowercase
o’ - Uppercase words that begin compound words, like
to-night(which might appear in poetry) - Lowercase
from,with, as long as they're not the first word and not preceded by a parenthesis - ~~Capitalise the first word after an opening quote or italicisation that signifies a work~~ this relies on SE specific markup
- Lowercase
theif preceded byvs. - Lowercase
de,von,van,le,duas inCharles de Gaulle,Werner von Braun, etc., and if not the first word and not preceded by an “ - Uppercase word following
Or,, since it is probably a subtitle - Uppercase word following
:, exceptor,, which indicates a kind of subtitle - Uppercase words after an initial contraction, like
O'KeefeorL'Affaire. But only if there's at least 3 letters after, to prevent catching things likeI'morE're - Uppercase letter after
Mc - Uppercase first letter after beginning contraction
- Uppercase first letter
- Lowercase
by - Lowercase leading
d’, as inMarie d’Elle - Uppercase
l’as inl’Affaire, but not if it's a the first letter - Uppercase leading
A-as inA-Breaking - Uppercase some known initialisms
- Lowercase
À(as inÀ La Carte) unless it's the first word - Uppercase initialisms
- Uppercase No. as in Number
- Lowercase V. as in versus in a legal case
- Lowercase
mm(millimeters, as in50 mm gun) unless it's followed by a period in which case it's likelyMm.(Monsieurs) - Lowercase
al-(as in the Arabic definite article) unless it’s the first word - …and some special cases
Would any of these be things that python-titlecase are interested in? I’d be happy to upstream them as PRs.
Personally, I think these would all be great additions!
Regarding À La Carte, should the la (litterally, "the") also be lowercase? so as à la Carte?
Should that also be extended to lowercase au (à + le) and aux (à + les) as well? (these are the masculine and plural forms of à la, which is feminine).
Regarding "Lowercase de, von, van, le, du" -- should this list be extended to des (de + les; du is de + le; de la is written out, all meainng "of the")? and also les and la (the plural and feminine forms of le, meaning "the")? (I realize la is sometimes used as a music note, so it's inclusion may cause more false positives than is helpful.)
These all look reasonable to me, and happy to take a PR (or possibly better several PRs as these look like a large number of rules?).
My biggest ask would be to make sure that each new rule adds a test case or two which demonstrates (and validates) when it is and is not supposed to trigger, and that it's operating correctly. Should be easy to just add a phrase-per-rule ish to the tests.py.
So, something I forgot was that Standard Ebooks’ tooling is GPL3 which isn’t compatible with MIT. That makes the way forwards a little difficult and comes down to a couple of options.
- I could check which of them I added and leave it at that. Potentially I could check in with other contributors to see if they’d be happy having their contributions reused in an MIT codebase. But I’ve checked with one of the bigger contributors and they’re not.
- Alternatively I could leave the list here, but remove the link. Then other people could do a cleanroom implementation of the functionality without reference to a GPL3 codebase.
Sorry about that, it honestly didn’t cross my mind until I sat down to actually implement it.
:/, that's unfortunate. I definitely can't do any kind of license change on this end in good conscience. I'm just a steward really of a project which has had many owners over the years.
I guess the best path forward is to pull over any changes you can, and then leave this open with the link removed as you suggested. It's a great todo list for anyone looking for some simple OSS contributions at least.
OK, I’ll try to get around to this at some point over the next week.
I reviewed through blame who’d contributed which rules, and it turns out that all but two were written by a contributor who would (reasonably of course) rather their code remains GPL-3 rather than MIT. The other two were written by me, but are not useful in the more general context.
So I think I’ve done as much as I can here. I know the original code so I don’t want to attempt a black-box reimplementation as MIT. If anyone else who hasn’t read the GPL3 code wants to take this list as the starting point for python-titlecase improvements then go for it, but otherwise let’s close this issue.
Thanks for the time anyway, and sorry that I hadn’t been more careful about licensing when I proposed this.