babel icon indicating copy to clipboard operation
babel copied to clipboard

Allow overriding of CLDR

Open jtwang opened this issue 9 years ago • 5 comments

This encompasses a few different ideas:

  • Allow overriding of existing values in the CLDR
  • Allow insertion of new elements in CLDR XML
  • Allow addition of xml files for new locales

I don't know how common this is for the world-at-large, but we require the ability to tweak things such as dates/times and currencies. We don't necessarily want to wait for the Unicode -> CLDR -> babel chain to catch up, we also don't necessarily want to submit our change requests. Eg. We want to show 'US' instead of 'U.S.' (CLDR 23.1)

Having the ability to define new elements would be extremely useful Eg. new datetime skeletons. Or completely new functionality not supported by the CLDR (we added single char currency symbols, which is totally kind of sketchy).

Adding entirely new locale xmls could also come in handy - CLDR 23.1 did not include xmls for en_CH, en_MY, and en_PH, which we needed.

jtwang avatar Feb 10 '16 23:02 jtwang

The way we implemented this feature is as follows:

  • in the cldr/ directory, we created a directory called cldr/custom/
  • this directory mirrors cldr/common/
  • for every override that was required, we'd add the XML file in the appropriate subdirectory, with the appropriate name. The contents would be an XML tree that mirrors the CLDR, containing only the branch whose leaf element we wanted to change.
  • we created a script that would merge our override XML files with the CLDR's XML files, and save them to a new directory cldr/merged/. The import script would read from this directory instead.

Interesting design issues:

  • in what order should the overrides and fallbacks be performed? cldr root -> override root -> cldr en -> override en -> cldr en_US -> override en_US?
  • having a common/ directory only worked since we forked. Additionally, this will only work if people decide to build their own Babel package (ie. won't work if someone's just downloading the binary distribution). What would be a better approach for Babel to take? Still require building, but read an env variable that points to the dir with overrides?

Here's a snippet of how our setup looks, keep in mind we're still on babel 1.3:

babel \
    cldr \
        common \
            <the usual>
        custom \
            main \
                cs.xml
                de.xml
                root.xml      
        merged \
            <the usual>

Example custom file contents (our script requires every override be linked to our ticket tracking system):

=== ROOT.XML
<?xml version='1.0' encoding='UTF-8'?>
<ldml>
    <dates>
        <calendars>
            <calendar type="gregorian">
                <dateTimeFormats>
                    <availableFormats>
                        <dateFormatItem id="MMMEEEd" ticket="INTL-2668">EEE, d MMM</dateFormatItem>
                    </availableFormats>
                </dateTimeFormats>
            </calendar>
        </calendars>
    </dates>
    <numbers>
        <currencies>
            <currency type="PHP">
                <symbol ticket="INTL-2668">₱</symbol>
            </currency>
        </currencies>
    </numbers>
</ldml>
=== DE.XML
<?xml version='1.0' encoding='UTF-8'?>
<ldml>
    <localeDisplayNames>
        <languages>
            <language ticket="INTL-2947" type="nb">Norwegisch</language>
        </languages>
        <territories>
            <territory ticket="INTL-1441" type="HK">Hongkong</territory>
        </territories>
    </localeDisplayNames>
    <dates>
        <calendars>
            <calendar type="gregorian">
                <months>
                    <monthContext type="format">
                        <monthWidth type="abbreviated">
                            <month ticket="INTL-3218" type="5">Mai.</month>
                            <month ticket="INTL-3218" type="6">Juni.</month>
                            <month ticket="INTL-3218" type="7">Juli.</month>
                        </monthWidth>
                    </monthContext>
                </months>
                <dateTimeFormats>
                    <availableFormats>
                        <dateFormatItem id="MMMEEEd" ticket="INTL-2668">EEE, d. MMM</dateFormatItem>
                    </availableFormats>
                </dateTimeFormats>
            </calendar>
        </calendars>
    </dates>
</ldml>
=== DE_AT.XML
<?xml version='1.0' encoding='UTF-8'?>
<ldml>
    <dates>
        <calendars>
            <calendar type="gregorian">
                <months>
                    <monthContext type="format">
                        <monthWidth type="abbreviated">
                            <month type="1" ticket='INTL-3218'>Jän.</month>
                        </monthWidth>
                    </monthContext>
                </months>
                <dateTimeFormats>
                    <availableFormats>
                        <dateFormatItem id="MMMEEEd" ticket="INTL-2668">EEE, d. MMM</dateFormatItem>
                    </availableFormats>
                </dateTimeFormats>
            </calendar>
        </calendars>
    </dates>
</ldml>

jtwang avatar Feb 10 '16 23:02 jtwang

Thanks for the extensive overview of how you guys do things!

My first gut feeling is that requiring LDML XMLen for overrides is Not A Good Idea, mostly owing to the overhead required in parsing XML and so on -- for good or for worse, it'd mean moving the CLDR importer into the core library.

Even if our Python-native format is a moving target, I don't think it has had significant overhauls in a long while -- only additions, pretty much, so I think adding an overlay on top of that would be a lighter, neater approach.

That said, perhaps one way to go about this would be to add a patch hook that would allow clients to modify the locale data as it is loaded:

@babel.register_locale_patch
def patch_php_currency(locale, data):
   # (Could check for the locale's properties here)
   data["currencies"]["PHP"]["symbol"] = "₱"

or similar. This feels like a very non-invasive hook to me, though the onus to keep up with possible changes to the Pythonic locale data format would be with those using the patch system. This would also allow "advanced" users as your org to perhaps load the actual override/overlay data from XML or MongoDB or whatever, while still allowing less enterprise users to ad-hoc patch as required. (As a con, this adds a layer of process-global context, which feels a little unclean... Though I think that might just be acceptable.)

Just my first 5 EUR cents here. What do you think?

EDIT: For the UC of

Allow addition of xml files for new locales

the data parameter passed to a patch function might be None or {} (whichever feels like the nicer protocol), and an "unknown locale" error would only be raised for a non-fuzzy locale if the final data is falsy.

akx avatar Feb 12 '16 17:02 akx

That was actually more-or-less our approach before we forked Babel and ended up with almost 1000 lines of overrides. D:

Mostly due to new datetime skeletons and currency symbol overrides. Protip: heavily encourage your design team to stick to a small set of (supported) date time display formats.

Anyhow, we decided to take the XML approach for a couple of reasons:

  • it let us take advantage of fallbacks: en_US -> en -> root
  • we had toyed with the idea of packaging up the CLDR XMLs ourselves and having Babel download our this .zip. The XML override approach was a natural step in that direction. This idea kind of lost traction though once we realized that Babel was still being maintained, ha.

That being said, requiring users to build their own package is a huge disadvantage and, as you mentioned, this approach would allow us to override everything, so we could still keep the XML override approach. Most users would probably only require a few overrides at the locale level.

In a nutshell, I'm OK with the patch approach. :)

jtwang avatar Feb 12 '16 18:02 jtwang

Writing an utility to convert your patch XMLen to Python patch statements doesn't sound like an impossible approach, really. (As a matter of fact, I was thinking a "generic" patcher using yaml/toml: currencies.PHP.symbol = "₱" or something...)

Maybe now that we have actual support for datetime skellingtons, you wouldn't need 1000 lines of overrides? :)

Also, re the fallbacks -- if the signature of the patch function is locale, data, you can easily inspect the locale's language/whatever spec to see what you need to patch. (And if we incorporate @etanol's patch to fold overrides into normal dicts at load time, we don't need the LocaleDataDict funny business either?)

akx avatar Feb 12 '16 19:02 akx

Any news on this?

I noticed that it is possible to modify a Locale instance to some degree. Is this discouraged? eg.:

locale = Locale("no")
locale.datetime_formats.short_date_format = 'yyyy-MM-dd'

Reading the code it seems each instance get a copy of the data, so I assume this will work ok for now as long I ensure all parts of the application use this modified locale instance. But the API does not exactly invite such overrides.

olejorgenb avatar Sep 28 '23 14:09 olejorgenb