Additional unit tests mostly for German & Dutch names + more degrees
I created 58 unit tests mostly for German & Dutch names as well as international degrees: https://gist.github.com/thomasbachem/c5e2c82479e0c3775e88
All but "von der" are failing right now.
It would be fantastic if my tests can be used to improve nameparser.
Thanks for posting these. At first glance it seems like many of these could be handled by adding new things to the prefixes constant, e.g:
>>> from nameparser import HumanName
>>> from nameparser.config import CONSTANTS
>>> CONSTANTS.prefixes.add('tho')
>>> HumanName("Sara tho Wopenreijss")
<HumanName : [
title: ''
first: 'Sara'
middle: ''
last: 'tho Wopenreijss'
suffix: ''
nickname: ''
]>
I can try to take a closer look next week.
Great! I think we just need to be careful when introducing many more prefixes that there might be names in other countries were e.g. "tho" is a first name as well. If nameparser can handle these situations (e.g. "Tho Bachem" -> Firstname "Tho", Lastname "Bachem") then it's fine. Otherwise we would need to do more research to be sure that this isn't the case.
And abbreviations (e.g. "MBA", "MA, "BA", "BSc" ...) might be written with or without punctuation and whitespace (e.g. "M.B.A.", "M. B. A.", "M.A.", "M. A.", "B.Sc.", "B. Sc.")
I added 11 more tests to the Gist for some more German degrees/titles.
I don't think we can parse things with spaces in them like "M. B. A." or "M. A." as a title because they looks the same as initials. Let me know if you can think of a way.
Comparison is all done using the lower case without periods version of the string so the constants should match all case and period variations of the constants.
They way the nameparser prefers to handle ambiguities like first name="Tho" is to try to be correct most of the time. It seems like "Tho" is equally uncommon as a name and a title so I would probably just let the user of the library use the config.
It could also be nice to have language- or domain-specific sets of config that you could use. So like if you picked the Dutch config pack then "Tho" would be a title.
Nothing in the suffixes constant will ever be a last name since suffixes come after last names. Nothing in the prefixes can also be a first name or a last name since prefixes come before last names. I think suffixes that could also be first names might work ok in the suffixes constant but i'd have to test it.
I took an initial stab at adding some things to the constants to cover those names. I added your gist in tests.py and marked some expectedFailures for things I didn't think would work. I still get
FAILED (failures=18, expected failures=21)
So there are some things that aren't working like I expected. I have another project so it might be next week before I get to take a closer look. Feel free to check out the branch, 18_german_dutch_names
I had the thought when looking at your data that we might be able to assume that strings longer than 2 characters that end in a period are some kind of title/prefix/suffix. I don't think anyone ends parts of their names with a period unless they are an initial. That assumption might cover more of your examples and be a way to recognize a lot of things without adding them all to the constants.
Also some of your tests seem to touch on #2, e.g.:
hn = HumanName("LL. M. John Meyer")
self.m(hn.suffix, "LL. M.", hn)
Until we do #2, "LL. M." would be a title because it's positionally in the front.
So since #2 is done? (it's at least closed) ... :grin: ... it would be nice to see more complete internationalization support ... @thomasbachem's tests still mostly fail as of today's master ...
Specifically, German compound title prefixes seem to be problematic. As you've probably seen in the tests, titles in German are more specific than in English (e.g. Dr. med. = Doktor der medizin = Doctor of Medicine or Dr.-Ing./Dr. ing. = Doktoringenieur or even some other ones: Prof. Dr. (a professor with a Doctorate; unlike in English where Prof usually overrides Dr and sometimes even Dr. Dr. )
There are some additional issues (e.g. parsing His Majesty King Felipe VI or Her Majesty Queen Elizabeth II) but I'll look to see if another issue is open or will open a new one to keep things separate :)