Noticing Rare First/Last Names More Likely to be Categorized as Corporations
Was parsing a list of names for my wedding (super helpful!). And noticed that people with rarer names (first name, last name, etc). are being labeled as corporations (no corporations are invited!). Don't feel super comfortable posting a name of a guest on GitHub, but figured I'd still file the issue.
Love the lib!
For example, first names like Este and Eastern European last names like Morowicz were categorized as corporations. All things I can fix by hand of course, but figure this may be an issue w/ the training data only coming from folks w/ Angl-American names.
Thanks for filling this @jonrobinson2! Glad to know the library is helpful for personal as well as professional work :)
@fgregg: I'd be interested in brainstorming ways we might source a set of Eastern European names.
Congratulations on your wedding – hope everything goes smoothly!
For future reference, @hancush recommended this paper:
http://www.pdmpassist.org/pdf/GroundTruthDataSetForRomanizedNames.pdf
Some interesting ideas in here for sourcing culturally diverse names.
I just started working with this library, and I also noticed it classifying an individual with unique name as a corporation. My data set has a flag for individual/corporation that I can pass to the parser, but sometimes my flag is wrong.
To source a set of Eastern European names, or any kind of names, I suggest scraping Wikipedia. For examples:
- https://en.wikipedia.org/wiki/List_of_Polish_people
- https://en.wikipedia.org/wiki/Category:Lists_of_German_people
- https://en.wikipedia.org/wiki/List_of_people_from_Ukraine
For training your models, you may want to provide more counterexamples such as Lists of corporations from Wikipedia
I heard DataMade provides data scraping services, so... :)
Very true – good suggestion @az0!