probablepeople Noticing Rare First/Last Names More Likely to be Categorized as Corporations

Was parsing a list of names for my wedding (super helpful!). And noticed that people with rarer names (first name, last name, etc). are being labeled as corporations (no corporations are invited!). Don't feel super comfortable posting a name of a guest on GitHub, but figured I'd still file the issue.

Love the lib!

Feb 26 '17 16:02 jonrobinson2

For example, first names like Este and Eastern European last names like Morowicz were categorized as corporations. All things I can fix by hand of course, but figure this may be an issue w/ the training data only coming from folks w/ Angl-American names.

Feb 26 '17 17:02 jonrobinson2

Thanks for filling this @jonrobinson2! Glad to know the library is helpful for personal as well as professional work :)

@fgregg: I'd be interested in brainstorming ways we might source a set of Eastern European names.

Congratulations on your wedding – hope everything goes smoothly!

Feb 28 '17 15:02 jeancochrane

For future reference, @hancush recommended this paper:

http://www.pdmpassist.org/pdf/GroundTruthDataSetForRomanizedNames.pdf

Some interesting ideas in here for sourcing culturally diverse names.

Feb 28 '17 16:02 jeancochrane

I just started working with this library, and I also noticed it classifying an individual with unique name as a corporation. My data set has a flag for individual/corporation that I can pass to the parser, but sometimes my flag is wrong.

To source a set of Eastern European names, or any kind of names, I suggest scraping Wikipedia. For examples:

https://en.wikipedia.org/wiki/List_of_Polish_people
https://en.wikipedia.org/wiki/Category:Lists_of_German_people
https://en.wikipedia.org/wiki/List_of_people_from_Ukraine

For training your models, you may want to provide more counterexamples such as Lists of corporations from Wikipedia

I heard DataMade provides data scraping services, so... :)

Jun 19 '17 18:06 az0

Very true – good suggestion @az0!

Jun 20 '17 19:06 jeancochrane