probablepeople icon indicating copy to clipboard operation
probablepeople copied to clipboard

Noticing Rare First/Last Names More Likely to be Categorized as Corporations

Open jonrobinson2 opened this issue 9 years ago • 5 comments

Was parsing a list of names for my wedding (super helpful!). And noticed that people with rarer names (first name, last name, etc). are being labeled as corporations (no corporations are invited!). Don't feel super comfortable posting a name of a guest on GitHub, but figured I'd still file the issue.

Love the lib!

jonrobinson2 avatar Feb 26 '17 16:02 jonrobinson2

For example, first names like Este and Eastern European last names like Morowicz were categorized as corporations. All things I can fix by hand of course, but figure this may be an issue w/ the training data only coming from folks w/ Angl-American names.

jonrobinson2 avatar Feb 26 '17 17:02 jonrobinson2

Thanks for filling this @jonrobinson2! Glad to know the library is helpful for personal as well as professional work :)

@fgregg: I'd be interested in brainstorming ways we might source a set of Eastern European names.

Congratulations on your wedding – hope everything goes smoothly!

jeancochrane avatar Feb 28 '17 15:02 jeancochrane

For future reference, @hancush recommended this paper:

http://www.pdmpassist.org/pdf/GroundTruthDataSetForRomanizedNames.pdf

Some interesting ideas in here for sourcing culturally diverse names.

jeancochrane avatar Feb 28 '17 16:02 jeancochrane

I just started working with this library, and I also noticed it classifying an individual with unique name as a corporation. My data set has a flag for individual/corporation that I can pass to the parser, but sometimes my flag is wrong.

To source a set of Eastern European names, or any kind of names, I suggest scraping Wikipedia. For examples:

  • https://en.wikipedia.org/wiki/List_of_Polish_people
  • https://en.wikipedia.org/wiki/Category:Lists_of_German_people
  • https://en.wikipedia.org/wiki/List_of_people_from_Ukraine

For training your models, you may want to provide more counterexamples such as Lists of corporations from Wikipedia

I heard DataMade provides data scraping services, so... :)

az0 avatar Jun 19 '17 18:06 az0

Very true – good suggestion @az0!

jeancochrane avatar Jun 20 '17 19:06 jeancochrane