Nationality Classification Using Name Embeddings Junting Ye1, Shuchu Han4, Yifan Hu2, Baris Coskun3* , Meizhu Liu2, Hong Qin1, Steven Skiena1 1Stony Brook University, 2Yahoo! Research, 3Amazon AI, 4NEC Labs America fjuyye,shhan,qin,
[email protected],fyifanhu,
[email protected],
[email protected] ABSTRACT Ethnicity Nationality (Lv1) Ritwik Kumar Ravi Kumar Ethnicity Nationality identication unlocks important demographic infor- Muthu Muthukrishnan Mohak Shah Black mation, with many applications in biomedical and sociological Deepak Agarwal White Ying Li research. Existing name-based nationality classiers use name sub- Lei Li API Jianyong Wang AIAN strings as features and are trained on small, unrepresentative sets Yan Liu Shipeng Yu 2PRACE of labeled names, typically extracted from Wikipedia. As a result, HangHang Tong Hispanic Aijun An these methods achieve limited performance and cannot support Qiaozhu Mei Jingrui He ne-grained classication. Xiaoguang Wang Tiger Zhang We exploit the phenomena of homophily in communication pat- Jing Gao Faisal Farooq Nationality (Lv1) terns to learn name embeddings, a new representation that encodes Rayid Ghani Usama Fayyad African gender, ethnicity, and nationality which is readily applicable to Leman Akoglu European Danai Koutra CelticEnglish building classiers and other systems. rough our analysis of 57M Evangelos Simoudis Evangelos Milios Greek Marko Grobelnik Jewish contact lists from a major Internet company, we are able to design a Tijl De Bie Claudia Perlich Muslim ne-grained nationality classier covering 39 groups representing Charles Elkan Nordic Diana Inkpen over 90% of the world population. In an evaluation against other Jennifer Neville EastAsian published systems over 13 common classes, our F1 score (0.795) is Derek Young SouthAsian Andrew Tomkins Hispanic Tina Eliassi-Rad substantial beer than our closest competitor Ethnea (0.580).