The Secret Lives of Names? Name Embeddings from Social Media

The Secret Lives of Names? Name Embeddings from Social Media Junting Ye Steven Skiena Stony Brook University Stony Brook University Stony Brook, NY Stony Brook, NY [email protected] [email protected] ABSTRACT Male 1th NN 2nd NN 3rd NN 4th NN Your name tells a lot about you: your gender, ethnicity and so on. Andy Pete Stuart Craig Will It has been shown that name embeddings are more effective in Dario Giovanni Luigi Francesco Claudio representing names than traditional substring features. However, Hilton Jefferson Maryellen Jayme Brock our previous name embedding model is trained on private email Lamar Ty Reggie Jada Myles data and are not publicly accessible. In this paper, we explore learn- Mohammad Abdul Ahmad Hassan Ahmed ing name embeddings from public Twitter data. We argue that Rocco Francesca Carlo Giovanni Luigi Twitter embeddings have two key advantages: (i) they can and Female 1th NN 2nd NN 3rd NN 4th NN will be publicly released to support research community. (ii) even Adrienne Aimee Brittany April Kristen with a smaller training corpus, Twitter embeddings achieve similar Aisha Maryam Fatima Ayesha Fatimah performances on multiple tasks comparing to email embeddings. Brianna Brooke Kayla Kaylee Megan As a test case to show the power of name embeddings, we inves- Chan Ka Cherry Yun Sha tigate the modeling of lifespans. We find it interesting that adding Cheyenne Hannah Kayla Madison Kelsey name embeddings can further improve the performances of mod- Gabriella Isabella Dario Cecilia Paola els using demographic features, which are traditionally used for Table 1: Four nearest neighbors of representative names lifespan modeling. Through residual analysis, we observe that fine- in Twitter embedding space, showing how they preserve grained groups (potentially reflecting socioeconomic status) are gender and ethnicity associations. Notes: Asian (Chinese, the latent contributing factors encoded in name embeddings. These Korean, Japanese, Vietnamese), British, European (Spanish, were previously hidden to demographic models, and may help to Italian), Middle Eastern (Arabic, Hebrew), North American enhance the predictive power of a wide class of research studies. (African-American, Native American, Contemporary). ACM Reference Format: Junting Ye and Steven Skiena. 2019. The Secret Lives of Names? Name Embeddings from Social Media. In Proceedings of KDD ’19: 25th ACM SIGKDD presents a representative set of name parts, each with their four Conference on Knowledge Discovery and Data Mining (KDD ’19). ACM, New nearest neighbors in name embedding space. It is clear that they York, NY, USA, 9 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn preserve associations of gender and ethnicity. Unfortunately, previous embeddings were trained on private email data and are not 1 INTRODUCTION publicly accessible to research community. Your name tells a lot about you. It commonly reveals your gen- In this paper, we propose to learn name embeddings from public der (male or female) and ethnicity (White, Black, Hispanic, or Twitter data. Our motivation is that name embeddings perform Asian/Pacific Islander). It can reveal your religion and your country well because of homophily, i.e. the tendency for people to associate of family origin. It can even inform on your marital status (is it with those similar to themselves. These associations are reflected hyphenated?), age (e.g. the generational differences between Fannie by communication patterns, which explains why large-scale email and Caitlin), or socioeconomic class (consider Archibald vs. Jethro). networks proved so effective at elucidating them. We argue that arXiv:1905.04799v1 [cs.SI] 12 May 2019 Name embeddings are distributed representations which encode homophily in communication is universal, and also exists social the cultural context of name parts (i.e. given name and surname) media [4]. Two major properties make Twitter embeddings a better in 100-dimension vectors learned through an unsupervised tech- alternative: (i) Twitter name embeddings can and will be released to nique. It has been shown that name embeddings are more effective support research community. (ii) Twitter embeddings achieve simi- representations than substrings on various tasks [18, 37]. Table 1 lar performances on gender, ethnicity and nationality identification as Email embeddings, even though the training corpus for Email Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed is two times larger than that for Twitter. We observe that Twitter for profit or commercial advantage and that copies bear this notice and the full citation embeddings have better performances on gender prediction, while on the first page. Copyrights for components of this work owned by others than the Email embeddings achieve higher scores on ethnic predictions. author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission A second focus of our work is to demonstrate the predictive and/or a fee. Request permissions from [email protected]. power of name embeddings on lifespan modeling, where gender, KDD ’19, August 4 – 8, 2019, Anchorage, Alaska USA ethnicity and nationality are all contributing features. Average lifes- © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. pan is one of the most critical measurements associated with quality https://doi.org/10.1145/nnnnnnn.nnnnnnn of life across different demographic groups. Mortality prediction KDD ’19, August 4 – 8, 2019, Anchorage, Alaska USA Junting Ye and Steven Skiena for individuals from available features is the foundation of life in- identify biases, as name embedding-based classifiers [37] are al- surance industry. Here we demonstrate how an individual’s most ready widely used by over 100 social scientists and economists readily available features (names and corresponding embeddings) to study discrimination and homophily [14, 34, 35]. For example, can be used to improve the accuracy over comparable demographic Gornall and Strebulaev find that Asian entrepreneurs received a6% models. It is an amazing testament to the power of homophily that higher rate of interested replies than White, after sending 80,000 contemporary communication patterns can account for mortality pitch emails introducing promising but fictitious start-ups to 28,000 in people born over a century ago. venture capitalists [17]. AlShebli et. al. study the effect of diversity We summarize our primary contributions in this paper as fol- on scientific impact, as reflected in citations. They find that ethnic lowing: diversity has the strongest correlation with scientific impact5 [ ]. Therefore, we believe a public and sharable name embeddings will • Twitter name embeddings. We explore and evaluate nine ver- help to enhance the predictive power of a wide class of research sions of Twitter name embeddings (see Table 2). We get studies. interesting observations via performance comparisons: (i) Mention embeddings outperform Email embeddings and 2 RELATED WORK other Twitter embeddings on gender recognition, indicating 2.1 Names and Mortality stronger gender homophily in Twitter mentions. (ii) Follow- ers embeddings work better than Followee, because ordinary There have been several previous studies of the impact of names users’ followers tend to be family members and/or close on lifespans. Compared to our work, these have generally been friends, while there are more celebrities among followees. performed on smaller datasets (hundreds or perhaps thousands of The performance is improved after removing celebrities’ individuals), versus the 85 million names in our study. Further, they names from followee lists. (iii) Aggregated* embeddings per- have generally studied surface features of names as opposed to the form the best among nine Twitter versions. They have similar latent properties exposed by our name embeddings. In particular, vocabulary size and achieve comparable performances on Abel and Kruger [2] observed that several categories of people gender, ethnicity and nationality classification as Email em- whose first name began with ‘D’ appeared to die earlier thanthose beddings. Twitter name embeddings are shared for research with other names. This effect did not show up in a larger-scale community (www.name-prism.com). study [31], and an independent study by Pinzur and Smith [26] • Demonstrating the power of name embeddings to improve concludes that first name and life expectancy are not related. lifespan modeling. To demonstrate the power of name em- Among athletes, Abel and Krugar [1] observe that having nick- beddings, we train a series of models to predict lifespan as a names increases longevity. Shin and Cho [30] report that self- function of five traditional demographic variables (birth year, reported stress declines after people legally change their names, state, gender, ethnicity and nationality) and name embedding demonstrating that there can be genuine physiological effects asso- features. We construct 32 (i.e. 25) different sets of linear re- ciated with undesired names. Pena’s analysis of SSDI data suggests gression models containing specific subsets of demographic that people with more frequent names have shorter average and variables, with and without Twitter/email name embeddings. median lifespans.

Load more