Data Mining of Online Datasets for Revealing Lifespan Patterns in Human Population

Michael Fire and Yuval Elovici∗

Telekom Innovation Laboratories at Ben-Gurion University of the Negev Department of Information Systems Engineering, Ben-Gurion University

January 5, 2014

Abstract Online genealogy datasets contain extensive information about millions of people and their past and present family connections. This vast amount of data can assist in identifying various patterns in human population. In this study, we present meth- ods and algorithms which can assist in identifying variations in lifespan distributions of human population in the past centuries, in detecting social and genetic features which correlate with human lifespan, and in constructing predictive models of hu- man lifespan based on various features which can easily be extracted from genealogy datasets. We have evaluated the presented methods and algorithms on a large online geneal- ogy dataset with over a million profiles and over 9 million connections, all of which were collected from the WikiTree website. Our findings indicate that significant but small positive correlations exist between the parents’ lifespan and their children’s lifespan. Additionally, we found slightly higher and significant correlations between the lifespans of spouses. We also discovered a very small positive and significant correlation between longevity and reproductive success in males, and a small and sig- nificant negative correlation between longevity and reproductive success in females. Moreover, our machine learning algorithms presented better than random classifica- tion results in predicting which people who outlive the age of 50 will also outlive the age of 80. We believe that this study will be the first of many studies which utilize the wealth of data on human populations, existing in online genealogy datasets, to better under- stand factors which influence human lifespan. Understanding these factors can assist scientists in providing solutions for successful aging. arXiv:1311.4276v2 [cs.SI] 5 Jan 2014

Keywords. Genealogy Data Mining, Aging, Gerontology, Human Pop- ulation Lifespan, Lifespan Prediction, Date Mining, Machine Learning, WikiTree

1 Introduction

In the last decade, Web 2.0 websites, such as Wikipedia1 and Reddit,2 have become extremely popular and widespread. Web 2.0 websites offer Internet users opportunities

∗Email:{mickyfi,elovici}@bgu.ac.il 1http://en.wikipedia.org 2http://www.reddit.com

1 to connect, collaborate, and share information with each other, creating massive datasets with millions of content items. For example, the free encyclopedia Wikipedia has more than 4.3 million articles and more than 127,000 active users who contribute new content to the site on a regular basis [24]. One type of Web 2.0 site which recently became popular is genealogy websites. Genealogy websites, such as MyHeritage,3 Ancestry,4 WikiTree,5 and ,6 have millions of users [16, 2] that use these websites to create, discover, and share their family history by generating and updating online family trees. These online family trees consist of personal data on family members from the last several centuries, and they give many personal details for each family member, such as the member’s date of birth, date of death, ancestors’ details, and children’s details, among others. The structure and the family members’ personal details that are stored in these genealogy websites create large-scale datasets, which contain billions of entries [10, 2] on human life and death properties. These datasets can be utilized to reveal interesting patterns regarding lifespan changes over the centuries. Additionally, these datasets can also assist in better understanding and identifying characteristics which are correlated with human lifespan changes. For example, these datasets can be explored and utilized to answer the following questions: Does having more children extend one’s lifespan? Does having long-lived ancestors prolong life? Does getting married lengthen one’s lifespan? Answering these types of questions can assist scientists in providing insights and solutions for successful aging. In this study, we present data mining algorithms for analyzing large genealogy datasets in order to examine human population lifespan variations over a substantial length of time (see Section 3.3.1). Moreover, we introduce methods to utilize these types of datasets to identify features which correlate with human lifespan (see Section 3.3.2). Additionally, we also present Machine Learning (ML) algorithms based on features extracted from genealogy datasets, which can assist in predicting if a particular 50-year-old individual will reach the age of 80 (see Section 3.3.4). To test and evaluate our algorithms, we developed a web crawler which crawled and parsed public profiles from the WikiTree website. WikiTree is a free, collaborative family- history website, which contains more than 5 million user-contributed profiles [5] of indi- viduals who have lived in the past centuries, and many of the profiles contain personal details about each individual. Using the collected data from WikiTree, we were able to construct a dataset (referred to as the WikiTree dataset) of over a million public profiles, out of which at least 416,030 profiles were of individuals who were born in the United States (see Section 3.4). By analyzing the WikiTree dataset, we calculated various statistics on variations of population lifespan over the last centuries, including specific statistics on the lifespan variations of the United States population (see Section 3.3 and Figures 3 and 4). As a result of this analysis, we discovered several interesting historical lifespan change patterns (see Section 5); for example, we discovered that the average lifespan of females who were born in the United States and lived beyond the age of ten increased sharply in just a half-century: from 62.66 in 1850 to 72.5 in 1900 (see Figure 3). Using the WikiTree dataset, we constructed a social network directed multigraph which contains over 1.38 million vertices and over 9.19 million links (see Section 3.1 and Table 3). We then analyzed the social network graph and extracted 21 features, such as parents’

3http://en.wikipedia.org 4http://www.ancestry.com 5http://www.wikitree.com 6http://familypedia.wikia.com

2 and grandparents’ ages of death, for each vertex in the graph (see Section 3.2). By using the extracted features and simple linear regression models, we discovered significant correlations with low coefficients of determination between the individuals’ ages of death and the ages of death of their siblings, parents, spouses, and grandparents (see Table 5). We also discovered a slighter higher significant correlation between the individual’s age of death and the age of death of his or her spouse (see Table 5). Additionally, we constructed multiple linear regression models for predicting an individual’s age of death based on various features which were extracted from the individual’s personal details. Our multiple linear models were with high significance and Multiple Adjusted R-squared values up to 0.085 (see Table 6). Our ML classifiers have presented better than random results in predicting which individuals who outlived the age of fifty and passed the age of menopause will also outlive the age of 80 (see Section 4.3). The remainder of the paper is structured as follows: In Section 2 we give a brief overview of previous relevant studies on characteristics which were found to be correlated with human lifespan. In this section, we also introduce several studies which used similar data mining algorithms as this study. Next, in Section 3 we present the methods and algorithms we developed for studying genealogy datasets. In this section, we also describe our constructed WikiTree dataset. Then, in Section 4 we present our algorithm evaluations results on the WikiTree dataset. Lastly, in Section 5 we discuss our results, and we also offer future research directions.

2 Related Work

The factors that influence human lifespan have been thoroughly studied over the past decades [20, 14, 13, 11, 8]. In this section we give a brief overview of recent genealogical studies that are most relevant to this study, pinpointing similar factors. Additionally, we also give a short overview of recent studies in the field of social network analysis and data mining, which used a similar methodology to the one used throughout this study. In recent years, many studies have tried to find correlations between parents’ and childrens’ lifespans, as well as correlations between lifespans of parents and their num- ber of children: In 1998, Westendorp and Kirkwood used a historical dataset, from the British aristocracy, to study the connection between longevity and reproductive success. They discovered that longevity was positively correlated with age at first childbirth, and negatively correlated with number of children. In 2000, Thomas et al. [20] studied the con- nection between longevity and fertility using a statistical dataset of 153 countries. They concluded that “humans who invest heavily in reproduction while young will, on average, pay for this reproductive success with a shortened lifespan.” In 2001, Mitchell et al. [14] used genealogical data of Old Order Amish members to estimate the parent-child correla- tions in lifespan. They also estimated the child age of death as a function of parent age at death. They discovered significant but small correlations between parental and child ages at death. In 2006, McArdle et al. [13] studied the correlation between the number of children and lifespan using genealogical data of 2,015 individuals who were members of an Old Order Amish community. In their study they discovered lifespans of fathers increased linearly with increasing number of children, while lifespans of mothers increased linearly up to 14 children but decreased with each additional child beyond 14. In 2007, Le Bourg [11] presented a thorough review of studies which researched the relationship between fertility and longevity under various conditions. According to Le Bourg, the review results indi-

3 cated that “in natural fertility conditions longevity does not decrease when the number of children increases but, in modern populations, mortality could slightly increase when women have more than ca 5 children.” In 2011, Gögele et al. [8] conducted a compre- hensive genealogical study with a thorough assessment of the heritability of lifespan and longevity in three villages in Italy. Their research, which included studying more than 50,000 individuals across four centuries, discovered “a general low inheritance of human lifespan, but which increases substantially when considering long-living individuals, and a common genetic background of lifespan and reproduction.” Many studies found connections between an excess in mortality and bereavement, also known as the “widow effect.” In 1969, Parkes et al. [18] followed 4,486 widowers at the age of 55 for nine years. Out of these widowers, 213 died during their first six months of bereavement, 40% above the expected rate for married men of the same age. In 1996, Martikainen et al. [12] conducted a large scale study of 1,580,000 married Finnish individuals and also discovered excess mortality among the bereaved. In 2008, Elwert and Christakis [6] studied 373,189 elderly married couples in the United States. They discovered that the death of a spouse from almost all causes increased the mortality of the bereaved partner to varying degrees. In our research we used several regression and ML techniques for lifespan prediction. In order to carry out our work, we mainly used attributes which could be extracted from genealogy datasets in order to construct the genealogy social network and extract features from the network (see Section 3). Similar techniques that involve social network analysis and regression were used by Christakis [4] in researching the spread of obesity, by Altshuler et al. [1] in predicting the individual parameters and social links of smart-phone users, and by Fire et al. [7] in predicting students’ final exam scores.

3 Methods and Experiments

To cope with the challenge of analyzing a huge online genealogy dataset with ten of millions of records on individuals’ personal data and their connections, we first chose to convert the dataset into a social network represented by a directed multigraph where vertices represent people and links represent connections among family members (see Section 3.1). Next, we used the constructed social network graph and extracted various features from each vertex, such as the vertex’s number of children, year of birth, and gender (see Section 3.2). We then used the extracted features to determine various statistics on the population lifespan variations over time (see Section 3.3.1). After that, we used linear regression to find the features that significantly influence human lifespan. We also constructed multi-linear regression models for lifespan prediction (see Section 3.3.2). Lastly, we used ML algorithms to construct classifiers which can predict if a person from the United States who outlives the age of fifty will also reach the age of eighty (see Section 3.3.4). To perform our statistical calculations and to construct our predictive models, we used various datasets that were extracted from a large genealogy dataset. These datasets are defined in Table 1. Additionally, the methods and algorithms we have used throughout this study are summarized in Table 2.

4 Table 1: Dataset Definitions

Dataset Name Description Formal Definition All-Dataset All the vertices with valid Age-of-Death feature values. { | ( ) All–Dataset- All the vertices with valid Age-of-Death feature values which outlived the { | ( ) selected . -Dataset All the vertices born in with valid Age-of-Death feature values. { | ( )

–Dataset- All the vertices born in with valid Age-of-Death feature values { | which outlived the selected . ( ) - Dataset All the vertices with valid Age-of-Death feature values, and Gender feature { | ( ) value equal to male (=1) or to female (=2). - Dataset- All the vertices with valid Age-of-Death feature values which outlived the { |

5 age of , and have Gender feature values equal to male (=1) or female ( ) (=2). --Dataset- All the vertices with valid Age-of-Death feature who outlived the selected { | , were born in , and the Gender feature values are equal ( ) only to one value (): Male (=1) or Female (=2). ( ) -Dataset-< Age> All the vertices which have valid selected feature () values and { | Age-of-Death feature values of at least the selected age (). ( ) --Dataset- All the vertices which have valid selected feature () values, { | Gender feature values which are equal to only one value (): Male ( ) (=1) or Female (=2), and Age-of-Death feature values above or equal to the ( ) selected age (). Married-Dataset All the vertices with valid Age-of-Death feature values, and Spouse-Number { | ( ) of at least 1. No-Missing–Dataset- All the vertices without missing numeric features, and valid Age-of-Death { | feature values which are at least the selected . ( )

Table 2: Method and Algorithm Overview

Algorithm Goal Datasets Method Present lifespan variations over time All-Dataset Calculate the Age-of-Death distribution of all people who were born in each United-States-Dataset quarter of a century between 1650 and 1900. Median and average lifespan All-Dataset-10 Calculate, for each dataset, both the average and median lifespans for calculation over time United-States-Dataset-10 people who were born in each year between 1650 and 1900, and outlived Male-Dataset-10 the age of 10. Female-Dataset-10 Male-United-States-Dataset-10 Female-United-States-Dataset-10

6 Identify inheritance of human -Dataset-10 Construct a simple linear regression between each selected feature in the lifespans extended family features and the Age-of-Death feature. Identify correlation between Married-Dataset Construct a simple linear regression between each one of the Spouse-Age- spouses’ lifespan of-Death features (Min/Max/Average), and the Age-of-Death feature. Identify if longevity is correlated Children-Number-Dataset-50 Construct a simple linear regression between the Children –Number with reproductive success Male-Children-Number-Dataset-50 feature and the Age-of-Death feature on each dataset. Female-Children-Number-Dataset-50 Construct models for predicting No-Missing–Dataset-10 Construct backward stepwise multiple linear regression models for individuals’ lifespans No-Missing–Dataset-50 predicting the Age-of-Death using All-Numeric-Features, the Heritage- Features, and the Nuclear-Family-Features sets. Predict if an individual will outlive United-States–Dataset-50 (only vertices Construct various machine learning classifiers for predicting if an individual the age of 80 in which the birth year was between born in the United States, and outlived the age of 50, will also reach the age 1650 and 1900) of 80.

3.1 Constructing the Genealogy Social Networks We constructed the social network directed multigraph G :=< V, E > from the genealogy dataset in the following manner: First, we assembled the graph vertices set V by adding a new vertex v ∈ V for each profile in the genealogy dataset. We then defined E as the multiset of links in the graph, with each link e ∈ E defined to be a tuple e = (u, v, t, c) ∈ E, where u, v ∈ V ; t is the link type, which can be one of the following values: t ∈ {Spouse, Child, Parent, Sibling}; and c is the creation date of the link. For example, if the genealogy dataset contains the profiles of Queen Elizabeth II and Prince Charles, then the social network graph will contain the following vertices: Elizabeth II, Prince Charles ∈ V , and the following edges: (Elizabeth II, Prince Charles, Parent, 14 November 1948) and (Prince Charles, Elizabeth II, Child, 14 November 1948). In case the genealogy dataset contains link of type t between a public profile u and a private profile, we added to the multigraph a new vertex vprivate to V and added new link e = (u, vprivate, t, ∅) to E.7 Next, for each non-private vertex v ∈ V in the network, we added attributes based on the information extracted from each individual’s profile page, which is represented by v.8 Then, for each vertex v, we calculated a list of features described in the following subsection. Lastly, we removed the birthday and death date data of any vertices in the graph with the following inconsistencies in their data: (a) profiles with a negative age of death, which can occur from reversing the birth and death dates; (b) profiles shown to be over the age of 122, the maximum recorded age [23]; and (c) profiles shown as having children and also an age of less than five.9

3.2 Feature Extraction After constructing the social network graph, we can extract, if possible, three types of features for each vertex: The first type is the vertex general profile features, which include basic information about the vertex, such as birth year, gender, and full name. The second type is the nuclear family features, which include information about the vertex’s children and spouses. The third and last type of features is the extended family features, which include information about the vertex’s parents, siblings, and grandparents. In this study, we extracted a total of 21 features for each vertex v ∈ V . In the remainder of this section, we introduce and give formal definitions for each one of these features.

3.2.1 General Features 1. Full-Name(v) - The full name of v.

2. Birth-Year(v) - The birth year of v.

3. Death-Year(v) - The death year of v.

7During this study, we have utilized private profiles to calculate public profiles’ features, such as Children-Number(u) and Spouse-Number(u) more accurately (see Section 3.2). In many cases, we cannot distinguish if two or more private profiles are in fact represent the same single profile in the genealogy dataset. Nevertheless, we can estimate the number of distinct private profiles by utilizing the private profile single link of type t. Namely, due to the fact that most people have two parents, we can conclude n that n private profiles with single link of type “Child” represent at least 2 distinct profiles. 8In most online genealogy websites, a profile usually contains the following information about each individual: gender, birth and death dates, location of birth, location of death, parents’ names, spouses’ names, siblings’ names, and children’s names. 9The youngest mother on record was a 5-year-old Peruvian girl [15].

7 77.2 28.0 1.4 30.3 0.6 70.1 72.1 103.1 35.5 69.4 81.7 69.2 79.3

70.0 82.1 77.0 1.9 58.0 60.8 34.3

74.3 42.8

80.2

50.4 79.0 42.7 87.3 85.0 77.1

26.0 40.3 74.8 63.5 82.9

76.2

52.9 59.9 38.0 52.0 59.4 87.7 71.7 45.2 76.1 93.9 83.2 75.9 45.9 54.4 30.4 72.3 94.4 68.6 42.0 81.1 89.3 1.0 65.0 2.0 58.5 57.0 30.2 89.0 82.2 85.1 50.0 66.0 88.1 59.4 39.3 44.5 76.8 79.4 1.0 73.6 68.6 55.0 88.1 86.5 71.4 92.1 33.5 30.6 1.0 73.1 4.2 78.2 63.4 0.0 65.3 45.3 36.1 2.8 66.9 81.3 1.0 1.6 49.1 80.1 3.7 86.1 79.7 0.6 85.7 40.4 76.7 46.1 55.4 65.3 93.6 69.3 84.2 65.0 57.0 75.9 83.6 76.4 5.9 82.7 75.9 84.7 2.9 92.7 50.9 87.1 80.0 61.2 74.7 0.1 72.7 80.8 87.0 77.7 26.0 34.5

63.1 48.6

61.6 40.0 75.2

0.2 78.2 27.5 28.2 76.4 73.1 51.2 28.5 99.8 81.6 77.1 85.3 82.7 73.1 99.1 86.4 74.1

59.2 4.0

6.0 63.0 69.4 67.0 24.0 88.3 16.3 54.9 24.0 45.1 95.2 74.9 30.0 1.6 59.9 83.6 16.5 57.5 66.0 22.2 79.0 79.6 61.5 69.7 62.2 89.2 91.5 62.4 90.0 87.0 42.4 75.5 59.8 0.1 36.0 76.3 1.9 45.7 81.4 73.7 52.9 0.0 25.7 13.4 82.7 81.5 75.1 56.9 89.1 70.9 6.2 72.7 46.7 15.9 15.2 71.2 49.8 85.0 61.3 38.9 68.4

63.8 90.0 64.4 74.9 85.3

82.0 87.7 84.2 98.5 40.3 75.5 77.9 5.2 78.6 79.3 63.6 0.2 38.0 85.8 0.1 82.2 79.9 0.2 0.0 76.0 76.0 53.0 94.3 75.3 3.7 76.0 4.8 24.0 56.6 1.3 90.5 0.0 46.9 72.0 88.3 76.0 70.3 17.5 58.1 2.4 55.0 23.0 53.3 80.8 56.4 43.5 0.0 83.1 81.9 0.0 52.0 1.2 83.3 29.6 30.0 53.7 80.9 50.0 83.2 78.6 0.0 0.0 3.4 45.4 80.1 3.7 95.2 70.0 48.2 73.8 0.0 37.6 0.0 66.8 70.1 0.7 32.5 94.3 81.1 42.0 0.0 29.0 91.8 19.4 25.8 38.0 47.4 19.4 84.3 59.1 27.0 68.3 73.2 0.0 38.4 63.2 26.6 18.7 84.0 0.0 14.1 79.4 70.0 17.5 82.7 43.0 55.1 33.7 66.7 25.4 89.1 87.7 1.7 69.5 39.5 0.0 63.6 17.9 40.0 37.4 66.2 2.4 74.9 50.0 0.0 84.3 53.5 75.3 41.0 46.7 78.3 13.8 87.3 49.8 57.8 0.0 57.9 52.8 77.0 75.8 59.4 65.1 33.0 70.4 13.2 27.4 0.8 82.5 51.1 24.0 9.0 1.4 79.3 69.0 50.0 65.0 91.2 21.6 11.7 58.4 75.2 0.0 56.7 1.1 35.0 64.0 44.0 41.0 77.1 66.9 69.9 76.6 93.2 0.9 80.7 0.0 55.4 67.3 19.4 76.6 3.0 72.4 86.8 80.3 69.3 1.5 77.4 90.3 74.2 67.0 4.6 0.0 75.8 83.5 77.2 0.0 68.6 72.1 67.5 0.6 66.0 32.7 45.2 50.7 31.5 42.1 2.7 85.4 62.0 60.2 3.9 81.6 3.8 90.9 93.4 0.0 0.0 1.0 80.3 85.9 14.8 69.2 0.1 68.5 1.4 86.4 71.0 21.0 64.5 83.7 56.5 44.6 72.5 21.1 61.1 1.4 74.0 82.4 85.7 0.0 5.0 68.4 39.1 73.4 72.8 70.3 79.7 62.7 34.8 73.5 28.5 9.3 81.8 16.8 0.0 80.7 72.5 76.2 80.3 59.6 41.4 79.7 3.4 31.4 77.9 56.1 0.5 36.5 46.8 60.0 80.4 91.9 21.0 23.2 89.3 75.0 84.6 71.5 91.1 66.3 66.1 97.5 58.0 90.4 28.2 66.4 71.7 10.5 51.7 20.3 59.2 85.4 42.8 3.8 32.5 81.9 8.3 68.9 33.1 107.9 50.9 86.5 0.2 77.1 79.9 43.2 0.1 30.0 63.9 70.9 45.1 83.6 42.9 49.1 74.0 52.5 82.4 56.7 18.4 68.5 56.4 78.1 57.8 70.6 23.5 27.1 74.6 13.4 84.9 63.2 37.1 50.4 64.7 52.0 57.9 82.6 12.3 48.7 83.6 85.8 104.0 42.8 81.2 71.4 59.0 100.9 96.9 70.6 70.9 55.8 78.4 40.7 80.7 2.5 80.1 90.2 81.9 70.0 39.6 16.0 33.0 89.8 92.8 69.2 75.0 81.5 25.7 0.7 100.1 72.3 0.1 33.0 85.4 84.2 5.030.2 16.7 78.7 67.7 28.3 69.5 80.0 65.7 80.2 57.6 76.7 76.7 82.8 42.5 51.7 44.8 69.3 78.3 76.3 55.2 82.9 101.1 1.3 77.5 63.0 54.7 1.1 37.6 23.3 81.0 71.2 69.3 67.8 27.6 68.0 94.9 52.8 65.4 64.4 93.9 62.5 111.8 32.7 76.946.4 65.1 75.1 76.7 72.0 75.2 51.077.4 30.1 84.6 83.1 41.7 62.5 77.4 87.7 24.2 0.1 73.6 97.3 79.9 48.1 72.2 55.7 45.3 62.6 17.0 73.9 30.6 51.3 83.2 46.3 78.1 0.8 3.7 75.3 37.0 78.2 57.4 73.5 1.5 21.0 61.4 86.3 60.8 78.0 42.8 75.9 76.3 0.0 48.6 101.1 68.0 80.6 46.0 65.5 79.0 20.2 23.5 5.2 3.9 23.6 56.0 30.6 81.3 84.9 29.0 38.7 86.9 96.2 70.2 5.0 74.4 31.0 87.5 77.1 93.9 86.6 0.9 92.6 12.7 73.6 54.2 84.5 74.1 70.0 79.2 64.9 46.0 90.9 36.2 33.2 0.0 62.2 5.0 68.6 48.7 83.8 67.4 51.1 67.4 46.0 24.2 80.3 36.0 83.0 69.4 83.3 60.9 82.5 32.7 7.0 54.1 71.4 29.0 90.8 104.2 78.1 78.2 71.9 70.0 71.8 61.2 86.8 69.0 70.7 19.7 76.2 85.4 18.1 79.0 82.4 71.3 81.1 4.3 83.5 39.7 82.3 88.4 82.6 33.1 3.7 61.7 41.8 0.0 76.1 0.0 46.1 74.1 29.7 32.7 68.4 36.0 66.1 77.022.4 1.1 43.8 86.4 84.6 15.7 17.1 47.2 36.5 56.0 52.8 90.0 86.4 74.1 67.0 66.1 25.2 64.7 73.5 59.7 81.5 77.9 0.0 7.8 47.0 80.9 5.0 70.3 57.4 76.9 50.1 72.3 98.5 0.0 26.5 23.4 53.5 45.0 5.9 26.4 93.2 32.9 4.0 0.1 89.0 88.4 2.3 93.6 29.1 11.7 35.6 4.6 1.0 25.4 85.4 89.4 53.7 78.4 87.1 4.1 86.7 0.0 70.1 85.2 65.0 67.1 21.1 71.0 73.5 63.2 71.9 66.8 36.8 10.0 91.2 0.3 81.8 99.1 29.9 77.8 85.9 60.7 2.7 83.1 30.3 67.9 74.6 0.4 84.6 0.0 39.5 39.9 38.9 87.0 93.4 37.0 61.3 74.9 11.1 71.5 74.5 73.2 47.4 77.0 54.0 84.5 78.0 0.0 80.8 86.9 94.5 63.6 43.0 69.5 25.0 78.6 76.7 70.6 75.3 77.6 27.1 59.7 60.2 3.4 79.0 41.4 39.8 40.8 69.4 73.3 38.9 85.6 76.6 36.2 13.7 58.7 89.8 68.4 18.7 81.0 72.2 81.6 52.7 2.4 67.9 0.9 69.5 83.7 82.4 57.5 73.7 88.4 0.7 38.5 77.0 73.9 29.7 86.8 71.8 1.6 73.5 64.0 27.7 86.4 87.0 70.6 30.3 2.0 52.0 19.6 43.7 74.4 69.0 58.1 66.7 85.6 61.9 79.6 66.1 27.0 78.2 44.6 67.0 94.4 83.1 59.5 33.9 75.8 79.1 62.8 65.9 65.6 14.4 38.9 12.2 71.9 28.7 92.7 79.6 62.5 72.3 56.1 65.5 33.8 0.4 61.5 60.6 0.2 30.3 23.1 84.1 82.1 57.1 3.7 5.3 3.6 62.1 38.0 71.4 53.1 82.8 77.2 86.7 38.3 74.2 28.8 65.5 33.1 61.3 34.9 80.9 87.2 84.6 71.9 76.3 69.7 38.2 30.0 68.6 44.7 69.0 79.2 5.7 64.828.7 35.0 63.6 7.7 44.9 87.5 87.0 68.8 8.0 57.8 81.4 28.2 72.8 67.9 83.2 88.6 74.4 60.7 9.2 60.7 66.9 95.0 9.7 65.8 60.0 31.6 35.3 71.0 87.4 35.7 84.2 80.4 34.5 57.0 73.7 77.0 73.7 86.3 84.6 33.0 84.0 78.6 65.9 84.5 8.6 78.1 10.0 0.0 68.9 77.3 75.4 75.4 70.3 88.2 60.5 18.3 0.0 35.1 88.4 75.0 76.7 1.9 31.8 97.6 48.2 64.8 44.4 0.3 75.0 62.2 55.9 2.7 21.5 42.8 16.0 71.7 66.8 73.8 79.8 81.1 85.1 76.8 72.6 78.7 77.4 74.3 63.3 51.0 74.3 78.4 0.2 71.9 14.7 82.3 53.1 53.0 73.5 5.4 50.6 60.1 36.0 31.5 95.7 22.4 45.0 24.7 77.0 81.7 39.6 83.7 4.9 35.3 91.6 44.3 26.4 63.3 78.2 86.2 80.3 73.9 0.0 79.3 30.9 29.4 61.2 79.6 23.2 47.6 51.0 28.5 0.2 29.6 54.9 82.1 95.9 97.8 73.5 17.0 1.0 93.9 20.2 77.5 35.0 2.5 5.0 68.0 35.9 49.2 74.7 85.2 78.9 76.5 56.5 47.7 26.6 81.3 90.0 42.9 0.2 33.4 84.1 49.7 2.4 0.0 65.0 62.2 44.2 0.1 98.1 66.1 74.6 71.7 66.8 36.7 52.6 55.0 52.8 48.2 71.0 87.8 90.5 90.0 38.7 43.2 76.4 83.9 85.3 38.7 68.4 63.5 2.0 74.9 37.0 78.5 91.7 64.9 58.4 62.8 67.8 68.6 32.0 69.6 63.0 63.5 84.3 76.1 31.2 33.9 89.1 82.5 75.0 79.7 77.3 78.4 72.1 77.3 64.7 49.9 84.1 87.8 48.8 64.0 0.0 60.3 48.1 88.8 73.3 78.6 74.5 70.6 65.7 8.0 75.9 60.8 76.9 27.3 9.0 94.0 60.0 31.0 82.7 61.0 79.6 85.8 5.2 81.4 52.0 76.0 46.1 42.6 21.0 66.0 57.5 66.1 87.3 83.8 91.8 77.8 74.2 48.8 63.3 42.3 39.4 78.9 83.1 76.0 84.3 82.2 47.8 41.5 29.3 74.7 56.1 77.4 84.9 49.3 30.7 91.7 28.4 75.8 84.0 79.8 70.8 77.9 87.5 47.9 49.4 78.6 66.8 85.4 79.3 82.0 69.1 72.7 69.9 2.1 88.2 0.9 62.6 73.1 78.3 92.1 51.5 75.9 80.2 1.2 48.8 89.9 73.0 21.0 1.5 69.1 55.8 39.2 81.9 71.0 36.5 78.5 53.0 86.1 93.0 65.9 34.2 77.9 82.5 50.0 77.4 47.8 71.6 79.2 0.2 79.4 70.1 84.328.2 94.1 11.3 72.1 32.7 28.2 18.9 3.1 73.1 10.0 94.1 28.7 82.4 71.3 73.0 82.3 32.4 47.7 47.4 56.0 31.9 85.3 67.9 70.9 66.1 81.6 65.7 55.7 50.6 43.2 11.3 75.4 93.1 42.2 30.4 71.0 12.3 67.2 59.5 65.9 53.8 37.8 71.8 69.7 68.9 85.8 83.1 72.9 81.2 47.0 70.5 86.5 72.2 20.9 65.3 34.5 84.1 64.1 67.3 64.7 77.7 63.9 63.5 39.0 81.3 37.9 62.964.9 95.3 71.9 49.0 43.6 54.9 100.9 19.1 16.6 70.9 68.5 66.2 73.1 50.2 54.4 30.3 30.1 89.0 74.6 30.0 73.2 73.8 12.4 81.8 64.7 27.0 81.7 31.0 0.2 45.0 26.7 105.0 52.2 67.0 32.7 30.9 69.0 40.0 79.1 77.3 86.0 74.0 1.0 49.9 13.6 42.7 76.8 69.4 30.0 93.4 34.1 94.6 81.7 1.2 86.5 69.4 76.3 2.7 99.3 91.8 92.0 69.6 42.151.7 85.1 51.8 29.6 61.3 61.273.6 69.8 60.7 69.5 2.8 90.9 24.7 88.3 0.1 76.6 81.1 40.8 55.2 0.7 47.2 42.5 36.5 72.4 80.1 56.0 5.2 41.3 78.2 88.4 81.1 91.2 36.6 84.0 76.3 26.0 65.0 61.0 47.8 42.7 65.6 71.0 70.4 47.5 54.0 67.2 45.6 1.3 0.0 31.7 86.2 43.5 68.3 76.6 80.4 71.1 52.4 47.0 42.7 35.0 28.0 50.0 90.1 74.4 94.1 51.7 62.9 101.6 67.7 52.2 4.1 31.0 65.3 2.7 93.7 77.0 42.4 69.8 79.3 44.6 72.0 50.1 16.4 66.3 68.7 87.8 89.3 0.0 68.1 29.7 78.0 72.6 50.0 85.6 83.5 64.7 78.1 76.1 77.7 1.7 77.8 55.9 80.5 0.1 21.9 86.9 68.5 58.7 80.8 0.0 26.8 81.2 79.7 26.1 41.0 90.1 78.0 40.9 50.1 45.5 59.5 81.6 47.0 58.1 94.4 62.9 72.6 49.7 63.2 67.5 0.4 79.3 84.3 67.1 92.4 80.1 91.4 86.3 89.2 55.1 68.0 88.2 77.8 59.7 71.6 67.7 83.0 40.3 4.5 44.3 42.4 35.0 88.7 0.0 1.0 0.1 40.7 0.2 23.5 80.1 72.1 29.3 0.9 88.3 82.6 0.2 68.4 85.6 88.7 40.6 0.6 40.3 7.8 62.1 17.7 48.2 86.3 49.4 84.9 56.8 47.0 88.7 0.0 53.0 80.3 67.8 38.5 38.2 70.0 84.5 0.9 63.0 65.1 79.0 0.0 62.5 65.0 104.2 17.4 88.5 65.4 55.4 43.0 48.9 0.4 80.3 96.2 70.3 73.0 46.4 21.0 77.2 61.4 77.2 82.2 1.0 79.7 69.9 77.2 64.2 83.6 67.0 0.9 49.2 18.8 78.0 84.5 56.0 3.5 75.7 67.3 78.2 74.1 30.0 81.2 82.5 85.5 40.8 16.7 1.0 83.1 1.7 30.1 50.0 83.4 56.6 29.0 93.9 71.4 75.3 74.6 30.9 71.8 70.8 1.0 80.4 54.2 76.6 72.7 77.2 69.4 83.1 52.3 87.5 81.9 70.8 54.6 81.6 88.6 70.0 77.7 16.0 24.2 62.0 70.6 2.4 74.0 79.0 63.4 53.1 90.9 83.4 79.9 88.0 67.4 73.0 89.1 77.6 67.9 83.3 66.8 41.0 77.4 75.1 55.1 74.1 74.5 38.4 74.0 52.3 81.6 73.5 55.9 55.5 64.0 77.1 85.9 20.1 54.7 29.8 60.7 77.8 74.2 71.7 49.1 85.8 34.3 83.5 74.9 44.3 30.0 46.4 68.5 60.0 9.0 93.8 28.7 85.3 86.2 22.6 55.1 27.5 66.7 75.6 89.3 51.4 67.9 59.4 1.0 69.4 28.7 16.0 81.1 88.4 26.9 40.4 67.5 61.9 86.6 83.2 60.5 32.5 0.0 114.2 53.6 47.5 44.9 37.4 74.3 67.1 79.0 80.2 46.0 76.1 73.6 70.5 86.4 67.7 67.3 46.0 1.6 40.3 21.0 64.8 0.0 72.2 30.9 53.1 81.4 69.0 96.8 0.5 29.7 56.7 68.8 76.4 0.9 56.2 92.6 77.2 0.7 41.1 28.1 93.4 67.4 67.8 77.0 46.0 76.3 85.1 57.9 63.4 69.9 1.0 96.9 76.9 21.6 44.2 31.5 82.1 0.2 14.0 82.7 73.5 75.8 66.4 0.0 1.5 0.0 26.3 48.5 79.1 78.8 44.0 50.0 23.2 76.8 44.0 88.2 92.4 52.1 76.6 62.0 68.8 61.4 51.1 71.8 66.9 81.3 63.5 55.9 68.3 66.6 58.9 15.2 90.6 4.6 60.5 50.9 57.8 89.9 62.0 73.0 81.3 1.1 0.2 76.1 31.0 60.1 83.7 38.6 59.8 67.1 17.6 92.1 80.0 79.0 72.7 81.4 47.0 65.7 30.0 64.7 38.0 77.3 39.9 12.8 18.1 65.7 47.6 85.1 79.1 78.3 51.1 2.4 51.1 59.8 86.4 12.1 39.5 38.4 84.6 86.0 81.4 53.2 74.4 78.8 55.9 78.6 93.0 39.2 71.9 25.7 74.7 0.0 72.4 86.5 23.0 92.2 77.9 0.2 61.3 75.0 88.2 81.2 38.9 15.9 0.0 59.4 62.8 97.7 62.8 63.9 0.0 58.4 80.7 22.1 72.6 42.7 7.0 81.4 40.2 56.0 0.8 71.7 83.9 54.3 72.1 72.6 73.0 26.6 79.7 41.8 71.7 35.9 71.5 32.7 66.4 66.2 66.1 0.0 75.1 84.8 98.1 78.1 66.7 60.7 63.1 50.0 58.2 77.1 26.4 67.2 69.9 73.7 61.7 80.6 6.0 73.4 3.0 82.5 76.8 29.7 56.6 49.3 78.4 38.1 93.2 25.4 87.1 91.8 64.1 62.3 46.2 75.3 68.0 44.1 73.5 66.5 80.3 71.7 95.2 81.3 14.6 66.3 93.2 73.4 67.0 66.3 75.2 79.8 28.9 65.5 67.3 51.9 83.3 64.0 93.1 77.9 79.8 80.7 31.8 88.0 3.4 78.4 69.9 74.8 87.0 39.0 31.5 55.9 79.4 33.1 38.5 59.7 83.7 52.7 53.2 83.8 80.0 54.2 71.4 60.5 88.5 90.1 12.5 75.6 67.9 61.2 0.0 86.2 80.4 54.6 78.8 44.3 18.1 47.9 0.7 0.0 68.2 86.5 93.6 83.9 97.2 49.6 70.1 76.4 76.7 2.1 62.9 37.0 71.2 55.1 74.5 79.1 84.2 70.1 0.2 84.8 64.5 75.5 80.7 71.7 86.6 37.9 46.7 21.7 69.3 61.6 48.7 86.6 62.1 60.4 35.3 33.8 65.8 61.6 55.1 7.4 90.1 40.4 88.4 64.0 2.2 76.2 47.0 84.4 40.6 79.0 70.0 34.5 30.3 22.8 68.4 73.662.4 74.6 4.8 40.6 1.8 90.1 0.4 69.2 6.8 26.3 41.0 48.0 6.5 51.1 0.6 86.6 7.9 71.0 82.028.3 74.8 72.8 69.8 86.9 2.2 67.9 55.4 72.1 90.1 43.8 41.4 85.9 81.0 73.0 34.4 80.1 31.5 64.6 36.0 92.6 76.5 90.2 41.5 86.1 34.1 95.274.4 74.5 91.3 64.0 81.7 86.7 54.5 34.0 77.0 30.3 58.5 70.4 78.8 27.4 75.0 28.5 84.8 77.9 59.7 5.4 60.3 0.0 0.6 74.0 74.1 36.2 22.6 0.0 1.7 39.0 83.3 50.5 64.3 64.5 58.5 25.0 86.1 76.9 63.0 83.8 85.6 73.8 84.1 72.0 27.6 88.3 68.7 72.4 86.9 23.6 88.6 68.3 84.1 84.8 77.1 43.9 38.8 40.1 83.1 6.1 78.7 78.4 79.5 66.0 17.9 73.0 76.0 77.0 1.6 52.9 35.5 0.6 85.1 42.0 71.0 80.0 67.0 83.4 72.2 76.2 83.1 19.8 94.3 50.9 68.8 82.0 74.9 71.2 74.7 72.2 53.0 82.1 59.4 74.8 82.0 71.0 69.766.5 69.9 2.6 68.5 82.8 68.6 73.9 66.6 79.6 84.5 78.4 71.2 71.4 48.3 60.0 82.0 79.1 88.7 59.6 14.0 48.9 74.1 90.4 75.9 72.3 2.5 79.1 44.5 1.2 0.0 78.3 54.0 18.3 77.3 81.5 82.5 74.5 37.0 72.0 90.7 81.1 83.7 60.0 57.1 87.2 21.4 4.5 98.4 75.3 50.9 33.8 84.9 82.0 88.5 89.1 90.1 93.3 80.4 87.1 35.3 95.9 33.7 66.7 34.0 23.5 82.2 76.2 99.3 26.7 90.3 66.1 80.0 2.5 66.2 40.0 83.1 69.5 64.1 49.8 63.9 63.7 50.2 41.0 54.7 75.0 77.5 95.0 86.2 11.4 29.9 38.2 78.6 90.2 20.8 71.2 70.6 86.1 53.0 72.6 13.3 71.5 58.0 81.1 35.5 59.0 95.9 58.9 91.7 57.0 76.0 63.0 1.9 0.0 57.2 91.8 49.2 70.0 80.3 22.4 17.5 97.7 73.4 81.9 62.0 0.2 78.3 5.2 83.7 13.0 84.3 60.9 39.2 75.3 0.2 65.0 84.1 86.6 88.0 67.0 79.7 75.8 86.3 75.8 88.9 67.1 37.0 69.9 60.4 68.2 75.2 0.6 67.1 58.6 72.3 83.1 89.8 64.8 69.1 5.3 0.8 70.4 83.0 6.3 86.0 79.8 77.5 74.3 61.1 32.6 94.3 70.1 67.5 86.6 80.8 64.6 87.9 31.1 71.4 92.7 82.4 76.7 79.7 85.9 41.2 83.7 84.7 100.8 44.2 89.6 79.7 50.1 5.0 54.7 63.1 11.3 82.0 94.0 45.6 20.1 26.5 3.5 49.4 50.3 18.9 89.0 0.4 4.9 60.9 91.1 88.1 69.0 71.5 76.1 70.5 44.0 66.9 91.6 33.3 0.8 55.4 77.9 83.8 64.9 17.7 54.9 63.4 75.2 78.1 74.3 8.2 25.8 84.1 95.1 76.7 32.3 83.3 60.9 39.8 61.0 79.6 85.2 87.3 89.2 0.0 52.3 69.5 59.3 98.6 2.7 63.8 58.1 86.3 68.4 27.6 49.2 93.3 0.0 16.3 72.3 23.9 81.4 58.6 88.7 94.6 57.0 40.9 51.8 69.2 48.3 78.9 77.4 38.1 46.0 80.6 82.1 64.9 72.2 73.8 14.3 76.1 48.1 18.5 58.2 92.6 0.2 67.7 80.1 83.2 43.0 82.2 41.6 74.7 36.2 50.8 39.0 17.2 12.8 86.7 61.4 14.0 33.0 1.1 62.6 73.1 28.4 21.9 61.3 77.0 32.9 81.3 70.5 44.2 50.3 84.4 29.2 91.2 88.4 61.0 57.3 8.1 30.0 77.8 74.4 8.0 0.0 16.819.3 9.0 74.7 82.2 31.4 68.3 70.5 79.6 62.6 80.6 60.1 86.9 0.5 77.5 13.0 88.1 0.5 67.5 43.1 76.1 89.4 61.4 28.3 61.7 47.0 77.5 74.0 1.0 39.8 0.7 77.9 23.2 85.6 72.0 24.8 38.4 0.4 55.8 16.7 67.0 32.0 95.1 57.4 72.0 42.5 67.7 71.4 32.3 72.0 83.2 2.7 29.2 41.9 80.7 92.5 6.0 80.2 71.2 78.9 71.4 79.7 74.3 39.4 27.6 40.7 7.7 86.5 21.8 0.2 85.0 1.1 6.0 84.4 62.7 80.5 64.9 72.4 75.6 77.2 66.0 77.3 44.6 73.2 67.0 17.9 66.7 72.4 44.8 42.4 92.3 41.8 84.0 76.3 0.8 79.4 70.4 38.3 73.1 63.0 0.2 0.2 38.5 74.7 47.0 90.3 92.3 72.9 62.2 93.3 83.6 87.3 74.4 71.6 66.3 94.1 68.8 90.8 100.2 3.7 70.7 83.2 44.2 60.9 65.0 91.5 76.0 7.1 65.5 70.9 65.9 64.1 40.5 84.359.7 71.7 0.3 62.8 27.8 51.9 64.4 90.2 23.1 70.0 21.2 32.4 49.8 87.5 32.7 4.8 70.3 79.2 66.5 4.0 47.9 67.2 90.3 79.0 79.6 2.2 13.0 56.2 88.4 82.6 73.9 85.9 73.7 46.2 94.4 81.5 73.6 88.8 8.3 79.9 0.1 92.4 20.0 78.8 0.0 60.7 74.1 42.0 62.7 56.0 83.3 34.1 64.2 35.8 29.0 72.5 69.1 16.5 75.3 4.2 84.9 82.6 30.9 80.7 55.0 95.4 84.4 88.5 82.8 1.8 88.2 79.1 54.0 78.3 75.9 0.0 86.4 37.3 42.3 76.5 76.2 31.6 56.0 79.5 65.3 56.0 64.9 99.6 73.5 81.6 67.4 87.1 64.9 12.9 105.1 64.0 64.0 83.1 86.1 74.1 63.1 90.0 52.5 25.0 75.7 80.0 81.1 63.2 88.4 91.5 88.3 97.7 86.7 68.8 70.6 0.7 64.7 5.4 85.9 71.5 64.2 92.7 89.4 65.0 63.8 37.6 31.5 59.4 95.9 68.4 48.0 79.3 34.6 86.3 80.4 84.9 24.0 81.1 87.3 83.2 52.5 63.1 44.6 3.0 20.1 22.6 86.5 76.4 1.0 93.2 67.0 80.2 73.8 76.5 71.0 26.5 33.0 81.3 78.8 69.0 73.0 57.6 24.0 91.8 58.0 50.7 23.7 101.8 90.7 78.3 72.0 0.0 7.0 53.2 0.7 55.0 76.6 51.0 43.1 32.0 80.3 87.1 84.5 82.3 0.0 71.4 91.5 81.5 57.3 44.2 77.5 70.4 77.9 84.8 90.3 2.0 70.7 81.4 18.6 53.7 69.2 52.0 85.5 87.5 64.0 0.9 86.5 59.8 24.1 49.2 25.0 21.0 67.6 4.2 0.3 0.0 19.0 0.0 7.0 77.0 76.9 83.0 83.8 63.3 91.6 20.3 18.3 62.6 90.1 73.5 0.0 0.2 51.0 65.3 94.0 55.0 53.1 58.1 76.4 18.8 36.2 25.0 20.0 77.3 40.7 88.4 55.7 52.0 19.4 79.0 91.6 58.9 78.0 54.3 87.4 71.4 88.1 39.0 28.2 96.6 42.3 29.7 75.2 82.9 92.3 27.0 71.3 76.9 76.8 39.0 34.7 91.9 29.0 79.8 56.5 95.5 90.7 39.0 0.6 93.6 70.5 1.0 45.0 69.0 14.8 68.6 8.0 94.8 99.3 4.4 38.0 80.2 47.2 4.2 0.8 87.0 32.0 25.7 57.6 46.8 42.7 2.5 29.1 26.1 93.7 92.4 87.0 84.5 72.5 19.0 80.7 84.7 31.0 72.1 36.2 33.3 34.6 58.1 88.2 61.7 96.0 73.4 83.3 66.1 46.2 42.6 30.5 61.3 24.1 81.1 67.1 69.5 91.6 81.4 65.2 25.4 32.7 64.2 59.6 58.3 42.8 55.1 78.271.5 82.1 77.2 19.9 67.0 71.2 42.2 78.2 40.0 69.6 66.7 60.3 82.9 31.7 87.0 82.6 83.1 18.0 0.5 22.0 81.8 21.4 72.8 88.1 78.6 92.5 40.1 77.7 84.6 40.6 63.8 90.0 77.5 66.6 0.2 43.0 71.3 0.1 49.9 87.1 90.6 63.0 83.3 0.0 74.6 85.5 87.5 60.0 76.2 40.5 45.0 3.0 91.8 63.0 75.0 30.1 77.5 70.4 79.5 88.1 79.4 89.9 55.3 72.6 80.3 42.5 17.9 46.0 49.4 35.0 40.7 67.7 80.8 76.4 71.0 73.3 76.4 78.6 58.5 89.6 83.1 87.7 79.4 69.1 55.9 82.3 79.9 89.7 82.2 72.6 70.3 53.1 40.5 79.2 30.1 55.7 52.4 69.5 80.3 50.2 67.0 61.4 65.9 34.0 57.0 33.8 49.0 83.6 42.0 48.0 75.6 84.5 34.0 52.8 90.1 36.0 89.2 87.4 49.2 13.5 1.2 92.0 77.4 65.7 75.3 37.0 79.6 36.0 59.0 59.6 66.9 82.4 85.6 48.2 70.0 67.3 76.2 97.0 79.7 47.5 75.2 2.0 29.6 81.9 76.6 60.1 84.9 64.6 7.0 54.8 87.6 75.0 62.7 61.7 66.5 66.7 33.3 73.4 51.1 74.8 83.1 85.5 77.1 13.0 42.0 84.9 82.2 91.5 18.0 79.4 63.0 38.7 40.8 63.3 68.9

82.1 20.0 96.5 23.0 15.0 34.6 79.5 84.5

57.6 72.7 74.7 78.7 84.5 87.8 76.9 90.1

59.1 79.3 61.3 76.4 90.3 61.3 77.2 61.9 79.1 0.4 1.0 25.0 71.3

72.6 78.1 1.1 68.7 75.6 38.5 18.0 18.3 68.0 76.9 84.8 79.2 78.2 27.9 69.5 90.5 70.2 77.0 50.7 32.0 72.7 63.0

82.5 93.8 48.0 49.4

48.3

59.8

32.0 50.5

86.1 86.4

73.2 64.5 79.8

34.8 73.2 66.7 1.7 84.3 71.1 19.2 75.9 72.8 0.7 66.5 52.4 3.3 74.1 66.4 89.1 31.2 35.6 55.1 87.3 70.1 81.15.0 73.8 57.0 58.9 1.0 81.9 92.5 90.4 92.3 58.9 31.9 80.2 15.3 80.1 44.4 54.3

61.7 82.3

83.6 81.5

41.1 86.9 86.7

80.6

Figure 1: WikiTree’s directed multigraph (subgraph with 10,000 vertices). Each link color represents a connection of a different type: blue is a Parent link, red is a Sibling link, green is a Spouse link, and yellow is a Child link. The color of each vertex defines the vertex’s gender: blue represents male vertices, pink represents female vertices, and gray represents unknown gender. Each vertex label, which is visible by zooming into the graph, contains the vertex lifespan if one exists.

4. Gender(v) - The gender of v converted to an integer, where male is set to 1, female is set to 2, and unknown gender is set to 0.

5. Birth-Country(v) - The country in which v was born.

6. Death-Country(v) - The country in which v died.

7. Age-of-Death(v) - The age of death of v (also referred to as the lifespan of v) which is calculated, if accurate dates are available, by subtracting the birth date of v from the death date of v.

8 3.2.2 Nuclear Family Features 8. Children-Number(v) - the number of children which v had. The formal Children- Number(v) definition is: Children-Number(v) := |{u ∈ V |u ∈ V ∧ ∃(v, u, Child, t) ∈ E}|.

9. Spouse-Number(v) - the number of individuals to which v was married to. The formal Spouse-Number(v) definition is: Spouse-Number(v) := |{u ∈ V |(u ∈ V ∧ ∃(v, u, Spouse, t) ∈ E}|.

10. Min-Spouse-Age-of-Death(v) - the minimum age of death of v’s spouses. The formal Min-Spouse-Age-of-Death(v) definition is:

Min-Spouse-Age-of-Death(v) := min({Age-of-Death(u) 6= ∅|u ∈ V ∧ ∃(v, u, Spouse, t) ∈ E}), where the function min(S) returns the minimum value among set S members, or 0 if S is empty. 11. Max-Spouse-Age-of-Death(v) - the maximum age of death of v’s spouses. The formal Max-Spouse-Age-of-Death(v) definition is:

Max-Spouse-Age-of-Death(v) := max({Age-of-Death(u) 6= ∅|u ∈ V ∧ ∃(v, u, Spouse, t) ∈ E}), where the function max(S) returns the maximum value among set S members, or 0 if S is empty. 12. Avg-Spouse-Age-of-Death(v) - the average age of death of v’s spouses. The formal Avg-Spouse-Age-of-Death(v) definition is:

Avg-Spouse-Age-of-Death(v) := avg({Age-of-Death(u) 6= ∅|u ∈ V ∧ ∃(v, u, Spouse, t) ∈ E}), where the function avg(S) returns the average value of set S members, or 0 if S is empty.

3.2.3 Extended Family Features 13. Father-Age-of-Death(v) - v’s father age of death. The formal Father-Age-of- Death(v) definition is: Father-Age-of-Death(v) := Age-of-Death(Father(v)), where the function Father returns the father vertex of v, if one exists. Namely, F ather(v) := u, where u ∈ V ∧ gender(u) = 1 ∧ ∃(v, u, P arent, t) ∈ E. 14. Mother-Age-of-Death(v) - v’s mother age of death. The formal Mother-Age-of- Death(v) definition is: Mother-Age-of-Death(v) := Age-of-Death(Mother(v)), where the function Mother returns the mother vertex of v, if one exists. Namely, Mother(v) := u, where u ∈ V ∧ gender(u) = 2 ∧ ∃(v, u, P arent, t) ∈ E.

9 15. Paternal-Grandfather-Age-of-Death(v) - v’s paternal grandfather’s age of death, if one exists. The formal Paternal-Grandfather-Age-of-Death(v) definition is:

Paternal-Grandfather-Age-of-Death(v) := Age-of-Death(Father(Father(v))).

16. Maternal-Grandfather-Age-of-Death(v) - v’s maternal grandfather’s age of death, if one exists. The formal Maternal-Grandfather-Age-of-Death(v) definition is:

Maternal-Grandfather-Age-of-Death(v) := Age-of-Death(Father(Mother(v))).

17. Paternal-Grandmother-Age-of-Death(v) - v’s paternal grandmother’s age of death, if one exists. The formal Paternal-Grandmother-Age-of-Death(v) definition is:

Paternal-Grandmother-Age-of-Death(v) := Age-of-Death(Mother(Father(v))).

18. Maternal-Grandmother-Age-of-Death(v) - v’s maternal grandmother’s age of death, if one exists. The formal Maternal-Grandmother-Age-of-Death(v) definition is:

Maternal-Grandmother-Age-of-Death(v) := Age-of-Death(Mother(Mother(v))).

19. Sibling-Number(v) - the number of brothers and sisters v had. The formal Sibling- Number(v) definition is:

Sibling-Number(v) := |{u ∈ V |u ∈ V ∧ ∃(v, u, Sibling, t) ∈ E}|.

20. Max-Sibling-Age-of-Death(v) - the maximum age of death of v’s siblings. The formal Max-Sibling-Age-of-Death(v) definition is:

Max-Sibling-Age-of-Death(v) := max({Age-of-Death(u) 6= ∅|u ∈ V ∧ ∃(v, u, Sibling, t) ∈ E}).

21. Avg-Sibling-Age-of-Death(v) - the average age of death of v’s siblings. The formal Avg-Sibling-Age-of-Death(v) definition is:

Avg-Sibling-Age-of-Death(v) := avg({Age-of-Death(u) 6= ∅|u ∈ V ∧ ∃(v, u, Sibling, t) ∈ E}).

Using the features defined above, we specify the following feature sets, which will later be used to construct our multiple linear regression models and ML classifiers: (a) All- Numeric-Features - a set which contains all the defined-above features that return numeric values, except the Death-Year feature; (b) Heritage-Features - a set which includes all the extended family features, including the Birth-Year and Gender features; and (c) Nuclear- Family-Features - a set which includes all the nuclear family features, including Birth-Year and Gender.

10 3.3 Statistical and Predictive Analysis In this study, we used various algorithms and methods to calculate the variations in human lifespan over the past centuries, to identify which features are correlated with human lifespan and longevity, and to create predictive models which can assist in predicting human lifespan. In the remainder of this subsection, we describe in detail each one of our methods and algorithms.

3.3.1 Lifespan Variations over Time After we had extracted the features for each vertex in the graph, we could utilize these features to calculate the variations in human lifespan over an extended period of time. To perform these calculations, we created two vertices datasets. The first dataset was the All-Dataset, which included all the vertices with valid values of Age-of-Death, while the second dataset was the -Dataset which included only vertices with valid values of Age-of-Death of people who were born in a specific country - in this study, we chose to take a closer look at people born in the United States. We utilized the All-Dataset and the United-States-Dataset to specifically look at the lifespan of people who were born in each quarter of a century between 1650 and 1900. For each quarter of a century on each dataset, we calculated the Age-of-Death distribution of those people born in the chosen quarter. For example, in the second dataset, we had a group of 22,021 people who were born in the United States and lived between 1700 and 1724; we then calculated the percent of the population that died at each age between 0 and 122.10 Additionally, for the All-Dataset-10 and for the United-States-Dataset-10, and for each year from 1650 to 1900, we calculated both the average and median lifespans of the people who were born in each year and outlived the age of 10. We also repeated these average and median calculations for each gender, using the Male-Dataset-10, Female-Dataset-10, Male-United-States-Dataset-10, and Female-United-States-Dataset-10 datasets.

3.3.2 Linear Regression One of the main goals of this study was to identify features which are correlated with lifespan and with longevity. To identify features correlated with an inheritance of human lifespan, we computed for each extended family feature, which was defined in Section 3.2.3, a simple linear regression Y = α+βX, where Y was set to be the Age-of-Death vector, and X was set to be selected feature values. For each feature we chose only vertices from the -Dataset-10, in which both the Age-of-Death value was greater or equal ten11 and the selected feature value existed.12 We then evaluated each simple linear regression by computing the regression’s P-value and R-squared values. To identify if an individual’s lifespan was correlated with the lifespan of his or her spouse(s), we repeated the same process of constructing a simple linear regression between the Age-of-Death feature and the Avg-Spouse-Age-of-Death, Max-Spouse-Age-of-Death,

10122 is the maximum confirmed human lifespan [23]. 11We chose to use a minimum lifespan of 10 to avoid adding infant and child mortality, which might be misreported. 12For features such as Max-Sibling-Age-of-Death and Min-Spouse-Age-of-Death, which involved calcu- lation of minimum, maximum or average, we ignored vertices with missing values, although by definition these features returned a valid value of 0.

11 and Min-Spouse-Age-of-Death features. However, this time we used the Married-Dataset to include only individuals who were married at least once. To identify if longevity is correlated with reproductive success, we repeated the same process of constructing a simple linear regression between the Age-of-Death feature and the Children-Number feature. However, with respect to Westendorp and Kirkwood’s [22] results in mind, we used Children-Number-Dataset-50, Male-Children-Number-Dataset-50, and Female-Children-Number-Dataset-50 datasets, which only contained vertices with age of death of at least 50, namely after menopause.

3.3.3 Multiple Linear Regression In this study, we used backward stepwise multiple linear regression to create models for predicting the Age-of-Death of individuals who had been born by 1900. We constructed these regression models by using the All-Numeric-Features, the Heritage-Features, and the Nuclear-Family-Features sets, which were defined at the end of Section 3.2. For construct- ing our first two models, we only used vertices from the No-Missing–Dataset-10 dataset with valid complete values, including defined gender values, for each selected features set of vertices who outlived the age of 10. Additionally, to prevent bias due to the tendency of people to get married and have children in later stages of life, for the Nuclear-Family- Features set we only used vertices from the No-Missing–Dataset-50 dataset, i.e., those who outlived the age of fifty. We evaluated these multiple linear regression models by calculating the P-value, as well as the Multiple R-squared, Adjusted R-squared, and Residual Standard Error (RSE) values.

3.3.4 Machine Learning Algorithms One of the major drawbacks of using online genealogy datasets is the issue of missing values. In many genealogy datasets not all the profile data is complete; many profiles contain missing values due to nonexistent data or privacy considerations [28]. To overcome the issue of missing values and still gain predictive information from the profiles with nonexistent data, we chose to use Machine Learning algorithms, such as decision trees and Naive-Bayes algorithms, which can deal with missing values. We evaluated various supervised learning algorithms in an attempt to predict which individuals who were born in the United States between 1650 and 1900, and outlived the age of fifty, will also outlive the age of 80. We constructed our classifiers using Weka [9], a popular suite of ML, and the features defined in the United-States–Dataset-50 dataset. We used all numeric features in each dataset, except the Age-of-Death and Death-Year features. Additionally, we also treated unknown gender values as missing values, instead of replacing them with 0 values. Using these datasets as training sets, we used Weka’s OneR, C4.5 (J48) decision tree, K-Nearest-Neighbors (IBk; with K=3,5), Naive-Bayes, RandomForest, and Bagging implementations of the corresponding algorithms. For each of these algorithms, most of the configurable parameters were set to their default values except for the J48 decision tree classifier, in which the pruning option was not enabled. We evaluated each classifier using the 10-folds cross validation method and calculated the True-Positive, False-Positive, F-Measure, and the Area-Under-Curve (AUC) measure. The AUC is a standard way to compare classifier performances [3], in which 0.5 a value represents a random classifier. Additionally, to obtain an indication of the usefulness of the various features, we analyzed their importance using Weka’s information gain attribute selection algorithm.

12 3.4 WikiTree Dataset To test and evaluate our methods and algorithms, we chose to use information collected from the WikiTree website. This is a free and accessible collaborative family history website started by Chris Whitten [25], and it contains more than 5 million profiles [5] of individuals who primarily lived in the past. WikiTree contains many profile pages of people who lived in the previous centuries, and many of the profiles contain the following details about each individual: full name, gender, date of birth, date of death, location of birth, location of death, parents’ profiles, children’s profiles, spouses’ profiles, and siblings’ profiles. Often, in order to maintain the privacy of still-living people, the website limits access to their profile personal details [28]. In order to maintain the integrity of WikiTree profile data, many profiles give reference to the source of the data presented in the profile, and most profiles have a profile manager who has primary responsibility for WikiTree profiles [27]. In addition, to prevent editing of profiles by untrusted users, each WikiTree profile has an independent “Trusted List” of people who can edit and view the profile [26], making the data in many profiles only editable to a limited number of people. To collect profile information from WikiTree, we developed a web crawler which crawled and parsed only public profiles from the website. Using our crawler, we have downloaded and parsed 1,070,189 public profile pages. Using these profiles, we were able to construct a directed multigraph with 9,192,212 links and 1,382,752 vertices, out of which 118,590 vertices represented at least distinct 28,011 private profiles. Moreover, the constructed multigraph contained at least 416,030 vertices represent individuals born in the United States, according to their profile pages. These vertices were connected by 5,168,275 links to other vertices in the multigraph (see Figure 1, and Tables 3 and 4).

Table 3: WikiTree Dataset Statistics

Graph Property All Profiles Vertices of People Born in the United States Downloaded Profile Number 1,070,189 416,030 Vertices Number 1,382,752 416,030 Links Number 9,191,147 5,168,275 Male Vertices Number 553,411 217,387 Female Vertices Number 506,605 196,793 Vertices with Valid 545,993 285,868 Lifespan (Age-Of-Death) Value

4 Results

In the following subsections, we present the results obtained using the algorithms and methods described in Section 3. The results consist of three parts: First, we present the results of calculating lifespan variations over time. Second, we present the results of the simple linear regression and multi-linear regression analysis techniques which were described in Section 3.3.2. Finally, we present the results of the ML algorithms mentioned in Section 3.3.4.

13 Table 4: All-Dataset Number of Profiles Born in Each Year

Decade/Year 0 1 2 3 4 5 6 7 8 9 1650 1177 1243 1186 1213 1414 1218 1169 1311 1071 1814 1660 1192 1280 1330 1343 1645 1400 1368 1539 1372 2136 1670 1346 1626 1568 1552 1917 1469 1334 1660 1504 2324 1680 1500 1660 1489 1567 1840 1576 1573 1665 1431 2437 1690 1484 1516 1544 1624 2012 1658 1586 1734 1602 2919 1700 1657 1936 1804 1885 2173 1814 1765 1862 1630 2579 1710 1665 1916 1701 1753 2172 1876 1769 1854 1734 2739 1720 1797 1870 1921 1915 2378 1919 1853 1980 1800 2757 1730 1809 2013 1914 1987 2558 2037 2018 2188 1911 2997 1740 1939 2093 1970 2094 2643 2130 2006 2184 1960 3114 1750 1917 2195 2103 2190 2591 2411 2201 2264 2177 3118 1760 2303 2490 2349 2361 3019 2330 2345 2422 2275 3181 1770 2248 2399 2323 2471 2992 2500 2282 2428 2192 3297 1780 2358 2476 2448 2571 2923 2659 2617 2668 2629 3637 1790 2603 2901 2780 2901 3268 2831 3057 2990 2859 4007 1800 3086 3070 3068 3199 3653 3214 3159 3252 3085 3922 1810 3275 3489 3188 3091 3580 3359 3168 3302 3248 3903 1820 3399 3445 3459 3485 3987 3601 3630 3813 3651 4250 1830 3900 3920 3989 4051 4408 4034 4150 4168 4094 4511 1840 4156 4280 4516 4321 4657 4355 4256 4408 4469 4724 1850 4246 4117 4230 4197 4461 4417 4527 4578 4603 4868 1860 4418 4352 4072 3999 4183 4000 4286 4393 4320 4770 1870 4126 4164 4265 4191 4433 4333 4296 4114 4351 4387 1880 3874 3644 3852 3721 4020 3717 3741 3683 3697 3942 1890 3554 3577 3591 3479 3529 3330 3375 3249 3222 3357 1900 3060 2708 2786 2729 2713 2640 2530 2603 2579 2476

4.1 Lifespan Variations over Time Results As described in Section 3.3.1, we utilized the All-Dataset and the United-States-Dataset to compute the changes of lifespan over each quarter of a century between 1650 and 1900. Then, we used these same datasets to take a closer look at the people who had been born during this 250-year span. For each quarter of a century on each dataset, we calculated the Age-of-Death distribution of the people who were born in the chosen quarter. The results showing the lifespan variations over time for the All-Dataset are presented in Figure 2a, and for the United-States-Dataset in Figure 2b. We also used the All-Dataset-10 and the United-States-All-Dataset-10 to calculate the average and the median lifespans for each gender, and for both genders, in each year between 1650 and 1900. The results of these calculations are presented in Figures 3 and 4.

14

5% 0.05 5% 0.05 0.04 0.04 0.03 0.03

1650-1675 4% 0.02 1650-1674 4% 0.02 1875-1900 1650-1674 1650-1674 1875-1899 0.01 0.01 1675-1700 1675-1700 0 0 1700-1724 0 50 100 1700-1724 0 50 100 3% 3% 1725-1749 Age-of-Death 1725-1749 Age-of-Death 1750-1774 1750-1774 1775-1799 1775-1799 1800-1824 2% 2% 15 1800-1824

1825-1849 1825-1849

Precentage of the Profiles the of Precentage Precentage of the Profiles the of Precentage 1850-1874 1850-1874 1875-1899 1% 1875-1899 1%

0% 0% 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 Age-of-Death Age-of-Death

(a) All-Dataset Lifespan Variations. (b) United-States-Dataset Lifespan Variations.

Figure 2: Vertex Lifespan Variations, 1650-1900. 75.00 75.00

73.00 73.00

71.00 71.00

69.00 69.00

67.00 67.00 Both Genders Both Genders Male Male

65.00 65.00 Average Lifespan Average Female Lifespan Average Female

63.00 63.00

61.00 61.00

59.00 59.00

57.00 57.00 1650 1675 1700 1725 1750 1775 1800 1825 1850 1875 1900 1650 1675 1700 1725 1750 1775 1800 1825 1850 1875 1900 Year of Birth Year of Birth

16 (a) All-Dataset-10 - Average Lifespan. (b) United-States-Dataset-10 - Average Lifespan.

Figure 3: Average Vertex Lifespans, 1650-1900.

77.00 77.00

75.00 75.00

73.00 73.00

71.00 71.00

69.00 69.00

Both Genders Both Genders 67.00 67.00

Male Male Median Median Lifespan Median Median Lifespan Female Female 65.00 65.00

63.00 63.00

61.00 61.00

59.00 59.00

57.00 57.00 1650 1675 1700 1725 1750 1775 1800 1825 1850 1875 1900 1650 1675 1700 1725 1750 1775 1800 1825 1850 1875 1900 Year of Birth Year of Birth (c) All-Dataset-10 - Median Lifespan. (d) United-States-Dataset-10 - Median Lifespan.

Figure 4: Median Vertex Lifespans, 1650-1900. 4.2 Regression Analysis Results Using R-project software [19], we ran several simple linear and multi-linear regression algorithms based on the features we defined in Section 3.2. From the regression algorithms, we generated and evaluated several prediction models in order to determine the vertices’ Age-of-Death.

4.2.1 Simple Linear Regression Analysis Results As described in Section 3.3.2, we computed several simple linear regression models in order to predict linear correlations between the Age-of-Death and other features. We first identified features correlated with the inheritance of human lifespan by computing a simple regression model for each feature in the Extended Family Features set, in order to predict the Age-of-Death feature. In these calculations, we used only the vertices who outlived the age of 10 and who also had valid existing information for each vertex’s selected feature; i.e., we only used vertices which exist in the -Dataset-10 for each selected feature. The simple regression results revealed that positive small but significant correlations exist between most of the Extended Family Features and the vertices’ lifespans (see Ta- ble 5). These correlations have R-squared values ranging from 0.0015 to 0.05, with a very low P-value of 2·10−16 indicating that the correlation is highly significant, where the high- est R-squared values were obtained for the Avg-Sibling-Age-of-Death (R-squared=0.05) and Max-Sibling-Age-of-Death (R-squared=0.0272) features, and the lowest R-squared values were obtained for the grandparents’ lifespan features (R-squared ranging from 0.0015 to 0.0028). Additionally, we also discovered a small negative correlation between the Sibling-Number feature and the vertices’ lifespan, with a slope of -0.155, R-squared of 0.0021, and P-value of 2 · 10−16. We then repeated the simple linear regression calculation to identify correlations between the vertices’ lifespans and their spouses’ lifespans by using the Avg-Spouse- Age-of-Death, Max-Spouse-Age-of-Death, and Min-Spouse-Age-of-Death features with the Married-Dataset. We discovered that each one of these features demonstrated a significant correlation, with a low P-value of 2 · 10−16 and a maximum R-squared value of 0.0564 (see Table 5). Lastly, to identify if longevity is correlated with reproductive success, we computed simple linear regressions between the Age-of-Death feature and the Children-Number feature on the following datasets: Children-Number-Dataset-50, Male-Children-Number- Dataset-50, and Female-Children-Number-Dataset-50. Using the simple linear regression, we obtained the following correlation results: (a) on the Children-Number-Dataset-50 dataset (n = 375, 938) the regression returned a negative slope of -0.006, with a R-squared of 4.1·10−6 and a P-value of 0.2137; (b) on the Male-Children-Number-Dataset-50 dataset (n = 214, 864) the regression returned a positive slope of 0.044, with a R-squared of 0.0002 and a P-value of 1.64 · 10−11; and (c) on the Female-Children-Number-Dataset-50 dataset (n = 160, 149) the regression returned a negative slope of -0.079, with a R-squared of 0.0006 and a P-value of 2.2 · 10−16.

4.2.2 Multi-Linear Regression Analysis Results To create models which can estimate a vertex age of death based on the vertex’s features, we chose to use the backward stepwise multiple linear regression technique. By combining this technique with the various predefined features sets, we created three multiple regres- sion models which presented Multiple R-squared values of 0.085, 0.042, and 0.025, for

17 Table 5: Simple Linear Regression Results

Feature Sample Y-Intercept ( Slope R-Squared P-value Size (n) ( Father-Age-of-Death 339,039 53.69 0.129 0.0095 Mother-Age-of-Death 291,823 54.97 0.114 0.0098 Paternal-Grandfather- 242,857 58.11 0.064 0.0026 Age-of-Death Maternal-Grandfather- 159,576 57.84 0.067 0.0028 Age-of-Death Paternal-Grandmother- 200,828 59.67 0.043 0.0015 Age-of-Death Maternal- 137,711 58.95 0.052 0.0021 Grandmother-Age-of- Death Sibling-Number 508,520 63.87 -0.155 0.0021 Max-Sibling-Age-of- 309,282 46.40 0.201 0.0272 Death Avg-Sibling-Age-of- 309,282 45.64 0.287 0.0500 Death Min-Spouse-Age-of- 255,248 51.89 0.212 0.0453 Death Max-Spouse-Age-of- 255,248 49.26 0.247 0.0564 Death Avg-Spouse-Age-of- 255,248 50.02 0.238 0.0526 Death

the All-Numeric-Features set, Heritage-Features set, and Nuclear-Family-Features sets re- spectively (see Table 6). We also obtained the following multiple linear regression models for each features set: For the All-Numeric-Features set, we computed the following model using data collected from 59,893 vertices who were born by 1900 and outlived the age of 10: Age-of-Death(v) = 9.813 − 2.194 · Gender(v) + 0.023 · Birth-Year(v) + 0.261 · Children-Number(v) + 1.706 · Spouse-Number(v) + 0.084 · Max-Spouse-Age-of-Death(v) + 0.053 · Father-Age-of-Death + 0.039 · Mother-Age-of-Death(v) + 0.015 · Paternal-Grandfather-Age-of-Death(v) + 0.016 · Maternal-Grandfather-Age-of-Death(v) + 0.011 · Paternal-Grandmother-Age-of-Death(v) + 0.062 · Sibling-Number(v) + −0.1 · Max-Sibling-Age-of-Death(v) + 0.193 · Avg-Sibling-Age-of-Death(v) For the Heritage-Features set, we computed the following model using data collected

18 from 59,893 vertices who were born by 1900 and outlived the age of 10:

Age-of-Death(v) = 18.184 − 2.06 · Gender(v) + 0.021 · Birth-Year(v) + 0.057 · Father-Age-of-Death + 0.041 · Mother-Age-of-Death(v) + 0.013 · Paternal-Grandfather-Age-of-Death(v) + 0.016 · Maternal-Grandfather-Age-of-Death(v) + 0.016 · Paternal-Grandmother-Age-of-Death(v) + −0.136 · Max-Sibling-Age-of-Death(v) + 0.202 · Avg-Sibling-Age-of-Death(v)

For the Nuclear-Family-Features set, we computed the following model using data collected from 349,118 vertices who were born by 1900 and outlived the age of 50:

Age-of-Death(v) = 47.73 + 1.063 · Gender(v) + 0.013 · Birth-Year(v) + −0.034 · Children-Number(v) + −0.22 · Spouse-Number(v) + −0.08 · Min-Spouse-Age-of-Death(v) + 0.098 · Avg-Spouse-Age-of-Death(v)

Table 6: Multiple Linear Regression Results

Features Set All-Numeric Heritage Features Nuclear-Family Features Features Model Attributes Sample Size (n) 59,893 59,893 349,118 Multiple R-Squared 0.085 0.042 0.025 Adjusted R-Squared 0.085 0.042 0.025 P-Value Residual Standard 20.33 20.80 11.95 Error

4.3 Machine Learning Algorithms Results We evaluated various supervised learning algorithms in an attempt to predict which indi- viduals who were born in the United States between 1650 and 1900 and outlived the age of fifty, will also outlive the age of eighty. We constructed our classifiers using the United- States–Dataset-50 dataset, which contained features of 183,494 vertices who outlived the age of fifty, out of which 58,975 vertices outlived the age of 80. To better understand which features were most useful to our classification algorithms, we analyzed the vari- ous features’ importance using Weka’s information gain features selection algorithm. For the United-States–Dataset-50 dataset, the top eight features with the highest rank re- trieved from Weka’s information gain features selection algorithm were: (a) Birth-Year

19 (0.0058), (b) Max-Sibling-Age-of-Death (0.0047), (c) Avg-Sibling-Age-of-Death (0.0023), (d) Max-Spouse-Age-of-Death (0.0019), (e) Gender (0.0018), (f) Avg-Spouse-Age-of-Death (0.0016), (g) Min-Spouse-Age-of-Death (0.0016), (h) Father-Age-of-Death (0.0008), (i) Mother-Age-of-Death (0.0006), and (j) Parental-Grandfather-Age-of-Death (0.0002). At the end of the list, the Maternal-Grandfather-Age-of-Death, the Children-Number, and the Sibling-Number features received an information gain score of 0. On this dataset, the RandomForest classifier received the maximum AUC of 0.632, better than a random classifier with AUC of 0.5, while the maximum True-Positive value of 0.976 was obtained by the Decision-Tree classifier, and the minimum False-Positive of 0.774 was obtained by K-Nearest-Neighbors (K=3) classifier (see Table 7). We used T- tests with a significance of 0.05 to compare the AUC results of the RandomForest and the naive OneR classifiers. According to the T-test result, RandomForest classifier performed better in terms of AUC, than the naive OneR classifier.

Table 7: Machine Learning Classifiers Results

Measures True Positive False-Positive F-Measure AUC Algorithms OneR 0.908 0.895 0.779 0.507 Decision-Tree (J48) 0.976 0.946 0.805 0.587 Naïve-Bayes 0.973 0.949 0.803 0.571 K-Nearest- 0.797 0.774 0.737 0.518 Neighbors (K=3) K-Nearest- 0.846 0.820 0.757 0.522 Neighbors (K=5) RandomForest 0.947 0.858 0.805 0.632 Bagging 0.971 0.917 0.807 0.620

5 Discussion

To our knowledge, this study is the largest study to date which utilizes genealogical datasets to better understand factors that correlate with human lifespan. The algorithms and methods presented throughout this study, which were evaluated on the WikiTree dataset, reveal several interesting patterns and correlations. Firstly, our results of lifespan variations over time, presented in Section 4.1 and in Figure 2, demonstrate how the lifespans of human population changed over the previous centuries. The lifespan graphs presented in Figure 2 show high infant and children death rates as well as local maximum values between the ages of 70 and 80; these results resemble the lifespan graphs presented in Mitchell et al. [14] and by the UK Office of National Statistics [17]. This resemblance supports our assumptions regarding the integrity of the WikiTree dataset, which indeed contains data on human population with largely accurate birth and death dates. However, the infant death rates presented in these graphs are not entirely accurate; according to Wegman [21], in 1900 the infant mortality rates in the United States were about 15%, which is higher than the values presented in our results. We assume that the main reason for this discrepancy was the lack of a uniform, formal

20 definition of “live births,” which was not standardized until 1951 [21]. Therefore, in most of this study we used as a sample set only people who outlived the age of ten. Nevertheless, by analyzing these graphs, we can observe that over time, lifespans increased and fewer people passed away at young ages. Another observation that can be concluded from these graphs is that even in the second half of the seventeenth century, people who outlived the age of ten would likely outlive the age of sixty. Indeed, according to our median lifespan analysis, presented in Figure 3d, the median age in 1650 for people who were born in the United States and outlived the age of ten was 62.46 for males and 62.04 for females. Secondly, our median and average population lifespan calculations, presented in Fig- ures 3 and 4, reveal some interesting patterns. By analyzing the graphs, we can locate several years in which the average lifespans sharply decreased for both males and females. For example, for people who were born in the United States in 1800 and outlived the age of 10, the average lifespans for males and females were 66.39 and 64.45, respectively. However, for people born in the United States ten years later, in 1810, the average lifespan was reduced by around 2 years: males’ lifespans decreased to 64.31, and females’ lifespans decreased to 62.20. An additional and even more interesting reoccurring pattern can be identified in Figure 3d in which, for a specific time period, the median lifespan for males increased while the median life span for females suddenly decreased, or vice versa. For example, from 1650 to 1660 the male median lifespan increased from 62.46, to 66.82 while in the same period of time the female median lifespan decreased from 62.04 to 60.24. Sim- ilar patterns reoccur between 1770 and 1780, only this time the female average lifespan increased from 65.57 to 68.69, while the male average lifespan decreased from 66.87 to 64.79 (see Figure 3b). Another interesting pattern can be found between 1850 and 1900 where in just a half a century the female average lifespan sharply increased from 62.66 to 72.5. We hope to discover underlying reasons for these patterns in our future research. Thirdly, using simple linear regression algorithms, we uncovered small but significant correlations between various features and the Age-of-Death feature which are presented in Table 5. We found small positive significant correlations between the Extended Family Features and the Age-of-Death feature. For all these correlations, R-squared values were small and ranged from 0.0015 to 0.05 with a P-value of 2.2 · 10−16, and these may indicate that lifespan can “run in the family.” However, due to the small R-squared values in our results, we can conclude that the influence of inherited lifespan is limited and, in fact, negligible after more than one generation. Alternately, the observed correlation could be explained due to socioeconomic reasons: ancestors with long lifespan might also indicate a higher socioeconomic status, which can be passed on to their offspring. We also found significant correlations between the Avg-Spouse-Age-of-Death, Max-Spouse-Age-of-Death, and Min-Spouse-Age-of-Death features and the Age-of-Death feature, with a low P-value and a maximum R-squared value 0.0564 (see Table 5). This indicates that correlations between the lifespans of spouses exist, supporting the claims for the existence of the “widow effect.” We hope to confirm this observation in a future study by taking a closer look at the time intervals between the deaths of married couples. Using simple linear regression models, we also identified small significant correlations between longevity and reproductive success. Namely, we discovered negligible negative correlation between females and their number of children (R-squared = 0.0006), and negligible positive correlation between males and their number of children (R-squared = 0.0002). Fourthly, using multiple linear regression models, we were able to construct models which can predict a person’s age of death using various features that were extracted from the WikiTree social network directed multigraph. Our models presented a low P-value of 2.2 · 10−16 with Adjusted R-squared of up to 0.085 (see Table 6), indicating that the

21 extracted features can indeed assist in predicting a person Age-of-Death based on data which was extracted from the WikiTree dataset. However, the relative low Adjusted R- squared values indicate that other external factors are also responsible for influencing an individual’s lifespan. We hope to test these assumptions in future studies by merging WikiTree genealogy datasets with other datasets that contain additional information about individuals’ habits and lifestyles. Fifthly, our machine learning classifiers presented better than random performances, with AUCs up to 0.632 (see Table 7), in identifying which people who were born in the United States and outlived the age of 50 would also outlive the age of 80. These results support our previous claims that the data collected from genealogy datasets can be utilized to predict a person’s lifespan. Additionally, the information gain algorithm results revealed that the Max-Sibling-Age-of-Death, Avg-Sibling-Age-of-Death, and Max- Spouse-Age-of-Death (see Section 4.3) were among the most useful features. These results also indicate that a correlation exists both between spouses’ lifespans and between siblings’ lifespans. In our future studies, we hope to use similar techniques to predict other personal attributes based on data collected from online genealogy datasets. The study presented here is among the first of its kind and offers many future research directions to pursue. One possible research direction is to analyze not only the structured data which appear in the WikiTree profile pages, but also to use Natural Language Pro- cessing (NLP) algorithms to analyze content data which appear in these pages. Another possible research direction is to compare the results presented in this study from the Wik- iTree dataset to other online genealogy datasets, such as FamiLinx,13 which is publicly available and contains information from Geni.com,14 a genealogy-driven social network. Additionally, in future work, we plan to utilize the results obtained through this study on the population of United States and test them on various populations in other countries.

Acknowledgment

The authors would like to thank Carol Teegarden for her editing expertise and helpful advice.

References

[1] Y. Altshuler, N. Aharony, M. Fire, Y. Elovici, and A. S. Pentland. Incremental learning with accuracy prediction of social and individual properties from mobile- phone data. In Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom), pages 969–974. IEEE, 2012.

[2] Ancestry.com. Ancestry.com inc. reports q3 2012 financial results. 2013. (last accessed on November 2th, 2013).

[3] A. P. Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7):1145–1159, 1997.

[4] N. A. Christakis and J. H. Fowler. The spread of obesity in a large social network over 32 years. New England journal of medicine, 357(4):370–379, 2007.

13http://erlichlab.wi.mit.edu/familinx/data.html 14http://www.geni.com/

22 [5] D. Eastman. Wikitree reaches five million profiles. http://blog.eogn.com/eastmans_online_genealogy/2013/04/ -reaches-five-million-profiles.html, 2013. (last accessed on November 2th, 2013).

[6] F. Elwert and N. A. Christakis. The effect of widowhood on mortality by the causes of death of both spouses. Journal Information, 98(11), 2008.

[7] M. Fire, G. Katz, Y. Elovici, B. Shapira, and L. Rokach. Predicting student exam’s scores by analyzing social network data. In Active Media Technology, pages 584–595. Springer, 2012.

[8] M. Gögele, C. Pattaro, C. Fuchsberger, C. Minelli, P. P. Pramstaller, and M. Wjst. Heritability analysis of life span in a semi-isolated population followed across four centuries reveals the presence of pleiotropy between life span and reproduction. The Journals of Gerontology Series A: Biological Sciences and Medical Sciences, 66(1):26– 37, 2011.

[9] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: an update. SIGKDD Explor. Newsl., 11:10–18, November 2009.

[10] J. Knowles. Myheritage hits 1 billion profiles and announces new features for historical research. 2012. (last accessed on November 2th, 2013).

[11] É. Le Bourg. Does reproduction decrease longevity in human beings? Ageing research reviews, 6(2):141–149, 2007.

[12] P. Martikainen and T. Valkonen. Mortality after the death of a spouse: rates and causes of death in a large finnish cohort. American Journal of Public Health, 86(8_Pt_1):1087–1093, 1996.

[13] P. F. McArdle, T. I. Pollin, J. R. O’Connell, J. D. Sorkin, R. Agarwala, A. A. Schäffer, E. A. Streeten, T. M. King, A. R. Shuldiner, and B. D. Mitchell. Does having children extend life span? a genealogical study of parity and longevity in the amish. The Jour- nals of Gerontology Series A: Biological Sciences and Medical Sciences, 61(2):190–195, 2006.

[14] B. D. Mitchell, W.-C. Hsueh, T. M. King, T. I. Pollin, J. Sorkin, R. Agarwala, A. A. SchaÈffer, and A. R. Shuldiner. Heritability of life span in the old order amish. American journal of medical genetics, 102(4):346–352, 2001.

[15] N. H. Murdock. Teenage pregnancy. Journal of the National Medical Association, 90(3):135, 1998.

[16] MyHeritage. Myheritage members map. http://www.myheritage.com/ -member-map, 2013. (last accessed on November 2th, 2013).

[17] O. of National Statistics. Mortality in england and wales: Average life span. http: //www.ons.gov.uk/ons/dcp171776_292196.pdf, 2012. (last accessed on Nov. 13th, 2013).

[18] C. M. Parkes, B. Benjamin, and R. G. Fitzgerald. Broken heart: a statistical study of increased mortality among widowers. British Medical Journal, 1(5646):740, 1969.

23 [19] R. D. C. Team et al. R: A language and environment for statistical computing, 2005.

[20] F. Thomas, A. Teriokhin, F. Renaud, T. De Meeûs, and J.-F. Guégan. Human longevity at the cost of reproductive success: evidence from global data. Journal of Evolutionary Biology, 13:409–414, 2000.

[21] M. E. Wegman. Infant mortality in the 20th century, dramatic but uneven progress. The Journal of Nutrition, 131(2):401S–408S, 2001.

[22] R. G. Westendorp and T. B. Kirkwood. Human longevity at the cost of reproductive success. Nature, 396(6713):743–746, 1998.

[23] C. R. WHITNEY. Jeanne calment, world’s elder, dies at 122. New York Times, 1997.

[24] Wikipedia. Wikipedia:statistics. http://en.wikipedia.org/wiki/Wikipedia: Statistics, 2013. (last accessed on November 2th, 2013).

[25] WikiTree. About wikitree. http://www.wikitree.com/wiki/About_WikiTree, 2013. (last accessed on November 2th, 2013).

[26] WikiTree. Trusted list. http://www.wikitree.com/wiki/Trusted_List, 2013. (last accessed on November 2th, 2013).

[27] WikiTree. Wikitree manager. http://www.wikitree.com/wiki/Profile_Manager, 2013. (last accessed on November 2th, 2013).

[28] WikiTree. Wikitree privacy. http://www.wikitree.com/wiki/Privacy, 2013. (last accessed on November 2th, 2013).

24