<<

Pantheon 1.0, a manually verified dataset of globally famous biographies

Authors Amy Zhao Yu1, Shahar Ronen1, Kevin Hu1, Tiffany Lu1, César A. Hidalgo1

Affiliations 1. Macro Connections, MIT Media Lab Corresponding author(s): Amy Zhao Yu ([email protected]), César A. Hidalgo ([email protected])

Abstract

We present the Pantheon 1.0 dataset: a manually verified dataset of individuals that have transcended linguistic, temporal, and geographic boundaries. The Pantheon 1.0 dataset includes the 11,341 biographies present in more than 25 languages in Wikipedia and is enriched with: (i) manually verified demographic information (place and date of birth, gender) (ii) a taxonomy of occupations classifying each biography at three levels of aggregation and (iii) two measures of global popularity including the of languages in which a biography is present in Wikipedia (L), and the Historical Popularity Index (HPI) a metric that combines information on L, time since birth, and page-views (2008- 2013). We compare the Pantheon 1.0 dataset to data from the 2003 book, Human Accomplishments, and also to external measures of accomplishment in individual games and sports: , , Car Racing, and . In all of these cases we find that measures of popularity (L and HPI) correlate highly with individual accomplishment, suggesting that measures of global popularity proxy the historical impact of individuals.

1 Background & Summary

In this paper, we present a dataset of the biographies of globally famous individuals that can be used to study the production and diffusion of the types of human generated information that is expressed in biographical data by linguistic groups, geographic location, and time period. Biographies allow us to capture people that have either produced a creative oeuvre—such as William Shakespeare and Leonardo da Vinci— or those who have contribute to well known historically events, such as George Washington, a key general in the American Revolution, or , a key player in 's 1986 World Cup championship.

The Pantheon 1.0 dataset connects occupations, places of births, and dates, helping us create reproducible quantitative measures of the popularity of biographical records that we can use to proxy historical information. This data, therefore, enables researchers to explore the role of polyglots in the global dissemination of human generated information 1, the gender inequality and biases present in online historical information 2, the occupations associated with the producers of historical information, and the breaks produced by communication technologies in the production and dissemination of information by humans 3, 4.

Previous efforts

Past efforts to quantify historical information include Charles Murray’s Human Accomplishments book, which contributed an inventory of 3,869 significant individuals within the domains of arts and sciences 5; the digitized text study self-branded by its authors as Culturomics 6; efforts focused on structuring Wikipedia data 7 and quantifying the impact of individuals across a more diverse set of occupations 8, 9. Most efforts, however, have looked only at the popularity of individuals in a few languages (predominantly in English) and lack a classification of domains of the contributions that can be used to categorize the areas of historical impact of an individual. This categorization is an essential contribution of our datasets, since without it, it is not possible to study the types of information generated at different time periods and in different geographies.

Table 1 provides a non-exhaustive comparison between various datasets constructed to quantify historical information and the Pantheon 1.0 dataset.

2 Methods

Data Collection Ideally, we would want to quantify historical information by using data that summarizes information produced by people in all languages and that includes all forms of historical information, from biographical data, to the characters created by authors in works of fiction (i.e. Mickey Mouse), and the artifacts and constructions that people create. Since no such dataset exists, we create a simpler dataset focused only on biographical information by using data from Freebase and 277 language editions of Wikipedia. Both Freebase and Wikipedia are open-source, collaborative, multi-lingual knowledge bases freely available online to the general public.

We note that previous efforts have produced structured datasets on biographical records based on Wikipedia 7, 8, but not considering all language editions (using only English), and they have not manually verified time periods and geographies, or introduced a controlled taxonomy of the occupations associated with each biography. While there are certainly considerable limitations to Wikipedia and Freebase, they are currently the largest available domain-independent repositories of collaboratively edited human knowledge, and past research has demonstrated the reliability of these collaborative knowledge bases 10, 11. We note that we also evaluated Wikidata (http://www.wikidata.org/), the repository for structured data associated with Wikimedia projects, as a possible data source. However, at the time of data collection (2012-2013), this initiative was in its first year of development, and had not yet accumulated a database as robust as what was available within Freebase.

Figure 1 summarizes the main components of the workflow used to create the Pantheon dataset. We derive our dataset of historical biographical information from Freebase’s entity knowledge graph (https://developers.google.com/freebase/data) and add metadata from Wikipedia accessible through its API. Freebase organizes information as uniquely identified entities with associated types and properties defined by a structured, but uncontrolled, data ontology. Therefore, to identify globally known biographies, we first determined a list of 2,394,169 individuals through Freebase’s database of all entities classified as Persons. Next, we linked individuals to their English Wikipedia page using their unique Wikipedia article id, and from there we obtained information about additional language editions using the Wikipedia API as of May 2013, narrowing the set to the 997,276 individuals that had a presence in Wikipedia. We later supplemented the data with monthly page view data for all language editions from the Wikipedia data dumps for page views for each individual from Jan. 2008 through Dec. 2013.

The Pantheon 1.0 dataset is restricted to the 11,341 biographies with a presence in more than 25 different languages in Wikipedia (L>25). The choice of the L>25

3 threshold is guided by a combination of criteria, based on the structure of the data and the limits of manual data verification. Figure 2 shows the cumulative distribution of biographies on a semi-log plot, as a function of the number of languages in which each of these biographies has a presence. Most of the 997,276 biographies surveyed have a presence in a few languages, such that the L>25 threshold is a high mark that can help filter the most visible of these biographies. For example, a sampling of the individuals above the L>25 threshold includes globally known individuals such as Charles Darwin, Che Guevara, and Nefertiti. Below the threshold, we find individuals that are locally famous – such as Heather Fargo, who is the former Mayor of Sacramento, . Also, 95% of individuals passing this threshold have an article in at least 6 of the top 10 spoken languages worldwide (Top 10 spoken languages by number of speakers worldwide: Chinese, English, Hindi, Spanish, Russian, Arabic, Portuguese, Bengali, French, Bahasa – see: http://meta.wikimedia.org/wiki/Top_Ten_Wikipedias), demonstrating that the Pantheon dataset has good coverage of non-Western languages.

Taxonomy Design

Since no globally standardized classification system currently exists to classify biographies to occupations, we introduce a new taxonomy connecting biographies to occupations. Following best practices of taxonomy creation from information science 12, we derive a controlled vocabulary from the raw data and design a classification hierarchy allowing three levels of aggregation. For simplicity we call the most disaggregate level of the taxonomy “occupations,” and its aggregation in increasing order of coarseness “industries” and “domains.” We note that we use these terms simply to facilitate the communication of the level of aggregation we are referring to. The design and verification process was led by the authors, with support from a multidisciplinary research team with domain expertise in a wide variety of fields, including economics, computer science, physics, design, history, and geography. Figure 3 shows the entire occupation taxonomy, with detail on the all three levels of the classification hierarchy.

To create this taxonomy, we use raw data on individual occupations from Freebase to create a normalized listing of occupations – for example, we map “Entrepreneur”, “Business magnate”, and “Business development” to the normalized occupation of “Businessperson”. We grouped normalized occupations into a second-tier classification (called industries), and top-level occupations. We associate individuals within the dataset to a single occupation based on the occupation that best encompasses their primary area of occupation. Thus, we explicitly choose to create a taxonomy—a hierarchical classification that maps each biography to a single category—rather than an ontology—a network that connects biographies to multiple categories—for both technical and historical reasons. The assignment of a biography to multiple categories is troublesome because it requires weighing the multiple classifications that are associated to

4 each biography and defining a threshold for when to stop counting. In some cases the weighing is relatively straightforward. For instance, Shaquille O’Neal should be classified as a basketball player first, and then as an actor, given his 22 actor credits on IMDb. But should we also consider O’Neal a singer (he has released several hip-hop albums), a producer, or a director (he directed a single episode of a little known TV series named Cousin Skeeter)? On a similar note, should we classify Angela Merkel as a physicist and Margaret Thatcher as a chemist (as their respective diplomas indicate), although their historical impact definitely comes from their work in politics? Because of the difficulties involved in defining what categories to consider when assigning an individual to multiple categories, we used a more pragmatic approach and assigned each individual to the category that corresponds to his or her claim to fame: thus, we assign O’Neal to the basketball player category, and assign Merkel and Thatcher to the politician category). By normalizing the data to a controlled vocabulary and using a nested classification system, we provide a consistent mapping for individuals to occupations across time, and enable users of this data to perform analysis at several levels of aggregation while avoiding double counting. Yet we understand that we also introduce the limitation of restricting the contribution of polymaths to one singular domain. The challenge of fairly distributing the historical impact of polymaths will be left for future consideration.

In terms of location assignment, we attribute individuals to a place of birth by country, based on current political boundaries. We use present day political boundaries because of the lack of a historical geocoding API to attribute geographic boundaries using latitude, longitude, and time. Birthplaces were obtained by scraping both Freebase and Wikipedia, and further refined by using fuzzy location matching and geocoding within the Yahoo Placemaker (http://developer.yahoo.com/geo/placemaker/) and Google Maps geocoding (https://developers.google.com/maps/documentation/geocoding/) APIs, and by manual verification. The dataset includes the raw data on individual birthplaces, as well as the cleaned country, which is derived using various APIs that allow us to attribute locations by modern-day country boundaries. To map birthplaces to countries, we normalize the raw data from Freebase indicating the city of birth by latitude and longitude using fuzzy location matching available within the geocoding APIs. Using the coordinates obtained through the APIs, individuals are then mapped to countries based on present-day geographic boundaries using the reverse geocoding API available on geonames.org. For example, individuals born in during the era are associated with . Using present-day boundaries allows for a consistent basis for matching individuals to countries, and mitigates the technological limitation of the lack of existing historical geocoding APIs for attributing geographic boundaries using latitude, longitude, and time. Historically, birthplace is a fairly suitable way of associating individuals to countries, however, given the increase of human mobility over time 9 and the net migration gains experienced by developed regions 13, future refinement of the

5 dataset may include consideration for improving the attribution of individuals to the geographies he inhabited across his life.

Visibility Metrics

We introduce metrics of popularity that help us capture the relative visibility of each biography in our dataset. The fame, or visibility, of historical characters is estimated using two measures. The simpler of the two measures, denoted as L, is the number of different Wikipedia language editions that have an article about a historical character. The documentation of an individual in multiple languages is a good first approximation for their global fame because it points to individuals associated with accomplishments or events that have been noted globally. The use of languages as a criterion for inclusion in our dataset helps us differentiate between biographies that are globally famous and locally famous.

We also introduce the Historical Popularity Index (HPI), a more nuanced metric for global historical impact that takes into account the following: the individual’s age in the dataset (A), or the time elapsed since his/her birth, calculated as 2013 minus birthyear; an L* measure that adjusts L by accounting for the concentration of pageviews among different languages (to discount characters with pageviews mostly in a few languages, see Equation 1); the coefficient of variation (CV) in pageviews across time (to discount characters that have short periods of popularity); and the number of non-English Wikipedia pageviews (vNE) to further reduce any English bias. In addition, to dampen the recency bias of the data, HPI is adjusted for individuals known for less than 70 years. Equation 4 provides the full formula for HPI. There we use log based 4 for the age variable in the aggregation to avoid age becoming the dominant factor in HPI (as it would if we would have used natural log).

For each biography i, we define:

Li = Number of different languages editions of Wikipedia for biography i * = Effective number of language editions for biography i * Li = exp(Hi ) (1) where Hi is the entropy in terms of Page Views

�!" �!" �! = − ln (2) ! �!" ! �!" !

and vij = total page views of individual i in language j

Ai = 2013 – Year of Birth

6 CV = Coefficient of variation in page views

σ i CVi = (3) µi

σ i = standard deviation in pageviews across all languages

µi = average monthly pageviews vNE = total pageviews in non-English editions of Wikipedia

Using the above, the Historical Popularity Index (HPI) of an individual, i, is defined as:

∗ �� �� � + �� � + ���� � + �� � − �� �� �� � ≥ �� ��� = ��!� (4) �� � + �� �∗ + ��� � + �� ��� − �� �� − �� � < �� � �

Table 2 shows the ten people with the highest L and HPI, respectively, for a few selected periods. An individual is assigned to a period according to his or her date of birth. Here we see that the most notable biographies for each period are associated primarily with well-known historical characters.

Biases & Limitations

As with all large data collection efforts, Pantheon is coupled with limitations and biases, which should be considered carefully when interpreting the dataset. This dataset should be interpreted narrowly, as a view of historical information that emerges from the multilingual expression of historical figures in Wikipedia as of May 2013. The main biases and limitations of the dataset come from:

1. The use of Wikipedia as a data source. 2. The use of place of birth to assign locations. 3. The use of biographies as proxies for historical information. 4. Other technical limitations.

1. The use of Wikipedia as a data source The data is limited by the set of people who contribute to Wikipedia. Wikipedia editors are not considered to be a representative sample of the world population, but a sample of publicly-minded knowledge specialists that are willing and able to dedicate time and effort to contribute to the online documentation of knowledge. Wikipedia editors have an English Bias, a Western Bias, a gender bias towards males, and they tend to be highly educated and technically inclined. They are also more prevalent among developed countries with Internet access. Wikipedia also has a considerable bias in the inclusion of people from different categories. This bias could be the result of the differences in the notability criteria in Wikipedia for biographies from different domains, or from systematic biases within the Wikipedia community 14. Finally, Wikipedia also has a recency bias, since

7 current events and contemporary individuals typically have greater prominence in the minds of Wikipedia contributors than events from the past 15, 16.

By using data from all Wikipedia language editions we are effectively reducing a bias that would favor information that is locally famous among English speakers. As an example, we note that there is only one American Football Player in the dataset: O.J. Simpson. Certainly, his global notoriety is not purely from his football career, showing that the use of many languages reduces the English bias of the dataset (famous American Football players, such as Peyton Manning, Tom Brady and Joe Montana all have a large presence in the English Wikipedia, but fail to meet the L>25 threshold). In comparison, the dataset contains over 1,000 soccer players – showing that soccer is a sport that is globally popular.

2. The use of place of birth to assign locations Individuals were assigned to geographic locations using their place of birth, based on present-day political boundaries. Country assignments were complemented with geocoding APIs for normalization and manual verification (to correct for errors in API and completeness). Place of birth is one way of assigning a location to an individual that allow us to assign locations in a comprehensive and consistent manner. Yet, there are biases and limitations that need to be considered when using this location assignment method. An important limitation is the inability to account for individuals who became globally known after immigrating to another country. Would Neruda, Picasso or Hemingway be as famous if they had not participated of the Parisian art scene? The place where an individual was born may differ from the place where that individual made his or her more important contributions. In some cases, the contributions are made in a number of different places, and the use of birthplace is unable to capture where the contributions were made. This is particularly true for athletes who migrate to the world’s most competitive leagues, or artists that move to the artistic centers of their time. In this dataset, such individuals are not represented since programmatically geo-coding birthplaces is more consistent than registering the place where each individual made his or her more significant contribution, which can only be found through the unstructured data buried in historical narratives.

3. Limitations in the use of biographies as proxies for historical information

The use of biographies to proxy historical information allows us to connect information with a linguistic group, geographic location, occupation, and time period. Some biographies involve people that produced an oeuvre directly, like Mozart or Michelangelo, but others reflect important historical events. So, biographies help us capture historical information in a broad sense because they are not limited only to the biographies of those who produced an oeuvre, but because they also include individuals

8 who have inspired documentation by having participated in events that punctuate the history of our species.

The use of biographies as proxies for historical information, however, has important shortcomings. Biographies may fail to capture information on works or events where the participation of groups trumps that of individuals. For example, consider collective enterprises where the accomplishments are the results of teams and not isolated individuals. Examples of accomplishments that are likely to get excluded include the works of music bands or orchestras, or the products produced by a firm, where the accolades collected from accomplishments are connected to a firm, or brand, rather than to an individual. Also, biographies are a proxy of historical information that is biased against works and events that did not result in the widespread fame of their main actors or creators. Moreover, the global popularity of biographies is known to be biased towards the languages that are more central in the global network of translations 1, biasing the estimates of historical information derived from biographical data to the information produced by the speakers of the world’s most connected languages.

4. Other Technical Limitations

Other biases and limitations include the volatility of Wikipedia and other online resources, which make the results presented here imperfectly reproducible. For example, the Yahoo Placemaker API, which was used for mapping individuals to countries by birthplace, has been deprecated and is no longer publicly available. Also, Freebase will also be retired as of June 2015, and while there are plans to transfer the data to Wikidata, at the time of writing the future availability of Freebase data is undetermined. Finally, the set of included individuals in the Pantheon 1.0 dataset is static and does not reflect events after early 2013 - as such, individuals who only recently rose to global prominence, including Pope Francis and Narendra Modi, are excluded from this dataset.

Data Records

The Pantheon dataset is publicly available on the Harvard Dataverse Network and can be accessed directly at: https://dataverse.harvard.edu/dataverse/pantheon. The dataset is visualized at http://pantheon.media.mit.edu, a data visualization engine that allows users to dynamically explore the dataset through interactive visualizations.

The data consists of three files – pantheon.tsv, wikilangs.tsv, and pageviews_2008- 2013.tsv (Data Citation 1).

The first file, pantheon.tsv, is a flattened tab-limited table, where each row of the table represents a unique biography. Each row contains the following variable fields:

9

• name – name of the historical character (in English) • en_curid – unique identifier for each individual biography, maps to the pageid from Wikipedia. To map to an individual’s biography in Wikipedia, use the en_curid field as an input parameter to the following URL: http://en.wikipedia.org/?curid=[en_curid]. We use the English curid as the unique identifier in the Pantheon dataset; we confirmed that all biographies with L > 25 as of May 2013 had an entry in the English Wikipedia. • countryCode- ISO 3166-1 alpha2 (based on present-day political boundaries) • countryCode3- ISO 3166-1 alpha3 country code (based on present-day political boundaries) • countryName – commonly accepted name of country • continentName – name of continent • birthyear – birthyear of individual • birthcity – given birthcity of individual • occupation – occupation of the individual • industry – category based on an aggregation of related occupations • domain – category based on an aggregation of related industries • gender – male or female • TotalPageViews – total pageviews across all Wikipedia language editions (January 2008 through December 2013) • L_star – adjusted L (see Appendix for calculation) • numlangs – number of Wikipedia language editions that each biography has a presence in (as of May 2013) • StdPageViews – standard deviation of pageviews across time (January 2008 through December 2013) • PageViewsEnglish – total pageviews in the English Wikipedia (January 2008 through December 2013) • PageViewsNonEnglish – total pageviews in all Wikipedias except English (January 2008 through December 2013) • AverageViews – Average pageviews per language (January 2008 through December 2013) • HPI – Historical Popularity Index (see Equation 4)

The second file, wikilangs.tsv, is a tab-delimited table of all the different Wikipedia language editions that each biography has a presence in. Each row of the table contains the following variables: • en_curid – unique identifier for each individual biography • lang – Wikipedia language code

10 • name – name in the language specified.

To link to the other editions of Wikipedia, use the lang and name parameters in the following URL: http://[lang].wikipedia.org/wiki/[name]

The third file, pageviews_2008-2013.tsv contains the monthly pageview data for each individual, for all the Wikipedia language editions in which they have a presence. Each row of this table includes the following variables: • en_curid – unique identifier for each individual biography • lang – Wikipedia language code • name – English name • numlangs – total number of Wikipedia language editions • countryCode3- ISO 3166-1 alpha3 country code (based on present-day political boundaries) • birthyear – birthyear of individual • birthcity – given birthcity of individual • occupation – occupation of the individual • industry – category based on an aggregation of related occupations • domain – category based on an aggregation of related industries • gender – male or female • 2008-01 through 2013-12 – total pageviews for the given month (denoted by the column header)

Technical Validation

Comparison with Human Accomplishments dataset

We compare the Pantheon 1.0 dataset with the Human Accomplishments (HA) dataset, an independent compilation of 3,869 notable people in the arts and sciences from 800BC to AD 1950 5. Unlike Pantheon, HA is based on printed encyclopedias and not online sources, but like Pantheon, HA values the presence of a biography in resources in multiple languages. Since HA is restricted to the arts and sciences domains, it does not include politicians like Julius Caesar, religious figures like Jesus, racecar drivers like or chess grandmasters like Gary Kasparov. Nevertheless, we find that our data overlaps significantly with the Human Accomplishment data. The Pantheon dataset contains 1,570 (40%) of the entries available in the Human Accomplishment dataset. The HA dataset is more regionally focused than Pantheon, and we find that many of the individuals in the HA dataset are more locally impactful in their respective geographies, and hence, have a presence in fewer languages in Wikipedia. If we lower the threshold of

11 the Pantheon dataset to include biographies existing in 10 or more languages (L ≥ 10) we would find an overlap of 2,878 biographies, or 74% of the HA dataset.

We also compare the assignment of individuals to their respective occupations within the Pantheon 1.0 dataset to the inventories within the HA dataset. We note that the HA dataset is based on five inventories (art, science, literature, philosophy, and music). From these inventories only the science inventory is disaggregated into additional fields (Chemistry, Biology, Mathematics, Technology, Astronomy, Medicine, Earth Sciences, Physics, and Science—for scientists that do not fit in any of these fields). The smaller number of categories in HA vis-à-vis Pantheon (13 versus 88) means that we cannot create a one-to-one mapping between both categorization systems. Nevertheless, we map each of the HA categories to its most appropriate counterpart in the Pantheon taxonomy. For instance, we map the individuals in the “Medicine” field in the HA dataset to the “Medicine” industry from the Pantheon 1.0 taxonomy, the individuals in the “Chemistry” field from HA to the “Chemist” occupation in the Pantheon taxonomy, and the individuals in the “Literature” inventory to the “Language” industry in the Pantheon 1.0 taxonomy. We find that there is an 84% agreement when comparing the assignment of occupation and industries, and a 95% overall agreement between the datasets when we consider the coarser occupations. Some examples of discrepancies involve Vladimir Lenin, who is categorized as a philosopher in HA but as a politician in Pantheon, and the photographer Ansel Adams, who is categorized as a scientist in HA but is categorized as a Photographer in Pantheon (within the Fine Arts industry and the Arts domain).

Moreover, we find a positive and significant correlation between the measures of historical impact advanced in both of these datasets. HA gives individuals a relative score that measures their impact on their respective domain. Figure 4 shows the correlation between the measures of historical impact in both Pantheon and HA. The historical impact measures in Pantheon correlate with the number of language editions in Wikipedia (L) with an R2=18% (p-value<2x10-70) and with the HPI index with an R2=12% (p- value=1.6x10-44). Note that unlike Pantheon, HA may classify an individual into multiple domains, with a different score for each one: e.g., Galileo Galilei is classified as an astronomer (with a score of 100 that puts him as the most influential astronomer of all time) as well a physicist (ranking fifth with a score of 83).

Comparison with External Measures of Accomplishment

Following an approach similar to the one used in Human Accomplishment 5 we also compare measures of individual accomplishments with the Pantheon dataset. Unfortunately, many occupations are not characterized by external metrics of accomplishment that we can associate to individuals, so we restrict our comparison to occupations where measures of individual accomplishment are available – namely,

12 individual sports. The achievements of individual sportsmen and women can be quantitatively expressed through measures such as number of championship titles won or points scored. Here, we focus on Formula-1 drivers, tennis players, swimmers and chess players as independent case studies that we can use to compare our metrics of global visibility.

1. Racecar Drivers

First we examine the subset of the dataset containing the top 56 Formula-1 drivers, according to the number of languages in which they have a presence in Wikipedia. For each of these drivers we created an additional dataset with the number of Grand Prix Wins, Championships Won, Podiums (number of times in the top 3), Starts, and a dummy variable for Killed in Action (dummy variables are variables used as statistical controls that take values of zero and one). These variables are used to construct a statistical model explaining the multilingual presence of each driver within Wikipedia as well as each driver’s Historical Popularity Index. Since Grand Prix Wins, Championships and Podiums are highly collinear—and hence not statistically significant when used together—only Podiums are used in the final model. Since neither L nor HPI can be negative, we link the fame of biographies to the aforementioned variables using an exponential function of the form: � = ��!!!!�!!!!�!!!!

where x1 is the number of podiums, x2 is number of starts, and x3 is an indicator for whether the individual is killed in action.

The first model in Figure 5a explains 54% of the variance in the number of languages in which each Formula-1 driver has a presence in the Wikipedia, showing that for Formula-1 drivers the number of languages in the Wikipedia accurately tracks accomplishments discounted by time. In contrast, when analyzing the same variables with the Historical Popularity Index, we find a model (see Figure 5b) that explains 68% of the variance in the Historical Popularity Index for each Formula-1 driver. The improved fit suggests that the corrections introduced by HPI enhances the L metric and contributes an improved characterization of accomplishment for this sample of individuals.

2. Tennis Players Next, we conduct a similar analysis for Tennis Players. The Tennis player subset focuses on the top 52 Tennis players according to the number of languages in the Wikipedia and augmented by additional data on each individual - the number of weeks he/she spent as number one in the ATP or WTA, the number of wins, the top rank ever obtained, and the player’s gender (Female = 1, Male = 0). We link the

13 fame of biographies for Tennis Players to the aforementioned variables using an exponential function of the form:

� = ��!!!!�!!!!�!!!!�!!!!

where x1 is the number of weeks at the number one, x2 is the number of Grand Slam wins, x3 is highest rank obtained, and x4 is the variable for gender.

For the number of language presences in Wikipedia (L), we construct a model which explains 34% of the variance in the multilingual presence of each of these individuals in the Wikipedia (Figure 6a). This shows that once again, the number of languages in Wikipedia is a good proxy for individual accomplishments. When we considered HPI, we find an improved model that explains 63% of the variation in HPI (Figure 6b). This further supports the use of HPI as an appropriate proxy for accomplishment, since HPI tracks the degree of achievement for tennis players better than L.

3. Swimmers We also perform a similar analysis considering Olympic swimmers born after 1950 (n=19). In this case, the model uses the total number of gold medals and gender. We link the fame of swimmers to these variables using an exponential function of the form:

� = ��!!!!�!!!!

where x1 is the number of gold medals, and x2 indicates gender.

In Figure 7a, the model explains 74% of the variance observed in the total number of languages that a swimmer has a presence in Wikipedia, demonstrating that this measure is a good proxy for measuring accomplishment for swimmers. When we perform the analysis for Historical Popularity Index, we find that the model explains 50% of the variance observed in the HPI for swimmers. Figure 7b shows the second model, which shows that HPI is also an appropriate proxy for quantifying accomplishment for swimmers, although in this case, HPI is not superior to L.

4. Chess Players Finally, we perform another analysis using all of the 30 individuals classified as chess players in the Pantheon dataset. In this case, we use data on each individual’s highest ELO ranking attained, gender, total games played, and percentage of wins, losses, and draws. We link the fame of chess players to these variables using an exponential function of the form: � = ��!!!!�!!!!�!!!!�!!!!�!!!!�!!!!

14

where x1 is the highest ELO ranking attained, x2 indicates gender, x3 is the total games played, x4 is the percentage of wins, x5 is the percentage of losses, and x6 is the percentage of draws.

For the number of language presences in Wikipedia (L), we construct a model that explains 37% of the variance in the multilingual presence of each of these individuals in the Wikipedia (Figure 8a). This further supports using the number of languages in Wikipedia as a proxy for individual accomplishments. Using HPI (Figure 8b), we find a model that explains 53% of the variation in HPI—demonstrating that HPI is an appropriate proxy for accomplishment, with an improved fit for tracking an individual’s achievements.

Discussion We introduced a dataset on historical impact that can be used to study spatial and temporal variations in historical information based on biographies that have a presence in more than 25 language editions of Wikipedia. This manually verified dataset allowed us to link historical works and events to places and time. To distinguish between biographies with different levels of visibility we introduce two measures of historical impact: the number of languages in which an individual has a presence in Wikipedia (L), and the Historical Popularity Index (HPI). We compared the Pantheon dataset by comparing it against the Human Accomplishments dataset and also compared our measures of global fame and visibility using external data on the accomplishments of Formula One racecar drivers, tennis players, swimmers, and chess players. In all these cases we find a good match between L, HPI, and the external measures of accomplishment, demonstrating that the measures developed within Pantheon correlate, for these particular occupations, with historical accomplishments. While these case studies are not exhaustive across all occupations, they show that the measures introduced are effective metrics for characterizing historical information across diverse sets of domains, time, and geography. Consider a Formula One racecar driver. Certainly, for a Formula One racer the number of Grand Prix won, or Championships, would be a better metric of accomplishment than the number of languages in Wikipedia. Yet, since Grand Prix won is a metric that applies only for Formula-1 drivers, it cannot be used for basketball players, swimmers, musicians or scientists. While imperfect, the measures based on the online presence of characters in diverse languages are appropriate proxies accomplishment and provide metrics that we can use to compare individuals from different occupations.

Usage Notes

The Pantheon 1.0 dataset enables quantitative analysis of historical information and already has demonstrated application in testing hypotheses related to the role of

15 polyglots in the global dissemination of information 1, and the extent of online gender inequality and biases 2. For future analysis, this dataset can motivate a number of potential areas of research investigating the dynamics of historical information across temporal and spatial dimensions. For example, the data can be used in connection with other datasets to empirically assess the connections between economic flourishing and historical information, the dynamics of fame across different domains and geographies, and the dynamics of our species’ collective memory.

The data is provided as flat files in tab-separated format, and no additional pre- processing is necessary for users to import the files into a scientific computing environment. A wide variety of software tools for data visualization and numerical analysis can be used to explore the dataset, including MATLAB, R, the SciPy stack, d3, d3plus, etc. In addition, the data includes a number of fields that can be linked to external datasets, such as standardized country and language codes, and unique individual ids from Wikipedia. We emphasize that future results should be interpreted within the narrow context of the dataset documented, and that analyses of the dataset should include consideration of its bias and limitations.

Acknowledgements We are grateful to Defne Gurel, Francine Loza, Daniel Smilkov, and Deepak Jagdish for their contributions and feedback during the data verification and cleaning process. We also want to thank Ethan Zuckerman and Alex Lex for their feedback and comments over the course of this project. This research is supported by funding from the MIT Media Lab Consortia and the Metaknowledge network at the University of .

Author Contributions AZY collected, verified, and compared the data, and wrote the manuscript. SR contributed to data collection, cleaning, and comparisons, and edited the manuscript. KH contributed to data cleaning, and supplemented the dataset with pageviews data. TL contributed to the initial scripts for scraping Freebase and Wikipedia. CAH conceived the study, verified & compared the data, and wrote the manuscript.

References 1. Ronen, S. et al., Links that speak: The global language network and its association with global fame. Proceedings of the National Academy of Sciences 111 (52) (2014). 2. Wagner, C., Garcia, D., Jadidi, M. & Strohmaier, M. It's a Man's Wikipedia? Assessing Gender Inequality in an Online Encyclopedia. arXiv arXiv:1501.06307 [cs.CY] (2015). 3. McLuhan, M., in Understanding Media: The Extensions of Man (1964). 4. Eisenstein, E. L., The printing press as an agent of change (Cambridge University Press, Cambridge, 1979).

16 5. Murray, C., Human Accomplishment (Harper Collins, New York, 2003). 6. Michel, J.-B. et al., Quantitative Analysis of Culture Using Millions of Digitized Books. Science 331 (176), 176-182 (2011). 7. Popescu, A. & Grefenstette, G., Spatiotemporal mapping of Wikipedia concepts. Proceedings of the 10th annual joint conference on Digital libraries, 129-138 (2010). 8. Skiena, S. & Ward, C., Who's Bigger? Where Historical Figures Really Rank (Cambridge University Press, Cambridge, 2013). 9. Schich, M. et al., A network framework of cultural history. Science 345 (6196), 558-562 (2014). 10. Giles, J., Internet encyclopedias go to head. Nature 438, 900-901 (2005). 11. Spinellis, D. & Louridas, P., The collaborative organization of knowledge. Communications of the ACM 51 (8), 68-73 (2008). 12. Hedden, H., Taxonomies and controlled vocabularies best practices for metadata. Journal of Digital Asset Management 6 (5), 279-284 (2010). 13. Abel, G. & Sander, N., Quantifying Global International Migration Flows. Science 343 (6178), 1520-22 (2014). 14. Ford, H., in Critical Point of View: A Wikipedia Reader (Institute of Network Cultures, Amsterdam, 2011), pp. 258-268. 15. Brown, A. R., Wikipedia as a Data Source for Political Scientists: Accuracy and Completeness of Coverage. PS: Political Science and Politics 44 (02), 339-343 (2011). 16. Royal, C. & Kapila, D., What's on Wikipedia, and What's Not.? Social Science Computer Review 27 (1), 138-148 (2009). 17. UNESCO, 2009 UNESCO Framework for Cultural Statistics (UNESCO Institute for Statistics, , 2009).

Data Citation Yu, A.Z., Ronen, S., Hu, K., Lu, T., Hidalgo, C. Harvard Dataverse http://dx.doi.org/10.7910/DVN/28201 (2014).

17 Figures

Figure 1: Pantheon Data Workflow

Raw Data Data Collection Cleaning & Verification - Geolocation cleaning: Use standard geocoding - Freebase: People Entity data - Record matching: Map Freebase APIs (GeoNames, Google Maps, Yahoo Placemaker) dump (people.tsv) people to Wikipedia records across to clean birth place locations all language editions - Wikipedia: language links, raw - Metrics: Pageview data data on individuals - Demographic data: Collect raw data on individuals’ date and - Taxonomy design: Create a controlled vocabulary place of birth, occupation, gender to normalize and classify occupations by domain.

- Data classification: Classify individual biographies by domain.

- Data verification: filter data by language threshold, adjust inaccurate data, fill in missing data

name: Albert Einstein id: /m/0jcx dob: 1879-03-144 place of birth: Ulm gender: Male profession: Professor,Mathematician, Scientist,Peace activist,Author, Pacifist,Philosopher,Writer,, Theoretical Physicist,PhysicistPhysicist wiki_id: 736 numlangs: 166 .....

2,392,169 997,276 11,341

Flow diagram detailing the data collection process for the Pantheon 1.0 (n=11,341). Inset image from pantheon.media.mit.edu.

18 Figure 2: Cumulative Number of Individuals with at least N Wikipedia Language Editions

Heather Fargo

1e+05

Tom Brady

11,340

Édith Piaf Che Guevara Leonardo Da Vinci 1e+03 Jesus Christ

Charles Darwin Barack Obama

Nefertiti

1e+01 Dmitri Mendeleev Björn Borg Marilyn Monroe log(Number of Individuals with at least N Wikipedia language editions)

025 50 100 150 200 Number of Wikipedia language editions

Cumulative distribution of biographies on a semi-log plot, as a function of the number of languages in which each of these biographies has a presence. Individual images from Wikimedia Commons.

19 Figure 3: Domain Taxonomy

Dance (12) Dancer (12) Architect (73) Design (127) Comic Artist (24) Designer (16) Fashion Designer (10) Game Designer (4) Film And Theatre (1374) Actor (1193) Arts (2866) Comedian (4) Film Director (177) Artist (88) Fine Arts (299) Painter (178) Photographer (12) Sculptor (21) Composer (225) Music (1054) Conductor (11) Musician (381) Singer (437) Business (91) Businessperson (79) Producer (12) Business & Law (108) Law (17) Lawyer (17) Explorers (102) Astronaut (32) Exploration (102) Explorer (70) History (48) Historian (48) Critic (5) Language (1000) Journalist (19) Humanities (1329) Linguist (21) Philosophy (281) Writer (955) Philosopher (281) Diplomat (36) Government (2704) Judge (9) Nobleman (116) Politician (2529) Public Worker (14) Institutions (3454) Military (232) Military Personnel (223) Pilot (9) Religion (518) Religious Figure (518) Activism (114) Social Activist (114) Companions (101) Companion (101) Celebrity (21) Chef (2) Public Figure (358) Media Personality (87) Magician (4) Model (30) Pornographic Actor (11) Presenter (19) Outlaws (56) Extremist (34) Mafioso (13) Pirate (9) Computer Science (34) Computer Scientist (34) Engineering (41) Engineer (41) Invention (67) Inventor (67) Math (161) Mathematician (157) Statistician (4) Medicine (143) Physician (143) Science & Technology (1368) Archaeologist (13) Astronomer (83) Natural Sciences (735) Biologist (141) Chemist (220) Geologist (10) Physicist (268) Anthropologist (11) Economist (102) Social Sciences (187) Geographer (14) Political Scientist (7) Psychologist (38) Sociologist (15) Athlete (74) Boxer (14) Chessmaster (30) Cyclist (29) Golfer (2) Gymnast (7) Individual Sports (526) Martial Arts (7) Mountaineer (5) Racecar Driver (104) Skater (9) Skier (17) Snooker (3) Swimmer (20) Sports (1756) Tennis Player (161) Wrestler (44) American Football Player (1) Baseball Player (5) Basketball Player (71) Team Sports (1230) Coach (75) Cricketer (2) Hockey Player (2) Referee (10) Soccer Player (1064)

From left to right: domain (i.e. Sports), industry (i.e. Team Sports) and occupation (i.e. Soccer Player)

20 Figure 4: Comparison Analysis of Human Accomplishments and Pantheon 1.0

name L MurrayIndex name L MurrayIndex Paul Valéry 38 19.00 Pablo Picasso 143 77.32 Arnold Sommerfeld 36 20.36 a. Marie Curie 130 40.78 Soseki Natsume 37 60.08 Max Planck 103 33.23 Frederick Hopkins 34 20.27 Sigmund Freud 134 34.46 4 Giuseppe Peano 40 23.78 Leo Tolstoy 123 40.53 Alexis Carrel 43 35.66 Fyodor Dostoyevsky 113 40.20 Hugo de Vries 31 43.66 Richard Wagner 116 79.55 Giovanni Schiaparelli 44 21.08 Dante Alighieri 137 62.50 James McNeill Whistler 35 19.93 Archimedes 129 51.14 Richard Dedekind 40 23.13 3 Thomas Aquinas 103 39.04

2 log(MurrayIndex) name L MurrayIndex name L MurrayIndex Alfred Jarry 32 3.35 Simone de Beauvoir 85 3.61 Henri Barbusse 35 1.97 1 Pablo Neruda 95 4.35 Heinrich Mann 39 4.25 George Orwell 101 2.52 Jules Bordet 39 1.66 Arthur Conan Doyle 90 2.19 Fritz Pregl 42 2.94 Victor Hugo 123 2.44 Paul Dukas 34 3.96 Blaise Pascal 103 6.03 Alexander Glazunov 36 3.51 Giordano Bruno 83 3.82 Carl Nielsen 35 3.20 0 Jorge Luis Borges 101 2.75 Max Bruch 35 2.02 Thales 84 6.07 Percival Lowell 32 4.39 Benjamin Franklin 88 3.73 3.5 4.0 4.5 5.0 log(numlangs) 2 R = 18%

name HPI MurrayIndex name HPI MurrayIndex William Shockley 23.85 16.35 Henri Matisse 27.39 38.30 Jacques Monod 23.46 15.05 b. Marie Curie 27.60 40.78 Edward Lawrie Tatum 22.70 16.69 Max Planck 27.32 33.23 Alfred Hershey 22.62 13.57 Sigmund Freud 29.63 34.46 George Gabriel Stokes 24.19 20.94 4 Vincent van Gogh 29.74 39.76 Herman Melville 23.93 14.50 Claude Monet 28.50 40.70 Jean-Martin Charcot 24.18 24.95 Édouard Manet 27.26 33.52 Francis Galton 24.41 26.01 Leo Tolstoy 28.21 40.53 Thomas Carlyle 24.17 13.89 Leonardo Fibonacci 27.83 33.71 John Herschel 24.45 26.81 3 Laozi 29.15 68.63

2 log(MurrayIndex) name HPI MurrayIndex name HPI MurrayIndex William Saroyan 23.08 3.54 Simone de Beauvoir 27.08 3.61 Albert Sabin 22.76 4.37 Pablo Neruda 27.23 4.35 1 Witold Gombrowicz 23.13 1.99 George Orwell 27.66 2.52 Erskine Caldwell 22.34 4.01 Benjamin Franklin 28.09 3.73 Evelyn Waugh 22.29 2.90 Seneca the Younger 28.82 4.06 Giulio Natta 23.33 4.08 Anaximenes of Miletus 27.37 3.53 Rafael Alberti 23.15 2.74 Xenophanes 27.21 2.89 Anna Seghers 23.09 2.46 0 Anaximander 28.12 3.87 Georg von Békésy 22.85 3.43 John Calvin 28.18 3.15 E. E. Cummings 22.57 1.88 Thomas More 28.16 3.17 3.0 3.1 3.2 3.3 3.4 log(HPI) 2 R = 12%

(a) Comparison of L (number of language presences in Wikipedia of an individual) with index scores from the Human Accomplishments (HA) dataset, including sample details on individuals matched between the datasets (n= 1,570).

(b) Comparison of HPI (Historical Popularity Index) with the index scores from HA, including sample details on individuals matched between the datasets (n= 1,570).

21 Figure 5: Analysis with Formula 1 Drivers

a. 110

100 Formula 1 Model

Fernando Alonso where and 1 1 2 x =PodiumsMichael Schumachere B = 0.0029 A= 32.96 90 x2=Starts B2= 0.0000006 R = 54% x3=Killed in Action B3= -0.0129

80 50 h

48 70 Kimi Räikkönen 46 Mika Häkkinen Ayrton Senna 44 60 42 40 Languages in Wikipedia (L) 50 Mark Webber Nico RosbergMika Häkkinen 38 Nelson HeikkiGiancarlo Kovalainen Fisichella Jacques Villeneuve Angelo Piquet Sébastien Bourdais Nick HeidfeldDamon Hill Timo GlockRalf Schumacher Jarno JuanTrulli Pablo Montoya 36 40 Adrian Sutil Pedro de la Rosa Nigel Mansell SébastienNelson Angelo Bourdais Piquet AnthonyNarain Karthikeyan Davidson Nelson Piquet Kazuki Nakajima Sébastien Buemi SébastienTakuma Sato Buemi 34 Roland RolandSakonVitantonioKamuiVitalyAlexander PetrovRatzenbergerYamamotoKobayashi Liuzzi Wurz Ratzenberger Vitaly PetrovVitantonio Liuzzi ScottChristijanLucaOlivierRomain HeinzBadoerSpeedEddie Panis Albers Grosjean−Harald Irvine Frentzen Alexander WurzKamui Kobayashi 30 Gilles VilleneuveJody Scheckter 32 Christian KlienJean Alesi Gerhard Berger MarcRobert Gené MicheleDoornbos Alboreto 30 30 35 40 45 50 20 20 30 40 50 60 70 80 90 100 110 Predicted from Podiums, Starts, and Killed in Action b. 26 25 Formula 1 Model Ayrton Senna where and 24 1 1 2 x =Podiums B = 0.0006 A= 16.37 Alain Prost x2=Starts B2= 0.00028 Nelson Piquet 3 3 R = 68% x =Killed in Action B = 0.1031 23 Nigel Mansell

20 22 Damon Hill Gerhard Berger Heinz−Harald Frentzen 19.5 Mika Häkkinen Juan Pablo Montoya 21 Jarno Trulli 19 Ralf Schumacher Jacques Villeneuve Giancarlo Fisichella Rubens Pedro de la Rosa Kimi Räikkönen 20 Barrichello Nick Heidfeld Mark Webber David Coulthard Je Heinz−Harald Frentzen 18.5 Felipe Massa Juan Pablo OlivierMontoya Panis

Jarno Trulli Lewis Hamilton Ralf Schumacher 19 GiancarloPedro de Fisichellala Rosa Nick Heidfeld Sebastian Vettel 18 Heikki Kovalainen Mark Webber Jenson Button Historical Popularity Index (HPI) Index Popularity Historical Felipe Massa Takuma Sato Lewis Hamilton Robert Kubica Sebastian Vettel Alexander Wurz 17.5 18 TakumaHeikki Sato Kovalainen ChristijanRobert Albers Kubica

Marc Gené Timo Glock 17 Nico Rosberg SébastienMarc GenéTimo Bourdais Glock Sébastien Bourdais 17 Nico Rosberg Luca Badoer Anthony DavidsonAdrian Sutil NarainVitantonio Karthikeyan Liuzzi Anthony Davidson Adrian Sutil RobertNelson Doornbos Angelo Piquet 16.5 Narain Karthikeyan ScottKazuki Speed Nakajima Nelson Angelo Piquet 16 Sakon KamuiYamamoto Kobayashi Kazuki Nakajima 16 Sébastien Buemi 16 16.5 17 17.5 18 18.5 19 19.5 20 15 16 18 20 22 24 26 Predicted from Podiums, Starts, and Killed in Action

(a) Analysis of L (number of language presences in Wikipedia of an individual) using Podiums, Starts, and Killed in Action data on Formula 1 drivers (n=56).

(b) Analysis of HPI (Historical Popularity Index) using Podiums, Starts, and Killed in Action data on Formula 1 drivers (n=56).

22 Figure 6: Analysis with Tennis Players

110 a. Model Tennis 100 where and x1=Weeks at Number One B1= -0.0003 A= 44.97 2 x2=Grand Slam B2= 0.0145 x3=Highest Rank B3= -0.0087 R = 34% 4 4 90 x =Gender (Female = 1) B = 0.022 60 Andrew Murray 58 80 56 54 70 52 50

Languages in Wikipedia Stefanie Graf 60 Andrew Murray Kim ClijstersBoris Becker 48 Justine Henin John McEnroe AnaAndy IvanoviMarat Roddick Safin Martina Hingis 46 Caroline Wozniacki Jelena Jankovic 50 Björn Borg Victoria Azarenka John McEnroe 44 Amélie Mauresmo Li Na Monica Seles Juan Martin Del Potro JelenaLleyton JankoviIvan Hewitt Lendl DinaraJuanAmélie Safina Martin Mauresmo Del Potro David Nalbandian Elena Dementieva Petra Kvitová Juan Carlos Ferrero DavidPetraJuan KvitováNalbandianJimmy Carlos Connors Ferrero 42 FrancescaGaël Monfils ArantxaSchiavoneStefan Sánchez Edberg Vicario Gaël Monfils 40 SamanthaNadiaDavidLindsay Petrova StosurFerrerGustavo Davenport Kuerten Arantxa Sánchez Vicario FernandoJo−NikolayWilfriedSvetlana VerdascoJennifer Davydenko ArthurTsonga Kuznetsova Capriati Ashe 40 40 45 50 55 60 40 50 60 70 80 90 100 110 Predicted from Wks@#1, Grand Slam Wins, Highest Rank & Gender

b. 25 24 Tennis Model where and x1=Weeks at Number One B1= 0.00009 A= 16.32 23 2 2 2 x =Grand Slam B = 0.0047 x3=Highest Rank B3= -0.0039 Björn Borg R = 63% x4=Gender (Female = 1) B4= 0.03227 22 Jimmy Connors Martina Navratilova 19 John McEnroe Boris Becker Chris Evert Stefanie Graf 21 Andre Agassi 18.5 Ivan Lendl Arantxa Sánchez Vicario Martina Hingis Roger Federer

20 Stefan Edberg Pete Sampras18 Venus Williams Maria Sharapova Marat Safin Novak Djokovic Rafael Nadal Amélie Mauresm o Monica Seles Justine Henin 17.5 19 Andy Roddick Anna Kournikova Arantxa MartinaSánchez Hingis Vicario Serena Williams Ana Ivanovic Juan Carlos Ferrero Gustavo Kuerten Venus Williams 17 David Nalbandian Lleyton Hewitt 18 Maria SharapovaMarat SafinNovak Djokovic Amélie Mauresmo Justine Henin Kim Clijsters LindsayAndy Davenport Roddick Ana IvanoviJuan Carlos Ferrero Jennifer Capriati 16.5 Jelena Jankovic David Ferrer David NalbandianLleyton Hewitt Elena Dementieva

Historical Popularity Index (HPI) 17 Kim Clijsters Caroline Wozniacki Nikolay Davydenko Jelena JankoviDavid Ferrer Jo−Wilfried Tsonga ElenaCaroline DementievaNikolay Wozniacki Davydenko Jo−Wilfried Tsonga 16 Fernando Verdasco Fernando Verdasco Francesca Schiavone Gaël Monfils 16 NadiaFrancesca PetrovaGaël Schiavone Monfils JuanAndrew Martin DelMurray Potro Andrew Murray Dinara Safina Juan Martin Del Potro Dinara Safina Victoria Azarenka 15.5 Svetlana Kuznetsova SamanthaLi Na Stosur 15 Victoria Azarenka 15 Li Na Petra Kvitová 16 16.5 17 17.5 18 18.5 19 14 14 16 18 20 22 24 Predicted from Wks@#1, Grand Slam Wins, Highest Rank & Gender

(a) Analysis of L (number of language presences in Wikipedia of an individual) using number of weeks at number one in the ATP or WTA, the number of Grand Slam wins, the top rank ever obtained, and gender (Female = 1, Male = 0) data on tennis players (n=52).

(b) Analysis of HPI (Historical Popularity Index) using number of weeks at number one in the ATP or WTA, the number of Grand Slam wins, the top rank ever obtained, and gender (Female = 1, Male = 0) data on tennis players (n=52).

23 Figure 7: Analysis with Swimmers

a. 80 Swimmers MichaelModel PhePhelpslps where and 70 x1=Gold Medals B1= 0.0238 A= 25.04 2 x2=Gender (Female = 1) B2= -0.0094 R = 74%

40 60

Ryan Lochte

50 35

Mark Spitz

Rūta Meilutytė

Languages in Wikipedia (L) 40 Stephanie Rice 30 Kirsty Coventry

Cesar Cielo Filho Alexander Vladimirovich Popov Ruta Meilutyte Stephanie Rice Alain Bernard 30 Kirsty Coventry Rebecca Soni Milorad Cavic Sun Yang Missy Franklin Cesar CieloAlexander Filho VladimirovichMatt Biondi Popov RebeccaAlain Bernard Soni Federica Pellegrini Leisel Jones Pieter van den Hoogenband MiloradFedericaSun LeiselPieteravi Yang MissyIngePellegrini Jones vande Franklin Bruijn den Hoogenband 25 25 30 35 40

20 20 30 40 50 60 70 80 Predicted from Gold Medals and Gender

b. 25 Swimmers Model where and x1=Gold Medals B1= 0.00787 A= 13.94 2 x2=Gender (Female = 1) B2= -0.0746 20 R = 50%

18

Michael Phelps M 17 Matt Biondi Matt Biondi Alexander Vladimirovich Popov Inge de Bruijn Alexander Vladimirovich Popov Inge de Bruijn Ian Thorpe Ian Thorpe 16 Pieter van den Hoogenband Pieter van den Hoogenband Alain Bernard 15 Federica Pellegrini 15 Alain Bernard Cesar Cielo Filho Federica Pellegrini Ryan Lochte

Cesar Cielo Filho Kirsty Coventry Milorad Čavić 14 Ryan Lochte Leisel Jones Stephanie RiceSun Yang Rebecca Soni 13 Kirsty Coventry

Historical Popularity Index (HPI) Milorad Čavić

Rūta Meilutytė 10 Missy Franklin 12 Leisel Jones Stephanie Rice Rebecca Soni Sun Yang 11

Rūta Meilutytė

10 10 11 12 13 14 15 16 17 18

5 5 10 15 20 25 Predicted from Gold Medals and Gender

(a) Analysis of L (number of language presences in Wikipedia of an individual) using the total number of gold medals won and gender data on swimmers born after 1950 (n=19).

(b) Analysis of HPI (Historical Popularity Index) using the total number of gold medals won and gender data on swimmers born after 1950 (n=19).

24 Figure 8: Analysis with Chess Players

a. 80

Model Chess 70 where and 1 1 2 x =highest ELO ranking B = 0.00003 A= 27.6 x2=gender B2= 0.12137 R = 37% x3=total games B3= 0.000028 x4=percentage wins B4= 0.00527 x5=percentage losses B5= -0.00776 x6=percentage draws B6= -0.00158 60 50 48 José Raúl Capablanca 46 50 Emanuel Lasker Wilhelm Steinitz 44 Magnus Carlsen Veselin TopalovMikhail Botvinnik Judit Polgár Boris Spassky Vasily Smyslov 42 Judit Polgár

Languages in Wikipedia 40 40 Paul Morphy Adolf Anderssen 38 Alexander Khalifman Vasyl Ivanchuk 36 Boris Gelfand Ruslan PonomariovPaul Keres Péter Lékó Levon Aronian Alexander Khalifman Teimour RadjabovRustam Kasimdzhanov 30 Ruy López de Segura 34

Alexei Shirov Vasyl Ivanchuk Mikhail Chigorin Michael Adams 32 Paul Keres Péter Lékó Levon Aronian 30 30 32 34 36 38 40 42 44 46 48 50 20 20 30 40 50 60 70 80 Predicted from highest ELO ranking, gender, total games, %wins, %losses, %draws b. 30 28 Chess Model where 1 1 26 2 x =highest ELO ranking B = -0.00007 A= 24.9 x2=gender B2= -0.0502 R = 53% Alexander xAlekhine3=total games B3= 0.000017 José Raúl Capablanca Bobby Fischer 4 4 Adolf Anderssen Emanuel Lasker x =percentage wins B = 0.00442 5 5 24 Mikhail Botvinnik Wilhelm Steinitz x =percentage losses B = -0.00288 Ruy López de 6Segura 6 Anatoly Karpov Paul Keres x =percentagePaul Morphy draws B = -0.00226 Boris Spassky 20 Vasily Smyslov Garry Kasparov 22 Mikhail Chigorin Viswanathan Anand 19 20 Alexander Khalifman Viswanathan Anand 18 Veselin Topalov Alexander Khalifman 18 Veselin Topalov s Michael Adams Michael Adam Vasyl Ivanchuk Alexei Shirov 17 Vasyl Ivanchuk Boris Gelfand PéterJudit Lékó PolgárRuslanRustam Ponomariov Kasimdzhanov Historical Popularity Index (HPI) 16 Boris Gelfand Rustam Kasimdzhanov Magnus Carlsen 16 Péter LékóJudit Polgár Levon Aronian Ruslan Ponomariov 14 Teimour Radjabov

Sergey Karjakin 15 12 Levon Aronian Magnus Carlsen

14 14 15 16 17 18 19 20 10 10 15 20 25 30 Predicted from highest ELO ranking, gender, total games, %wins, %losses, %draws

(a) Analysis of L (number of language presences in Wikipedia of an individual) using highest ELO ranking attained, gender, total games played, and percentage of wins, losses, and draws of chess players (n=30).

(b) Analysis of HPI (Historical Popularity Index) using highest ELO ranking attained, gender, total games played, and percentage of wins, losses, and draws of chess players (n=30).

25 Tables Table 1: Comparison Chart of Quantitative Datasets for Studying Historical Information

Includes domain # of # of Dataset Metric(s) Size Domains Time classification Countries Languages UNESCO economic 52 countries Arts, Yes (7 core 2000- >50 2471 Cultural productivity, – feature Design, domain present Statistics 17 cultural films, 20 Media headings) participation countries – employment survey World Values Values 85,000 Religion, N/A 1981- 87 20

Survey respondents Institutions present (http://www.worldv aluessurvey.org/)

6 Culturomics n-grams 5,195,769 Arts, N/A 1800- not 7 books Politics, 2000 reported Sciences Human Individuals 3,869 Arts & Yes (9 800BC- 30* 6*

Accomplishment and events significant Sciences within 1950 countries languages 5 individuals, Sciences, 4 & regions 1,560 within Arts) significant events 8 Who’s Bigger Entities 843,790 All Yes (based 2686 BC >200* 1 (Individuals, individuals domains on English - 2010 (English) events, known in Wikipedia places, English categories) things) Wikipedia

Popescu & Individuals, 500,896 All Yes (based <600 BC >200* 7 Grefenstette 7 places names domains on English - 2009 known in Wikipedia English categories) Wikipedia

Networked Individuals Multiple All Yes – # 1-2012 >200* 2 – Framework of & Places datasets domains categories (mainly

Cultural History (FB, AKL, per dataset: English, 9 ULAN, AKL: 5, German) WCEN, ULAN: 7, Google FB: 6 Ngrams) Pantheon 1.0 Individuals, 11,341 All Yes (88 4000BC – 194 277 Places. notable domains occupations, 2010 individuals 27 industries, 8 domains) *estimated (exact numbers not reported)

26 Table 2: Top 10 Biographies for each Time Period by Number of Language Editions (L) and Historical Popularity Index (HPI)

Top 10 By L Top 10 by HPI Time Period Name L Name HPI Jesus Christ 214 Aristotle 31.99 Confucius 192 Plato 31.99 Aristotle 152 Jesus Christ 31.90 Qin Shi Huang 144 Socrates 31.65 Plato 142 Alexander the Great 31.58 Before 500 Homer 141 Confucius 31.37 Alexander the Great 138 Julius Caesar 31.12 Socrates 137 Homer 31.11 Archimedes 129 Pythagoras 31.07 Julius Caesar 128 Archimedes 30.99 Muhammad 150 Muhammad 30.65 Genghis Khan 121 Charlemagne 30.48 Charlemagne 116 Genghis Khan 29.74 Saladin 104 Saladin 29.14 Avicenna 102 Avicenna 28.89 500-1199 Li Bai 102 Ali 28.09 Muhammad ibn Musa al-Khwarizmi 100 Li Bai 28.07 Du Fu 97 Francis of Assisi 28.00 Umar 92 Du Fu 27.91 Omar Khayyám 86 Averroes 27.83 Leonardo da Vinci 174 Leonardo da Vinci 31.46 Michelangelo 158 Michelangelo 30.44 Christopher Columbus 153 Christopher Columbus 30.18 Dante Alighieri 137 Dante Alighieri 30.15 Nicolaus Copernicus 128 Martin Luther 30.03 1200-1499 Marco Polo 127 Marco Polo 29.77 Martin Luther 125 Jeanne d'Arc 29.56 Ferdinand Magellan 120 Thomas Aquinas 29.44 Vasco da Gama 119 Niccolò Machiavelli 29.27 Albrecht Dürer 112 Raphael 29.21 Isaac Newton 191 William Shakespeare 30.44 William Shakespeare 163 Isaac Newton 30.29 Galileo Galilei 146 Johann Sebastian Bach 30.17 Johann Sebastian Bach 144 Galileo Galilei 29.96 1500-1749 George Washington 134 Immanuel Kant 29.69 Johann Wolfgang von Goethe 132 René Descartes 29.45 Johann Wolfgang von Miguel de Cervantes 128 Goethe 29.34

27 Immanuel Kant 125 Voltaire 29.27 Rembrandt 125 Blaise Pascal 29.25 Carl Linnaeus 123 Jean-Jacques Rousseau 29.16 Wolfgang Amadeus Mozart 177 Wolfgang Amadeus Mozart 30.51 Ludwig van Beethoven 153 Napoleon Bonaparte 30.33 Karl Marx 148 Ludwig van Beethoven 30.11 Charles Darwin 148 Karl Marx 29.84 1750-1849 Napoleon Bonaparte 145 Charles Darwin 29.20 Abraham Lincoln 131 Friedrich Nietzsche 28.67 Thomas Edison 126 Victor Hugo 28.66 Victor Hugo 123 Richard Wagner 28.57 Leo Tolstoy 123 Claude Monet 28.50 Friedrich Nietzsche 117 Frédéric Chopin 28.47 Adolf Hitler 169 Adolf Hitler 30.58 Huang Xian Fan 167 Albert Einstein 30.21 Albert Einstein 166 Vincent van Gogh 29.74 Mustafa Kemal Atatürk 166 Sigmund Freud 29.63 1850-1899 Vincent van Gogh 155 Pablo Picasso 29.60 Charlie Chaplin 145 Mahatma Gandhi 29.14 Pablo Picasso 143 Joseph Stalin 28.93 Mahatma Gandhi 138 Vladimir Lenin 28.92 Vladimir Lenin 137 Benito Mussolini 28.55 Joseph Stalin 134 Oscar Wilde 28.53 Hebe Camargo 157 Che Guevara 29.16 Marilyn Monroe 143 Martin Luther King, Jr. 28.69 George Bush 143 Elvis Presley 28.62 Lech Wałęsa 135 Salvador Dalí 28.59 Nelson Mandela 133 Marilyn Monroe 28.35 1900-1950 Che Guevara 129 Walt Disney 28.17 Elizabeth II of the 127 Jean-Paul Sartre 28.15 Pope Benedict XVI 123 Jimi Hendrix 27.93 Salvador Dalí 122 Andy Warhol 27.92 Neil Armstrong 121 Mother Teresa 27.86

28