Using Historical Editions of Encyclopaedia Britannica to Track the Evolution of Reputations
Total Page:16
File Type:pdf, Size:1020Kb
Catching the Red Priest: Using Historical Editions of Encyclopaedia Britannica to Track the Evolution of Reputations Yen-Fu Luo†, Anna Rumshisky†, Mikhail Gronas∗ †Dept. of Computer Science, University of Massachusetts Lowell, Lowell, MA, USA ∗Dept. of Russian, Dartmouth College, Hanover, NH, USA yluo,arum @cs.uml.edu, [email protected] { } Abstract mention statistics from books written at different historical periods. Google Ngram Viewer is a tool In this paper, we investigate the feasibil- that plots occurrence statistics using Google Books, ity of using the chronology of changes in the largest online repository of digitized books. But historical editions of Encyclopaedia Britan- while Google Books in its entirety certainly has nica (EB) to track the changes in the land- quantity, it lacks structure. However, the history scape of cultural knowledge, and specif- of knowledge (or culture) is, to a large extent, the ically, the rise and fall in reputations of history of structures: hierarchies, taxonomies, do- historical figures. We describe the data- mains, subdomains. processing pipeline we developed in order to identify the matching articles about his- In the present project, our goal was to focus on torical figures in Wikipedia, the current sources that endeavor to capture such structures. electronic edition of Encyclopaedia Britan- One such source is particularly fitting for the task; nica (edition 15), and several digitized his- and it has been in existence at least for the last torical editions, namely, editions 3, 9, 11. three centuries, in the form of changing editions of We evaluate our results on the tasks of arti- authoritative encyclopedias, and specifically, Ency- cle segmentation and cross-edition match- clopaedia Britannica. Throughout their existence, ing using a manually annotated subset of encyclopedias have claimed to be well-organized 1000 articles from each edition. As a case (i.e., structured) representations of knowledge and study for the validity of discovered trends, have effectively served as its (obviously imperfect) we use the Wikipedia category of 18th mirrors. Each edition of Encyclopaedia Britannica century classical composers. We demon- reflected a collective editorial decision, based on a strate that our data-driven method allows scholarly consensus, regarding the importance of us to identify cases where a historical fig- each subject that has to be included and the relative ure’s reputation experiences a drastic fall volume dedicated to it. As such, it can be thought or a dramatic recovery which would allow of as a proxy of sorts for the state of contemporary scholars to further investigate previously knowledge. Of course, institutions such as Britan- overlooked instances of such change. nica, their claims to universality notwithstanding, throughout their histories have been necessarily 1 Introduction western-centric and reflected the prejudices of their Histories of nations are reflected in their shifting time. A note of caution is therefore in order here: borders. But the histories of things immaterial, what this data allows us to reconstruct is the evo- yet no less interesting–concepts, ideologies, reputa- lution of knowledge representation, rather than of tions of historical personalities–are mapless. This the knowledge itself. paper describes the progress of the Knowledge Evo- In this paper, we investigate the feasibility of lution Project (KnowEvo), which investigates the using historical Encyclopaedia Britannica editions possibility of using historical digitized text to track to develop tools that can be used in scholarship and map long-range historical changes in the con- and in pedagogy to illustrate and analyze known ceptual landscape, and specifically, the history of historical changes and to facilitate the discovery intellectual networks and reputations. of overlooked trends and processes. Specifically, One of the ways to investigate the change in how we focus on the history of intellectual reputations. ideas and personalities are represented is to use We are interested in whether certain categories of 1 Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pages 1–9, Beijing, China, July 30, 2015. c 2015 Association for Computational Linguistics and The Asian Federation of Natural Language Processing people that form an intellectual landscape of a cul- revealed during the case study, was the change in ture can be tracked through time using Britannica’s Vivaldi’s legacy. The author of “The Four Seasons”, historical editions. We suggest that by measuring known to his contemporaries as the red priest (due changes in the relative importance assigned to a to his hair and profession, respectively) was com- particular figure in successive editions of Britan- pletely forgotten towards the beginning of the 20th nica we can reconstruct the history of his or her century and then rediscovered and joined the canon reputation. Thus, each edition can be thought of as in the second part of the century. For a student a proxy for the contemporary state of knowledge of musical history these facts are not surprising. (and reputations in particular), with the history of However, they have been obtained through an au- editions reflecting the history of such states. Con- tomatic method which can be used in other, less tinuing previous work (Gronas et al., 2012), we well known areas of cultural history and on a large develop a set of tools for cleaning noisy digitally scale. scanned text, identifying articles and subjects, nor- malizing their mentions across editions, and mea- 2 Related work suring their relative importance. The data about historical figures and their reputations, based on A big data analysis of large textual datasets in their representation in different editions of Ency- humanities has been gaining momentum in re- clopaedia Britannica, is available for browsing and cent years, as evidenced by the success of Cul- visualization through the KnowEvo Facebook of turomics (Michel et al., 2011), a method based on the Past search interface.1 n-gram frequency analysis of the Google Books We examine the plausibility of using these tools corpus, available via the Google Ngram Viewer. to track the change in people’s reputations in var- Skiena and Ward (2013) recently applied simi- ious domains of culture. In the current work, we lar quantitative analysis to empirical cultural his- verify the accuracy of our cross-edition normaliza- tory, assessing the relative importance of historical tion methods and conduct a case study to examine figures by examining Wikipedia people articles. whether the discovered trends are valid. As a case They supplemented word frequency analysis with study, we look at the reputations of 18th century several Wikipedia-based measures, such as PageR- classical composers. Many of 18th century classi- ank (Page et al., 1999), page size, the number of cal composers had utmost importance for western page views and page edits. Their approach is com- classical music and exerted far-reaching influence plementary to ours: whereas they are interested during the following centuries. We looked at the in the reputations as they exist today, we seek to reputation changes between the 11th edition (1911) quantify the dynamics of cultural change, i.e. the and the 15th edition (1985–2000), thus covering historical dimension of reputations, rather than a most of the 20th century. contemporary snapshot. The results of our case study suggest that our A culturomics-like approach applied to large methods provide a valid way to examine the trends structured datasets (knowledge bases) is advocated in the rise and fall of reputations. For example, in Suchanek and Preda (2014). Our approach is the case study revealed that in the course of the somewhat similar in that the corpus of historical 20th century, among the major composers, Han- editions of Britannica can be considered a knowl- del’s reputation underwent the biggest change, as edge base, with an important difference being a he dropped from being second most important com- chronological dimension, absent from such knowl- poser (after Johann Sebastian Bach) in the begin- edge bases as YAGO or DBpedia. An example of ning of the century to the fifth position, well be- mining a historical corpus for trends using the fre- hind Gluck and Haydn. Meanwhile, Mozart de- quentist approaches to vocabulary shifts as well as throned Bach, who moved from the first to the normalization to structured sources can be found second place. Some of the lesser composers (Lotti in the recent work on newspaper and journal histor- and Gaensbacher) disappeared from encyclopedia- ical editions such as Kestemont et al. (2014) and curated cultural memory altogether; whereas the Huet et al. (2013). familiar name of Telemann owes its familiarity to Disambiguation of named entities to structured a recent revival. Another notable shift, empirically sources such as Wikipedia has been an active area of research in recent years (Bunescu and Pasca, 1http://knowevo.cs.uml.edu 2006; Cornolti et al., 2013; Cucerzan, 2007; Hof- 2 fart et al., 2011; Kulkarni et al., 2009; Liao and look for uppercase words at the beginning of a line Veeramachaneni, 2009; Ratinov et al., 2011). Our preceded by an empty line and followed by at least approach to diachronic normalization between dif- one non-empty line; the first word should be at least ferent editions opens the door to time-specific en- two characters long, and excludes frequent