<<

Towards a quantitative framework for historical disciplines

Barbara McGillivray1, Jon Wilson2, Tobias Blanke3 1The Alan Turing Institute, University of Cambridge, United Kingdom 2Department of , King’s College London, United Kingdom 3Department of Digital , King’s College London, United Kingdom [email protected],{jon.wilson, tobias.blanke}@kcl.ac.uk

1 Background and motivation Donald E. Knuth is maybe the most famous god- father of computer science. For him, “[s]cience is knowledge which we understand so well that we The ever-expanding wealth of digital material that can teach it to a computer; and if we don’t fully researchers have at their disposal today, coupled understand something, it is an art to deal with it. with growing computing power, makes the use of . . . [T]he process of going from an art to a science quantitative methods in historical disciplines in- means that we learn how to automate something” creasingly more viable. However, applying exist- (Knuth, 2007). Computing science is defined by the ing techniques and tools to historical datasets is not tension to automate processes using digital means a trivial enterprise (Piotrowski, 2012; McGillivray, and our inability to do so, because we fail to create 2014). Moreover, scholarly communities react dif- fully explicit ways of understanding processes. In ferently to the idea that new research questions and this sense, a computational approach to collecting insights can arise from quantitative explorations and processing (historical) evidence would be a that could not be made using purely qualitative ap- science if we could learn to automate it. Many proaches. Some of them, such as linguistics (Jenset features of the past can be understood through auto- and McGillivray, 2017), have been acquainted with mation. Yet, the problematic nature of the relation- quantitative methods for a longer time. Others, ship between sources and reality and the mutability such as history, have seen a growth in quantitat- of categories, means it will always rely on a sig- ive methods on the fringes of the discipline, but nificant degree of human intuition, and cannot be have not incorporated them into the mainstream of fully automated; computational history is an art in scholarly practice (Hitchcock, 2013). Knuth’s terms. Historical disciplines, i.e., those focusing on the The methodological reflections in this paper are study of the past, possess at least two character- part of an effort to think about how to define the istics, which set them apart and require careful possibilities and limits of quantification and auto- consideration in this context: the need to work with mation in historical analysis. Our aim is to as- closed archives which can only be expanded by sist scholars to take full advantage of quantifica- working on past records (Mayrhofer, 1980), and tion through a rigorous account of the boundaries the focus on phenomena that change in a complex between science and art in Knuth’s terms. Building fashion over time. First, that means historical re- on McGillivray et al.(2018), in this contribution we search is grounded in empirical sources which are will begin with the framework proposed by Jenset stable and fixed (one cannot change the archival and McGillivray(2017) for quantitative historical record). But they are often hard to access and, re- linguistics and illustrate it with two case studies. cording the language and actions of only a small fraction of historical reality at any moment, have 2 A quantitative framework for a complex relationship to the past being studied. historical linguistics Secondly, the categories through which the past is studied themselves change, making modelling, Jenset and McGillivray(2017)’s framework is the and the automation of analysis based on a limited only general framework available for quantitative number of features in the historical record a fraught historical linguistics. A comparable framework, enterprise. but more limited in scope, can be found in Köhler

53 Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018)

(2012). Jenset and McGillivray(2017)’s frame- lowing assumptions shared by the community, work starts from the assumption that linguistic his- other claims, or evidence. A origin- torical reality is lost and the aim of quantitative ates from previous research, intuition, or logical research is to arrive at models of and claims on , and is “a claim that can be tested em- such reality which are quantitatively driven from pirically, through statistical hypothesis testing on evidence and lead to consensus among the schol- corpus data” (Jenset and McGillivray, 2017, 42). arly community. The scope of application of this In this context, “model” means a formalized rep- framework is delimited to the cases where quantifi- resentation of a , be it statistical or able evidence (such as n-grams or numerical data) symbolic (Zuidema and de Boer, 2014). Models can be gathered from primary sources, typically in (including those deriving from hypotheses tested the form of corpora, i.e., collections of electronic quantitatively against evidence) are research tools text created with the purpose of linguistic analysis. embedding claims or hypotheses, useful in order to Jenset and McGillivray(2017) define evidence produce novel claims and hypotheses in turn via “a in quantitative historical linguistics as the set of continual process of coming to know by manipulat- “facts or properties that can be observed, independ- ing representations” (McCarty, 2004). ently accessed, or verified by other researchers” Based on these definitions, Jenset and McGil- (Jenset and McGillivray, 2017, 39), and thus ex- livray(2017) formalize the research process they clude intuition as inadmissible as evidence. Such envisage as part of their framework, see Figure1. facts can be pre-theoretical (as the fact that the Eng- The process starts from the historical linguistic real- lish word the is among the most frequent ones) or ity, which we assume to be lost for ever. Any re- based on some hypotheses or assumptions (as the search model can only aim at approaching this real- fact that the class of article in English is among the ity without reaching it completely, and quantitative most frequent ones, which is based on the assump- historical linguistics ultimately will produce mod- tion that the class of articles groups certain words els of language that are quantitative driven from together). Quantitative evidence is “based on nu- evidence. The rest of the diagram shows how this merical or probabilistic or inference” is achieved. The historical linguistic reality gave (Jenset and McGillivray, 2017, 39), and the quanti- rise to a series of primary sources, including docu- fication should be independently verifiable. On the ments and other (mainly textual) sources, and these other hand, distributional evidence has the form to secondary sources like grammars and diction- “x occurs in context y”, where context can consist aries. Based on the knowledge of the language of words, classes, phonemes, etc. Annotated cor- we gather from these sources we can draft annota- pora, where linguistic (morphological, syntactic, tion schemes which specify the rules for adding semantic, etc.) information has been encoded in linguistic information to the corpora and thus ob- context, are considered as sources of distributional tain annotated corpora. Corpora are the source of evidence to study phenomena in historical linguist- quantitative distributional evidence which can be ics. used to test statistical hypotheses, formulated based on our intuition of the language and on knowledge Following Carrier(2012), Jenset and McGilliv- drawn from examples. Such hypotheses can also ray(2017, 40) define claims as anything that is not feed into the creation of linguistic models, which evidence, and statements are based on evidence or aim to represent the historical linguistic reality. on other claims. The role of claims in the frame- work concerns their connection with truth, which 3 Model-building in history can be stated in categorical terms (as in “the claim that x belongs to class y is true”) or probabilistic In contrast with quantitative historical linguistics, terms (e.g., “x belongs to class y with probabil- the discipline of history possesses an extraordinary ity p). Claims possess a strength proportional to variety of idioms to describe itself, and has much that of the evidence supporting them. For example, less rigorous analytical vocabulary to describe its all other things being equal, claims supported by method. Yet there are important similarities, which large evidence are stronger than claims supported mean Jenset and McGillivray(2017)’s framework by little evidence. can be translated and modified for use for histor- Ultimately, research in historical linguistics aims ical research more generally. First, historians as- at making (hopefully strong) claims logically fol- sume that historical reality is lost, and can only

54 Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018)

deal with fuzzy categories would help overcome these obstacles. What’s more, the use of digital data-sets and ap- plication of quantitative techniques to them allows historical claims based on the prevalence of cer- tain features of the past to be empirically tested. Such claims are central to many forms of histor- ical argumentation already; about the importance of particular concepts or practices at specific mo- ments, for example. Of course such claims need to be precisely related to the structure of the (di- Figure 1: Research process from the quantitative gitised) archive; as ever, limitations must be recog- historical linguistics framework described in Jenset nised. But given the amount of material which can and McGillivray(2017). Figure modified from be quickly processed, quantification allows claims Figure 2.1 in Jenset and McGillivray(2017, 45). previously asserted through little more the accumu- lation of anecdotes to be more rigorously validated.

be understood through traces left in a variety of 4 Languages of power archives (including human memory). Second, al- The first where we apply Jenset and though historians rarely explicitly talk about con- McGillivray(2017)’s framework considers a re- structing models, their practice largely consists of cent collaboration between Digital Humanities and making claims about representations of the past History at King’s College London (Blanke and which other disciplines would describe in precisely Wilson, 2017), to develop a “materialist such terms. From the process they describe as the of political texts” following Moretti’s ideas of dis- ‘interpretation’ or ‘analysis’ (Tosh, 2015) of the tant reading (Moretti, 2013). The project worked sources, historians create representations which re- on a corpus of post-1945 UK government White duce the vast complexity of historical reality to a Papers to map connections and similarities in polit- few limited, stylised characteristics; Max Weber’s ical language from 1945 to 2010. As the corpus Protestant Ethic, Lewis Namier’s of fac- is time-indexed, a quantitative analysis traced the tional interest or C.A. Bayly’s great uniformity. changing shape of political language, by tracking Third, these representations are used to make hy- clusters of terms relating to particular concepts and potheses and claims about change over time of charting the changing meaning of words. Creating different kinds. These might be about about the the distributional quantitative evidence involved endurance or rupture of certain key feature in a text pre-processing to create a term-document mat- particular sphere of activity, or about the forces rix. Using natural language processing libraries, responsible for causing a particular event or set of this was annotated with grammatical information, processes process, for example. as well as with a number of dictionaries that reflec- We have suggested that history is (if implicitly) ted facets such as sentiment, ambiguity and so on. essentially a model-building enterprise. That al- These allowed the project to use models for histor- lows many of the hypotheses which historians de- ical texts which not only read the texts themselves velop to be theoretically amenable to quantification. but also to developed ways of classifying them into The use of quantitative methods (in particular us- time intervals. More advanced techniques were ap- ing the analysis of textual corpora) has increased plied to trace changes of meaning in key political recently (Guldi and Armitage, 2014). But, most concepts across time intervals, using topic models historians are reluctant to quantify because they are and word embeddings, allowing historiographical skeptical about formalising their models, believ- and linguistic hypotheses to be tested. ing that to do so would imply their possessing a In Jenset and McGillivray(2017)’s terms, these degree of categorical rigidity unwarranted by the various techniques produced a variety of different complexity of the past. We suggest that more ex- quantitative distributional evidence, which allowed plicit reflection on method, and engagement with a series of hypotheses to be developed and tested. other fields (such as historical linguistics) which Intuition, often developed from historical research

55 Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018) using non-quantitative techniques, had an import- terms. There is significant work to be done devel- ant role in framing hypotheses. But quantitative oping ways to visually represent the quantitative evidence was able to impart greater clarity and features of any corpus of texts. specificity to intuitional hypotheses, often closing down multiple possibilities. For example, using 5 Predicting the Past our dictionaries demonstrated a major break in the language of White Papers in the mid-1960s, around Digital humanities generally use computational the election of Harold Wilson’s Labour government. modelling for exploratory data analysis. Digital While this intuitionally made sense, so would a humanities makes use of the advancements in the break in the early 1980s, which we did not find, abilities to visualise and interactively explore in instead seeing a rupture in the early 1990s. a relatively free fashion. Recently, we have wit- nessed the emergence of new combinations of ex- Combining our chronological analysis with topic ploratory data analysis with statistical evidence for modelling and word embeddings allowed us to discovered patterns. In the digital humanities, this build a series of models of the predominant con- is popular, too, if Klingenstein et al.(2014), for cerns and the structure of political language in instance, integrats a historical regression analysis each epoch. In line with In Jenset and McGilli- into their data visualisations. Our first example vray(2017)’s framework, these models were built above is an instance of exploratory data analysis, from iteratively generating and testing hypotheses. using topic modelling and other tools to provide For example, we tested the frequency of different statistical evidence for underlying trends in the doc- term clusters generated through topic modelling, uments, as earlier demonstrated. Models, however, and the terms whose embedding changed most dra- often have another purpose beyond the exploration matically between each epoch. of data. They are part of predictive analytics. Ab- Our process of hypothesis generation and test- bott(2014) is one of the most famous practitioners ing always had in mind the commonplace assump- in the field. For him, predictive analytics work on tions made by historians using non-quantitative “discovering interesting and meaningful patterns techniques in the field. In many respectives, quant- in data. It draws from several related disciplines, itative distributional evidence produced hypotheses some of which have been used to discover patterns at variance with those scholarly norms. For ex- in data for more than 100 years, including pattern ample, we found White Papers in the period from recognition, , machine learning, artificial 1945 to 1964 to be dominated by post-war foreign intelligence, and data mining.” (Abbott, 2014). policy concerns, not the construction of the welfare It is a common misunderstanding to reduce pre- state; economic language was being dominant in dictive analytics to attempts to predicting the future. the period from 1965-1990 not afterwards; and ‘the It is rather about developing meaningful relation- state’ as a political agent is more important in the ships in any data. Predictive analytics compared later period than before. to traditional analytics is driven by the data un- Yet, as challenging as they may be to much of der observation rather than primarily by human the of post-war Britain, the form assumptions on the data. Its discipline strives to of these hypotheses is very similar to the form of automate the modelling and finding patterns as far the claims made in standard historical argumenta- as this is possible. In this sense, it moves away tion; there is no dramatic epistemological leap in from both exploratory and confirmatory data ana- the type of knowledge being produced. Although lysis, as it fully considers how computers would our models were developed using automated tech- process evidence. niques, they can be verified qualitatively in the O’Neil and Schutt(2013) introduce the idea of same way as non-quantifiable claims, through quo- predicting the past, which is used to model the tation, and the interpretation of words and phrases effects of electronic health records (EHR) and to in specific contexts. set up new monitoring programs for drugs. For One important finding is the need to recognise O’Neil and Schutt(2013), these integrated datasets the broad range of different ways in which quant- were the foundations of novel research attempts to itative analysis can be expressed. It is important, predict the past. They cite the ‘Observational Med- for example, to indicate the absolute frequency of ical Outcomes Partnership (OMOP)’ in the US that terms in any series as well as their relation to other investigates how good we are at predicting what

56 Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018)

we already know about drug performance in health ing data too closely, which negatively impacts its using past datasets. Once OMOP had integrated ability to generalize to new cases. We perform data from heterogeneous sources, it began to look extensive cross-validations to avoid over-fitting. into predicting the past of old drug cases and how Predicting the past, however, differs significantly effective their treatments were. “Employing a vari- from other approaches, as the model is not prepared ety of approaches from the fields of epidemiology, for future addition of data but to analyse existing statistics, computer science, and elsewhere, OMOP data. The aim is to understand which (minimal) seeks to answer a critical challenge: what can med- set of features makes it likely that observation x ical researchers learn from assessing these new includes feature y. In Blanke(2018), we aimed to health databases, could a single approach be ap- understand which combination of features make it plied to multiple diseases, and could their findings likely that a historical person is of gender female, be proven?” (O’Neil and Schutt, 2013). Predicting male or unknown. The next step in our method- the past thus tries to understand how “well the cur- ology is therefore to apply the best performing rent methods do on predicting things we actually models to the whole data set again to analyse what already know” (O’Neil and Schutt, 2013). gender determinations exist in the data. Is it, e.g., Such a novel approach relating to past data sets more likely that vagrants were female in London? should be of interest to the digital history. Digital The common approaches to gender prediction history could use the approach to control decisions in the digital humanities uses predefined dictionar- on how we organise and divide historical records. ies of first names and then matches the gender of An existing example that implies predicting past individuals against this dictionary. This has firstly events by joining historical data sets, is the identific- the problem that these dictionaries are heavily de- ation of historical spatio-temporal patterns of IED pendent on culture and language they relate to. But usage by the Provisional Irish Republican Army this is not the only issue, as dictionary-based ap- during ‘The Troubles’, used to attribute ‘historical proaches secondly also assume that errors are ran- behaviour of terrorism’ (Tench et al., 2016). domly distributed. Gender trouble is simply a prob- lem of not recording the right gender in the data. In Blanke(2018), we demonstrate how predict- Our predictive analytics approach in Blanke(2018) ing the past can complement and enhance existing on the other hand does not make this assumption work in the digital humanities that is mainly con- in advance and judges gender based on the existing centrated on exploring gender issues as they appear data. This has led in turn to interesting insights in past datasets. Blevins and Mullen(2015) provide on why certain genders remain unknown to the an expert introduction into why digital humanities models. should be interested in predicting genders. Gender values are often missing from datasets and need to In summary, predicting the past is based firstly be imputed. Predictive analytics can be seen as a on going through all traditional predictive analytics corrective to existing data practices and we can pre- steps to form a stable model that reflects the under- dict the genders in a dataset. In Blanke(2018), we lying historical evidence close enough but also does compare a traditional dictionary-based approach not overfit. Secondly, we use this stable model to with two machine learning strategies. First a classi- algorithmically analyse historical evidence to gain fication algorithm is discussed and then three dif- insights on how a computer would see the relations ferent rule-based learners are introduced. We can of evidence. demonstrate how these rule-based learners are an effective alternative to the traditional dictionary- 6 Conclusion and future work based method and partly outperform it. This comparison leads us to the conclusion that, Blanke(2018) develops the predicting the past despite the broad applicability of Jenset and Mc- further and present differences from Gillivray(2017)’s framework in both cases, some other predictive analytics approaches. We follow important differences emerge between historical all the steps of traditional predictive analytics to linguistics and history. We discuss two. First of prepare a stable and reliable model, where we pay all, the scope of primary source and its quantitat- particular attention to avoid overfitting the data, ive representation is broader in history, including one of the main risks in predictive models. An not only distributional but also categorical, ordinal, ‘overfitting’ model is one that models existing train- and numerical evidence. History requires careful

57 Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018)

discernment of which is most appropriate, and how Jenset, Gard B. and Barbara McGillivray (2017). they should be combined. Quantitative Historical Linguistics. A Corpus Frame- Secondly, the scope for a purely quantitative work. Oxford: Oxford University Press. approach is less broad: quantitative evidence and Klingenstein, Sara, Tim Hitchcock, and Simon DeDeo models can often only contribute to inform hypo- (2014). The civilizing process in London’s Old Bailey. theses and claims which rely on qualitative evid- Proceedings of the National Academy of Sciences, 111 (26):9419–9424. doi:10.1073/pnas.1405984111. ence and methods. Often it seems that quantitative methods are only accepted by historical scholars Knuth, Donald E. (2007). Computer programming as if the claims developed by automated techniques an art. In ACM Turing award lectures, page 1974. can also be verified qualitatively, through anecdote, ACM. quotation and so on. In many fields quantification Köhler, Reinhard (2012). Quantitative Syntax Analysis. can be accepted because it creates results which Berlin: de Gruyter Mouton. look similar to those produced by qualitative re- Mayrhofer, Manfred (1980). Zur Gestaltung des etymo- search. But this approach limits the development logischen Wörterbuchs einer “Großcorpus-Sprache”. of methods that use quantification to do more than Wien: Österr. Akademie der Wissenschaften, phil.-hist. simply re-frame qualitative , and in- Klasse. stead make statistical arguments about aggregate McCarty, Willard (2004). Modeling: A study in words behaviour in its own right. In the future, we plan and meanings. In Susan Schreibman, Ray Siemens, to develop these insights further, in order to build and John Unsworth, eds., A Companion to Digital Hu- a more comprehensive research framework which manities, pages 254–270. Malden, MA: Blackwell. integrates qualitative and quantitative approaches. McGillivray, Barbara (2014). Methods in Latin Com- putational Linguistics. Leiden: Brill. Acknowledgments McGillivray, Barbara, Giovanni Colavizza, and Tobias This work was supported by The Alan Turing In- Blanke (2018). Towards a quantitative research frame- stitute under the EPSRC grant EP/N510129/1. BM work for historical disciplines. In COMHUM 2018: is supported by the Turing award TU/A/000010 Book of Abstracts for the Workshop on Computational Methods in the Humanities 2018. Lausanne, Switzer- (RG88751). land. Moretti, Franco (2013). Distant Reading. London: References Verso. Abbott, Dean (2014). Applied predictive analytics: O’Neil, Cathy and Rachel Schutt (2013). Doing data Principles and techniques for the professional data ana- science: Straight talk from the frontline. Sebastopol, lyst. Hoboken, NJ: Wiley. CA: O’Reilly. Blanke, Tobias (2018). Predicting the past. Digital Piotrowski, Michael (2012). Natural language pro- Humanities Quarterly, 12(2). cessing for historical texts. San Rafael, CA: Morgan & Claypool. Blanke, Tobias and Jon Wilson (2017). Identifying epochs in text archives. In 2017 IEEE International Tench, Stephen, Hannah Fry, and Paul Gill (2016). Conference on Big Data (Big Data), pages 2219–2224. Spatio-temporal patterns of IED usage by the Provi- sional Irish Republican Army. European Journal of Blevins, Cameron and Lincoln Mullen (2015). Jane, Applied Mathematics, 27(3):377–402. John. . . Leslie? A for algorithmic gender prediction. Digital Humanities Quarterly, 9(3). Tosh, John (2015). The Pursuit of History. Aims, Meth- Carrier, Richard (2012). Proving history: Bayes’s the- ods and New Directions in the Study of History. Lon- orem and the quest for the historical Jesus. Amherst, don: Routledge, sixth ed. NY: Prometheus Books. Zuidema, Willem and Bart de Boer (2014). Modeling Guldi, Jo and David Armitage (2014). The History in the language sciences. In Robert J. Podesva and Manifesto. Cambridge: Cambridge University Press. Devyani Sharma, eds., Research Methods in Linguist- ics, pages 428–445. Cambridge: Cambridge University Hitchcock, Tim (2013). Confronting the digital: Or Press. how academic history writing lost the . Cultural and Social History, 10(1):9–23.

58