Towards a Quantitative Research Framework for Historical Disciplines

Towards a quantitative research framework for historical disciplines Barbara McGillivray1, Jon Wilson2, Tobias Blanke3 1The Alan Turing Institute, University of Cambridge, United Kingdom 2Department of History, King’s College London, United Kingdom 3Department of Digital Humanities, King’s College London, United Kingdom [email protected],{jon.wilson, tobias.blanke}@kcl.ac.uk 1 Background and motivation Donald E. Knuth is maybe the most famous god- father of computer science. For him, “[s]cience is knowledge which we understand so well that we The ever-expanding wealth of digital material that can teach it to a computer; and if we don’t fully researchers have at their disposal today, coupled understand something, it is an art to deal with it. with growing computing power, makes the use of . [T]he process of going from an art to a science quantitative methods in historical disciplines in- means that we learn how to automate something” creasingly more viable. However, applying exist- (Knuth, 2007). Computing science is defined by the ing techniques and tools to historical datasets is not tension to automate processes using digital means a trivial enterprise (Piotrowski, 2012; McGillivray, and our inability to do so, because we fail to create 2014). Moreover, scholarly communities react dif- fully explicit ways of understanding processes. In ferently to the idea that new research questions and this sense, a computational approach to collecting insights can arise from quantitative explorations and processing (historical) evidence would be a that could not be made using purely qualitative ap- science if we could learn to automate it. Many proaches. Some of them, such as linguistics (Jenset features of the past can be understood through auto- and McGillivray, 2017), have been acquainted with mation. Yet, the problematic nature of the relation- quantitative methods for a longer time. Others, ship between sources and reality and the mutability such as history, have seen a growth in quantitat- of categories, means it will always rely on a sig- ive methods on the fringes of the discipline, but nificant degree of human intuition, and cannot be have not incorporated them into the mainstream of fully automated; computational history is an art in scholarly practice (Hitchcock, 2013). Knuth’s terms. Historical disciplines, i.e., those focusing on the The methodological reflections in this paper are study of the past, possess at least two character- part of an effort to think about how to define the istics, which set them apart and require careful possibilities and limits of quantification and auto- consideration in this context: the need to work with mation in historical analysis. Our aim is to as- closed archives which can only be expanded by sist scholars to take full advantage of quantifica- working on past records (Mayrhofer, 1980), and tion through a rigorous account of the boundaries the focus on phenomena that change in a complex between science and art in Knuth’s terms. Building fashion over time. First, that means historical re- on McGillivray et al.(2018), in this contribution we search is grounded in empirical sources which are will begin with the framework proposed by Jenset stable and fixed (one cannot change the archival and McGillivray(2017) for quantitative historical record). But they are often hard to access and, re- linguistics and illustrate it with two case studies. cording the language and actions of only a small fraction of historical reality at any moment, have 2 A quantitative framework for a complex relationship to the past being studied. historical linguistics Secondly, the categories through which the past is studied themselves change, making modelling, Jenset and McGillivray(2017)’s framework is the and the automation of analysis based on a limited only general framework available for quantitative number of features in the historical record a fraught historical linguistics. A comparable framework, enterprise. but more limited in scope, can be found in Köhler 53 Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018) (2012). Jenset and McGillivray(2017)’s frame- lowing assumptions shared by the community, work starts from the assumption that linguistic his- other claims, or evidence. A hypothesis origin- torical reality is lost and the aim of quantitative ates from previous research, intuition, or logical research is to arrive at models of and claims on arguments, and is “a claim that can be tested em- such reality which are quantitatively driven from pirically, through statistical hypothesis testing on evidence and lead to consensus among the schol- corpus data” (Jenset and McGillivray, 2017, 42). arly community. The scope of application of this In this context, “model” means a formalized rep- framework is delimited to the cases where quantifi- resentation of a phenomenon, be it statistical or able evidence (such as n-grams or numerical data) symbolic (Zuidema and de Boer, 2014). Models can be gathered from primary sources, typically in (including those deriving from hypotheses tested the form of corpora, i.e., collections of electronic quantitatively against evidence) are research tools text created with the purpose of linguistic analysis. embedding claims or hypotheses, useful in order to Jenset and McGillivray(2017) define evidence produce novel claims and hypotheses in turn via “a in quantitative historical linguistics as the set of continual process of coming to know by manipulat- “facts or properties that can be observed, independ- ing representations” (McCarty, 2004). ently accessed, or verified by other researchers” Based on these definitions, Jenset and McGil- (Jenset and McGillivray, 2017, 39), and thus ex- livray(2017) formalize the research process they clude intuition as inadmissible as evidence. Such envisage as part of their framework, see Figure1. facts can be pre-theoretical (as the fact that the Eng- The process starts from the historical linguistic real- lish word the is among the most frequent ones) or ity, which we assume to be lost for ever. Any re- based on some hypotheses or assumptions (as the search model can only aim at approaching this real- fact that the class of article in English is among the ity without reaching it completely, and quantitative most frequent ones, which is based on the assump- historical linguistics ultimately will produce mod- tion that the class of articles groups certain words els of language that are quantitative driven from together). Quantitative evidence is “based on nu- evidence. The rest of the diagram shows how this merical or probabilistic observation or inference” is achieved. The historical linguistic reality gave (Jenset and McGillivray, 2017, 39), and the quanti- rise to a series of primary sources, including docu- fication should be independently verifiable. On the ments and other (mainly textual) sources, and these other hand, distributional evidence has the form to secondary sources like grammars and diction- “x occurs in context y”, where context can consist aries. Based on the knowledge of the language of words, classes, phonemes, etc. Annotated cor- we gather from these sources we can draft annota- pora, where linguistic (morphological, syntactic, tion schemes which specify the rules for adding semantic, etc.) information has been encoded in linguistic information to the corpora and thus ob- context, are considered as sources of distributional tain annotated corpora. Corpora are the source of evidence to study phenomena in historical linguist- quantitative distributional evidence which can be ics. used to test statistical hypotheses, formulated based on our intuition of the language and on knowledge Following Carrier(2012), Jenset and McGilliv- drawn from examples. Such hypotheses can also ray(2017, 40) define claims as anything that is not feed into the creation of linguistic models, which evidence, and statements are based on evidence or aim to represent the historical linguistic reality. on other claims. The role of claims in the framework concerns their connection with truth, which 3 Model-building in history can be stated in categorical terms (as in “the claim that x belongs to class y is true”) or probabilistic In contrast with quantitative historical linguistics, terms (e.g., “x belongs to class y with probabil- the discipline of history possesses an extraordinary ity p). Claims possess a strength proportional to variety of idioms to describe itself, and has much that of the evidence supporting them. For example, less rigorous analytical vocabulary to describe its all other things being equal, claims supported by method. Yet there are important similarities, which large evidence are stronger than claims supported mean Jenset and McGillivray(2017)’s framework by little evidence. can be translated and modified for use for histor- Ultimately, research in historical linguistics aims ical research more generally. First, historians as- at making (hopefully strong) claims logically fol- sume that historical reality is lost, and can only 54 Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018) deal with fuzzy categories would help overcome these obstacles. What’s more, the use of digital data-sets and application of quantitative techniques to them allows historical claims based on the prevalence of certain features of the past to be empirically tested. Such claims are central to many forms of historical argumentation already; about the importance of particular concepts or practices at specific mo- ments, for example. Of course such claims need to be precisely related to the structure of the (di- Figure 1: Research process from the quantitative gitised) archive; as ever, limitations must be recog- historical linguistics framework described in Jenset nised. But given the amount of material which can and McGillivray(2017). Figure modified from be quickly processed, quantification allows claims Figure 2.1 in Jenset and McGillivray(2017, 45). previously asserted through little more the accumu- lation of anecdotes to be more rigorously validated. be understood through traces left in a variety of 4 Languages of power archives (including human memory).

Load more