<Web>Scraping Parchment
Total Page:16
File Type:pdf, Size:1020Kb
Háskóli Íslands Hugvísindasvið Medieval Icelandic Studies <Web>Scraping Parchment Investigating Genre throughHandrit.is Network Analysis of the Electronic Manuscript Catalogue Ritgerð til MA-prófs í Medieval Icelandic Studies Mathias Blobel Kt.: 1601865939 Leiðbeinandi: Emily Lethbridge September 2015 Abstract The goal of this thesis is to establish a methodology of utilising network analysis to investigate ethnic genre categories in the corpus of Old Icelandic literature. Unlike other such attempts, which focus on the similarities of different texts, it approaches the problem from the perspective of manuscripts. Texts that frequently appear in manuscripts together are assumed to have been thought of as generically similar. Communities in a network of manuscripts connectedHandrit.is by the co-occurrence of texts can therefore be used to establish groupings of generic significance. The necessary data is obtained through web scraping of the database. A network is as- sembled from this and subjected to several community detection algorithms. The result is a system of categories of literature that could arguably represent a glimpse into the medieval mind. Markmið þessarar loka ritgerðar er að koma á fót aðferðafræði sem nýtir netgrein- ingu til að rannsaka flokkunarhópa í safni forníslenskra bókmennta. Ólíkt öðrum slí- kum tilraunum, sem hafa einblínt á líka þætti innan ólíkra texta, er efnið nálgast frá sjónarhóli handrita. Textar sem oft birtast saman í handritum eru taldir hafa verið álitnir svipaðir. Slíkur samanburður texta myndar þar afHandrit.is leiðandi samfélög innan netkerfis sem hægt er að nota til að gera flokka eða hópa textanna. Nauðsynlegum gögnum er safnað gegnum vefhremmingu af gagnagrunni . Úr þeim gögnum er byggt netkerfi sem á er beitt nokkrum samfélagsgreiningaralgrímum. Útkoman er kerfi bókmenntaflokka sem færa má rök fyrir að gefi innsýn í hugarheim miðalda manna. Acknowledgments Like any human endeavour, an MA thesis cannot be created in isolation (or at least it wouldn’t be half as good if it were). I am therefore indebted to the following people (in no particular order): - Handrit.is Guðvarður Már Gunnlaugsson, Haukur Þorgeirsson and Örn Hrafnkelsson for an swering endless (and probably tedious) questions about (Which is an amazing tool into which has gone a lot of hard work. I hope my criticism of some of- its components does not sound too harsh. It isn’t intended to be). Beeke Stegmann for answering questions about Árni’s habit of butchering manuscripts. The organ isers and audience of the Aarhus student conference for valuable comments and encouragement and especially Luke for introducing me with my nom de guerre. The organisers and teachers of the VMN and MIS programmes at HÍ for having made this one of the best experiences of my life, academic and otherwise. Emily Lethbridge for- agreeing to supervise this rather unorthodox thesis and being genuinely interested in it, as well as for very valuable comments. Stéfania Andersen Áradóttir for transla tion help. Katie Thorn for going above and beyond in proofreading. My parents for supporting me even through this second Master’s in an even more obscure field. And finally, all of my friends for encouragement, coffee (and other beverages), and diversions. Seriously, thanks guys! Contents 1. Introduction 1 2. Genre and Manuscripts - Old Norse texts in their context 2 Handrit.is Handrit.is 7 3. and Network Analysis 3.1 3.2 Network Analysis 9 3.3 Methodology 12 3.3.1 General Methodology 12 4 3.3.2 The Data 13 3.3.3 Scraping 1 7 3.3.4 Parsing and Network Analysis 16 3.3.5 Problems and Potential Sources of Error 1 4. Analysis and Results 4.1 Identifying Unusual Manuscripts - 21 4 Basic network statistics and betweenness centrality 4.2 The General Structure of the Network - 2 Spring-embedded clustering 4.3 Tackling Fragmentation - Reduction by edge-weight 36 4.4 Identifying Specific Groups - Markov Chain Clustering 38 4.5 The Time Component - Networks by year 40 4.6 Synthesis of Analysis Results 43 5. Conclusion 46 7 Bibliography 48 Software: 5 Appendices: 58 Figures 69 The use of digital technology has1. Introductiona (maybe somewhat surprisingly) long tradition in medieval studies. In Old Norse studies, however, these projects have been largely 1 confined to the worthy projects of digital editions of texts and manuscripts. While some 2 forays have been made, such as digital mapping and automatic stemma generation,4 not 3 many projects in Old Norse scholarship use large-scale data sources for digital analysis and visualisation. There are two reasons for this. On the one hand there is always a cer- tain reluctance in literary criticism to embrace quantitative and statistical methods. On the subject of quantitative codicology Gumbert remarks that “there are [those] who are constitutionally unable to handle numbers, and who are physically paralysed at sight of a formula”. He goes on to decry a fear “that quantitativists are trying to take over and to replace good, traditional, humanistic ways of work by their own mechanical activities”. 5 Such harsh words are hardly necessary, but quantitative approaches are certainly seen with some scepticism in literary history. The other reason for its slow adoption is that using quantitative methodology requires a sizeable investment of time: not only for the acquisition of the necessary skills, but also for the entering of sufficient data in order to have a dataset of a size that actually allows meaningful interpretation. Ideally such data- sets would be assembled as part of larger projects and be freely available to scholars for interpretation. This is not yet common for Old Norse subjects; where data is available it has not been assembled with quantitative analysis in mind and is therefore rather hard to exploit in that manner. One such dataset is the Handrit.is catalogue of manuscripts. It contains a large amount of information about individual manuscripts but, since the entries are not linked consist- ently (and the database does not offer an API, see below), quantitative information can not be extracted directly. Any such information must therefore be obtained by crawling the database by software means and reconstructing connections from the obtained data. As it is indexing the most important collections of Old Norse manuscripts, the data on Handrit.is is the closest thing to a corpus of Old Icelandic literature available in a dig- 6 ital format. Through the technique of web scraping the data contained within it becomes available for a wide variety of analyses. By utilising network analysis, hidden structures can be uncovered in the dataset. When building a network with individual manuscripts as nodes and connections between See Unsworth 2012. ���������������������������������������������������������������������������������� For example the Medieval Nordic Text Archive (www.menota.org), the Skaldic Project (http://abdn.ac.uk/skaldic/), and the Stories for all Times: The Icelandic Fornaldarsögur project (http://fasnl.ku.dk/). ��������������������� http://sagamap.hi.is/ ���������������������� Hall and Parsons 2013. ������������������ Gumpert 2004, 525. �������������������������� And to some extent Danish. 1 them based on the number of texts they share, communities of manuscripts can be identi- fied that contain similar kinds of texts even if they do not all share the same one. If one assumes that manuscript compilers in general combined texts that they believed should belong together, these communities should mostly correspond to these categories. Since they are clusters of manuscripts, not texts, a single text can be in more than one commu- nity. This means that the clusters should be close analogies to the categories of literature in which a manuscript compiler would have thought, even if they themselves would not necessarily have clearly defined and delineated categories. This is what Joseph Harris has called “ethnic genre”.7 This thesis is an attempt to develop and apply a methodology of web scraping Handrit.is and network analysis to investigate these categories in the corpus of medieval Ice- landic manuscripts as represented in . Its goal is not so much to propose a new analytical system of genre for Old Icelandic literature. Nor does it aim, like some similar studies, to look at a manuscript or a small group of manuscripts. Rather it tries to extract general trends of genre and collection interests out of as large a part of the medieval Icelandic manuscript corpus as is reconstructable. It tries to meet Handrit.is three goals: showing the usefulness of a web scraping approach on datasets such as the one contained in , even if they weren’t originally compiled with quan- titative analysis in mind; developing a methodology of utilising the co-occurrence Handrit.is of texts in manuscripts as indicators of ethnic genre through network analysis; and applying this methodology to data from . 2. Genre and Manuscripts - Old Norse texts in their context The question of genre is one that implicitly underlies a great deal of the discussion in the study of medieval literature but that is only infrequently discussed explicitly. In Old Norse studies, as in the study of other European medieval literature, the genre categories used by modern scholarship evolved over time without explicit theoris- ing. Hans Robert Jauss has attempted to develop a theory of the genre of medieval European literature, and especially that in the Romance languages. He understands genre not as one fixed category a work has to fit in but rather a series of traits, which are more or less dominant. The genre of 8a work is then the most dominant trait, which can also define the work on its own. Other productive theoretical approaches to genre theory in medieval studies have been the application9 of Bakhtinian theory and, deriving from that, Even-Zohar’s Polysystem theory. However, in general the ������������ Harris 195. Jauss 1982, 8-82. For an overview over Bakhtinian theory as relating to Old Norse studies see Phelpstead 200, 3–5.