Geo-Semantic Exploration of a Very Large Corpus of Danish Folklore Peter M
Total Page:16
File Type:pdf, Size:1020Kb
TrollFinder: Geo-Semantic Exploration of a Very Large Corpus of Danish Folklore Peter M. Broadwell, Timothy R. Tangherlini U.C.L.A. Box 951537, Los Angeles, CA 90095-1537 [email protected], [email protected] Abstract We propose an integrated environment for the geo-navigation of a very large folklore corpus (>30,000 stories). Researchers of traditional storytelling are largely limited to existing indices for the discovery of stories. These indices rarely include geo-indexing, despite a fundamental premise of folkloristics that stories are closely related to the physical environment. In our approach, we develop a representation of latent semantic connections between stories and project these into a map-based navigation and discovery environment. Our preliminary work is based on the pre-existing corpus indices and a shared-keyword index, coupled to an index of geo-referenced places mentioned in the stories. Combining these allows us to produce heat maps of the relationship between places and a first level approximation of the story topics. The heat maps reveal concentrations of topics in a specific place. A researcher can use these topic concentrations as a method for building and refining research questions. We also allow for spatial querying, an approach that allows a researcher to discover topics that are particularly related to a specific place. Our corpus representation can be extended to include multimodal network representations of the corpus and LDA topic models to allow for additional visualizations of latent corpus topics. Keywords: computational folkloristics, GIS, heat maps, text mining, geospatial search 1. Introduction 2. History of the Problem The study of folklore (folkloristics) has, since its The first systematic folklore theory was labeled “the inception, been predicated on the comparison of multiple historic geographic method” (Krohn, 1926), and was variants of a culturally expressive form both across time based on the comparison of a large number of variants of and across space. Within the broader discipline of a single story or motif across a broad geographic area (in folkloristics, the study of traditional narrative—fairy tale, some studies, global), and over a very wide range of time legend, epic, ballad, etc.—has been the main focus of this (in some studies, millennia). The main goal of this approach. Despite the underlying emphasis on the approach was to find the geographic and temporal origin comparison of story variants in time and in space, the for any given type of folkloric expression known as the realization of a meaningful geo-temporal representation of urform, or original form. Consequently, early folklorists a folklore corpus has largely eluded folklorists. were more interested in mapping where a particular In our work, we present a method for navigating a large folklore variant was collected (Anderson, 1923) than they folklore collection (>30,000 individual stories) that were in exploring the place name referents that are a includes visualizations of the frequency with which common feature of folk expressions. In the maps based on topics, keywords or a series of keywords appear in this early folklore method, time was persistent, so that a relationship to a place name. These frequencies are story attested in the ninth century could appear side-by- visualized as heat maps, and can assist a researcher in side with a story variant collected in the late nineteenth developing hypotheses related to (a) the distribution of century. narrative motifs and allomotifs across a geographic area Carl Wilhelm von Sydow called into question the idea (Thompson, 1951; Dundes, 1964), (b) the emergence of that one could use these types of maps to determine locally determined “tradition dominants” (Eskeröd, 1947), anything other than the fact that a motif or story type had (c) tradition participants’ cognitive maps of an area, be it been attested in a particular place at some time (1943), a a country, a region or a smaller locality (Lynch, 1960), view that was refined by Anna Birgitta Rooth several and (d) the interaction between concepts in the overall years later (1951). Their work effectively put an end to corpus and individual repertoires of individuals the use of maps to “discover” the original forms of folk (Tangherlini, 2010. Our approach to corpus visualization expressions. allows the researcher to move beyond the traditional one Recent folklore scholarship has recognized the power of story-one classifier indexing schemes that are standard for geographic representations of aspects of a folklore corpus most folklore collections, and allows her to discover latent as part of a more nuanced description of the dynamics of patterns based on story semantics. These patterns are folklore. This recognition is tempered with an visualized on top of historically relevant maps. Finally, understanding that these maps show distributions of topics our implementation of spatial querying allows a as related to tradition participants’ conceptualizations of researcher interested in a specific area, or type of area where various phenomena (ghosts, trolls, elves, witches, (e.g. the heath), to develop queries that return latent etc.) exist. In short, geographic representations of a semantic patterns visualized as heat maps as a result of the folklore corpus are excellent tools for understanding the queries. distribution of topics and motifs across an area, and cannot be used to answer questions about the origins of any particular expression. Understanding the distribution of topics and motifs can help researchers develop informants and place names mentioned in stories for each sophisticated, geographically aware research questions of these collections were also scanned, aligned and based on a consideration of latent patterns in the corpus as integrated. This integration of the indices allowed us to a whole. Franco Moretti (2000) labels this type of broad develop a set of metadata for each story, including story approach “distant reading.” source (informant name), story collection place, and Below, we illustrate how heat maps, derived from (a) the places mentioned in the story. indexed representation of a folklore collection and (b) a All of the place names related to stories—places of semantic indexing of that same collection can assist in collection and places mentioned—were subsequently geo- developing a distant reading approach to a folklore referenced, at least to the level of parish (in the late corpus. The maps reveal interesting patterns of nineteenth century, Denmark was organized into amt distribution for closely related topics that can inform a [county], herred [district] and sogn [parish]). The geo- historically conditioned reading of the tradition. Why referencing of these place names was a three-part process. were witches so commonly associated with Grinderslev? First, place names that could be unambiguously matched What may lie behind the close spatial relationship to an existing gazetteer of place names adjusted to align between mentions of cunning folk and ministers? The with late nineteenth century orthographic conventions researcher can subsequently explore these questions on a were automatically assigned the corresponding geo- “close reading” level, by drilling down to the underlying reference. Second, place names that were accompanied stories, thereby combining the benefits of pattern with a topographic reference number from an earlier index discovery across the entire corpus with specific of Danish parish names (Skjelborg, 1967) were assigned interrogations of individual corpus occurrences. to the corresponding parish’s coordinates. Since parishes generally have a radius of approximately five kilometers, 3. Corpus Selection and Preparation this lack of absolute precision does not overly distort the results of visualizations and spatial queries. Finally, place Over the past decade, various folklore archives have names that could not be assigned coordinates by these two begun digitizing their holdings, most notably the Finnish methods were assigned coordinates in a semi-supervised Literary Society and the Dutch Folktale Database process using DDupe (Bilgic et al., 2006). (Venbrux & Meder, 2004). As part of this process, many As part of our meta-data representation of each of the archives have geo-encoded place names associated both stories, we included the topic index assignments from the with the collection of a specific item as well as the place original collections’ topic indices. As with many printed names mentioned in that item. This geo-encoding of story folklore collections, these indices are constrained by the related place names is a critical step in the realization of a limitation that each story can only have one classification. geographically aware “distant reading” approach to Consequently, stories that span multiple topics can easily folklore study. be misclassified, making it difficult for researchers to Our work is based on the largest collection of folklore discover these stories. In earlier work, we have described produced by a single person. This Danish collection, the a multi-modal network-based classification scheme that Evald Tang Kristensen collection, is housed at the Danish allows for the discovery of stories that span multiple Folklore Archives at the Royal Library in Copenhagen, classifications (Abello, Broadwell & Tangherlini, 2012). and comprises approximately 24,000 hand-written In this work, we avoid many of the pitfalls of the one manuscript pages. Tang Kristensen,