Significant Concordance and Co-Occurrence in Quantitative

Exploring Cities in Crime: Significant Concordance and Co-occurrence in Quantitative Literary Analysis Janneke Rauscher1 Leonard Swiezinski2 Martin Riedl2 Chris Biemann2 (1) Johann Wolfgang Goethe University Gruneburgplatz¨ 1, 60323 Frankfurt am Main, Germany (2) FG Language Technology, Dept. of Computer Science Technische Universitat¨ Darmstadt, 64289 Darmstadt, Germany [email protected], [email protected], {riedl,biem}@cs.tu-darmstadt.de Abstract Although the number of research projects in Dig- ital Humanities is increasing at fast pace, we still We present CoocViewer, a graphical analy- observe a gap between the traditional humanities sis tool for the purpose of quantitative lit- scholars on the one side, and computer scientists on erary analysis, and demonstrate its use on the other. While computer science excels in crunch- a corpus of crime novels. The tool dis- ing numbers and providing automated processing plays words, their significant co-occurrences, and contains a new visualization for signif- for large amounts of data, it is hard for the com- icant concordances. Contexts of words and puter scientist to imagine what research questions co-occurrences can be displayed. After re- form the discourse in the humanities. In contrast to viewing previous research and current chal- this, humanities scholars have a hard time imagining lenges in the newly emerging field of quan- the possibilities and limitations of computer technol- titative literary research, we demonstrate how ogy, how automatically generated results ought to CoocViewer allows comparative research on be interpreted, and how to operationalize automatic literary corpora in a project-specific study, and how we can confirm or enhance our hypothe- processing in a way that its unavoidable imperfec- ses through quantitative literary analysis. tions are more than compensated by the sheer size of analysable material. This paper resulted from a successful co- 1 Introduction operation between a natural language processing Recent years have seen a surge in Digital Human- (NLP) group and a literary researcher in the field ities research. This area, touching on both the of Digital Humanities. We present the CoocViewer fields of computer science and the humanities, is analysis tool for literary and other corpora, which concerned with making data from the humanities supports new angles in literary research through analysable by digitalisation. For this, computational quantitative analysis. tools such as search, visual analytics, text mining, In the Section 2, we describe the CoocViewer statistics and natural language processing aid the hu- tool and review the landscape of previously available manities researcher. On the one hand, software per- tools for our purpose. As a unique characteristic, mits processing a larger set of data in order to assess CoocViewer contains a visualisation of significant traditional research questions. On the other hand, concordances, which is especially useful for target this gives rise to a transformation of the way re- terms of high frequency. In Section 3, we map the search is conducted in the humanities: the possibil- landscape of previous and current quantitative re- ity of analyzing a much larger amount of data – yet search in literary analysis, which is still an emerging in a quantitative fashion with all its necessary aggre- and somewhat controversial sub-discipline. A use- gation – opens the path to new research questions, case for the tool in the context of a specific project and different methodologies for attaining them. is laid out in Section 4, where a few examples illus- 61 Proceedings of the Second Workshop on Computational Linguistics for Literature, pages 61–71, Atlanta, Georgia, June 14, 2013. c 2013 Association for Computational Linguistics trate how CoocViewer is used to confirm and gen- ried out for producing the contents of CoocViewer’s erate hypotheses in literary analysis. Section 5 con- database. These steps consist of a fairly standard cludes and provides an outlook to further needs in natural language processing pipeline, which we de- tool support for quantative literary research. scribe shortly. After tokenizing, part-of-speech tagging (Schmid, 2 CoocViewer - a Visual Corpus Browser 1994) and indexing the input data by document, This section describes our CoocViewer visual cor- sentence and paragraph within the document, we pus browsing tool. After shortly outlining neces- compute signficant sentence-wide and paragraph- 6 sary pre-processing steps, we illustrate and moti- wide co-occurrences, using the tinyCC tool. Here, vate the functionality of the graphical user interface. the log-likelihood test (Dunning, 1993) is employed The tool was specifically designed to aid researchers to determine the significance sig(A, B) of the co- from the humanities that do not have a background occurrence of two tokens A and B. To sup- in computational linguistics. port the significant concordance view (described in the next section), we have extended the tool to 2.1 Related Work also produce positional significant co-occurrences, Whereas there exist a number of tools for visualizing where sig(A, B, offset) is computed by the log- co-occurrences, there is, to the best of our knowl- likelihood significance of the co-occurrence of A edge, no tool to visualize positional co-occurrences, and B in a token-distance of offset. Since the sig- or as we also call them, significant concordances. nificance measure requires the single frequencies of In (Widdows et al., 2002) tools are presented that A and B, as well as their joint frequency per po- visualize meanings of nouns as vector space repre- sitional offset in this setup, this adds considerable sentation, using LSA (Deerwester et al., 1990) and overhead during preprocessing. To our knowledge, graph models using co-occurrences. There is also we are the first to extend the notion of positional co- a range of text-based tools, without any quantita- occurrence beyond direct neighbors, cf. (Richter et tive statistics, e.g. Textpresso (Muller¨ et al., 2004), al., 2006). We apply a sigificance threshold of 3.847 PhraseNet1 and Walden2. For searching words in and a frequency threshold of 2 to only keep ‘interest- context, Luhn (1960) introduced KWIC (Key Word ing’ pairs. The outcome of preprocessing is stored in in Context) which allows us to search for concor- a MySQL database schema similar to the one used dances and is also used in several corpus linguis- by LCC (Biemann et al., 2007). We store sentence- tic tools e.g. (Culy and Lyding, 2011), BNCWeb3, and paragraph-wide co-occurrences and positional Sketch Engine (Kilgarriff et al., 2004), Corpus co-occurrences in separate database tables, and use Workbench4 and MonoConc (Barlow, 1999). Al- one database per corpus. The database tables are in- though several tools for co-occurrences visualiza- dexed accordingly to optimize the queries issued by tion exist (see e.g. co-occurrences for over 200 lan- the CoocViewer tool. Additionally, we map the part- guages at LCC5), they often have different aims, and of-speeches to E (proper names), N (proper nouns), e.g. do not deliver the functionality to filter on dif- A (adjectives), V (verbs), R (all other part-of-speech ferent part-of-speech tags. tags) for an uniform representation for different languages. 2.2 Corpus Preprocessing To make a natural language corpus accessible in the 2.3 Graphical User Interface tool, a number of preprocessing steps have to be car- The graphical user interface (UI) is built with com- 1http://www-958.ibm.com/software/data/ mon web technologies, such as HTML, CSS and cognos/manyeyes/page/Phrase_Net.html JavaScript. The UI communicates via AJAX with 2http://infomotions.com/sandbox/ a backend, which utilizes PHP and a MySQL network-diagrams/bin/walden/ 3http://bncweb.lancs.ac.uk/bncwebSignup/ 6http://wortschatz.uni-leipzig.de/ user/login.php ˜cbiemann/software/TinyCC2.html, (Richter et 4http://cwb.sourceforge.net al., 2006) 5http://corpora.uni-leipzig.de/ 7corresponding to 5% error probability 62 database. This makes the approach flexible regard- ing their part-of-speech tags (as described in Sec- ing the platform. It can run on client computers tion 2.2). The graph drawing panel visualizes the using XAMP8, a portable package of various Web queried term and its significant co-occurrences resp. technologies, including an Apache web server and a concordances, significancy being visualized by the MySQL server. Alternatively, the tool can operate as thickness of the lines. In Figure 1, showing the con- a client-server application over a network. In partic- cordances for bread, we can directly see words that ular, we want to highlight the JavaScript data visual- occur often with bread in the context: E.g. bread ization framework D3 (Bostock et al., 2011), which is often used in combination with butter, cheese, was used to layout and draw the graphs. We deliber- margarine (offset +2), but also the kind of different ately designed the tool to match the requirements of breads is described by the adjectives at offset -1. For literary researchers, who are at times overwhelmed the same information, using the normal KWIC view, by general-purpose visualisation tools such as e.g. one has to count the words with different offset by Gephi9. The UI is split into three parts: At the top a hand to find properties for the term bread. At the menu bar, including a search input field and search bottom, the maximal number of words shown in the options, a graph drawing panel and display options graph can be specified. For the concordances display at the bottom of the page. there is an additional option to specify the maximal offset. The original text (with provenance information) containing the nodes (words) or edges (word pairs) shown in the graph can be retrieved by either clicking on a word itself or on the edge connecting two words, in a new window (see Figure 2) within the application.

Significant Concordance and Co-Occurrence in Quantitative

Unit 1: Introduction

Cambridge Handbook of English Corpus Linguistics Chapter 2: Computational Tools and Methods for Corpus Compilation and Analysis1

PDF of Volume 33

A Sense Discrimination Engine for English, Chinese and Other Languages

The BNC Handbook Exploring the British National Corpus with SARA

Latent Semantic Analysis, Corpus Stylistics and Machine Learning