Programme ”Projects of Large In- Frastructure for Research, Development, and Inno- Vations” (LM2010005) Is Appreciated

Cover photo 'Vilnius castle tower by night' by Mantas Volungevičius http://www.flickr.com/photos/112693323@N04/13596235485/ Licensed under Creative Commons Attribution 2.0 Generic See http://creativecommons.org/licenses/by/2.0/ for full terms

Cover design Nils Blomqvist Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools at NODALIDA 2015

Editors Gintarė Grigonytė, Simon Clematide, Andrius Utka and Martin Volk

May 11-13, 2015 Vilnius, Lithuania

Published by

Linköping University Electronic Press, Sweden Linköping Electronic Conference Proceedings #111 ISSN: 1650-3686 eISSN: 1650-3740 NEALT Proceedings Series 25 ISBN: 978-91-7519-035-8 Copyright

The publishers will keep this document online on the Internet – or its possible replacement –from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/her own use and to use it unchanged for non-commercial research and educational purposes. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law, the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

Linköping University Electronic Press Linköping, Sweden, 2015 Linköping Electronic Conference Proceedings, No. 111 ISSN: 1650-3686 eISSN: 1650-3740 URL: http://www.ep.liu.se/ecp_home/index.en.aspx?issue=111 NEALT Proceedings Series, Vol. 25 ISBN: 978-91-7519-035-8

Recent years have seen an increased interest in and availability of many different kinds of corpora. These range from small, but carefully annotated treebanks to large parallel corpora and very large monolingual corpora for big data research. It remains a challenge to offer ﬂexible and powerful query tools for multilayer annotations of small corpora. When dealing with large corpora, query tools also need to scale in terms of processing speed and reporting through statistical information and visualization options. This becomes evident, for example, when dealing with very large corpora (such as complete Wikipedia corpora) or multi-parallel corpora (such as Eu- roparl or JRC Acquis). The QueryVis workshop has gathered researchers who develop and evaluate new corpus query and visualization tools for linguistics, language technology and related disciplines. The papers focus on the design of query languages, and on various new visualization options for monolingual and parallel corpora, both for written and spoken language. We hope that QueryVis will stimulate discussions and trigger new ideas for the workshop participants and any reader of the proceedings. The preparation of the workshop and the reviewing of the submissions has already been an inspiring experience. All papers were peer-reviewed by three program committee mem- bers. We would like to thank all reviewers and contributors for their work and for sharing their thoughts and experiences with us. Let us all join our forces to make corpus exploration a rewarding, en- tertaining, and exciting experience which will grant us ever new insights into language and thought.

May 4, 2015 Gintare˙ Grigonyte˙ Zurich¨ Simon Clematide Andrius Utka Martin Volk

iii Program Committee

Janne Bondi Johannessen University of Oslo Noah Bubenhofer University of Zurich Simon Clematide University of Zurich Johannes Graen¨ University of Zurich Gintare˙ Grigonyte˙ Stockholm University Milosˇ Jakub´ıcekˇ Lexical Computing Ltd. Andrius Utka Vytautas Magnus University Martin Volk University of Zurich Robert Ostling¨ Stockholm University

iv Table of Contents

KoralQuery - A General Corpus Query Protocol ...... 1 Joachim Bingel and Nils Diewald Reﬂections and a Proposal for a Query and Reporting Language for Richly Annotated Multiparallel Corpora ...... 6 Simon Clematide Interactive Visualizations of Corpus Data in Sketch Engine ..... 17 Lucia Kocincova,´ V´ıt Baisa, Milosˇ Jakub´ıcekˇ and Vojtechˇ Kova´rˇ Visualisation in speech corpora: maps and waves in the Glossa system ...... 23 Michał Kosek, Anders Nøklestad, Joel Priestley, Kristin Hagen and Janne Bondi Johannessen ParaViz: A vizualization tool for crosslinguistic functional comparisons based on a parallel corpus ...... 32 Ruprecht von Waldenfels

v KoralQuery – a General Corpus Query Protocol

Joachim Bingel, Nils Diewald Institut für Deutsche Sprache Mannheim, Germany bingel,[email protected]

Abstract interoperability between different query systems, The task-oriented and format-driven de- for instance to forward user requests from one sys- velopment of corpus query systems has led tem to another, which may have access to addi- to the creation of numerous corpus query tional resources. languages (QLs) that vary strongly in ex- In this paper, we present KoralQuery, a gen- pressiveness and syntax. This is a severe eral protocol for the representation of requests to impediment for the interoperability of cor- corpus query systems independent of a particular pus analysis systems, which lack a com- query language. KoralQuery provides an extensi- mon protocol. In this paper, we present ble system of different linguistic and metalinguis- KoralQuery, a JSON-LD based general tic types and operations, which can be combined corpus query protocol, aiming to be inde- to represent queries of great complexity. Several pendent of particular QLs, tasks and cor- query languages can thus be mapped to a common pus formats. In addition to describing the representation, which lets users of query systems system of types and operations that Koral- formulate queries in any of the QLs for which such Query is built on, we exemplify the rep- a mapping is implemented (cf. Section 4). Further resentation of corpus queries in the serial- benefits of KoralQuery include the dynamic defi- ized format and illustrate use cases in the nition of virtual corpora and the possibility to si- KorAP project. multaneously access several, concurrent layers of annotation on the same primary textual data. 1 Introduction In the past, several corpus query systems have 2 Related Work been developed with the purpose of exploring and In former publications, KoralQuery was intro- providing access to text corpora, often under the duced as a unified serialization format for CQLF1 assumption of specific linguistic questions that the (Banski´ et al., 2014), a companion effort focussing annotated corpora have been expected to help an- on the identification and theoretical description of swer. This task-oriented and format-driven devel- corpus query concepts and features. opment has led to the creation of several distinct Another approach to a common query lan- corpus query languages (QLs), including those guage that is independent of tasks and formats mentioned in Section 3. Such QLs vary strongly is CQL (Contextual Query Language) (OASIS in expressiveness and usability (Frick et al., 2012). Standard, 2013), with its XML serialization for- This brings several unpleasant consequences mat XCQL.2 KoralQuery differs from CQL in fo- both for researchers and developers. For instance, cussing on queries of linguistic structures, and the researcher who uses a particular system must separating document and span query concepts (see formulate her queries in no other QL than the one Section 3). used for this system, which might require addi- 1 tional training prior to the actual research. It might CQLF is short for Corpus Query Lingua Franca, which is part of the ISO TC37 SC4 Working Group 6 (ISO/WD even be the case that certain research questions 24623-1). cannot be answered due to limitations of the QL, 2Like KoralQuery, XCQL is not meant to be human read- while the actual query system and the underlying able, but to represent query expressions as machine readable tree structures. For various compilers from CQL to XCQL, corpus data could in fact provide results. For de- see http://zing.z3950.org/cql/; last accessed 27 April velopers, the lack of a common protocol prevents 2015.

1 3 Query Representation 1 { 2 "@context": "http://korap.ids-mannheim.de/ns/ KoralQuery is serialized to JSON-LD (Sporny et koral/0.3/context.jsonld", al., 2014), a JSON (Crockford, 2006) based for- 3 "collection":{ 4 "@type": "koral:doc", mat for Linked Data, which makes it possible for 5 "key": "pubDate", corpus query systems to interoperate by exchang- 6 "value": "2005-05-25", 3 7 "type": "type:date", ing the common protocol. JSON-LD relies on 8 "match": "match:geq" the definition of object types via the @type key- 9 }, word, thus informing processing software of the 10 "query":{} 11 } attributes and values that a particular object may hold. As can be seen in the example serializations in this section (see Fig. 1-3), KoralQuery makes Figure 1: KoralQuery serialization for a virtual use of the @type keyword to declare query object collection that is restricted to documents with a types. Those types fall into different categories pubDate of greater or equal than 2005-05-25. that we introduce in the remainder of this section.4 While KoralQuery aims to express as many dif- combination of constraints), for instance all doc- ferent linguistic and metalinguistic query struc- uments that were published after a certain date or tures as possible, it currently guarantees to rep- that contain a certain string of characters in their resent types and operations defined in Poliqarp title. Figure 1 illustrates the serialization of a sim- QL (Przepiórkowski et al., 2004), COSMAS II QL ple virtual collection definition. (Bodmer, 1996) and ANNIS QL (Rosenfeld, 2010). In addition, the protocol comprises a subset of the 3.2 Span Queries elements of CQL (OASIS Standard, 2013). To find occurrences of particular linguistic struc- As JSON-LD objects can reference further tures in corpus data (possibly restricted through namespaces (via the @context attribute), Koral- the aforementioned document queries), Koral- Query is arbitrarily extensible. Query uses the attribute query, under which it registers objects of specific, well-defined types. 3.1 Document Queries Those objects, along with their hierarchical orga- KoralQuery allows to specify metadata constraints nization, represent the linguistic query issued by that act as filters for virtual collections using the the user.5 collection attribute. Those metadata constraints, The intended generic usability of KoralQuery so-called collection types, serve a dual purpose: demands a high degree of flexibility in order to Besides the obvious benefit of allowing users to cover as many linguistic phenomena and theories restrict their search via dynamic sampling to docu- as possible. It must therefore be maximally inde- ments that meet specific requirements on metadata pendent of, and neutral with regard to, such as publication date, authorship or genre, they can be used to control access to texts that the user (i) the type and structure of linguistic annotation has no permission to read (cf. Sec. 3.3). on the text data, A single metadata constraint is called a basic (ii) the choice of specific tag sets, e.g. for part- collection type, and defines a metadata field, a of-speech annotations or dependency labels. value and a match modifier, for example to negate KoralQuery achieves this neutrality by instanti- the constraint. Basic collection types can be com- ating distinct linguistic types as abstract structures bined using boolean operators (AND and OR) to which can flexibly address different sources and recursively form . The complex collection types layers of linguistic annotation at the same time. result of a collection type is a collection of doc- Linguistic patterns of greater complexity can be uments which satisfy the encoded constraint (or defined by using a modular system of nestable 3JSON-LD was chosen to be compatible with LAPPS rec- types and operations, drawing on various famil- ommendations from ISO TC37 SC4 WG1-EP, as suggested iar search technologies and formalisms, includ- by Piotr Banski.´ 4The type categories are set in boldface. A detailed def- 5As the response format is not part of the KoralQuery inition of types and attributes is provided by the KoralQuery specification, the result handling is subject to the query en- specification (Diewald and Bingel, 2015), which may serve gine. It may, for instance, return surrounding text spans or as a reference for implementers of KoralQuery processors. the total number of occurrences.

2 ing concepts from regular expressions, XML tree 1 { traversal, boolean search and relational database 2 "@context": "http://korap.ids-mannheim.de/ns/ koral/0.3/context.jsonld", queries. 3 "collection":{}, The nesting principle of KoralQuery states that 4 "query":{ objects describing linguistic structures in the cor- 5 "@type":"koral:group", 6 "operation": "operation:sequence", pus data, so-called span types, may be embedded 7 "operands":[{ in parental objects to recursively describe complex 8 "@type": "koral:token", 9 "wrap":{ linguistic structures, thus forming a single-rooted 10 "@type": "koral:termGroup", tree. 11 "relation": "relation:and", Span types may be further sub-classified into 12 "operands":[{ 13 "@type": "koral:term", basic and complex types. Basic span types denote 14 "foundry": "tt", linguistic entities such as words, phrases and sen- 15 "key": "ADJA", tences that are annotated in the corpus data. The 16 "layer": "pos", 17 "match": "match:eq" result of such a span type is a text span, which in 18 },{ turn is defined through a start and an end offset 19 "@type": "koral:term", with respect to the primary text data. 20 "foundry": "cnx", Complex 21 "key": "@PREMOD", span types define linguistic or result-modifying 22 "layer": "syn", operations on a set of embedded span types, which 23 "match": "match:eq" 24 }] thus act as arguments (or operands) of the relation 25 },{ and pass their resulting text spans on to the parent 26 "@type": "koral:token", operation.6 Such operations may express syntactic 27 "wrap":{ 28 "@type": "koral:term", relations or positional constraints between spans. 29 "key": "octopus", Figure 2, for example, represents a span query 30 "layer": "lemma", 31 "match": "match:eq" of two koral:token objects (basic span types) 32 } each wrapping a single koral:term object, whose 33 }] resulting text spans are required to be in a se- 34 } 35 } quence (i.e. follow each other immediately in the order they appear in the list), as formulated by the operation:sequence in the embedding Figure 2: KoralQuery serialization for a pre- koral:group object (a complex span type). modifying adjective followed by the lemma oc- Leaf objects of the span query tree structure topus. The dual constraint on the first token may either be basic span types or parametric (adjective and premodifying) is reflected by the types, containing specific information that is re- koral:termGroup, which expresses a conjunction quested for certain span types. They are intended of the two koral:term objects. The different val- to normalize the usage and representation of simi- ues for foundry indicate that different annotation lar or equal parameters used across different types. sources are addressed. The koral:term objects in Figure 2, which express constraints on their parent koral:token ob- cific set of obligatory and optional attributes that jects, are examples of such parametric types and carry corresponding values. Those values, in turn, are used to uniformly access annotation labels are also constrained to be of specific data types. from different sources and on different layers. They can either be primitives (like string, integer Next to such basic parametric types, KoralQuery or boolean), parametric KoralQuery types, or con- provides complex parametric types that encode, trolled values. for instance, logical operations on other parametric types (see the koral:termGroup in Figure 2). 3.3 Query Rewrites Note that all of those types are themselves com- Query processors may perform a wide range of plex structures in that they are composed of a spe- different tasks aside of searching. Examples in- 6In addition, the koral:reference type may refer to ob- clude the modification of queries to restrict access jects elsewhere in the tree, which provides a mechanism sim- to certain documents, to improve recall (e.g. by in- ilar to ID/IDREF in XML. This strategy is necessary to support graph-based query structures found in certain query lan- troducing synonyms or suggesting query reformu- guages. lations), or to inject missing query elements (like

3 1 { other corpora). This rewrite is documented by 2 "@context": "http://korap.ids-mannheim.de/ns/ the koral:rewrite object (a report type). Doc- koral/0.3/context.jsonld", 3 "collection":{ umenting rewrites is optional (e.g. the injected 4 "@type": "koral:docGroup", operation:and in the example Figure is implicit 5 "operation": "operation:and", and was not reported using koral:rewrite). 6 "operands":[{ 7 "@type": "koral:doc", In addition, KoralQuery allows to report on var- 8 "key": "pubDate", ious processing issues (independent of rewrites, 9 "value": "2005-05-25", e.g. regarding incompatibilities) by using the 10 "type": "type:date", 11 "match": "match:geq" errors, warnings, and messages attributes. 12 },{ Report types (in opposition to collection types, 13 "@type": "koral:doc", 14 "key": "corpusID", span types, and parametric types) do not alter the 15 "value": "Wikipedia", expected query result. 16 "rewrites":[{ 17 "@type": "koral:rewrite", 18 "src": "Kustvakt", 4 Implementations 19 "operation": "operation:injection" 7 20 }] KoralQuery is the core protocol used in KorAP 21 }] (Banski´ et al., 2013), a corpus analysis platform 22 }, developed at the Institute for the German Lan- 23 "query":{} 24 } guage (IDS). KorAP is designed to handle very large corpora and to be sustainable with regard to future developments in corpus linguistic research. Figure 3: Rewritten KoralQuery instance (see Fig- This is ensured through a modular architecture of ure 1), with an injected access restriction. interoperating software units that are easy to maintain, extend and replace. The interoperability of preferred annotation tools) based on user settings components in KorAP is certified through the use (Banski´ et al., 2014). Queries may also be ana- of KoralQuery for all internal communications. 8 lyzed for the most commonly queried structures, Koral translates queries from various corpus for instance to perform query and index optimiza- query languages (as mentioned in Section 3) to tion or to shed light on which texts and annota- corresponding KoralQuery documents. This con- tions are most popular with the users. In a post- version is a two-stage process, which first parses processing step, queries can also be transformed the input query string using a context-free gram- for visualization purposes, for example to illus- mar and the ANTLR framework (Parr and Quong, trate sequences or alternatives in complex query 1995) before it translates the resulting parse tree to structures. KoralQuery. 9 Using a well-defined and widely adopted seri- Krill is a corpus search engine that expects Ko- alization format such as JSON makes it easy to ralQuery instances as a request format. To index perform such tasks, and KoralQuery supports this and retrieve primary data, textual annotations and kind of pre- and post-processors even further by metadata of documents as formulated by Koral- 10 introducing mechanisms to trace query rewrites by Query, Krill utilizes Apache Lucene. using so-called report types that are passed to fur- Kustvakt is a user and corpus policy manage- ther processors in the processing pipeline. In this ment service that accepts KoralQuery requests and way, query modifications (like the aforementioned rewrites the query as a preprocessor (see Sec. 3.3) rewrites for access restriction and recall improve- before it is passed to the search engine (e.g. Krill). ments) can be made visible and transparent to the Rewrites of the document query may restrict the user. In this respect, KoralQuery differs from com- requested collection to documents the user is al- mon database query systems, where rewrites are lowed to access, while the span query may be internal and hidden from the user (Huey, 2014). modified by injecting user defined properties.

In Figure 3, the virtual collection of Figure 1 is 7http://korap.ids-mannheim.de/ rewritten by the processor Kustvakt in a way that 8http://github.com/KorAP/Koral; Koral is free soft- a further constraint is injected, limiting the vir- ware, licensed under BSD-2. 9http://github.com/KorAP/Krill; Krill is free soft- tual collection to all documents with a corpusID ware, licensed under BSD-2. of Wikipedia (i.e. excluding all documents from 10http://lucene.apache.org/core/

4 5 Summary and Further Work the 6th Language and Technology Conference, Poz- nan.´ Fundacja Uniwersytetu im. A. Mickiewicza. We have presented KoralQuery, a general protocol for queries to linguistic corpora, which is se- Piotr Banski,´ Nils Diewald, Michael Hanl, Marc Kupi- etz, and Andreas Witt. 2014. Access Control by rialized as JSON-LD. KoralQuery allows for a Query Rewriting: the Case of KorAP. In Pro- flexible representation and modification of corpus ceedings of the Ninth International Conference on queries that is independent of pre-defined tag sets Language Resources and Evaluation (LREC 2014), or annotation schemes. Those queries pertain to Reykjavik, Iceland, may. European Language Re- sources Association (ELRA). both selection of documents by metadata or content, and text span retrieval by the specification Franck Bodmer. 1996. Aspekte der Abfragekompo- of linguistic patterns. To this end, the protocol nente von COSMAS II. LDV-INFO, 8:142–155. defines a set of types and operations which can Douglas Crockford. 2006. The application/json Media be nested to express complex linguistic structures. Type for JavaScript Object Notation (JSON). Tech- By employing an automatic conversion from sev- nical report, IETF, July. http://www.ietf.org/ rfc/rfc4627.txt. eral QLs to KoralQuery, corpus engines may allow their users to choose the QL that they are most Nils Diewald and Joachim Bingel. 2015. Koral- comfortable with or that are best equipped to an- Query 0.3. Technical report, IDS, Mannheim, Germany. Working draft, in preparation, http: swer their research questions. //KorAP.github.io/Koral, last accessed 27 April The KoralQuery specification (Diewald and 2015. Bingel, 2015) does not claim to be complete or to cover all possible linguistic types and structures. Elena Frick, Carsten Schnober, and Piotr Banski.´ 2012. Evaluating query languages for a corpus processing Amendments to the protocol may follow in fu- system. In Proceedings of the Eighth International ture versions or may be implemented by individ- Conference on Language Resources and Evaluation ual projects, which is easily done by supplying an (LREC 2012), pages 2286–2294. additional JSON-LD @context file that links new Patricia Huey, 2014. Oracle Database, Security Guide, concepts to unique identifiers. Extensions that we 11g Release 1 (11.1), chapter 7. Using Oracle consider for upcoming versions of KoralQuery in- Virtual Private Database to Control Data Access, clude text string queries that are not constrained by pages 233–272. Oracle. http://docs.oracle. com/cd/B28359_01/network.111/b28531.pdf, last token boundaries and more powerful stratification accessed 27 April 2015. techniques for virtual collections. OASIS Standard. 2013. searchRetrieve: Part Acknowledgements 5. CQL: The Contextual Query Language Version 1.0. http://docs.oasis-open.org/ KoralQuery, as well as the described implemen- search-ws/searchRetrieve/v1.0/os/part5-cql/ tation components, are developed as part of the searchRetrieve-v1.0-os-part5-cql.html. KorAP project at the Institute for the German Terence J. Parr and Russell W. Quong. 1995. ANTLR: Language (IDS)11 in Mannheim, member of the A predicated-LL (k) parser generator. Software: Leibniz-Gemeinschaft, and supported by the Ko- Practice and Experience, 25(7):789–810. bRA12 project, funded by the Federal Ministry of Adam Przepiórkowski, Zygmunt Krynicki, Lukasz De- Education and Research (BMBF), Germany. The bowski, Marcin Wolinski, Daniel Janus, and Piotr authors would like to thank their colleagues for Banski.´ 2004. A search tool for corpora with positional tagsets and ambiguities. In Proceedings of the their valuable input. Fourth International Conference on Language Re- sources and Evaluation (LREC 2004), pages 1235– 1238. European Language Resources Association References (ELRA). Piotr Banski,´ Joachim Bingel, Nils Diewald, Elena Viktor Rosenfeld. 2010. An implementation of the An- Frick, Michael Hanl, Marc Kupietz, Piotr Pezik, nis 2 query language. Technical report, Humboldt- Carsten Schnober, and Andreas Witt. 2013. KorAP: Universität zu Berlin. the new corpus analysis platform at IDS Mannheim. In Zygmunt Vetulani and Hans Uszkoreit, editors, Manu Sporny, Dave Longley, Gregg Kellogg, Markus Human Language Technologies as a Challenge for Lanthaler, and Niklas Lindström. 2014. JSON- Computer Science and Linguistics. Proceedings of LD 1.0 – A JSON-based Serialization for Linked Data. Technical report, W3C. W3C Recommen- 11 http://ids-mannheim.de/ dation, http://www.w3.org/TR/json-ld/. 12http://www.kobra.tu-dortmund.de/

5 Reﬂections and a Proposal for a Query and Reporting Language for Richly Annotated Multiparallel Corpora

Simon Clematide Institute of Computational Linguistics, University of Zurich [email protected]

Abstract for automated linguistic data mining, a use case Large and open multiparallel corpora are of computational linguists. A successful combi- a valuable resource for contrastive corpus nation of these two different paradigms of linguis- linguists if the data is annotated and stored tic information retrieval (i.e. ad hoc search and in a way that allows precise and flexible precomputed word collocation statistics) has been ad hoc searches. A linguistic query lan- shown in the case of the text corpus query language should also support computational guage CQL within the framework of the Sketch linguists in automated multilingual data Engine (Kilgarriff et al., 2014). mining. We review a broad range of ap- Historically, there are two different strains of proaches for linguistic query and report- linguistic query systems, (a) corpus linguistics ing languages according to usability crite- tools for text corpora such as CQP (Christ, 1994) ria such as expressibility, expressiveness, with KWIC reporting, and (b) treebank tools such and efficiency. We propose an architecture as TGrep2 (Rohde, 2005) for searching through that tries to strike the right balance to suit deeply nested structures of syntactically anno- practical purposes. tated sentences. In recent years, we have seen a convergence of these strains: query languages 1 Introduction for text corpora have enriched their search opera- There is a large amount (millions of sentences) tors in order to cope with syntactic constituents, of open multiparallel text data available electroni- for example introducing the operators within cally: resolutions of the General Assembly of the and contain in CQL (Jakubicek et al., 2010) United Nations (Rafalovitch and Dale, 2009), Eu- or the constituent search construct in Poliquarp ropean parliament documents (Koehn, 2005; Ha- (Janus and Przepiorkowski,´ 2007). On the other jlaoui et al., 2014), European administration trans- hand, treebanking-style query approaches that lation memories and law texts (Steinberger et al., were bound to context-free tree structures have 2012; Steinberger et al., 2006), documents from evolved into more general query systems for struc- the European Union Bookstore (Skadin¸sˇ et al., tural linguistic annotations, e.g. ANNIS (Krause 2014), and movie subtitles. See Tiedemann (2012) and Zeldes, 2014) which allows a richer set of the and Steinberger et al. (2014) for an overview. structural relations (multi-layered directed acyclic Automatic part-of-speech tagging and lemma- graphs, including syntactic dependencies or coref- tization of raw text has become standard proce- erence chains across sentences), or the Prague dure, and richer linguistic annotations such as Markup Language Tree Query (PML-TQ) system morphological analysis, named entity recognition, for multi-layered annotations (Stˇ epˇ anek´ and Pajas, base chunking, and dependency analysis are pos- 2010), which also covers parallel treebanks.1 sible for many languages. Further, statistical word alignment can be applied to any parallel language resource. If we want to exploit these large, 1Unfortunately, it is difficult to access up-to-date information about the query possibilities for alignments of words richly annotated resources and flexibly serve the or syntactic nodes. The documentation, however, describes a language-related information needs of translators, general cross-layer, node-identifier-based selector dimension. terminologists and contrastive linguists, an expres- The parallel Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) http://ufal.mff.cuni.cz/pcedt2.0 sive query language for ad hoc search must be pro- illustrates the representation of word-aligned dependency vided. Such a query language will also be useful trees.

6 1.1 Linguistic Information Needs sults. All flexible linguistic query languages offer A linguistic query in a general sense is a set of means to select the sub-structures and attributes 5 interrelated constraints about linguistic structures. which the user is interested in. This may also The following paragraphs introduce the structures include sorting, aggregating or statistical tabulat- we want to represent and query. ing of the results, as for instance the excellent re- Monolingual constraints on the primary level porting functions of PML-TQ allow. In our opin- of word tokens (the minimal unit of analysis) are ion, reporting also includes the user-configurable dealing with inflected word forms, base forms, export of search results, for example as simple part-of-speech tags, and morphological categories. comma-separated data for further statistical pro- 6 Word tokens have a sequential ordering relation cessing , or as hierarchically structured XML se- (linear precedence). For our case of orthograph- rializations. ically well-formed texts, we assume consistent to- The graphical visualization of search results kenization for all levels of annotation. Giving up aids end users in quickly browsing complex data this requirement leads to non-trivial ordering prob- structures. Visualizations of syntactic structures or lems (Chiarcos et al., 2009). Sentences are se- frequency distributions of aligned words should be quences of tokens, and documents are sequences generated on top of specific textual reporting for- of sentences.2 Documents or sentences typically mats. Interactive behavior (collapsing trees, high- have metadata associated with them, for instance lighting of aligned nodes) supports a quick inter- indicating whether a document is a translation or pretation of search results. not. The remainder of this paper is structured as fol- Each full or partial dependency analysis of a lows. Section 2 describes general usability cri- sentence can be represented as a directed and la- teria of linguistic query systems. Section 3 dis- beled tree graph where each node is a word to- cusses interesting linguistic query languages and ken, except for the root of the tree, which we as- their main properties. Section 4 introduces gen- sume to be a virtual node. Nested syntactic con- eral data query languages that are related to lin- stituents (or chunks in the case of partial parsing) guistic systems. Section 5 discusses evaluation ap- introduce a dominance relation between syntac- proaches for linguistic query languages. Finally, tic nodes (non-terminals) or primary token nodes section 6 presents our proposals for an efficient (terminals). Dominated nodes also have a linear linguistic query and reporting system for multipar- precedence ordering, the sibling relation. allel data. Cross-lingual constraints are concerned with word alignments and sub-sentential alignments on 2 Usability Criteria for Linguistic Query the chunk level.3 Directed bilingual word align- Systems ments as produced by statistical word alignment Expressibility How naturally can users express tools such as GIZA++ are 1:n (Och and Ney, their information need? Can users apply their lin- 2003). Bidirectional alignments are thus rela- guistic concepts to formulate their query (Jaku- tional, in general, we have m : n alignments on bicek et al., 2010, 743), or do they have to deal the level of words, for example, between a Ger- with cumbersome constructs? man compound and its corresponding multi-word Non-experts may profit from a visual or menu- unit in French, unless we apply a symmetrization based composition of queries. Gartner¨ et al. technique (Tiedemann, 2011, 75ff.).4 (2013) and M´ırovsky´ (2008) describe graphical 1.2 Reporting and Visualization query solutions for dependency trees. ANNIS The set of constraints in a query does not exactly (Zeldes et al., 2009) offers a graphical query in- determine the content or format of the search re- terface for AQL. Nygaard and Johannessen (2004) built a menu-based visual query composition for 2 In order to keep the description simple we do not impose parallel treebanks that used TGrep2 as its query more nesting levels in documents. 3Sentence alignments are considered as given in the con- execution engine. text of multiparallel corpora, although in practical terms it might require a lot of work to achieve a proper and consistent 5TGrep2 uses backticks to mark the top node of the sub- sentence alignment across multiple languages. tree that is printed as output. 4Recently, Baisa et al. (2014) applied Dice coefficients to 6ANNIS provides a practical export format for the WEKA identify aligned lemmas in parallel sentences. machine learning framework.

7 Experts, however, will profit most from text- bank search tool TIGERSearch7 rely on named based queries that allows to abstract common node specifications, and they can only be accessed and recurrent functionality in the form of user- and configured by graphical user interface inter- definable macros, variables, or functions. actions. Other query languages such as PML-TQ offer a proper reporting language with a rich set Expressiveness Are there inherent limitations in of functions for sorting, aggregating and exporting a query language that systematically prevent the (e.g. grammar rules). formulation of precise search constraints for certain structures? It is well known since its inception Visualization Does the query system offer ap- that the fragment of existential first-order logic im- pealing visualizations of the data or data aggrega- plemented by the TIGERSearch language does not tions? ANNIS3 (Krause and Zeldes, 2014) has an allow for the search of missing constituents in syn- outstanding amount of visualization options. tactic graphs (Konig¨ and Lezius, 2003). Lai and Availability and accessibility Is a system bound Bird (2010) provide a concise overview on the for- to specific operating systems? Large datasets mal expressiveness of query languages for hier- typically overstrain personal desktop computers. archical linguistic structures and discuss the fact Web-based services can be hosted on dedicated that transitive closures of immediate dominance or computing infrastructure, and there is typically no precedence relations formally require the expres- client-side software installation necessary given siveness of monadic second-order logic. Interest- the rendering capabilities of modern web browsers ingly, such a high expressiveness does not imply (e.g. interactive SVG graphics). Open web-based inefficient or impractical execution times as shown services enable easy sharing of query results via by Maryns and Kepser (2009) for context-free URLs (Pezik, 2011). treebank structures – if tree automata techniques are used. However, purely logical approaches have 3 Families of Linguistic Query not received much attention in practice. Languages

Efficiency How much processing time and As mentioned above, there are two strains of lin- memory is needed for the execution of a query? guistic query languages. Some specific properties Answers to this question relate to many different of these languages are discussed next. parameters. First, data size of the corpora matters – dealing with thousands, millions, or billions of 3.1 Text Corpus Query Languages sentences makes a big difference. Second, data CQP The language of the IMS Corpus Query model complexity matters. Third, query expres- Processing Workbench (Hardie, 2012)8 has a long siveness and complexity matters. history (Christ, 1994). From this common ances- Even if a user is dealing with large datasets, tor, CQL (Kilgarriff et al., 2004) and Poliquarp complex data models and complicated queries, were later developed. Right from the beginning, there are solutions to produce acceptable response CQP supported annotated word tokens, structural times. For instance, by providing a highly par- boundaries (sentences, constituents) and sentence- allel computing infrastructure using MapReduce aligned parallel texts. The core of a query con- techniques (Schneider, 2013), or by using sophis- sists of regular expressions that specify matching ticated indexing and retrieval techniques (Ghodke token sequences. These descriptions can refer to and Bird, 2012). the level of word forms, part-of-speech tags or any other positional (=token-bound) attribute. Non- Reporting and exporting Does the query lan- recursive constituents are indirectly available as guage or query system offer flexible support for structural boundaries and can be used to restrict the user to configure the data reported in the search the search space for regular expression matches results? The selection of sub-structures is typi- on the positional level. The constituent segments cally deeply integrated in the query syntax. For also allow for attributes which can be queried, for text concordancing tools, Frick et al. (2012) men- instance syntactic head information. The main tion the LINK/ALL operator of COSMAS II, or 7http://www.ims.uni-stuttgart.de/forschung/ bracketed expressions in Poliquarp. The statisti- ressourcen/werkzeuge/tigersearch.html cal reporting functions of the monolingual tree- 8http://cwb.sourceforge.net

8 Relation QL Symbol expressed succinctly as S !<< saw. Their infor- Immediate TGrep2, fsq, TS, AQL > dominance LPath / mation need Q5 “Find the first common ancestor TGrep2 >> of sequences of a noun phrase followed by a verb Transitive fsq >+ phrase” leads to a short but intricate query (see dominance TS, AQL >* LPath // Tab. 1 for operators): Immediate TGrep2, fsq, TS, AQL . *=p << (NP=n .. (VP=v >> =p !>> precedence LPath -> (* << =n >> =p))) TGrep2, fsq .. Transitive TS, AQL .* precedence LPath --> 3.2.1 Path-based Languages TS, AQL $ Immediate LPath Bird et al. (2006) developed this query TGrep2 $. sibling LPath => language as a generally applicable extension of the XPath query language for XML10. Syntactic Table 1: Operators of query languages (QL) trees as well as XML documents are ordered trees. However, the direct use of XPath for querying linguistic trees is limited by the absence of (a) the weakness of this query language is the lack of a horizontal axis of x immediately follows/precedes means to query arbitrary relations between tokens, y, and (b) sibling x immediately follows/precedes which would be necessary to properly support the sibling y.11 Q2 from above can be stated as search for dependency relations. Given the fact /S[not //_[@lex = ’saw’]] that dependency labels are bound to words, one could map this information as an attribute on the Q5 cannot be expressed correctly (Lai and Bird, positional level, for example, attributing the prop- 2004). A further extension of LPath, called erty of being a subject to the head of the subject. LPath+ (Lai and Bird, 2005), is more expressive An integrated macro and reporting language and allows for a correct but complex query: distinguishes CQP as a powerful and versatile tool. //_[/_[(NP or (/_[not(=>_)])*/NP[not(=>_)) and => (VP or (/_[not(<=_)])*/VP[not(<=_)])] CQL The query language behind the commer- This is due to the fact that path-based, variable- cial corpus query platform Sketch Engine9 is an free languages cannot easily express equality re- extension of CQP (Jakubicek et al., 2010). strictions. Therefore, the following shorter LPath Support for identifying word matches across expression does not have the correct meaning be- parallel corpora is technically implemented via the cause each NP (or VP) may refer to different within operator. For a sentence-aligned parallel nodes: corpus (English and German Europarl corpus), a query rooted in the English side might look like: //_[{//NP->VP} and not(//_{//NP->VP})] [word="car"] within europarl7_de: [word="Auto"] DDDQuery This language is another attempt to This finds all occurrences of car in sentences extend XPath and to better adapt it for linguistic where a parallel sentence containing the word information needs (Faulstich et al., 2006). Its data Auto exists. This kind of query, however, does model was developed for a multi-layered, linguis- not allow to explicitly test for word alignment re- tically richly annotated representation of historical lations. Still, the search patterns on both sides of texts, including transcriptions and aligned trans- the within operator can be arbitrarily complex. lations, which resulted in “non-tree-shaped annotation graphs and multiple annotation hierarchies 3.2 Treebank Query Languages with conflicting structure”. This query language TGrep2 The efficient treebank query tool “goes beyond LPath by supporting queries on text TGrep2 is limited to context-free parse trees. Lai spans, on multiple annotation layers, and across and Bird (2004) see its strength in the ability to aligned texts”. The language introduces shared query for non-inclusion or non-existence of con- variables for any node set in order to easily ex- stituents. Their information need Q2 “Find sen- press equality restrictions and report the matched tences that do not include the word saw” can be nodes as result data.

9See Kilgarriff et al. (2014) for a recent description. The 10http://www.w3.org/tr/xpath NoSketchEngine, the open-source part of the Sketch Engine, 11Note that the transitive closures of these relations are is available from http://nlp.fi.muni.cz/trac/noske. available in XPath.

9 PML-TQ This query language is also a path- tified formulas can be used to effectively query based approach (Stˇ epˇ anek´ and Pajas, 2010). A matching structures. query consists of a Boolean combination of node selector paths and filters. The language allows re- TIGERSearch Konig¨ and Lezius (2000) intro- cursive sub-queries in selectors which evaluate to duced this logic-based, syntax graph description node sets. The cardinality of these node sets can language for the TIGER data model. It is a subset be tested by numeric quantifiers. A quantifier of of first-order logic, providing only globally exis- zero tests for the non-existence of nodes; there- tentially quantified variables and limited negation. fore, non-existing nodes can be queried in a natu- The language has two layers, namely, node con- ral way. A similar technique of extensionalization straints and graph constraints. of sub-queries into node sets was implemented for Node constraints are either node descriptions or the TreeAligner language (Marek et al., 2008). node (relation) predicates. Node descriptions are Boolean expressions of feature-value constraints 3.2.2 Logic-based Languages with optional variable decorations for referencing fsq12 The Finite Structure Query language the same node several times in a query, for in- (Kepser, 2003) provides full first-order logic as stance #v:[word != "saw"] for a terminal node a query language over syntactic structures of the description, or #np:[cat = ("NP"|"CNP")] for TIGER data model (Brants et al., 2004). This a simple or coordinated noun phrase. Node includes labelled secondary edges between arbi- predicates constrain selected properties of nodes, trary nodes and discontiguous children. Therefore, such as being the root of a tree (root(#s)) fsq has an outstanding expressiveness. Regular or having a certain number of daughter nodes expression support for node labels and response (arity(#CNP,2)). Node relation predicates ex- times that are comparable to TIGERSearch make press the usual structural relations in a user- this approach a practical one. Lai and Bird’s dif- friendly operator notation, e.g. #s >* #np for ficult question Q5 can be expressed as follows in a dominance relation. Graph constraints are the somewhat inconvenient LISP-like prefix nota- conjunctions or disjunctions of node constraints. tion for first-order logic of fsq 13: Negation is not allowed on the level of graph con- (E a (E n (E v (& straints, which severely limits the expressiveness. (cat n NP) (cat v VP) (>+ a v) (.. n v) The TIGER language originally specified user- (! (>+ n v)) (! (>+ v n)) (A b (-> (& (>+ a b) (>+ b n)) defined macros (templates), however, this part of (! (>+ b v)))))))) the language was never implemented.

Compared to the query language of TIGERSearch, AQL The query language of ANNIS is an exten- there is a lack of special purpose predicates such sion of the TIGERSearch language for multi-level as the (token) arity of syntactic nodes or prece- graph-based annotations. It offers operators for la- dence or dominance restrictions with numeric dis- belled dependency relations, inclusion or overlap tance limits, for example, >2,5 expressing a indi- of token spans, corpus metadata information, and rect dominance relation with a minimal depth of 2 namespaces for annotations of the same type pro- and a maximum of 5. duced by different tools15. The operator for depen- MonaSearch14 Maryns and Kepser (2009) ex- dency relations is an instance of the general opera- tended the logical expressiveness of fsq even fur- tor -> for directed and labelled edges between any ther to monadic second-order logic. However, its two nodes. Such edges can also be used to es- data model is restricted to context-free parse trees. tablish or query alignments between parallel sen- A main application of such an expressive lan- tences on the level of words or phrases. guage are automatic consistency checks in human- TreeAligner The Stockholm TreeAligner (Lund- created treebanks. However, existentially quan- borg et al., 2007) introduced an operator for 12The Java implementation of fsq also includes a querying bilingual alignments between words or TIGERSearch-like visualization for the matched trees, see phrases of parallel treebanks, freely combinable http://www.tcl-sfs.uni-tuebingen.de/fsq. 13Existential (E) and universal (A) quantiﬁcation, conjunc- with monolingual TIGERSearch-style queries. tion (&), negation (!), implication (->). To overcome some expressiveness limitations of 14http://www.tcl-sfs.uni-tuebingen.de/ MonaSearch 15For instance, for different parsers (Chiarcos et al., 2010).

10 TIGERSearch, Marek et al. (2008) introduced SPARQL19 RDF (Resource Description Frame- node sets (node descriptions decorated with vari- work) triple stores with SPARQL endpoints for ables starting with % instead of #). One might try querying linked data are fairly standard nowa- to express Bird and Lai’s Q2, that is, find sentences days. Kouylekov and Oepen (2014) used this tech- without saw, in the following ways: nique to represent and query semantic dependen- #s:[cat="S"] >* #w:[word!="saw"] (1) cies. However, the queries directly operate on the #s:[cat="S"] !>* #w:[word="saw"] (2) internal RDF representations and do not meet the #s:[cat="S"] !>* %w:[word="saw"] (3) criteria of natural expressibility. The authors pro- (1) actually matches all cases where a sentence pose a query-by-example and a template expan- dominates any other word than saw. (2) searches sion front-end for better usability. for occurrences of the word saw not dominated by Chiarcos (2012) introduce POWLA, a generic a sentence node. The interpretation of (3) relies on formalism to represent multi-layer annotated cor- a modified evaluation strategy of the negated dom- pora in RDF and OWL/DL and to query these inance if one of the arguments is a node set: only structures by SPARQL. In order to improve the ex- those sentences match where the negated transi- pressibility, SPARQL macros for AQL operators tive dominance constraint !>* is true for any of are defined. Given the expressiveness of SPARQL, the nodes with the word attribute saw. this allows to overcome the query language limitations of AQL or TIGERSearch, which cannot 4 General Data Query Languages query for missing annotations.

Complex data structures are not a privilege of lin- LUCENE20 Every information retrieval system guistics, so obviously many general data query has an integrated query language. Powerful text languages and data management systems exist. indexing and query engines such as LUCENE can Some of them have been used to represent and be used to manage large amounts of texts. By query linguistic structures. treating each sentence as an IR document, Ghodke 16 Bouma and Kloosterman and Bird (2012) implemented a high-performance XPath/XQuery 21 (2007) used these XML technologies in a treebank query system on top of LUCENE. straightforward manner for querying and mining 5 Evaluation Strategies for Linguistic syntactically annotated corpora. These query Query Languages languages are also the basis of Nite QL (Carletta et al., 2005), which is targeted at multimodal There are essentially two approaches to implement annotations. the evaluation of linguistic query languages: either by programming a custom implementation of the SQL The structured query language for rela- execution of the query over a custom implementa- tional databases (RDBMS) is a standard tech- tion of the data management, or by translating the nology with highly efﬁcient implementations. query and the data into a host database system and RDBMSs have been widely used to represent large executing the actual query on the host. amounts of data, e.g. for text concordancing.17 Sometimes, these approaches are mixed; for CYPHER18 Distributed NoSQL graph instance, the TreeAligner uses the relational databases and CYPHER as one of the straight- database SQLite for storing and retrieving the pri- forward query languages seem to be a good mary data of word tokens, but implements a cus- match for highly interconnected linguistic data tom in-memory engine for the evaluation of the (Holzschuher and Peinl, 2013). Pezik (2013) re- Boolean algebra of node predicates and node rela- ports some experiments for corpus representation tions. and corpus query with a pure graph database. Banski et al. (2013) integrate a general text 5.1 Custom Evaluation Engines retrieval engine with a graph database for their Manatee (Rychly,´ 2007) is CQL’s back-end for corpus analysis platform. textual data management and query evaluation. It

16http://www.w3.org/XML/Query 19http://www.w3.org/TR/sparql11-query 17http://corpus.byu.edu (Davies, 2005) 20http://lucene.apache.org 18http://neo4j.com/developer/ 21Their query language, however, does not allow regular cypher-query-language expressions over labels, or underspeciﬁed node descriptions.

11 is language and annotation independent and in- first-order logic intermediate representation from cludes efficient implementations of inverted in- which the actual SQL queries are derived. dexes, word compression, etc. in order to cope The development of the relational data model of with extremely large text corpora. Attributes ANNIS (relANNIS) and the corresponding trans- of primary data can be set-valued and support lation of the ANNIS query language AQL into unification-style attribute comparisons. Another SQL queries by Rosenfeld (2010) was inspired by interesting feature of Manatee is its support for dy- the DDDQuery translation. In the next section, we namic attributes of positional primary data. These propose a linguistic query language which is sim- are implemented as function calls which can be ilar to AQL but has a simpler data model. There- declared at the level of the corpus configuration, fore, we expect that our query translation com- for instance, for external lexicon look-up, morpho- ponent can be built using the techniques of AQL logical analysis, or the transformation of tags. query evaluations. TGrep2, TIGERSearch and fsq are examples for treebank query systems with a fully custom data 6 A Proposal for Querying Richly management and query evaluation engine. Rosen- Annotated Multiparallel Text Corpora feld (2010) gives a concise description of the Our data model presupposes the following com- implementation techniques behind TIGERSearch. ponents: (a) multiparallel corpora with sentence, The corpus import of TIGERSearch includes the word and sub-sentential alignments across lan- construction of many specialized indexes for pred- guages; (b) monolingual linguistic annotations icates and attributes. During indexing, statistics on such as PoS tags (preferably the same universal the selectivity of attributes are built, which in turn tagset across languages), base forms and morpho- guide the query execution planner to limit the full logical information; (c) syntactic annotations in evaluation of a query to a subset of syntactic trees. the form of dependency relations and (partial) con- At the stage of corpus indexing, users can provide stituents, allowing the output of different tools their own type definitions, that is, short names for for the same kind of analysis (multi-annotation). subsets of admissible feature values. A definition Multi-tokenization is not required for our data for genitive or dative case looks as follows: and would impose unnecessary complexity for the gen-dat := "gen","dat"; query component. However, metadata on the level Although any query involving this case ambi- of corpora, documents, or sentences is needed. guity can be expressed by a Boolean disjunc- The proposed query language should allow to tion, type definitions lead to both more readable flexibly query all aspects of our data model. How- and compact queries and also to more efficient ever, the search space of the query evaluation will processing due to the type-based data model of be restricted to the context of a monolingual sen- TIGERSearch. tence and its corresponding aligned sentences.22 The concrete query syntax for monolingual search 5.2 Query Translation Approach will be based on TIGERSearch. Additionally, we introduce an alignment operator similar to the LPath and DDDQuery are both Xpath-style lan- bilingual one of the TreeAligner. However, in guages that owe much to the hierarchical data multiparallel queries the alignment operator can model of XML. However, storing and efficient re- be used to constrain alignments between nodes trieval of large XML data sets turns out to be a of any pair of languages. From AQL, we reuse technical challenge in general (Grust et al., 2004). the operator for dependency relations, the sup- One common solution for high-performance XML port for metadata predicates, and explicit names- retrieval is based on a mapping of the hierarchi- paces. From CQP, we import the concept of a non- cal document structure into a flat relational for- recursive macro language. Such a facility proved mat, which in turn allows the use of highly effi- to be extremely useful for large scale linguistic cient RDBMSs. Both linguistic query languages mining in the case of Sketch Grammars of the – LPath and DDDQuery – are translated into SQL Sketch Engine (Kilgarriff et al., 2004). queries because their XML data model is physi- cally stored in an RDBMS. The implementation 22Monolingual searches across sentence boundaries as per- mitted in CQP-style queries will not be possible. However, of DDDQuery (Faulstich et al., 2006) is especially this search limit does not preclude reporting contextual infor- interesting for us because they first translate into a mation from surrounding sentences.

12 24 ral. Indeed, SQL itself offers the set operations UNION, INTERSECT, and EXCEPT to combine

the results of different queries. In the next section, we present a proposal for searching missing elements using the result set operation EXCEPT.

6.1 Proposal for Query Result Set Operations If we carefully separate reporting from querying, we can apply result set operations in order to implement the search for missing structures as filtering. We admit that there will be some computing overhead, but conceptually, filtering is easier for end users than full first-order logic. To illustrate the idea, we informally embed CQP-style macros and TreeAligner constraints into SQL syntax. Bird and Lai’s Q2 is easy: SELECT #s FROM corpus WHERE #s:[cat="S"] Figure 1: Architecture of our proposed system EXCEPT SELECT #s FROM corpus WHERE #s:[cat="S"] >* [word="saw"] The predicates needed for expressing the con- The information need of Q5 focuses on a triple straints of linguistic queries are different from the of an ancestor a, an NP n and a VP v. reporting functions. After the query execution, re- MACRO a_dom_n_and_v($0=#a,$1=#n,$2=#v) porting functions will be applied to the token IDs, $0:[] >* $1:[cat="NP"] & $0 >* $2:[cat="VP"] & for instance the function lemma(#wordid) which $1 .* $2 & $1 !>* $2 & $2 !>* $1 ; renders the base form of a terminal node. Flexible SELECT #a,#v,#n FROM corpus reporting expressions similar to PML-TQ have to WHERE a_dom_n_and_v[#a,#n,#v] be defined and implemented. Graphical visualiza- EXCEPT SELECT #a,#v,#n FROM corpus tion is just another post-processing step that ren- WHERE a_dom_n_and_v[#x,#n,#v] & #a >* #x ders the output of specialized reporting functions. RDBMSs are stable and efficient data manage- The first select is too general and includes all ment platforms, and modern, open source imple- ancestors. The second selects the ancestors which mentations such as PostgreSQL23 support exten- dominate such an ancestor. The EXCEPT operator sions to cope with acyclic graph structures (e.g. (which calculates the set minus) leaves the very recursive SELECT). Therefore, we decided to host ancestor that does not dominate any other. our data on an RDBMS and compile our linguistic A bilingual use case is the search for English query language into SQL. The overall architecture noun chunks (nc) without article that are aligned 25 of our system is shown in Fig. 1. to a German chunk with an article. The informa- One remaining problem is the inability to search tion need are the parallel nouns. for missing elements. The work presented here MACRO aligned_nc($0=#c,$1=#n,$2=#c2,$3=#n2) $0:[cat="NC"] > $1:[pos="NOUN"] & is part of a contrastive corpus linguistics project $2:[cat="NC"] > $3:[pos="NOUN"] & which is interested in differences in the use of ar- $1 --en,de $3 ; ticles in English and other languages, especially SELECT #n_en,#n_de FROM corpus in the case where one language has an article and WHERE aligned_nc[#c_en,#n_en,#c_de,#n_de] the other has not. A direct reimplementation of & #c_de > [pos="DET"] the TreeAligner approach with node set variables EXCEPT SELECT #n_en,#n_de FROM corpus seems problematic since the evaluation of a query WHERE aligned_nc[#c_en,#n_en,#c_de,#n_de] in the TreeAligner is implemented by iteratively & #c_de > [pos="DET"] & #c_en > [pos="DET"] constructing and manipulating node sets in mem- 24Sub-selectors in PML-TQ work in a similar way and ory. However, the general idea of an extension- their quantifiers are cardinality tests on the matched node alization of intermediate search results is natu- sets. 25We extend the alignment operator A -- B of the 23http://www.postgresql.org TreeAligner with language specifications A --L1,L2 B.

13 In all examples shown above, the resulting nodes Gosse Bouma and Geert Kloosterman. 2007. Min- of the combined SELECT statements will be fed ing syntactically annotated corpora with XQuery. In into the desired reporting functions. Proceedings of the Linguistic Annotation Workshop, pages 17–24. 7 Conclusion Sabine Brants, Stefanie Dipper, Peter Eisenberg, Sil- via Hansen-Schirra, Esther Konig,¨ Wolfgang Lezius, We provided a thorough review of linguistic query Christian Rohrer, George Smith, and Hans Uszko- languages and their implementation approaches reit. 2004. TIGER: Linguistic interpretation of a Research on Language and Com- and tried to connect them to our use case of richly German corpus. putation, 2(4):597–620. annotated multiparallel corpora. The usability of linguistic query systems is determined by their ex- Jean Carletta, Stefan Evert, Ulrich Heid, and Jonathan pressibility, expressiveness, and efficiency, their Kilgour. 2005. The NITE XML toolkit: Data model and query language. Language Resources and Eval- support for flexible reporting and exporting – in- uation, 39(4):313–334. cluding output for visualization back-ends – and open availability. We decided to host our highly Christian Chiarcos, Julia Ritz, and Manfred Stede. 2009. By all these lovely tokens... merging con- structured data on a RDBMS and to provide a flicting tokenizations. In Proceedings of the Third translation from a logic-based query language into Linguistic Annotation Workshop, pages 35–43. SQL. Christian Chiarcos, Kerstin Eckart, and Julia Ritz. An important open question for future work is 2010. Creating and exploiting a resource of paral- an empirical assessment whether our approach can lel parses. In Proceedings of the Fourth Linguistic efficiently deal with the huge amount of annota- Annotation Workshop, pages 166–171. tions and textual material of large multiparallel Christian Chiarcos. 2012. A generic formalism to rep- corpora. Furthermore, a performance comparison resent linguistic corpora in RDF and OWL/DL. In with approaches based on RDF triple stores and Proc LREC 2012, pages 3205–3212. SPARQL (Chiarcos, 2012) is needed. Oliver Christ. 1994. A modular and flexible architecture for an integrated corpus query system. In Acknowledgments Proceedings of COMPLEX’95 3rd Conference on Computational Lexicography and Text Research Bu- I wish to thank Johannes Graen¨ for fruitful discus- dapest, Hungary, pages 23–32. sions and Rahel Oppliger for proofreading. This Mark Davies. 2005. The advantage of using rela- research was supported by the Swiss National Sci- tional databases for large corpora: Speed, advanced ence Foundation under grant 105215 146781/1 queries and unlimited annotation. International through the project “Large-scale Annotation and Journal of Corpus Linguistics, 10(3):307–334. Alignment of Parallel Corpora for the Investiga- Lukas Faulstich, Ulf Leser, and Thorsten Vitt. 2006. tion of Linguistic Variation”. Implementing a linguistic query language for his- toric texts. In Current Trends in Database Technol- ogy, pages 601–612. References Elena Frick, Carsten Schnober, and Piotr Banski.´ 2012. V´ıt Baisa, Milosˇ Jakub´ıcek,ˇ Adam Kilgarriff, Vojtechˇ Evaluating query languages for a corpus processing Kova´r,ˇ and Pavel Rychly. 2014. Bilingual word system. In Proc LREC 2012, pages 2286–2294. sketches: the translate button. In Proc EURALEX, Sumukh Ghodke and Steven Bird. 2012. Fangorn: A pages 505–513. system for querying very large treebanks. In Pro- ceedings of COLING 2012: Demonstration Papers, Piotr Banski, J Bingel, N Diewald, E Frick, M Hanl, pages 175–182, December. M Kupietz, P Pezik, C Schnober, and A Witt. 2013. KorAP: the new corpus analysis platform at IDS Torsten Grust, Maurice Van Keulen, and Jens Teub- Mannheim. In Human Language Technologies as ner. 2004. Accelerating XPath evaluation in any a Challenge for Computer Science and Linguistics. RDBMS. ACM Trans. Database Syst., 29(1):91– Proceedings of the 6th Language and Technology 131, March. Conference. Markus Gartner,¨ Gregor Thiele, Wolfgang Seeker, An- Steven Bird, Yi Chen, Susan B. Davidson, Haejoong ders Bjorkelund,¨ and Jonas Kuhn. 2013. ICARUS Lee, and Yifeng Zheng. 2006. Designing and evalu- – an extensible graphical search tool for dependency ating an XPath dialect for linguistic queries. In Pro- treebanks. In Proceedings of the 51st Annual Meet- ceedings of the 22nd International Conference on ing of the Association for Computational Linguis- Data Engineering, pages 52–. tics: System Demonstrations, pages 55–60.

14 Najeh Hajlaoui, David Kolovratnik, Jaakko Vayrynen,¨ Catherine Lai and Steven Bird. 2004. Querying and Ralf Steinberger, and Daniel Varga. 2014. DCEP - updating treebanks: A critical survey and require- digital corpus of the European Parliament. In Proc ments analysis. In Proceedings of the Australasian of LREC 2014, pages 3164–3171. Language Technology Workshop 2004, pages 139– 146. Andrew Hardie. 2012. CQPweb — combining power, ﬂexibility and usability in a corpus analysis Catherine Lai and S. G. Bird. 2005. LPath+: A ﬁrst- tool. International Journal of Corpus Linguistics, order complete language for linguistic tree query. In 17(3):380–409. Proc PACLIC’19, pages 1–12.

Florian Holzschuher and Rene´ Peinl. 2013. Perfor- Catherine Lai and Steven Bird. 2010. Querying lin- mance of graph query languages: Comparison of guistic trees. J. of Logic, Lang. and Inf., 19(1):53– Cypher, Gremlin and native access in Neo4J. In 73, January. Proceedings of the Joint EDBT/ICDT 2013 Work- shops, pages 195–204. Joakim Lundborg, Torsten Marek, Mael¨ Mettler, and Martin Volk. 2007. Using the Stockholm Milos Jakubicek, Adam Kilgarriff, Diana McCarthy, TreeAligner. In Proceedings of the 6th Workshop and Pavel Rychly. 2010. Fast syntactic searching on Treebanks and Linguistic Theories, pages 73–78. in very large corpora for many languages. In Pro- ceedings of the 24th Paciﬁc Asia Conference on Lan- Torsten Marek, Joakim Lundborg, and Martin Volk. guage, Information and Computation, pages 741– 2008. Extending the TIGER query language with 747. universal quantiﬁcation. In KONVENS 2008: 9. Daniel Janus and Adam Przepiorkowski.´ 2007. Konferenz zur Verarbeitung naturlicher¨ Sprache, Poliqarp: An open source corpus indexer and search pages 5–17. engine with syntactic extensions. In Proceedings of the 45th Annual Meeting of the Association for Com- Hendrik Maryns and Stephan Kepser. 2009. putational Linguistics Companion Volume Proceed- Monasearch - a tool for querying linguistic tree- ings of the Demo and Poster Sessions, pages 85–88. banks. In Treebanks and Linguistic Theories 2009, pages 29–40. Stephan Kepser. 2003. Finite structure query: A tool for querying syntactically annotated corpora. In Jir´ı M´ırovsky.´ 2008. Netgraph - making searching in Proceedings of EACL, pages 179–186. treebanks easy. In In Proc. of IJCNLP’08, pages 945–950. Adam Kilgarriff, Pavel Rychly,´ Pavel Smrz,ˇ and David Tugwell. 2004. The Sketch Engine. In Proceedings Lars Nygaard and J.B. Johannessen. 2004. Searchtree of the Eleventh EURALEX International Congress, - a user-friendly treebank search interface. In Proc pages 105–116. TLT 2004, Tubingen,¨ December 10–11, 2004, pages 183–189. Adam Kilgarriff, V´ıt Baisa, Jan Busta,ˇ Milosˇ Jakub´ıcek,ˇ Vojtechˇ Kova´r,ˇ Jan Michelfeit, Pavel Franz Josef Och and Hermann Ney. 2003. A sys- Rychly, and V´ıt Suchomel. 2014. The Sketch En- tematic comparison of various statistical alignment gine: ten years on. Lexicography, 1(1):7–36. models. Computational linguistics, 29(1):19–51.

Philipp Koehn. 2005. Europarl: A parallel corpus for Piotr Pezik. 2011. Providing corpus feedback for statistical machine translation. In Machine Transla- translators with the PELCRA search engine for tion Summit, volume 5, pages 79–86. NKJP. In Explorations across languages and cor- Milen Kouylekov and Stephan Oepen. 2014. Semantic pora : PALC 2009,Łod´ z´ Studies in Linguistics, technologies for querying linguistic annotations: An pages 135–144, Frankfurt am Main; New York. Pe- experiment focusing on graph-structured data. In ter Lang. Proc LREC 2014, pages 4331–4336. Piotr Pezik. 2013. Indexed graph Thomas Krause and Amir Zeldes. 2014. Annis3: A databases for querying rich TEI annotation. new architecture for generic corpus query and visu- http://digilab2.let.uniroma1.it/teiconf2013/wp- alization. Digital Scholarship in the Humanities. content/uploads/2013/09/Pezik.pdf. Esther Konig¨ and Wolfgang Lezius. 2000. A descrip- Alexandre Rafalovitch and Robert Dale. 2009. tion language for syntactically annotated corpora. In United nations general assembly resolutions: A six- COLING 2000, July 31 - August 4, 2000, Univer- language parallel corpus. In Proceedings of the MT sitat¨ des Saarlandes, Saarbrucken,¨ Germany, pages Summit, pages 292–299. 1056–1060. Douglas L. T. Rohde. 2005. TGrep2 user manual. Esther Konig¨ and Wolfgang Lezius. 2003. The http://tedlab.mit.edu/ dr/Tgrep2/tgrep2.pdf. TIGER language. A description language for syntax graphs. formal deﬁnition. Technical report, In- Viktor Rosenfeld. 2010. An Implementation of the An- stitute for Natural Language Processing, University nis 2 Query Language. Student thesis, Humboldt- of Stuttgart. Universitat¨ zu Berlin.

15 Pavel Rychly.´ 2007. Manatee/Bonito – a modular corpus manager. In Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2007, pages 65–70. Roman Schneider. 2013. KoGra-DB: Using MapRe- duce for language corpora. In 43. Jahrestagung der Gesellschaft fur¨ Informatik (GI), pages 140–142. Raivis Skadin¸s,ˇ Jorg¨ Tiedemann, Roberts Rozis, and Daiga Deksne. 2014. Billions of parallel words for free: Building and using the EU Bookshop Corpus. In Proc of LREC 2014, pages 1850–1855. Ralf Steinberger, B. Pouliquen, A. Widiger, C. Ignat, T. Erjavec, D. Tuﬁs¸, and D. Varga. 2006. The JRC- acquis: A multilingual aligned parallel corpus with 20+ languages. In Proc of LREC 2006. Ralf Steinberger, Andreas Eisele, Szymon Klocek, Spyridon Pilos, and Patrick Schluter.¨ 2012. DGT- TM: A freely available Translation Memory in 22 languages. In Proc of LREC 2012, pages 454–459. Ralf Steinberger, Mohamed Ebrahim, Alexandros Poulis, Manuel Carrasco-Benitez, Patrick Schluter,¨ Marek Przybyszewski, and Signe Gilbro. 2014. An overview of the European Union’s highly multilingual parallel corpora. Language Resources and Evaluation, 48(4):679–707. Jorg¨ Tiedemann. 2011. Bitext Alignment, volume 4 of Synthesis Lectures on Human Language Technolo- gies. Morgan & Claypool. Jorg¨ Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proc of LREC 2012, pages 2214– 2218. Amir Zeldes, Anke Ludeling,¨ Julia Ritz, and Chris- tian Chiarcos. 2009. ANNIS: A search tool for multi-layer annotated corpora. In Corpus Linguis- tics 2009. Jan Stˇ epˇ anek´ and Petr Pajas. 2010. Querying diverse treebanks in a uniform way. In Proc LREC 2010, pages 1828–1835.

16 Interactive Visualizations of Corpus Data in Sketch Engine

Lucia Kocincova´† [email protected]

V´ıt Baisa‡† and Milosˇ Jakub´ıcekˇ ‡† and Vojtechˇ Kova´rˇ‡† [email protected]

†NLP Centre, Faculty of Informatics Masaryk University, Brno, Czech Republic ‡Lexical Computing, Brighton, United Kingdom

Abstract In this paper we describe an ongoing work on implementing new visualization options for lan- Automatic analysis of large text corpora guage data analysis within Sketch Engine cor- produces large amounts of figures as re- pus management system (Kilgarriff et al., 2014). sult of various functions. These provide Sketch Engine has a large variety of users ranging empirical evidence for a research hypothe- from language learners, students and researchers sis or serve in numerous practical applica- in linguistics, lexicographers, translators and ter- tions of natural language processing. Usu- minologists or data scientists in diverse domains. ally, the results are presented in the form The presented enhancements to the user inter- of tables containing raw data to be interface are implemented with the hope that they will preted by domain experts. This paper de- not only provide a more graphically appealing and scribes an ongoing work on new visual- easier way how to understand the data for novice izations and user interface enhancements users, but also speed up the daily work carried out in Sketch Engine corpus management sys- by language experts (e.g. lexicographers). tem which aim at easing the interpretation of the data for both novice users and lan- 2 Sketch Engine guage professionals. Sketch Engine is a leading corpus management 1 Introduction system useful for discovering how words behave Analyses of textual data deal with the issue of in different contexts. It has a wide range of ana- choosing a suitable representation for its results. lytic functions dealing with billion-word corpora While this factor is often neglected and most at- (see e.g. (Pomikalek´ et al., 2012; Jakub´ıcekˇ et al., tention is being paid to the performance of the 2013)). In this paper we focus on the visualization analytic functions (where the problem might be of two core functions that leverage the principles simply seen as continuation of Aristotle’s form vs. of distributional semantics—word sketches and a matter debate, with matter being absolutely pre- distributional thesaurus. dominant in science), there is no doubt that the representation heavily influences how data are per- 2.1 Word Sketches ceived and can significantly help or harm correct A word sketch is a one-page summary of a understanding of the results (Meirelles, 2013). word’s collocational behavior according to par- This becomes even more appealing where the ticular grammar relations. It is usually com- underlying analytic sample does not posses uni- puted by evaluating a large number of corpus form distribution—like language which usually queries (Jakub´ıcekˇ et al., 2010) performing shal- follows Zipf’s distribution (Zipf, 1949). Not only low parsing or by using some existing parser to do “ordinary” language users but sometimes even this task so as to retrieve a large number of colloca- language experts tend to underestimate the impact tion candidates which are then sorted using a lex- of such heavily skewed distributions like the Zip- icographic association score (see (Rychly,´ 2008; fian one, henceforth drawing invalid conclusions Evert, 2005)). from the analyses they carry out. A word sketch is currently presented in a table

17 Figure 1: Word sketches for lemma strategy. where each column contains one grammar relation score value is relatively low. This fact was taken with collocates sorted by their score (see Figure 1). into account when designing the visualization, so the score axis is adaptive – its boundary values are 2.2 Thesaurus always taken from currently selected data, there- On top of the word sketches, Sketch Engine com- fore the user can always clearly indicate the most putes a distributional thesaurus. Normally a dis- similar words of the current lemma. tributional thesaurus tackles the issue of finding Score in the visualization is also mapped to a similar words by asking the following question: colour of the circle behind the word which simpli- given a word, what are the words that occur in fies comparison of two close words. The user can the same context? In Sketch Engine this question modify the colour range of score by choosing an- is answered by finding words that share the same other appropriate colour from control panel, so the collocates in the same grammar relations in word visual output can be adjusted. sketches. Frequency is mapped to the size of a circle and A thesaurus is simply a list of words accompa- also to the font size of a word, which can be dis- nied with their frequency and a similarity score. abled to avoid confusion when comparing short and long words. 3 Thesaurus Visualization 4 Word Sketch Visualization The thesaurus already provides a visualization of the words in the form of a word cloud (Figure 2), Word sketches are technically very similar in however the score is not represented in the the- terms of data types—thesaurus is a word list with saurus in any way, so the potential of the data is score and frequency and word sketches are mul- not used to its full extent. tiple word lists with the same attributes. There- The main objective of the presented visualiza- fore, the principles in graphical representations tion in Figure 6 is to display all thesaurus attributes may remain the same—assuming also that a con- (textual string, its frequency and similarity and sistent visualization across different system func- score) of each lemma in a meaningful but also a tion makes its perception easier. clear way, which was achieved by mapping these Score and frequency values are mapped to the two values to multiple attributes. The core of the same graphical elements as in thesaurus except the visualization is a lemma, the respective words are colour of circles because in word sketches, distinct placed around it according to the score value—the colours are used for different grammar relations higher the score is, the closer a word is to the cen- – as can be perceived from Figure 3. Therefore ter. a score on different radius around the center is The score values are normalized into the [0,1] mapped to circle transparency so the words with range, therefore a comparison of two different the highest scores pop out from the center of the word lists is possible. However, a user evaluates picture. each word list independently, so a fixed score axis Example of word sketch visualization can be could lead into misunderstanding in cases where seen in Figure 4.

18 Figure 2: Thesaurus for lemma strategy.

Figure 3: Mapping of graphical elements. Figure 4: Word sketches of experiment.

5 Interactivity of Visualizations mostly used by users, boundaries with labels are located in a legend. It is automatically updated Each visualization is interactive thus the explo- when data changes so the user is always aware of ration of relationships among words is easier. All the applied scale. interaction controls are grouped in a panel located The interfaces and their parts are described in in the right side of the page. The panel includes as Figure 5. many options as possible, however the number of options will be reduced after a user testing. 6 Implementation The exact values of a word’s frequency and score are not left out entirely, they can be retrieved The visualizations introduced in this paper were on demand by hovering over a particular word. implemented using JavaScript, jQuery and D3 li- For an approximate evaluation of values which is brary. D3 helps to focus on the graphical output

19 Figure 5: Visualization interface of thesaurus (top) and word sketches (bottom).

20 and its transformations and allows more effective development of interactive visualizations. (Bo- stock et al., 2011) The input for scripts generating the visual output is a JSON object retrieved from Sketch En- gine. Data boundaries and other properties are calculated to set up scales correctly. An assignment of coordinates for each word is made afterwards according to score values. The new position has to satisfy two conditions—the word’s bounding box cannot overlap with other bounding boxes in the area and also the circles cannot overlap. If these conditions are not met all scales are contracted and the positions are recal- culated. If it is not possible to meet the given restrictions, for example too many words are being (a) Czech corpus requested for output, the limitations are dropped and positions are calculated without any restrictions and it is upon the user to ﬁlter the output. This behavior ensures that a visualization is always rendered and the user’s requirements for data exploration are not limited. The exact positions of words in a given score radius are currently random, but mapping of another attribute is possible in the future. In the visualization of thesaurus the whole area around the center is covered with words. In the word sketches the whole area is divided into grammar relations, each relation is assigned an arc whose area is calculated as the sum of frequencies of words that belong to the given relation. The (b) Arabic corpus length of an arc represents the average score of displayed words. The presented visualizations don’t evaluate or modify the words as strings—the algorithms work with them as elements—they are therefore language-independent and can be used with any corpora available in Sketch Engine as can be seen in Figure 6. Graphics generated from the system can be also downloaded for further use.

7 Conclusions and Future Work In this paper we have presented a number of visualization enhancements that are soon to appear in Sketch Engine (including its open source variant (c) Estonian corpus NoSketch Engine). The main added value of these visualizations is that users can immediately see Figure 6: Thesaurus generated from different cor- differences between data values which can lead to pora. faster interpretation of results and faster decisions. We are conﬁdent that further development of such

21 enhancements is necessary to facilitate better un- Rychly,´ and V´ıt Suchomel. 2014. The Sketch En- derstanding of the underlying corpus data espe- gine: Ten Years On. Lexicography, 1:7–36. cially in the context of (“big data”). Isabel Meirelles. 2013. Design for Information: An In- Evaluation testing with differently experienced troduction to the Histories, Theories, and Best Prac- users of Sketch Engine is currently in progress tices Behind Effective Information Visualizations. and so far shows that the visualizations are mostly Rockport publishers. valuable for new and intermediate users. Accord- Jan Pomikalek,´ Pavel Rychly,´ and Milosˇ Jakub´ıcek.ˇ ing to the feedback from users we also plan A/B 2012. Building a 70 Billion Word Corpus of En- testing to verify new improved versions of the pre- glish from ClueWeb. In Proceedings of the Eight In- ternational Conference on Language Resources and sented visualizations. In the future further fea- Evaluation (LREC’12), pages 502–506. tures of Sketch Engine will be subject to visualization enhancements as well (e. g. corpus metadata Pavel Rychly.´ 2008. A Lexicographer-Friendly As- overview). sociation Score. Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN, pages 6–9. Acknowledgments George Kingsley Zipf. 1949. Human Behavior and the This work has been partly supported by the Min- Principle of Least Effort. Addison-Wesley Press. istry of Education of CR within the LINDAT- Clarin project LM2010013. The research leading to these results has received funding from the Norwegian Financial Mechanism 2009–2014 and the Ministry of Education, Youth and Sports under ProjectContract no. MSMT-28477/2014 within the HaBiT Project 7F14047. The access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum, provided under the programme ”Projects of Large In- frastructure for Research, Development, and Inno- vations” (LM2010005) is appreciated.

References Michael Bostock, Vadim Ogievetsky, and Jeffrey Heer. 2011. D3 Data-Driven Documents. Visualiza- tion and Computer Graphics, IEEE Transactions on, 17(12):2301–2309.

Stefan Evert. 2005. The statistics of Word Cooccur- rences: Word Pairs and Collocations. Ph.D. thesis, Universitat¨ Stuttgart, Holzgartenstr. 16, 70174 Stuttgart.

Milosˇ Jakub´ıcek,ˇ Adam Kilgarriff, Diana McCarthy, and Pavel Rychly.´ 2010. Fast Syntactic Search- ing in Very Large Corpora for Many Languages. PACLIC 24 Proceedings of the 24th Paciﬁc Asia Conference on Language, Information and Compu- tation, pages 741–747.

Milosˇ Jakub´ıcek,ˇ Adam Kilgarriff, Vojtechˇ Kova´r,ˇ Pavel Rychly,´ and V´ıt Suchomel. 2013. The TenTen Corpus Family. International Conference on Cor- pus Linguistics, Lancaster.

Adam Kilgarriff, V´ıt Baisa, Jan Busta,ˇ Milosˇ Jakub´ıcek,ˇ Vojtechˇ Kova´r,ˇ Jan Michelfeit, Pavel

22 Visualisation in Speech Corpora: Maps and Waves in the Glossa System

Michał Kosek, Anders Nøklestad, Joel Priestley, Kristin Hagen and Janne Bondi Johannessen The Text Laboratory, Department of Linguistics and Scandinavian Studies University of Oslo Oslo, Norway michalkk,noklesta,joeljp,kristiha,jannebj @iln.uio.no { }

Abstract The most important role of a speech corpus We present the Glossa web-based system is to give access to data that are otherwise diffi- for corpus search and results handling, fo- cult to obtain. Without a corpus, researchers ei- cussing on two modes of visualisation im- ther need to trawl through a large amount of pre- plemented in the system. First, we de- existing speech recordings, listening for words or scribe the use of maps to show the geo- patterns they are interested in, or elicit such pat- graphical distribution of search results and terns from speakers. The former option is very its utility for exploring dialectal variation time-consuming, and the latter requires setting up and discovering new isoglosses. Secondly, an experiment, which may introduce a bias, since we present a functionality for speech visu- the experimental setup would be designed with alisation, yielding dynamically generated specific research questions in mind. Speech cor- representations of spectrograms, pitch and pora do not necessarily solve the observer’s para- formants. The analyses are accompanied dox (Labov, 1972). They do, however, provide by the ability to replay selected parts of the speech data that are not affected by a specific re- waveform, as well as export and compare search question. Furthermore, in some cases participants are not even aware that their speech will maximum, minimum and average values 1 of the parameters for different selections. inform linguistic research. BigBrother is one Among other things, this can be used to such corpus. explore in more detail the set of spoken However, mere access to speech data is not suf- variants revealed by the geographical map ficient. With the large amount of material cur- view. rently available for many languages, techniques for extracting and visualising potentially interest- 1 Introduction ing patterns in the data have become more impor- The current availability of large corpora presents tant than ever. It is both important to have a way to a new challenge for corpus research: how to pro- group a large number of search results in a mean- cess hundreds of millions or even billions of to- ingful way, and to have the possibility to “zoom kens, extract relevant information and transform it in” and analyse detailed features of the speech. If into a shape that can be used to explore specific we take dialect research as an example, the cor- hypotheses about language. In addition, the emer- pus user may be interested in the relation between gence of extensive collections of audio- and video- pronunciation and the place where the speakers recorded speech, accompanied by transcriptions as live. The first of the presented features of Glossa well as metadata about the speakers such as their groups the search results according to the phonetic age, sex and geographical location, presents us transcription and geographical location and visu- with the challenge of how to make these additional alises them on a map. The user may then want to sources of linguistic data readily available to lan- perform a more detailed analysis of pronunciation guage researchers. To help with these tasks, the samples from some specific places shown on the Glossa corpus search system has been developed. map. Another feature of Glossa may then be used: The present article discusses the visualisation pos- the sound visualisation, which gives quick access sibilities that Glossa provides for speech corpora, to the spectrogram, waveform, pitch and formant which are particularly useful for research within 1http://www.tekstlab.uio.no/nota/ phonetics, phonology and dialectology. bigbrother/english.html

23 Figure 1: Advanced search in Glossa. We have searched in Norwegian dialects for a pronoun and the lemma ha separated by 1 or 2 tokens, and are in the process of selecting grammatical features for a third search term. plots of the speech. powerful queries: The article is structured as follows. Section 2 introduces general information about Glossa, and a) a Google-like search box for simple token or in particular, features related to speech corpora. phrase searches, The focus of the rest of the article is on visual- b) a set of text inputs, checkboxes and drop- isation possibilities of speech corpora in Glossa: down menus for more complex, grammati- geographical visualisation is presented in section cally speciﬁed searches (see Figure 1), 3, and speech visualisation in section 4. Section 5 describes technical details related to the imple- c) a search box for queries that are directly mentation of these features. Then, section 6 dis- passed to the underlying search engine. cusses research applications that can beneﬁt from these types of visualisation. Finally, section 7 dis- Results are presented as KWIC concordances, cusses possible improvements of the features that with the additional possibility to generate fre- may make research even more effective. quency distributions for tokens, lemmas or parts of speech. Glossa is easily installed on servers or 2 The Glossa Corpus Search System laptops via Docker (see section 5). Alternatively, the source code can be freely downloaded from 2.1 Features of Glossa GitHub2 under a very permissive open-source li- Glossa is a web application that provides powerful cence (MIT). methods for corpus search and result visualisation Out of the box, Glossa comes with support combined with a strong focus on user-friendliness. for corpora encoded with the IMS Open Cor- It allows a user to search monolingual and mul- pus Workbench (Christ, 1994; Evert and Hardie, 3 tilingual (parallel) text and speech corpora anno- 2011) , which supports up to 2.1 billion tokens tated with grammatical analyses or other types of per corpus. However, Glossa was built from the token information. By selecting a set of metadata ground up to be easily extended with support for values (such as the author and publisher of a writ- different search engines and corresponding search ten text or the location and age of a speaker), the and result views, and there is already an optional user can limit the search to a certain subcorpus. module for searching corpora on remote servers There are three alternative search interfaces, 2https://github.com/textlab/glossa ranging from maximum ease of use to maximally 3http://cwb.sourceforge.net

24 using the Federated Content Search protocol de- search feature allows looking for utterances where fined by the European CLARIN infrastructure4. a word is pronounced in a particular way. Glossa provides a simple admin interface which If the speakers in a corpus were recorded at allows a Corpus Workbench (CWB) corpus to be different geographical locations (such as in a di- created by uploading a zip file containing CWB in- alect corpus) and geographical coordinates are dexes and potentially also tab separated value files provided for these locations, search results can with metadata as well as audio and video files if also be visualised as plots on a geographical map. applicable. Glossa itself does not provide func- Furthermore, if audio recordings are available, tionality to create these input resources; however each search result can be analysed in an interface we are currently working on a corpus processing that implements the most important functionality pipeline for creating corpora from XML or plain found in desktop applications for sound analysis 5 text files, including TEI format for written cor- such as Praat (Boersma and Weenink, 2001). The 6 pora and ELAN format for speech corpora. rest of this paper will focus on the latter two func- It should be noted that Glossa is not the only tionalities: geographical maps and sound analysis. web-based corpus search system available; some examples of powerful alternatives are CQPweb7, Corpuscle8, Korp9, and SketchEngine10. What 3 Geographical Visualisation in Glossa sets Glossa apart from these is a unique combination of characteristics: Corpus linguistic investigation commonly draws the functionality for audio analysis and dis- on analytic and communicative techniques taken • play of geographical distribution described in from other fields. Dialectologists interested in re- this paper, gional variation have long turned to manually rendered maps to represent linguistic features. Per- support for parallel queries in multilingual • haps the earliest example of such work is Der corpora, Deutsche Sprachatlas, carried out in the first half a strong focus on ease of use for non- of the twentieth century and more recently digi- • technical users, tised as a result of the Digitaler Wenker-Atlas (DiWA) project (Schmidt et al., 2001). However, ease of installation (particularly through its • as late as 2005, the lack of automation was still Docker distribution), a concern (Labov et al., 2005, 41–42). Work has extensibility with respect to different search since been done to automatically render linguistic • engines and database systems, data in geographical maps, for example Dynamic Syntactic Atlas of the Dutch dialects (Barbiers and it is freely available without charge. others, 2006, DynaSAND). The following section • 2.2 Speech Corpora in Glossa details one simple approach to achieving the type of digital linguistic topography heralded by Labov. With speech corpora, search results can be linked to audio and video clips that are accompanied by Geographical location is an essential metadata an auto-cue display showing each transcribed ut- component of any speech corpus, and can be terance as it is spoken. The utterances may have utilised in search visualisations. The Google Maps 11 several different transcriptions. For example, in Embed API has proven useful in extending the the Nordic Dialect Corpus (Johannessen et al., functionality of Glossa to this end. While origi- 2009), they are transcribed into the standard or- nally incorporated into Glossa as a way of provid- thography, and to a simplified phonetic transcrip- ing a metadata overview for corpus queries, it soon tion, which shows how the word was actually pro- became evident that the spatial distribution of data nounced in a particular utterance. The phonetic reveals interesting patterns, particularly for corpora comprising multiple layers of transcription. 4 http://clarin.eu The Norwegian component of the Nordic Dialect 5http://www.tei-c.org/index.xml 6https://tla.mpi.nl/tools/tla-tools/elan Corpus (Johannessen et al., 2009) is one such ex- 7https://cqpweb.lancs.ac.uk ample; it is transcribed using a simplified phonetic 8http://clarino.uib.no/korpuskel/page 9http://spraakbanken.gu.se/eng/korp-info 10http://www.sketchengine.co.uk 11https://developers.google.com/maps/web

25 Figure 2: Geographical visualisation in Glossa. We can see the geographical distribution of dialectal variations of ikke ‘not’ in Norway: yellow = velar plosive; black = fricative/affricate, non-nasal system as its initial layer12, with an automatically 4 Speech Visualisation in Glossa transliterated orthographic transcription providing a subsequent layer. This dual-tiered approach pre- 4.1 Background serves the rich dialectal variation while ensuring Sound visualisation applications are an important searchability. Simple orthographic searches may part of the phonetician’s toolkit; they provide in- yield dozens of distinct dialectal variants. Using formation that does not depend on the subjective the Embed API, Glossa takes advantage of this perception of the sound signal and can be pre- system of transcription by compiling the phonetic sented and referred to in a written form. Among variants along with their corresponding geograph- various sound analysis programs, Praat (Boersma ical coordinates into a data structure and plotting and Weenink, 2001) is specifically designed to vi- the locations onto a Google map. Colour palettes sualise and extract parameters of speech and there- are provided, giving the user the option of colour fore it is a standard tool within phonetics. Among coding specified variants. The resulting clustering its many features, the most significant ones in- enables users to easily scan the distribution of lin- clude: visualisation of the waveform and the spec- guistic features. For corpora of a sufficient size trogram, pitch and formant analysis. and geographical distribution, isoglosses are read- Speech corpora not only enable easy access to ily visible (see Figure 2). The size of the corpus, a large amount of spoken utterances, but also al- in terms of tokens per informant, will determine low the user to restrict the search according to a which linguistic features will be available for ex- range of variables, based on corpus annotations amination; less frequent features requiring more and metadata. For instance, phonological anno- data. In the example, there is an average of about tations allow the user to find all words that share 4000 tokens per informant, with 564 informants a particular pronunciation, part-of-speech tagging spread amongst 163 locations. can be used to differentiate between some of the homonyms, and metadata may be used to specify 12The transcription standard is described here: http://www.tekstlab.uio.no/nota/scandiasyn/ sex and dialect of the speaker. Transkripsjonsrettleiing%20for%20ScanDiaSyn. It would therefore be natural to use Praat to pdf (in Norwegian). It is a coarse-grained transcription standard based on the Oslo Norwegian pronunciation of the analyse search results from speech corpora. Un- alphabet (Papazian and Helleland, 2005). fortunately, it is often difficult or even impossible.

26 Figure 3: Sound visualisation in Glossa. We have searched a Mandarin Chinese corpus for utterances with three consecutive third tones. The pitch plot (black) shows that in the presented search result only the last syllable is pronounced with the actual low third tone. The ﬁgure also shows waveform, spectrogram and formant plots, and the user interface that provides functionalities described in the article.

Praat operates only on local files saved on the hard above-mentioned problems. The sound visualisa- drive. Due to copyright restrictions, speech cor- tion is accessible through an icon next to each re- pora do not commonly allow sound files to be di- sult presenting KWIC concordances. An overview rectly downloaded. Most often the search results of the user interface is presented in Figure 3. are only available via streaming and can only be The following plots are available: the waveform listened to through a web browser. in the upper part, the spectrogram in the lower If the download links are available, Praat might part, and four formants and pitch overlaid over the be sufficient for research that requires deep analy- spectrogram. The plots can be turned on or off in- sis of a relatively small amount of samples. There dividually according to user preference. Just like are, however, many situations in which the use in Praat, the sound can be played and an animated of Praat is suboptimal. For instance, in the ex- vertical line shows the current playing position in ploratory phase of the research, when one tries real time. to formulate a hypothesis, it is often beneficial to The user can select a part of the spectrogram conduct a lot of different searches according to that may, for example, correspond to one sound different criteria and perform a quick analysis of or syllable, listen to that part, and zoom in to the results. In such cases, downloading each re- see the features of the sound in more detail. As sult separately and repeatedly switching between the selection is made, the numerical values of the separate applications is ineffective. parameters are displayed: duration and statistics for the selected period, namely maximal, mini- 4.2 Features of the Current Visualisation mal and average value of the pitch and formants. System The time and frequency of any point on the spec- The online speech visualisation in Glossa contains trogram can be accessed via a mouse hover-over. the most important features of Praat and solves the Right-clicking allows the numerical values to be

27 exported to a separate window. Also the statistics sible without using too many resources. That said, of the selection can be exported. if the chunks are too long, the system becomes less The visualisation is fully integrated with the usable for the researchers, as they need to spend search functionality of Glossa. For instance, in a time looking for words or sounds they are inter- corpus with speaker metadata, part-of-speech an- ested in. In our corpora, the segments are gener- notation and phonetic transcription, the user may ally not longer than a few dozens of seconds. The specify a particular word in its orthographic and/or plot for a typical, 10-second segment is generated phonetic transcription, its part of speech, and re- on our server in less than 4 seconds, which al- strict the search to utterances from speakers who lows high interactivity. If a corpus is divided into meet particular criteria. larger segments, the user will need to wait a bit longer, but our solution is still faster than down- 5 Technical Details loading the file and opening it in Praat: the plot for a 38-second segment is generated in less than 9 Even though Praat has batch processing facilities, seconds, and for a 94-second segment in less than it does not provide any application programming 21 seconds. interface (API) that would make integration into The interactive audio player that allows any part a larger system possible. Therefore we used the of the displayed sound to be played was writ- Snack Sound Toolkit13 instead, which contains ten in JavaScript and makes use of the API pro- Tcl/Tk and Python bindings. vided by SoundManager 215. Our visualisation The visualisation is generated by a Python dae- tool may use Adobe Flash to play sounds, or use a mon using Snack, which communicates with the purely JavaScript-based alternative when Flash is main Glossa daemon, written in Ruby on Rails, not available. through a simple protocol. Such an architecture is In order to minimise installation problems motivated by several factors. Snack does not have and support reproducible research (Chamberlain Ruby bindings, so calling it directly from the main and Schommer, 2014), Glossa is released as a Glossa daemon is not possible. Moreover, hav- Docker16 image, which contains all the required ing this functionality in a separate daemon allows dependencies. Such packaging guarantees that it Glossa to be more responsive: other functionali- will give exactly the same results on the same ties of Glossa are not blocked during the gener- data, avoiding problems caused by different soft- ation of the spectrograms, nor is it necessary to ware versions. Moreover, although the system create a separate process for every request. Ad- makes use of several programming languages and ditionally, initialisation of Snack and other related libraries, due to Docker packaging, it is cross- code requires several seconds. In our architecture platform and installable with just a few clicks. this needs to be done only once, after starting the daemon, and does not affect the time of the gener- In order to render the maps, Glossa communi- ation of the spectrogram. cates with the Google Maps Embed API using a The speed of the spectrogram generation de- JSON object. The colour-coding widget is bor- pends on the size of the segments14. The dae- rowed from the JQuery API, and shares the same mon can generate spectrograms for sound chunks JSON object. The reason for choosing Google’s of any length. Since Snack needs to load the whole service was simply practicality. At the time of file into memory in order to generate the spec- implementation, Google’s API was deemed most trogram, and the generation of some of the plots well developed and its interface most familiar to (e.g. the formants) may be time-consuming for potential users. However, most of the develop- longer files, we decided to split sound files into ment involved was agnostic with regards to which one-minute chunks, generate the plots for each of service was chosen, meaning any future move to them, and stitch them together in a way that makes another service will require little adaptation. It the result practically indistinguishable from a plot is worth noting here that using an open-source generated directly from a larger segment. This so- map service, such as OpenStreetMap, would en- lution makes visualisation of large sound files pos- able hosting the maps together with the Glossa in-

13http://www.speech.kth.se/snack 15http://www.schillmania.com/projects/ 14A segment is a piece of transcription synchronised with soundmanager2 the audio, with a deﬁned start and end time. 16http://www.docker.com

28 stallation, thus enabling offline viewing. darin, only distinguishes unstressed syllables that completely lose their underlying tone. Therefore, 6 Research Applications without a sound recording it is impossible to dis- tinguish unstressed syllables that still retain their As mentioned in subsection 4.1, the visualisation tone from stressed syllables. Another feature of features of Glossa may be useful in the exploratory Mandarin, present in the Beijing dialect, is the phase of the research. For example, one may start retroflex suffixation. This suffix is often omit- by searching for a word in the Nordic Dialect Cor- ted in speech transcription, and therefore an actual pus and see the geographical distribution of differ- recording is again more reliable than transcription ent pronunciation variants. The phonetic search alone. But even with a recording, researchers of- feature, mentioned in subsection 2.2, may then be ten have no other choice than to rely on their sub- used to restrict the search to one of the variants jective evaluation of whether stress or a suffix is presented by the geographical visualisation. The present in a particular syllable. The sound visual- phonetic transcription used in the corpus is coarse- isation makes it possible to refer to objective fea- grained, which means that there may be subtler tures of the waveform, spectrogram and formants, differences within a particular variant. The sound such as F3 decrease in case of the retroflex suffixa- visualisation allows the user to find out whether tion (Lee, 2005), instead of researchers’ subjective there are differences within the chosen pronuncia- perception. tion variant. For instance, one may measure dura- Those who work with specific groups of speak- tion of the vowels or the voice onset time of the ers, for example children or people with pronunci- consonants. The spectrogram analysis may im- ation difficulties, may take advantage of features prove accuracy in differentiating relatively similar of Glossa, even if corpora covering their target phones (such as alveolar tap and trill), compared to group are not available. For example, Norwegian a situation where only audio is available. The tool children may tend to produce epenthetic vowels is designed to work interactively, and one may eas- in word-initial consonant clusters. Researching ily repeat the procedure with different words and this feature requires a control group of adult Nor- different features in order to produce a hypothesis. wegian speakers, and this is where data from the The map visualisation has already had a sig- Norwegian Speech Corpus18 or the Nordic Dialect nificant impact on the Scandinavian linguistic re- Corpus19 may be useful. The sound visualisa- search community. Researchers from the projects tion makes it possible to quickly find out whether NorDiaSyn and NorDiaCorp have written more adults produce such epenthetic vowels and to mea- than sixty papers on various syntactic phenomena sure their duration. in the North Germanic languages, in which maps One of the easiest features to analyse in the from the Nordic Dialect Corpus have played a cru- sound visualisation is the pitch contour, which cial role. These papers have been published in a may give valuable information, especially in the new online open access journal which requires its case of tonal languages. For example, one may papers to use empirical data from, inter alia, the analyse the 3rd tone sandhi in Mandarin Chinese: Nordic Dialect Corpus and maps generated from the patterns occurring when there are two or more it: Nordic Atlas of Language Structures Journal subsequent syllables with the 3rd tone. When the (Johannessen and Vangsnes, 2014). syllables occur within a prosodic foot, all but the Visualisation in speech corpora is also useful for last one change to the 2nd tone (Shih, 1997). In more reliable data gathering. For example, stress other cases the change is not obligatory, but may is a feature of spoken Mandarin Chinese that re- occur. The pitch plot is a useful tool for verifica- ceived relatively little attention. One may want to tion of whether the tone sandhi actually occurs and use a Mandarin speech corpus, such as MAID17, analyse patterns of its occurrence. The visualisa- to investigate the patterns of its occurrence. The tion is even more useful for investigating effects problem, however, is that stress is not marked in of tone coarticulation – while the tone sandhi is the Chinese writing system, and Hanyu Pinyin, categorical, the coarticulation effect changes the the official phonetic transcription system for Man- pitch in different degrees, depending on the sit- 17Mandarin Audio Idiolect Dictionary, a dictionary and corpus of Beijing Mandarin: http://www.hf.uio.no/ 18http://tekstlab.uio.no/nota/oslo/english iln/om/organisasjon/tekstlab/prosjekter/maid 19http://tekstlab.uio.no/nota/scandiasyn

29 uation (Zhang and Liu, 2011). Figure 3 shows In the speech corpora currently available in how one may use Glossa to look for particular tone Glossa, the sound and the text are synchronised at combinations, visualise and analyse the pitch con- the utterance level. There are, however, no techni- tour. Average pitch values over specified periods cal problems with producing corpora that are syn- of time can then be exported, which allows for the chronised at the word level. When such corpora investigation of degrees of tone changes in natural become available, Glossa may be extended with an speech, depending on the adjacent tonal context. API that allows algorithms that find and/or mea- No formal evaluation of the presented visuali- sure specific phonetic details to be applied, such sation features has been performed. However, the as the automatic measurement of voice onset time fact that many researchers use the corpora served (Sonderegger and Keshet, 2012) or the detection by Glossa on a daily basis, and publish research of retroflex suffixation (Zhang et al., 2014). papers based on them, is a sign that they have suc- ceeded in satisfying the user groups. 16,142 in- 8 Conclusion dividual searches were performed by more than This paper has discussed the visualisation possi- 220 different users between 20. December 2014 bilities of the Glossa corpus search system. The and 19. April 2015 (i.e. 135 searches per day main focus was on the features available for on average). The many papers in the Nordic Atlas speech corpora: geographical visualisation and of Language Structures Journal provide yet more speech visualisation. Geographical visualisation evidence. The quantitative results from Google makes it possible to display pronunciation vari- Scholar are also worth mentioning: 141 schol- ants of the search results on a map and use colour- arly publications refer to “Nordic Dialect Cor- coding to cluster them into larger groups. The pus”, which is just one of the many corpora us- phonetic search feature allows specific pronunci- ing Glossa, and another 24 refer to its Norwegian ation variants in the corpus to be found. Each name “Nordisk dialektkorpus”. These are high search result in a speech corpus can be visualised numbers for a corpus that was only ready for use in the built-in tool for audio analysis. The user in 2011. may select its part and plot its parameters or export their values. These visualisations may be used 7 Future Work to explore the data and formulate a research hy- The detection of isoglosses discussed still leaves pothesis, verify the existence of particular pho- the job of charting them. One interesting line of netic features, and easily analyse various param- future development would be attempting to per- eters of speech. form this task automatically. The application of Acknowledgements k-means clustering and computing convex hulls (Wiedenbeck and La Touche, 2008) would be one We would like to thank Professor Christoph such avenue. Harbsmeier for his suggestions and input on the The currently available sound visualisation is visualisation of speech, and Pernille Hansen and a multi-purpose tool that provides a wide range Anders Vaa for their ideas on the use of the speech of acoustic data about the speech signal. It re- features now included in Glossa. Further, we are duces the time required to visualise the features of grateful to those who have provided speech con- the utterances that are of interest. However, cor- tent, and to the transcribers who have facilitated pora should not only give access to specific ex- the searchability of our corpora. The two projects amples, but also provide useful statistics that al- NorDiaSyn (financed by the Research Council low generalisations to be drawn from the search of Norway) and NorDiaCorp (financed by Nord- results. In this case, the efficiency of research forsk) were responsible for many of the recordings would be increased even more if the corpus search and the transcriptions used in the present speech tool could directly provide statistics relating to corpora. the search results, for example average voice on- This work was partly supported by the Research set time, vowel length or formant values within a Council of Norway through its Centres of Excel- vowel. This would, however, require synchronisa- lence funding scheme, project number 223265, tion of speech at the level of words and/or sylla- and partly through the Research Council’s infras- bles. tructure project CLARINO.

30 References Chilin Shih. 1997. Mandarin third tone sandhi and prosodic structure. Studies in Chinese Phonology, Sjef Barbiers et al. 2006. Dynamic Syntactic Atlas 20:81–123. of the Dutch dialects (DynaSAND). Meertens Insti- tute, Amsterdam. http://www.meertens.knaw. Morgan Sonderegger and Joseph Keshet. 2012. Au- nl/sand. tomatic measurement of voice onset time using dis- criminative structured prediction. The Journal of the Paul Boersma and David Weenink. 2001. Praat, a sys- Acoustical Society of America, 132(6):3965–3979. tem for doing phonetics by computer. Glot Interna- tional, 5(9/10):341–345. Bryce Wiedenbeck and Kit La Touche. 2008. Drawing isoglosses algorithmically. In Class of 2008 Senior Ryan Chamberlain and Jennifer Schommer. 2014. Us- Conference on Computational Geometry, page 22. ing Docker to support reproducible research. Tech- nical report, Invenshure, LLC. http://dx.doi. Jie Zhang and Jiang Liu. 2011. Tone sandhi and org/10.6084/m9.figshare.1101910. tonal coarticulation in Tianjin Chinese. Phonetica, 68:161–191. Oliver Christ. 1994. A modular and ﬂexible archi- Long Zhang, Haifeng Li, Lin Ma, and Jianhua Wang. tecture for an integrated corpus query system. In 2014. Automatic detection and evaluation of Erhua Proceedings of the 3rd International Conference on in the Putonghua proﬁciency test. Chinese Journal Computational Lexicography (COMPLEX), pages of Acoustics, 1:83–96. 22–32, Budapest.

Stefan Evert and Andrew Hardie. 2011. Twenty-ﬁrst century Corpus Workbench: Updating a query architecture for the new millennium. In Proceedings of the Corpus Linguistics 2011 conference, Birming- ham. University of Birmingham.

Janne Bondi Johannessen and Øystein Alexander Vangsnes. 2014. Nordic Atlas of Language Struc- tures Journal. Department of Linguistics and Scan- dinavian Studies, University of Oslo. http://www. tekstlab.uio.no/nals/.

Janne Bondi Johannessen, Joel Priestley, Kristin Ha- gen, Tor Anders Afarli,˚ and Øystein Alexander Vangsnes. 2009. The Nordic Dialect Corpus – An advanced research tool. In Proceedings of the 17th Nordic Conference of Computational Linguis- tics NODALIDA 2009. NEALT proceedings series, volume 4.

William Labov, Sharon Ash, and Charles Boberg. 2005. The atlas of North American English: Pho- netics, phonology and sound change. Walter de Gruyter.

William Labov. 1972. Sociolinguistic patterns. Num- ber 4 in Conduct and Communication. University of Pennsylvania Press.

Wai-Sum Lee. 2005. A phonetic study of the “er-hua” rimes in Beijing Mandarin. In Ninth European Con- ference on Speech Communication and Technology.

Eric Papazian and Botolv Helleland. 2005. Norsk talemal˚ . Hyskoleforlaget, Kristiansand.

Jurgen¨ Erich Schmidt, Joachim Herrgen, Tanja Giessler, Alfred Lameli, Alexandra Lenz, Karl- Heinz Muller,¨ Wolfgang Naser,¨ Jost Nickel, Roland Kehrein, Christoph Purschke, et al. 2001. Digi- taler Wenker-Atlas. Forschungszentrum Deutscher Sprachatlas, Marburg. http://www.diwa.info.

31 The ParaViz Tool: Exploring Cross-linguistic Differences in Functional Domains Based on a Parallel corpus

Ruprecht von Waldenfels Institute of Polish, Polish Academy of Sciences, Cracow [email protected]

Abstract and builds on ParaVoz (Meyer, Waldenfels, Ze- ParaViz is a modular corpus query and man 2014), a corpus query interface which allows analysis tool in development for use with querying parallel corpora through a traditional a word-aligned, linguistically annotated web interface on the basis of complex queries in- multilingual parallel corpus. Represent- volving token sequences across aligned languages, ing an addition to classic query-based cor- including negative queries on aligned segments. pus tools, it allows to assess the cross- Section 2 of this paper gives an introduction to linguistic variation in the functional do- functional comparison using parallel texts as im- main of items or structures that are de- plemented in ParaViz. In section 3, I describe the fined as cognate or otherwise equivalent implementation, and conclude in section 4. by the user. ParaViz provides the user with 2 Functional Comparison Based on two perspectives on such data: on the one Parallel Texts hand, a close-up perspective with word- aligned corpus examples that are classified The function of a linguistic item as understood and color-coded according to the user’s here includes its semantic, pragmatic, or other criteria; on the other hand, a bird’s view characteristics. As a rule, functional characteris- perspective with word lists and Neighbor- tics of linguistic elements are harder to describe Net visualizations that offer an overview than their formal characteristics and involve com- of the aggregated differences in use. To- plex analysis of corpus examples. This makes the gether they enable researchers to quickly comparison of such functions particularily diffi- find and explore convergent and divergent cult, since it presupposes a consistent, comparable functional patterns of equivalent formants analysis of these functions across many languages, in different languages. a task that quickly becomes very complex with a growing number of languages. 1 Introduction To see this, consider a comparison of the Ger- ParaViz is a modular corpus query and analysis man, Dutch, and English perfect. In all three lan- tool in development for use with a word-aligned, guages, it is quite straightforward to describe cog- linguistically annotated multilingual corpus. It nate perfect constructions that consist of an aux- is deployed with ParaSol, a small multilingual iliary (‘have’ or ‘be’) and a past participle. How- parallel corpus primarily geared towards linguis- ever, in order to compare the functional profiles tic contrastive and typological research of Slavic of such constructions, we first need to describe (Waldenfels 2011), but may in general be used the functions of each construction, taking care to with any massively parallel word-aligned corpus do this in a consistent, comparable manner that is such as Opus (Tiedemann, 2012) or InterCorp indeed relevant for the comparison and does not (Cermákˇ and Rosen, 2012). ParaViz functions as miss important contrasts. This is a very time- a stand-alone component at the moment of writing consuming and difficult task for even a medium and is planned to be implemented as a web appli- number of languages. cation. The fact that aligned parallel corpora involve ParaViz adds a new type of functionality to par- translationally equivalent texts in many languages allel corpus querying going beyond what is de- can be harnessed to quickly attain insights on scribed in Volk et al. (2014). It supplements functional differences and similarities based on

32 formal definitions alone (see the papers in Cysouw sults of a classification of the corpus data based on and Wälchli (2007) and Dahl (2014) for related these definitions. This section describes the stand- approaches). The basic notion is straightforward: alone application as it is functional at the moment if two items in two languages are often used as of writing; in the future, this system is planned to translations of each other, we assume that their be implemented as a web service. functional potential is similar. This notion can ParaViz is used with ParaSol, a small multi- be used to do quite complex comparisons of func- lingual parallel corpus primarily geared towards tional domains. In application to the above exam- linguistic contrastive and typological research1. ple, the fact that these perfect constructions differ ParaSol focusses on Slavic, but also includes Ro- in their function quickly emerges from simply ob- mance, Germanic, Finno-Ugric, Greek, Armenian serving that their distribution is very different in and other languages. The word forms in most lan- the parallel corpus. guages are lemmatized and POS-tagged; a sub- For a different, rather lexical example, let us set of the corpus is word-aligned using UPLUG assume users want to compare the functional do- (Tiedemann, 2003). main of color terms in different languages. In order to do that, users define which words represent 3.1 Operationalization of Variables the lexical categories red, blue and yellow across As a first step in the process, users define the sets many languages, and the system then compares of elements which they want to compare across the use of these words (and thus, the lexical cat- languages. This is done by the operationalization egories) across languages in translationally equiv- of variables in parameter files in XML format. In alent expressions. For example, such a compari- such a paramter file, variables are defined as con- son would show that German blau, English blue, straints over word-aligned word forms and their French bleu are often used in the same segments, annotation. At the moment, the parameter files al- but that the distribution of Russian sinij stands low the definition of such variables as regular ex- apart, since this item denotes a dark hue of blue. pressions over tokens, their lemmas and POS tags, Moreover, the German representative of the lexical as well as over tokens, lemmas and POS tags di- concept ‘blue’ is used in contexts that it isn’t used rectly adjacent. The following example defines the in English and French since it also refers to a state suffix classes OST and STVO, both denoting ab- of drunkenness (er ist blau ’he is drunk’), while in stract nouns, in two Slavic languages: English, blue also denotes a melancholic state of mind (as in I’m feeling blue). If these uses are at- tested in the parallel corpus, the differences in the ru denotation of hues as well as in non-literal uses are ость$ readily apparent in the distribution of these terms ^N.* in translationally equivalent segments across the languages in question. ... In this way, the functions of variables of dif- sl ferent types ranging from grammatical categories ost$ such as tense, aspect, or case to lexical categories ^S.* such as words for the color red or derivational suffixes can be easily compared. Further applications and a more detailed description of the approach are found in von Waldenfels (2014). ru ство$ 3 The ParaViz system ^N.* ParaViz is meant to simplify the type of comparison outlined in section 2 by offering a standard- ... ized way to easily conduct such functional com- pl parisons for a wide range of variables. This is [cs]two$ done by offering the user a mechanism to define ^subst.* variables in a formal way and outputing the re- 1http://www.parasolcorpus.org

33 Figure 1: A word-aligned corpus sample with color coding according to user-supplied parameter ﬁle.

ure 1. This perspective affords a qualitative assessment and allows the user to explore the data. The user then defines a filter condition for one This part of the output is based on ParaVoz, a of the languages that is taken as the primary lan- modular corpus query interface for CorpusWork- guage. For example, for the comparison of nomi- bench2 (CWB; see Evert & Hardie (2011)) pub- nal suffixes above, the user may choose to classify lished as open source (Meyer et al. 2014). It is only nouns. In other cases, the user may want to designed as an easy to use, easy to install and easy restrict the classification to some list of lemmata; to maintain flexible corpus interface for a parallel for example, the user may be interested in a partic- corpus hosted by CWB. ParaVoz (and CWB) uses ular lexical or grammatical domain. The primary CQP, a query language also used with a number of language will usually be the language of the orig- other corpus engines such as the NoSketchEngine inal, but in general, any language may be chosen. (Rychlý, 2007) or Poliqarp3. ParaVoz does not use The tokens in the primary language that sat- the CWB output directly, but, having configured isfy the filter condition as well as well their word- CWB to SGML mode, reformats its output into aligned equivalents in the other languages are clas- a convenience XML format using regular expres- sified in the next stop. This classification assigns a sions. type to each token in question based on the criteria defined in the parameter file. The output module of ParaVoz then uses XSLT The system then does a corpus search and clas- to transform the XML result document. Using sification of the corpus data, which is offered to XML and XSLT at this stage allows rapid adaption the user in two ways; first, as random samples of to diverse types of corpus data. In this case, word the corpus hits in context; second, in an aggregate alignment visualization is realized by linking ids form. that are encoded as token annotations. These an- notions are compared, and if a token in one of the 3.2 Qualitative Perspective: Color-coded target languages is aligned to the target tokens in Corpus Samples the source language, it is shown in bold. Using the same script, each target form is classified accord- As a first result, the user is given random samples ing to the user-defined definitions and color-coded of corpus data conforming to the definitions. This accordingly. enables the user to review and refine the operationalization of his or her parameters. The word- aligned forms in these corpus results are color- coded to reflect the types previously defined in the 2http://cwb.sourceforge.net parameter file; an example output is given in fig- 3http://poliqarp.sourceforge.net

34 Polish/2

Polish

Belarusian

Ukrainian/2 Slovak Ukrainian

Variable: STVO-MX4

Czech

Russian

Figure 2: Strings representing the word-aligned Bulgarian tokens as classiﬁed into types according to the Slovenian Serbian/2 Serbian Croatian user-supplied parameter ﬁle. Macedonian

Figure 3: Similarity of use of nouns derived with 3.3 Aggregate Perspective the suffix class OST in multiple versions of the In a second perspective, the system outputs visual- same text in different Slavic languages izations of the aggregate differences in distribution of the variables across different languages. As de- 0.1 scibed above, tokens in the primary language and OST NIE their equivalents are classified with respect to the STVO CIJA user defined operationalization in terms of word, lemma, tag, and other possible levels, just as it is STVIE done for the random samples. The examples are NICA thus converted to strings as shown in figure 2. If the strings are seen as a table, each column repre-

ICA sents a set of word-aligned word forms, with each TEL word form represented as a single letter reflecting the type it was classified as. eK The functional similarity or dissimilarity be- AÄŒ tween the distribution of the variables in multilingual versions of the same text is then computed by EC IK determining for each pair of texts the overlap in NIK the occurrences of this variable in aligned tokens, AR i.e.: V V Figure 4: Suffix classes across Slavic: functional dist(V ,V ) = 1 lng1 ∩ lng2 lng1 lng2 − V V similarity. lng1 ∪ lng2 In other words, the system computes strings of word-aligned corpus positions that are labeled in question. This concerns (a) the distances of according to the classification in the parameter texts/languages to each other with respect to some file and computes the hamming distance between variable (see fig. 3); (b) the distances of the vari- these strings. This computation is used to ar- ables to each other taken as cross-linguistic types; rive at distance matrices describing the similarity this is calculated for each pair of variables by de- or dissimilarity of the distribution of the variable termining the proportion of examples where both between texts. These matrices are visualized in are used in equivalent word forms (see fig. 4); NeighborNets (Huson and Bryant, 2006), a clus- (c) the distance of the language specific instanti- tering algorithm that was chosen since it preserves ations of the variables to each other; this allows to much of the ambiguity we find in this data. see whether formants of different classes overlap Using different filters and definitions of the fea- in their domain (see fig. 5). In addition, it out- tures that are being compared, the system can puts documents with word lists of equivalent to- then be used to output different visualizations kens in different languages and their classification of the differences in distribution of the variables (not shown here).

35 1.0 PL.VLOC HR.VLOC Michael Cysouw and Bernhard Wälchli, editors. 2007. SK.VLOC RU.VLOC Parallel Texts: Using translational equivalents in SL.VLOC CZ.VLOC linguistic typology. Special Issue of STUF 60/2.

UK.VLOC US.V BY.VLOC Östen Dahl. 2014. The perfect map: Investigating the cross-linguistic distribution of tame categories in a parallel corpus. In Benedikt Szmrecsanyi and Bern- MK.V PL.VACC hard Wälchli, editors, Aggregating Dialectology and BG.V Typology: Linguistic Variation in Text and Speech, HR.DO SL.DO within and across Languages, pages 268–289. De RU.DO MK.DO Gruyter Mouton, Berlin, New York. BG.DO Stefan Evert and Andrew Hardie. 2011. Twenty-first BY.DO century corpus workbench: Updating a query archi- CZ.DO SK.DO tecture for the new millennium. In Proceedings of UK.VACC SL.VACC US.DO the Corpus Linguistics 2011 Conference, Birming- HR.VACC BY.VACC ham, UK. University of Birmingham. UK.DO RU.VACC Daniel H. Huson and David Bryant. 2006. Applica- PL.DO tion of phylogenetic networks in evolutionary studies. Mol. Biol. Evol., 23:254–267. Figure 5: Functional similarity of Slavic preposi- Roland Meyer, Ruprecht von Waldenfels, and An- tions (language specific representations) dreas Zeman. 2006-2014. Paravoz - a simple web interface for querying parallel corpora. https://bitbucket.org/rvwfels/paravoz. 4 Concluding Remarks Pavel Rychlý. 2007. Manatee/bonito - a modular I have presented a component that supplements a corpus manager. In 1st Workshop on Recent Ad- query-based interface to a word-aligned multilin- vances in Slavonic Natural Language Processing, gual parallel corpus with a new comparative cor- pages 65–70, Brno. Masaryk University. pus evaluation component which is available of- Jörg Tiedemann. 2003. Recycling Translations – Ex- fline and is planned to be implemented as a web traction of Lexical Data from Parallel Corpora and service. This corpus evaluation component will their Application in Natural Language Processing. Ph.D. thesis, Uppsala University, Uppsala, Sweden. provide users with the possibility to upload their Anna Sågvall Hein, Åke Viberg (eds): Studia Lin- own parameter files which provide complex defi- guistica Upsaliensia. nitions of comparable items in different languages based on their formal characteristics. The corpus Jörg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In Nicoletta Calzolari (Con- is then evaluated in respect to the functional simi- ference Chair), Khalid Choukri, Thierry Declerck, larity of the items in question. Crucially, the com- Mehmet Ugur Dogan, Bente Maegaard, Joseph Mar- ponent aims to give both an aggregate and a de- iani, Jan Odijk, and Stelios Piperidis, editors, Pro- tailed view of the data, so that the user keeps the ceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Is- possibility to interpret the aggregate picture, and tanbul, Turkey. European Language Resources As- refine the parametrization as necessary for his or sociation (ELRA). her needs. Martin Volk, Johannes Graën, and Elena Callegaro. 2014. Innovations in parallel corpus search tools. In Acknowledgments Proceedings of the Ninth International Conference I gratefully acknowledge funding by the Swiss on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland. European Language Resources National Science Foundation, grant 151230 Con- Association (ELRA). vergence and divergence of Slavic from a usage based, parallel corpus driven perspective. Ruprecht von Waldenfels. 2014. Explorations into variation across Slavic: taking a bottom-up approach. In Benedikt Szmrecsanyi and Bernhard Wälchli, editors, Aggregating Dialectology and Ty- References pology: Linguistic Variation in Text and Speech, František Cermákˇ and Alexandr Rosen. 2012. The within and across Languages, pages 290–323. De case of InterCorp, a multilingual parallel cor- Gruyter Mouton, Berlin, New York. pus. International Journal of Corpus Linguistics, 13(3):411–427.