Cover photo 'Vilnius castle tower by night' by Mantas Volungevičius http://www.flickr.com/photos/112693323@N04/13596235485/ Licensed under Creative Commons Attribution 2.0 Generic See http://creativecommons.org/licenses/by/2.0/ for full terms
Cover design Nils Blomqvist Proceedings of the Workshop on Innovative Corpus Query and Visualization Tools at NODALIDA 2015
Editors Gintarė Grigonytė, Simon Clematide, Andrius Utka and Martin Volk
May 11-13, 2015 Vilnius, Lithuania
Published by
Linköping University Electronic Press, Sweden Linköping Electronic Conference Proceedings #111 ISSN: 1650-3686 eISSN: 1650-3740 NEALT Proceedings Series 25 ISBN: 978-91-7519-035-8 Copyright
The publishers will keep this document online on the Internet – or its possible replacement –from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/her own use and to use it unchanged for non-commercial research and educational purposes. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law, the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.
Linköping University Electronic Press Linköping, Sweden, 2015 Linköping Electronic Conference Proceedings, No. 111 ISSN: 1650-3686 eISSN: 1650-3740 URL: http://www.ep.liu.se/ecp_home/index.en.aspx?issue=111 NEALT Proceedings Series, Vol. 25 ISBN: 978-91-7519-035-8
© The Authors, 2015 Preface
Recent years have seen an increased interest in and availability of many different kinds of corpora. These range from small, but carefully anno- tated treebanks to large parallel corpora and very large monolingual cor- pora for big data research. It remains a challenge to offer flexible and powerful query tools for multilayer annotations of small corpora. When dealing with large cor- pora, query tools also need to scale in terms of processing speed and re- porting through statistical information and visualization options. This be- comes evident, for example, when dealing with very large corpora (such as complete Wikipedia corpora) or multi-parallel corpora (such as Eu- roparl or JRC Acquis). The QueryVis workshop has gathered researchers who develop and evaluate new corpus query and visualization tools for linguistics, lan- guage technology and related disciplines. The papers focus on the design of query languages, and on various new visualization options for mono- lingual and parallel corpora, both for written and spoken language. We hope that QueryVis will stimulate discussions and trigger new ideas for the workshop participants and any reader of the proceedings. The preparation of the workshop and the reviewing of the submissions has already been an inspiring experience. All papers were peer-reviewed by three program committee mem- bers. We would like to thank all reviewers and contributors for their work and for sharing their thoughts and experiences with us. Let us all join our forces to make corpus exploration a rewarding, en- tertaining, and exciting experience which will grant us ever new insights into language and thought.
May 4, 2015 Gintare˙ Grigonyte˙ Zurich¨ Simon Clematide Andrius Utka Martin Volk
iii Program Committee
Janne Bondi Johannessen University of Oslo Noah Bubenhofer University of Zurich Simon Clematide University of Zurich Johannes Graen¨ University of Zurich Gintare˙ Grigonyte˙ Stockholm University Milosˇ Jakub´ıcekˇ Lexical Computing Ltd. Andrius Utka Vytautas Magnus University Martin Volk University of Zurich Robert Ostling¨ Stockholm University
iv Table of Contents
KoralQuery - A General Corpus Query Protocol ...... 1 Joachim Bingel and Nils Diewald Reflections and a Proposal for a Query and Reporting Language for Richly Annotated Multiparallel Corpora ...... 6 Simon Clematide Interactive Visualizations of Corpus Data in Sketch Engine ..... 17 Lucia Kocincova,´ V´ıt Baisa, Milosˇ Jakub´ıcekˇ and Vojtechˇ Kova´rˇ Visualisation in speech corpora: maps and waves in the Glossa system ...... 23 Michał Kosek, Anders Nøklestad, Joel Priestley, Kristin Hagen and Janne Bondi Johannessen ParaViz: A vizualization tool for crosslinguistic functional comparisons based on a parallel corpus ...... 32 Ruprecht von Waldenfels
v KoralQuery – a General Corpus Query Protocol
Joachim Bingel, Nils Diewald Institut für Deutsche Sprache Mannheim, Germany bingel,[email protected]
Abstract interoperability between different query systems, The task-oriented and format-driven de- for instance to forward user requests from one sys- velopment of corpus query systems has led tem to another, which may have access to addi- to the creation of numerous corpus query tional resources. languages (QLs) that vary strongly in ex- In this paper, we present KoralQuery, a gen- pressiveness and syntax. This is a severe eral protocol for the representation of requests to impediment for the interoperability of cor- corpus query systems independent of a particular pus analysis systems, which lack a com- query language. KoralQuery provides an extensi- mon protocol. In this paper, we present ble system of different linguistic and metalinguis- KoralQuery, a JSON-LD based general tic types and operations, which can be combined corpus query protocol, aiming to be inde- to represent queries of great complexity. Several pendent of particular QLs, tasks and cor- query languages can thus be mapped to a common pus formats. In addition to describing the representation, which lets users of query systems system of types and operations that Koral- formulate queries in any of the QLs for which such Query is built on, we exemplify the rep- a mapping is implemented (cf. Section 4). Further resentation of corpus queries in the serial- benefits of KoralQuery include the dynamic defi- ized format and illustrate use cases in the nition of virtual corpora and the possibility to si- KorAP project. multaneously access several, concurrent layers of annotation on the same primary textual data. 1 Introduction In the past, several corpus query systems have 2 Related Work been developed with the purpose of exploring and In former publications, KoralQuery was intro- providing access to text corpora, often under the duced as a unified serialization format for CQLF1 assumption of specific linguistic questions that the (Banski´ et al., 2014), a companion effort focussing annotated corpora have been expected to help an- on the identification and theoretical description of swer. This task-oriented and format-driven devel- corpus query concepts and features. opment has led to the creation of several distinct Another approach to a common query lan- corpus query languages (QLs), including those guage that is independent of tasks and formats mentioned in Section 3. Such QLs vary strongly is CQL (Contextual Query Language) (OASIS in expressiveness and usability (Frick et al., 2012). Standard, 2013), with its XML serialization for- This brings several unpleasant consequences mat XCQL.2 KoralQuery differs from CQL in fo- both for researchers and developers. For instance, cussing on queries of linguistic structures, and the researcher who uses a particular system must separating document and span query concepts (see formulate her queries in no other QL than the one Section 3). used for this system, which might require addi- 1 tional training prior to the actual research. It might CQLF is short for Corpus Query Lingua Franca, which is part of the ISO TC37 SC4 Working Group 6 (ISO/WD even be the case that certain research questions 24623-1). cannot be answered due to limitations of the QL, 2Like KoralQuery, XCQL is not meant to be human read- while the actual query system and the underlying able, but to represent query expressions as machine readable tree structures. For various compilers from CQL to XCQL, corpus data could in fact provide results. For de- see http://zing.z3950.org/cql/; last accessed 27 April velopers, the lack of a common protocol prevents 2015.
1 3 Query Representation 1 { 2 "@context": "http://korap.ids-mannheim.de/ns/ KoralQuery is serialized to JSON-LD (Sporny et koral/0.3/context.jsonld", al., 2014), a JSON (Crockford, 2006) based for- 3 "collection":{ 4 "@type": "koral:doc", mat for Linked Data, which makes it possible for 5 "key": "pubDate", corpus query systems to interoperate by exchang- 6 "value": "2005-05-25", 3 7 "type": "type:date", ing the common protocol. JSON-LD relies on 8 "match": "match:geq" the definition of object types via the @type key- 9 }, word, thus informing processing software of the 10 "query":{} 11 } attributes and values that a particular object may hold. As can be seen in the example serializations in this section (see Fig. 1-3), KoralQuery makes Figure 1: KoralQuery serialization for a virtual use of the @type keyword to declare query object collection that is restricted to documents with a types. Those types fall into different categories pubDate of greater or equal than 2005-05-25. that we introduce in the remainder of this section.4 While KoralQuery aims to express as many dif- combination of constraints), for instance all doc- ferent linguistic and metalinguistic query struc- uments that were published after a certain date or tures as possible, it currently guarantees to rep- that contain a certain string of characters in their resent types and operations defined in Poliqarp title. Figure 1 illustrates the serialization of a sim- QL (Przepiórkowski et al., 2004), COSMAS II QL ple virtual collection definition. (Bodmer, 1996) and ANNIS QL (Rosenfeld, 2010). In addition, the protocol comprises a subset of the 3.2 Span Queries elements of CQL (OASIS Standard, 2013). To find occurrences of particular linguistic struc- As JSON-LD objects can reference further tures in corpus data (possibly restricted through namespaces (via the @context attribute), Koral- the aforementioned document queries), Koral- Query is arbitrarily extensible. Query uses the attribute query, under which it registers objects of specific, well-defined types. 3.1 Document Queries Those objects, along with their hierarchical orga- KoralQuery allows to specify metadata constraints nization, represent the linguistic query issued by that act as filters for virtual collections using the the user.5 collection attribute. Those metadata constraints, The intended generic usability of KoralQuery so-called collection types, serve a dual purpose: demands a high degree of flexibility in order to Besides the obvious benefit of allowing users to cover as many linguistic phenomena and theories restrict their search via dynamic sampling to docu- as possible. It must therefore be maximally inde- ments that meet specific requirements on metadata pendent of, and neutral with regard to, such as publication date, authorship or genre, they can be used to control access to texts that the user (i) the type and structure of linguistic annotation has no permission to read (cf. Sec. 3.3). on the text data, A single metadata constraint is called a basic (ii) the choice of specific tag sets, e.g. for part- collection type, and defines a metadata field, a of-speech annotations or dependency labels. value and a match modifier, for example to negate KoralQuery achieves this neutrality by instanti- the constraint. Basic collection types can be com- ating distinct linguistic types as abstract structures bined using boolean operators (AND and OR) to which can flexibly address different sources and recursively form . The complex collection types layers of linguistic annotation at the same time. result of a collection type is a collection of doc- Linguistic patterns of greater complexity can be uments which satisfy the encoded constraint (or defined by using a modular system of nestable 3JSON-LD was chosen to be compatible with LAPPS rec- types and operations, drawing on various famil- ommendations from ISO TC37 SC4 WG1-EP, as suggested iar search technologies and formalisms, includ- by Piotr Banski.´ 4The type categories are set in boldface. A detailed def- 5As the response format is not part of the KoralQuery inition of types and attributes is provided by the KoralQuery specification, the result handling is subject to the query en- specification (Diewald and Bingel, 2015), which may serve gine. It may, for instance, return surrounding text spans or as a reference for implementers of KoralQuery processors. the total number of occurrences.
2 ing concepts from regular expressions, XML tree 1 { traversal, boolean search and relational database 2 "@context": "http://korap.ids-mannheim.de/ns/ koral/0.3/context.jsonld", queries. 3 "collection":{}, The nesting principle of KoralQuery states that 4 "query":{ objects describing linguistic structures in the cor- 5 "@type":"koral:group", 6 "operation": "operation:sequence", pus data, so-called span types, may be embedded 7 "operands":[{ in parental objects to recursively describe complex 8 "@type": "koral:token", 9 "wrap":{ linguistic structures, thus forming a single-rooted 10 "@type": "koral:termGroup", tree. 11 "relation": "relation:and", Span types may be further sub-classified into 12 "operands":[{ 13 "@type": "koral:term", basic and complex types. Basic span types denote 14 "foundry": "tt", linguistic entities such as words, phrases and sen- 15 "key": "ADJA", tences that are annotated in the corpus data. The 16 "layer": "pos", 17 "match": "match:eq" result of such a span type is a text span, which in 18 },{ turn is defined through a start and an end offset 19 "@type": "koral:term", with respect to the primary text data. 20 "foundry": "cnx", Complex 21 "key": "@PREMOD", span types define linguistic or result-modifying 22 "layer": "syn", operations on a set of embedded span types, which 23 "match": "match:eq" 24 }] thus act as arguments (or operands) of the relation 25 },{ and pass their resulting text spans on to the parent 26 "@type": "koral:token", operation.6 Such operations may express syntactic 27 "wrap":{ 28 "@type": "koral:term", relations or positional constraints between spans. 29 "key": "octopus", Figure 2, for example, represents a span query 30 "layer": "lemma", 31 "match": "match:eq" of two koral:token objects (basic span types) 32 } each wrapping a single koral:term object, whose 33 }] resulting text spans are required to be in a se- 34 } 35 } quence (i.e. follow each other immediately in the order they appear in the list), as formulated by the operation:sequence in the embedding Figure 2: KoralQuery serialization for a pre- koral:group object (a complex span type). modifying adjective followed by the lemma oc- Leaf objects of the span query tree structure topus. The dual constraint on the first token may either be basic span types or parametric (adjective and premodifying) is reflected by the types, containing specific information that is re- koral:termGroup, which expresses a conjunction quested for certain span types. They are intended of the two koral:term objects. The different val- to normalize the usage and representation of simi- ues for foundry indicate that different annotation lar or equal parameters used across different types. sources are addressed. The koral:term objects in Figure 2, which ex- press constraints on their parent koral:token ob- cific set of obligatory and optional attributes that jects, are examples of such parametric types and carry corresponding values. Those values, in turn, are used to uniformly access annotation labels are also constrained to be of specific data types. from different sources and on different layers. They can either be primitives (like string, integer Next to such basic parametric types, KoralQuery or boolean), parametric KoralQuery types, or con- provides complex parametric types that encode, trolled values. for instance, logical operations on other paramet- ric types (see the koral:termGroup in Figure 2). 3.3 Query Rewrites Note that all of those types are themselves com- Query processors may perform a wide range of plex structures in that they are composed of a spe- different tasks aside of searching. Examples in- 6In addition, the koral:reference type may refer to ob- clude the modification of queries to restrict access jects elsewhere in the tree, which provides a mechanism sim- to certain documents, to improve recall (e.g. by in- ilar to ID/IDREF in XML. This strategy is necessary to sup- port graph-based query structures found in certain query lan- troducing synonyms or suggesting query reformu- guages. lations), or to inject missing query elements (like
3 1 { other corpora). This rewrite is documented by 2 "@context": "http://korap.ids-mannheim.de/ns/ the koral:rewrite object (a report type). Doc- koral/0.3/context.jsonld", 3 "collection":{ umenting rewrites is optional (e.g. the injected 4 "@type": "koral:docGroup", operation:and in the example Figure is implicit 5 "operation": "operation:and", and was not reported using koral:rewrite). 6 "operands":[{ 7 "@type": "koral:doc", In addition, KoralQuery allows to report on var- 8 "key": "pubDate", ious processing issues (independent of rewrites, 9 "value": "2005-05-25", e.g. regarding incompatibilities) by using the 10 "type": "type:date", 11 "match": "match:geq" errors, warnings, and messages attributes. 12 },{ Report types (in opposition to collection types, 13 "@type": "koral:doc", 14 "key": "corpusID", span types, and parametric types) do not alter the 15 "value": "Wikipedia", expected query result. 16 "rewrites":[{ 17 "@type": "koral:rewrite", 18 "src": "Kustvakt", 4 Implementations 19 "operation": "operation:injection" 7 20 }] KoralQuery is the core protocol used in KorAP 21 }] (Banski´ et al., 2013), a corpus analysis platform 22 }, developed at the Institute for the German Lan- 23 "query":{} 24 } guage (IDS). KorAP is designed to handle very large corpora and to be sustainable with regard to future developments in corpus linguistic research. Figure 3: Rewritten KoralQuery instance (see Fig- This is ensured through a modular architecture of ure 1), with an injected access restriction. interoperating software units that are easy to main- tain, extend and replace. The interoperability of preferred annotation tools) based on user settings components in KorAP is certified through the use (Banski´ et al., 2014). Queries may also be ana- of KoralQuery for all internal communications. 8 lyzed for the most commonly queried structures, Koral translates queries from various corpus for instance to perform query and index optimiza- query languages (as mentioned in Section 3) to tion or to shed light on which texts and annota- corresponding KoralQuery documents. This con- tions are most popular with the users. In a post- version is a two-stage process, which first parses processing step, queries can also be transformed the input query string using a context-free gram- for visualization purposes, for example to illus- mar and the ANTLR framework (Parr and Quong, trate sequences or alternatives in complex query 1995) before it translates the resulting parse tree to structures. KoralQuery. 9 Using a well-defined and widely adopted seri- Krill is a corpus search engine that expects Ko- alization format such as JSON makes it easy to ralQuery instances as a request format. To index perform such tasks, and KoralQuery supports this and retrieve primary data, textual annotations and kind of pre- and post-processors even further by metadata of documents as formulated by Koral- 10 introducing mechanisms to trace query rewrites by Query, Krill utilizes Apache Lucene. using so-called report types that are passed to fur- Kustvakt is a user and corpus policy manage- ther processors in the processing pipeline. In this ment service that accepts KoralQuery requests and way, query modifications (like the aforementioned rewrites the query as a preprocessor (see Sec. 3.3) rewrites for access restriction and recall improve- before it is passed to the search engine (e.g. Krill). ments) can be made visible and transparent to the Rewrites of the document query may restrict the user. In this respect, KoralQuery differs from com- requested collection to documents the user is al- mon database query systems, where rewrites are lowed to access, while the span query may be internal and hidden from the user (Huey, 2014). modified by injecting user defined properties.
In Figure 3, the virtual collection of Figure 1 is 7http://korap.ids-mannheim.de/ rewritten by the processor Kustvakt in a way that 8http://github.com/KorAP/Koral; Koral is free soft- a further constraint is injected, limiting the vir- ware, licensed under BSD-2. 9http://github.com/KorAP/Krill; Krill is free soft- tual collection to all documents with a corpusID ware, licensed under BSD-2. of Wikipedia (i.e. excluding all documents from 10http://lucene.apache.org/core/
4 5 Summary and Further Work the 6th Language and Technology Conference, Poz- nan.´ Fundacja Uniwersytetu im. A. Mickiewicza. We have presented KoralQuery, a general proto- col for queries to linguistic corpora, which is se- Piotr Banski,´ Nils Diewald, Michael Hanl, Marc Kupi- etz, and Andreas Witt. 2014. Access Control by rialized as JSON-LD. KoralQuery allows for a Query Rewriting: the Case of KorAP. In Pro- flexible representation and modification of corpus ceedings of the Ninth International Conference on queries that is independent of pre-defined tag sets Language Resources and Evaluation (LREC 2014), or annotation schemes. Those queries pertain to Reykjavik, Iceland, may. European Language Re- sources Association (ELRA). both selection of documents by metadata or con- tent, and text span retrieval by the specification Franck Bodmer. 1996. Aspekte der Abfragekompo- of linguistic patterns. To this end, the protocol nente von COSMAS II. LDV-INFO, 8:142–155. defines a set of types and operations which can Douglas Crockford. 2006. The application/json Media be nested to express complex linguistic structures. Type for JavaScript Object Notation (JSON). Tech- By employing an automatic conversion from sev- nical report, IETF, July. http://www.ietf.org/ rfc/rfc4627.txt. eral QLs to KoralQuery, corpus engines may al- low their users to choose the QL that they are most Nils Diewald and Joachim Bingel. 2015. Koral- comfortable with or that are best equipped to an- Query 0.3. Technical report, IDS, Mannheim, Germany. Working draft, in preparation, http: swer their research questions. //KorAP.github.io/Koral, last accessed 27 April The KoralQuery specification (Diewald and 2015. Bingel, 2015) does not claim to be complete or to cover all possible linguistic types and structures. Elena Frick, Carsten Schnober, and Piotr Banski.´ 2012. Evaluating query languages for a corpus processing Amendments to the protocol may follow in fu- system. In Proceedings of the Eighth International ture versions or may be implemented by individ- Conference on Language Resources and Evaluation ual projects, which is easily done by supplying an (LREC 2012), pages 2286–2294. additional JSON-LD @context file that links new Patricia Huey, 2014. Oracle Database, Security Guide, concepts to unique identifiers. Extensions that we 11g Release 1 (11.1), chapter 7. Using Oracle consider for upcoming versions of KoralQuery in- Virtual Private Database to Control Data Access, clude text string queries that are not constrained by pages 233–272. Oracle. http://docs.oracle. com/cd/B28359_01/network.111/b28531.pdf, last token boundaries and more powerful stratification accessed 27 April 2015. techniques for virtual collections. OASIS Standard. 2013. searchRetrieve: Part Acknowledgements 5. CQL: The Contextual Query Language Version 1.0. http://docs.oasis-open.org/ KoralQuery, as well as the described implemen- search-ws/searchRetrieve/v1.0/os/part5-cql/ tation components, are developed as part of the searchRetrieve-v1.0-os-part5-cql.html. KorAP project at the Institute for the German Terence J. Parr and Russell W. Quong. 1995. ANTLR: Language (IDS)11 in Mannheim, member of the A predicated-LL (k) parser generator. Software: Leibniz-Gemeinschaft, and supported by the Ko- Practice and Experience, 25(7):789–810. bRA12 project, funded by the Federal Ministry of Adam Przepiórkowski, Zygmunt Krynicki, Lukasz De- Education and Research (BMBF), Germany. The bowski, Marcin Wolinski, Daniel Janus, and Piotr authors would like to thank their colleagues for Banski.´ 2004. A search tool for corpora with posi- tional tagsets and ambiguities. In Proceedings of the their valuable input. Fourth International Conference on Language Re- sources and Evaluation (LREC 2004), pages 1235– 1238. European Language Resources Association References (ELRA). Piotr Banski,´ Joachim Bingel, Nils Diewald, Elena Viktor Rosenfeld. 2010. An implementation of the An- Frick, Michael Hanl, Marc Kupietz, Piotr Pezik, nis 2 query language. Technical report, Humboldt- Carsten Schnober, and Andreas Witt. 2013. KorAP: Universität zu Berlin. the new corpus analysis platform at IDS Mannheim. In Zygmunt Vetulani and Hans Uszkoreit, editors, Manu Sporny, Dave Longley, Gregg Kellogg, Markus Human Language Technologies as a Challenge for Lanthaler, and Niklas Lindström. 2014. JSON- Computer Science and Linguistics. Proceedings of LD 1.0 – A JSON-based Serialization for Linked Data. Technical report, W3C. W3C Recommen- 11 http://ids-mannheim.de/ dation, http://www.w3.org/TR/json-ld/. 12http://www.kobra.tu-dortmund.de/
5 Reflections and a Proposal for a Query and Reporting Language for Richly Annotated Multiparallel Corpora
Simon Clematide Institute of Computational Linguistics, University of Zurich [email protected]
Abstract for automated linguistic data mining, a use case Large and open multiparallel corpora are of computational linguists. A successful combi- a valuable resource for contrastive corpus nation of these two different paradigms of linguis- linguists if the data is annotated and stored tic information retrieval (i.e. ad hoc search and in a way that allows precise and flexible precomputed word collocation statistics) has been ad hoc searches. A linguistic query lan- shown in the case of the text corpus query lan- guage should also support computational guage CQL within the framework of the Sketch linguists in automated multilingual data Engine (Kilgarriff et al., 2014). mining. We review a broad range of ap- Historically, there are two different strains of proaches for linguistic query and report- linguistic query systems, (a) corpus linguistics ing languages according to usability crite- tools for text corpora such as CQP (Christ, 1994) ria such as expressibility, expressiveness, with KWIC reporting, and (b) treebank tools such and efficiency. We propose an architecture as TGrep2 (Rohde, 2005) for searching through that tries to strike the right balance to suit deeply nested structures of syntactically anno- practical purposes. tated sentences. In recent years, we have seen a convergence of these strains: query languages 1 Introduction for text corpora have enriched their search opera- There is a large amount (millions of sentences) tors in order to cope with syntactic constituents, of open multiparallel text data available electroni- for example introducing the operators within cally: resolutions of the General Assembly of the and contain in CQL (Jakubicek et al., 2010) United Nations (Rafalovitch and Dale, 2009), Eu- or the constituent search construct in Poliquarp ropean parliament documents (Koehn, 2005; Ha- (Janus and Przepiorkowski,´ 2007). On the other jlaoui et al., 2014), European administration trans- hand, treebanking-style query approaches that lation memories and law texts (Steinberger et al., were bound to context-free tree structures have 2012; Steinberger et al., 2006), documents from evolved into more general query systems for struc- the European Union Bookstore (Skadin¸sˇ et al., tural linguistic annotations, e.g. ANNIS (Krause 2014), and movie subtitles. See Tiedemann (2012) and Zeldes, 2014) which allows a richer set of the and Steinberger et al. (2014) for an overview. structural relations (multi-layered directed acyclic Automatic part-of-speech tagging and lemma- graphs, including syntactic dependencies or coref- tization of raw text has become standard proce- erence chains across sentences), or the Prague dure, and richer linguistic annotations such as Markup Language Tree Query (PML-TQ) system morphological analysis, named entity recognition, for multi-layered annotations (Stˇ epˇ anek´ and Pajas, base chunking, and dependency analysis are pos- 2010), which also covers parallel treebanks.1 sible for many languages. Further, statistical word alignment can be applied to any parallel lan- guage resource. If we want to exploit these large, 1Unfortunately, it is difficult to access up-to-date infor- mation about the query possibilities for alignments of words richly annotated resources and flexibly serve the or syntactic nodes. The documentation, however, describes a language-related information needs of translators, general cross-layer, node-identifier-based selector dimension. terminologists and contrastive linguists, an expres- The parallel Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) http://ufal.mff.cuni.cz/pcedt2.0 sive query language for ad hoc search must be pro- illustrates the representation of word-aligned dependency vided. Such a query language will also be useful trees.
6 1.1 Linguistic Information Needs sults. All flexible linguistic query languages offer A linguistic query in a general sense is a set of means to select the sub-structures and attributes 5 interrelated constraints about linguistic structures. which the user is interested in. This may also The following paragraphs introduce the structures include sorting, aggregating or statistical tabulat- we want to represent and query. ing of the results, as for instance the excellent re- Monolingual constraints on the primary level porting functions of PML-TQ allow. In our opin- of word tokens (the minimal unit of analysis) are ion, reporting also includes the user-configurable dealing with inflected word forms, base forms, export of search results, for example as simple part-of-speech tags, and morphological categories. comma-separated data for further statistical pro- 6 Word tokens have a sequential ordering relation cessing , or as hierarchically structured XML se- (linear precedence). For our case of orthograph- rializations. ically well-formed texts, we assume consistent to- The graphical visualization of search results kenization for all levels of annotation. Giving up aids end users in quickly browsing complex data this requirement leads to non-trivial ordering prob- structures. Visualizations of syntactic structures or lems (Chiarcos et al., 2009). Sentences are se- frequency distributions of aligned words should be quences of tokens, and documents are sequences generated on top of specific textual reporting for- of sentences.2 Documents or sentences typically mats. Interactive behavior (collapsing trees, high- have metadata associated with them, for instance lighting of aligned nodes) supports a quick inter- indicating whether a document is a translation or pretation of search results. not. The remainder of this paper is structured as fol- Each full or partial dependency analysis of a lows. Section 2 describes general usability cri- sentence can be represented as a directed and la- teria of linguistic query systems. Section 3 dis- beled tree graph where each node is a word to- cusses interesting linguistic query languages and ken, except for the root of the tree, which we as- their main properties. Section 4 introduces gen- sume to be a virtual node. Nested syntactic con- eral data query languages that are related to lin- stituents (or chunks in the case of partial parsing) guistic systems. Section 5 discusses evaluation ap- introduce a dominance relation between syntac- proaches for linguistic query languages. Finally, tic nodes (non-terminals) or primary token nodes section 6 presents our proposals for an efficient (terminals). Dominated nodes also have a linear linguistic query and reporting system for multipar- precedence ordering, the sibling relation. allel data. Cross-lingual constraints are concerned with word alignments and sub-sentential alignments on 2 Usability Criteria for Linguistic Query the chunk level.3 Directed bilingual word align- Systems ments as produced by statistical word alignment Expressibility How naturally can users express tools such as GIZA++ are 1:n (Och and Ney, their information need? Can users apply their lin- 2003). Bidirectional alignments are thus rela- guistic concepts to formulate their query (Jaku- tional, in general, we have m : n alignments on bicek et al., 2010, 743), or do they have to deal the level of words, for example, between a Ger- with cumbersome constructs? man compound and its corresponding multi-word Non-experts may profit from a visual or menu- unit in French, unless we apply a symmetrization based composition of queries. Gartner¨ et al. technique (Tiedemann, 2011, 75ff.).4 (2013) and M´ırovsky´ (2008) describe graphical 1.2 Reporting and Visualization query solutions for dependency trees. ANNIS The set of constraints in a query does not exactly (Zeldes et al., 2009) offers a graphical query in- determine the content or format of the search re- terface for AQL. Nygaard and Johannessen (2004) built a menu-based visual query composition for 2 In order to keep the description simple we do not impose parallel treebanks that used TGrep2 as its query more nesting levels in documents. 3Sentence alignments are considered as given in the con- execution engine. text of multiparallel corpora, although in practical terms it might require a lot of work to achieve a proper and consistent 5TGrep2 uses backticks to mark the top node of the sub- sentence alignment across multiple languages. tree that is printed as output. 4Recently, Baisa et al. (2014) applied Dice coefficients to 6ANNIS provides a practical export format for the WEKA identify aligned lemmas in parallel sentences. machine learning framework.
7 Experts, however, will profit most from text- bank search tool TIGERSearch7 rely on named based queries that allows to abstract common node specifications, and they can only be accessed and recurrent functionality in the form of user- and configured by graphical user interface inter- definable macros, variables, or functions. actions. Other query languages such as PML-TQ offer a proper reporting language with a rich set Expressiveness Are there inherent limitations in of functions for sorting, aggregating and exporting a query language that systematically prevent the (e.g. grammar rules). formulation of precise search constraints for cer- tain structures? It is well known since its inception Visualization Does the query system offer ap- that the fragment of existential first-order logic im- pealing visualizations of the data or data aggrega- plemented by the TIGERSearch language does not tions? ANNIS3 (Krause and Zeldes, 2014) has an allow for the search of missing constituents in syn- outstanding amount of visualization options. tactic graphs (Konig¨ and Lezius, 2003). Lai and Availability and accessibility Is a system bound Bird (2010) provide a concise overview on the for- to specific operating systems? Large datasets mal expressiveness of query languages for hier- typically overstrain personal desktop computers. archical linguistic structures and discuss the fact Web-based services can be hosted on dedicated that transitive closures of immediate dominance or computing infrastructure, and there is typically no precedence relations formally require the expres- client-side software installation necessary given siveness of monadic second-order logic. Interest- the rendering capabilities of modern web browsers ingly, such a high expressiveness does not imply (e.g. interactive SVG graphics). Open web-based inefficient or impractical execution times as shown services enable easy sharing of query results via by Maryns and Kepser (2009) for context-free URLs (Pezik, 2011). treebank structures – if tree automata techniques are used. However, purely logical approaches have 3 Families of Linguistic Query not received much attention in practice. Languages
Efficiency How much processing time and As mentioned above, there are two strains of lin- memory is needed for the execution of a query? guistic query languages. Some specific properties Answers to this question relate to many different of these languages are discussed next. parameters. First, data size of the corpora matters – dealing with thousands, millions, or billions of 3.1 Text Corpus Query Languages sentences makes a big difference. Second, data CQP The language of the IMS Corpus Query model complexity matters. Third, query expres- Processing Workbench (Hardie, 2012)8 has a long siveness and complexity matters. history (Christ, 1994). From this common ances- Even if a user is dealing with large datasets, tor, CQL (Kilgarriff et al., 2004) and Poliquarp complex data models and complicated queries, were later developed. Right from the beginning, there are solutions to produce acceptable response CQP supported annotated word tokens, structural times. For instance, by providing a highly par- boundaries (sentences, constituents) and sentence- allel computing infrastructure using MapReduce aligned parallel texts. The core of a query con- techniques (Schneider, 2013), or by using sophis- sists of regular expressions that specify matching ticated indexing and retrieval techniques (Ghodke token sequences. These descriptions can refer to and Bird, 2012). the level of word forms, part-of-speech tags or any other positional (=token-bound) attribute. Non- Reporting and exporting Does the query lan- recursive constituents are indirectly available as guage or query system offer flexible support for structural boundaries and can be used to restrict the user to configure the data reported in the search the search space for regular expression matches results? The selection of sub-structures is typi- on the positional level. The constituent segments cally deeply integrated in the query syntax. For also allow for attributes which can be queried, for text concordancing tools, Frick et al. (2012) men- instance syntactic head information. The main tion the LINK/ALL operator of COSMAS II, or 7http://www.ims.uni-stuttgart.de/forschung/ bracketed expressions in Poliquarp. The statisti- ressourcen/werkzeuge/tigersearch.html cal reporting functions of the monolingual tree- 8http://cwb.sourceforge.net
8 Relation QL Symbol expressed succinctly as S !<< saw. Their infor- Immediate TGrep2, fsq, TS, AQL > dominance LPath / mation need Q5 “Find the first common ancestor TGrep2 >> of sequences of a noun phrase followed by a verb Transitive fsq >+ phrase” leads to a short but intricate query (see dominance TS, AQL >* LPath // Tab. 1 for operators): Immediate TGrep2, fsq, TS, AQL . *=p << (NP=n .. (VP=v >> =p !>> precedence LPath -> (* << =n >> =p))) TGrep2, fsq .. Transitive TS, AQL .* precedence LPath --> 3.2.1 Path-based Languages TS, AQL $ Immediate LPath Bird et al. (2006) developed this query TGrep2 $. sibling LPath => language as a generally applicable extension of the XPath query language for XML10. Syntactic Table 1: Operators of query languages (QL) trees as well as XML documents are ordered trees. However, the direct use of XPath for querying lin- guistic trees is limited by the absence of (a) the weakness of this query language is the lack of a horizontal axis of x immediately follows/precedes means to query arbitrary relations between tokens, y, and (b) sibling x immediately follows/precedes which would be necessary to properly support the sibling y.11 Q2 from above can be stated as search for dependency relations. Given the fact /S[not //_[@lex = ’saw’]] that dependency labels are bound to words, one could map this information as an attribute on the Q5 cannot be expressed correctly (Lai and Bird, positional level, for example, attributing the prop- 2004). A further extension of LPath, called erty of being a subject to the head of the subject. LPath+ (Lai and Bird, 2005), is more expressive An integrated macro and reporting language and allows for a correct but complex query: distinguishes CQP as a powerful and versatile tool. //_[/_[(NP or (/_[not(=>_)])*/NP[not(=>_)) and => (VP or (/_[not(<=_)])*/VP[not(<=_)])] CQL The query language behind the commer- This is due to the fact that path-based, variable- cial corpus query platform Sketch Engine9 is an free languages cannot easily express equality re- extension of CQP (Jakubicek et al., 2010). strictions. Therefore, the following shorter LPath Support for identifying word matches across expression does not have the correct meaning be- parallel corpora is technically implemented via the cause each NP (or VP) may refer to different within operator. For a sentence-aligned parallel nodes: corpus (English and German Europarl corpus), a query rooted in the English side might look like: //_[{//NP->VP} and not(//_{//NP->VP})] [word="car"] within europarl7_de: [word="Auto"] DDDQuery This language is another attempt to This finds all occurrences of car in sentences extend XPath and to better adapt it for linguistic where a parallel sentence containing the word information needs (Faulstich et al., 2006). Its data Auto exists. This kind of query, however, does model was developed for a multi-layered, linguis- not allow to explicitly test for word alignment re- tically richly annotated representation of historical lations. Still, the search patterns on both sides of texts, including transcriptions and aligned trans- the within operator can be arbitrarily complex. lations, which resulted in “non-tree-shaped anno- tation graphs and multiple annotation hierarchies 3.2 Treebank Query Languages with conflicting structure”. This query language TGrep2 The efficient treebank query tool “goes beyond LPath by supporting queries on text TGrep2 is limited to context-free parse trees. Lai spans, on multiple annotation layers, and across and Bird (2004) see its strength in the ability to aligned texts”. The language introduces shared query for non-inclusion or non-existence of con- variables for any node set in order to easily ex- stituents. Their information need Q2 “Find sen- press equality restrictions and report the matched tences that do not include the word saw” can be nodes as result data.
9See Kilgarriff et al. (2014) for a recent description. The 10http://www.w3.org/tr/xpath NoSketchEngine, the open-source part of the Sketch Engine, 11Note that the transitive closures of these relations are is available from http://nlp.fi.muni.cz/trac/noske. available in XPath.
9 PML-TQ This query language is also a path- tified formulas can be used to effectively query based approach (Stˇ epˇ anek´ and Pajas, 2010). A matching structures. query consists of a Boolean combination of node selector paths and filters. The language allows re- TIGERSearch Konig¨ and Lezius (2000) intro- cursive sub-queries in selectors which evaluate to duced this logic-based, syntax graph description node sets. The cardinality of these node sets can language for the TIGER data model. It is a subset be tested by numeric quantifiers. A quantifier of of first-order logic, providing only globally exis- zero tests for the non-existence of nodes; there- tentially quantified variables and limited negation. fore, non-existing nodes can be queried in a natu- The language has two layers, namely, node con- ral way. A similar technique of extensionalization straints and graph constraints. of sub-queries into node sets was implemented for Node constraints are either node descriptions or the TreeAligner language (Marek et al., 2008). node (relation) predicates. Node descriptions are Boolean expressions of feature-value constraints 3.2.2 Logic-based Languages with optional variable decorations for referencing fsq12 The Finite Structure Query language the same node several times in a query, for in- (Kepser, 2003) provides full first-order logic as stance #v:[word != "saw"] for a terminal node a query language over syntactic structures of the description, or #np:[cat = ("NP"|"CNP")] for TIGER data model (Brants et al., 2004). This a simple or coordinated noun phrase. Node includes labelled secondary edges between arbi- predicates constrain selected properties of nodes, trary nodes and discontiguous children. Therefore, such as being the root of a tree (root(#s)) fsq has an outstanding expressiveness. Regular or having a certain number of daughter nodes expression support for node labels and response (arity(#CNP,2)). Node relation predicates ex- times that are comparable to TIGERSearch make press the usual structural relations in a user- this approach a practical one. Lai and Bird’s dif- friendly operator notation, e.g. #s >* #np for ficult question Q5 can be expressed as follows in a dominance relation. Graph constraints are the somewhat inconvenient LISP-like prefix nota- conjunctions or disjunctions of node constraints. tion for first-order logic of fsq 13: Negation is not allowed on the level of graph con- (E a (E n (E v (& straints, which severely limits the expressiveness. (cat n NP) (cat v VP) (>+ a v) (.. n v) The TIGER language originally specified user- (! (>+ n v)) (! (>+ v n)) (A b (-> (& (>+ a b) (>+ b n)) defined macros (templates), however, this part of (! (>+ b v)))))))) the language was never implemented.
Compared to the query language of TIGERSearch, AQL The query language of ANNIS is an exten- there is a lack of special purpose predicates such sion of the TIGERSearch language for multi-level as the (token) arity of syntactic nodes or prece- graph-based annotations. It offers operators for la- dence or dominance restrictions with numeric dis- belled dependency relations, inclusion or overlap tance limits, for example, >2,5 expressing a indi- of token spans, corpus metadata information, and rect dominance relation with a minimal depth of 2 namespaces for annotations of the same type pro- and a maximum of 5. duced by different tools15. The operator for depen- MonaSearch14 Maryns and Kepser (2009) ex- dency relations is an instance of the general opera- tended the logical expressiveness of fsq even fur- tor -> for directed and labelled edges between any ther to monadic second-order logic. However, its two nodes. Such edges can also be used to es- data model is restricted to context-free parse trees. tablish or query alignments between parallel sen- A main application of such an expressive lan- tences on the level of words or phrases. guage are automatic consistency checks in human- TreeAligner The Stockholm TreeAligner (Lund- created treebanks. However, existentially quan- borg et al., 2007) introduced an operator for 12The Java implementation of fsq also includes a querying bilingual alignments between words or TIGERSearch-like visualization for the matched trees, see phrases of parallel treebanks, freely combinable http://www.tcl-sfs.uni-tuebingen.de/fsq. 13Existential (E) and universal (A) quantification, conjunc- with monolingual TIGERSearch-style queries. tion (&), negation (!), implication (->). To overcome some expressiveness limitations of 14http://www.tcl-sfs.uni-tuebingen.de/ MonaSearch 15For instance, for different parsers (Chiarcos et al., 2010).
10 TIGERSearch, Marek et al. (2008) introduced SPARQL19 RDF (Resource Description Frame- node sets (node descriptions decorated with vari- work) triple stores with SPARQL endpoints for ables starting with % instead of #). One might try querying linked data are fairly standard nowa- to express Bird and Lai’s Q2, that is, find sentences days. Kouylekov and Oepen (2014) used this tech- without saw, in the following ways: nique to represent and query semantic dependen- #s:[cat="S"] >* #w:[word!="saw"] (1) cies. However, the queries directly operate on the #s:[cat="S"] !>* #w:[word="saw"] (2) internal RDF representations and do not meet the #s:[cat="S"] !>* %w:[word="saw"] (3) criteria of natural expressibility. The authors pro- (1) actually matches all cases where a sentence pose a query-by-example and a template expan- dominates any other word than saw. (2) searches sion front-end for better usability. for occurrences of the word saw not dominated by Chiarcos (2012) introduce POWLA, a generic a sentence node. The interpretation of (3) relies on formalism to represent multi-layer annotated cor- a modified evaluation strategy of the negated dom- pora in RDF and OWL/DL and to query these inance if one of the arguments is a node set: only structures by SPARQL. In order to improve the ex- those sentences match where the negated transi- pressibility, SPARQL macros for AQL operators tive dominance constraint !>* is true for any of are defined. Given the expressiveness of SPARQL, the nodes with the word attribute saw. this allows to overcome the query language lim- itations of AQL or TIGERSearch, which cannot 4 General Data Query Languages query for missing annotations.
Complex data structures are not a privilege of lin- LUCENE20 Every information retrieval system guistics, so obviously many general data query has an integrated query language. Powerful text languages and data management systems exist. indexing and query engines such as LUCENE can Some of them have been used to represent and be used to manage large amounts of texts. By query linguistic structures. treating each sentence as an IR document, Ghodke 16 Bouma and Kloosterman and Bird (2012) implemented a high-performance XPath/XQuery 21 (2007) used these XML technologies in a treebank query system on top of LUCENE. straightforward manner for querying and mining 5 Evaluation Strategies for Linguistic syntactically annotated corpora. These query Query Languages languages are also the basis of Nite QL (Carletta et al., 2005), which is targeted at multimodal There are essentially two approaches to implement annotations. the evaluation of linguistic query languages: either by programming a custom implementation of the SQL The structured query language for rela- execution of the query over a custom implementa- tional databases (RDBMS) is a standard tech- tion of the data management, or by translating the nology with highly efficient implementations. query and the data into a host database system and RDBMSs have been widely used to represent large executing the actual query on the host. amounts of data, e.g. for text concordancing.17 Sometimes, these approaches are mixed; for CYPHER18 Distributed NoSQL graph instance, the TreeAligner uses the relational databases and CYPHER as one of the straight- database SQLite for storing and retrieving the pri- forward query languages seem to be a good mary data of word tokens, but implements a cus- match for highly interconnected linguistic data tom in-memory engine for the evaluation of the (Holzschuher and Peinl, 2013). Pezik (2013) re- Boolean algebra of node predicates and node rela- ports some experiments for corpus representation tions. and corpus query with a pure graph database. Banski et al. (2013) integrate a general text 5.1 Custom Evaluation Engines retrieval engine with a graph database for their Manatee (Rychly,´ 2007) is CQL’s back-end for corpus analysis platform. textual data management and query evaluation. It
16http://www.w3.org/XML/Query 19http://www.w3.org/TR/sparql11-query 17http://corpus.byu.edu (Davies, 2005) 20http://lucene.apache.org 18http://neo4j.com/developer/ 21Their query language, however, does not allow regular cypher-query-language expressions over labels, or underspecified node descriptions.
11 is language and annotation independent and in- first-order logic intermediate representation from cludes efficient implementations of inverted in- which the actual SQL queries are derived. dexes, word compression, etc. in order to cope The development of the relational data model of with extremely large text corpora. Attributes ANNIS (relANNIS) and the corresponding trans- of primary data can be set-valued and support lation of the ANNIS query language AQL into unification-style attribute comparisons. Another SQL queries by Rosenfeld (2010) was inspired by interesting feature of Manatee is its support for dy- the DDDQuery translation. In the next section, we namic attributes of positional primary data. These propose a linguistic query language which is sim- are implemented as function calls which can be ilar to AQL but has a simpler data model. There- declared at the level of the corpus configuration, fore, we expect that our query translation com- for instance, for external lexicon look-up, morpho- ponent can be built using the techniques of AQL logical analysis, or the transformation of tags. query evaluations. TGrep2, TIGERSearch and fsq are examples for treebank query systems with a fully custom data 6 A Proposal for Querying Richly management and query evaluation engine. Rosen- Annotated Multiparallel Text Corpora feld (2010) gives a concise description of the Our data model presupposes the following com- implementation techniques behind TIGERSearch. ponents: (a) multiparallel corpora with sentence, The corpus import of TIGERSearch includes the word and sub-sentential alignments across lan- construction of many specialized indexes for pred- guages; (b) monolingual linguistic annotations icates and attributes. During indexing, statistics on such as PoS tags (preferably the same universal the selectivity of attributes are built, which in turn tagset across languages), base forms and morpho- guide the query execution planner to limit the full logical information; (c) syntactic annotations in evaluation of a query to a subset of syntactic trees. the form of dependency relations and (partial) con- At the stage of corpus indexing, users can provide stituents, allowing the output of different tools their own type definitions, that is, short names for for the same kind of analysis (multi-annotation). subsets of admissible feature values. A definition Multi-tokenization is not required for our data for genitive or dative case looks as follows: and would impose unnecessary complexity for the gen-dat := "gen","dat"; query component. However, metadata on the level Although any query involving this case ambi- of corpora, documents, or sentences is needed. guity can be expressed by a Boolean disjunc- The proposed query language should allow to tion, type definitions lead to both more readable flexibly query all aspects of our data model. How- and compact queries and also to more efficient ever, the search space of the query evaluation will processing due to the type-based data model of be restricted to the context of a monolingual sen- TIGERSearch. tence and its corresponding aligned sentences.22 The concrete query syntax for monolingual search 5.2 Query Translation Approach will be based on TIGERSearch. Additionally, we introduce an alignment operator similar to the LPath and DDDQuery are both Xpath-style lan- bilingual one of the TreeAligner. However, in guages that owe much to the hierarchical data multiparallel queries the alignment operator can model of XML. However, storing and efficient re- be used to constrain alignments between nodes trieval of large XML data sets turns out to be a of any pair of languages. From AQL, we reuse technical challenge in general (Grust et al., 2004). the operator for dependency relations, the sup- One common solution for high-performance XML port for metadata predicates, and explicit names- retrieval is based on a mapping of the hierarchi- paces. From CQP, we import the concept of a non- cal document structure into a flat relational for- recursive macro language. Such a facility proved mat, which in turn allows the use of highly effi- to be extremely useful for large scale linguistic cient RDBMSs. Both linguistic query languages mining in the case of Sketch Grammars of the – LPath and DDDQuery – are translated into SQL Sketch Engine (Kilgarriff et al., 2004). queries because their XML data model is physi- cally stored in an RDBMS. The implementation 22Monolingual searches across sentence boundaries as per- mitted in CQP-style queries will not be possible. However, of DDDQuery (Faulstich et al., 2006) is especially this search limit does not preclude reporting contextual infor- interesting for us because they first translate into a mation from surrounding sentences.
12 24 ral. Indeed, SQL itself offers the set operations UNION, INTERSECT, and EXCEPT to combine