Searching corpora with Glossa

Anders Nøklestad The Text Laboratory Department of and Scandinavian Studies University of Oslo

The Glossa corpus search

• Glossa is a web application for searching corpora • It is still early in development, but can already be used to search in monolingual written corpora • The corpora can be located on the same server as Glossa itself, or on other servers within the CLARIN infrastructure The Glossa corpus search system

• Many more features are planned for the future, such as – Support for multilingual (parallel) corpora – Support for speech corpora with audio, video and geographical maps – Saving of search results – User annotations of search results – Various statistics (frequency lists, collocations, metadata distributions, etc.)

Example 1: Search a Norwegian corpus for pronoun + verb constructions • We will search the Norwegian Corpus of Bokmål for phrases involving the pronoun han “he” followed by a verb • This will illustrate the different search interfaces offered by Glossa, as well as the use of metadata criteria to filter search results Simple search: han “he”

22 October 2013 CLARINO consortium meeting 6 Simple search: han er “he is”

11. april 2011 Ny Powerpoint mal 2011 7 Search results

11. april 2011 Ny Powerpoint mal 2011 8 Generalization 1: Search for all forms of the verb være “be”

11. april 2011 Ny Powerpoint mal 2011 9 Generalization 2: Search for past tense of all verbs… …with a specified interval between the words

11 Click on the “Filters” button to restrict the results to corpus texts matching selected metadata values

11. april 2011 Ny Powerpoint mal 2011 12 Advanced search with regular expressions • Most searches can be specified using the simple or extended views illustrated so far • However, for maximum flexibility there is a third view that allows the use of regular expressions • This view gives access to the full query used by the search engine (in this case, the IMS Corpus Workbench)

11. april 2011 Ny Powerpoint mal 2011 15 Searching corpora in the CLARIN network

• Glossa can search simultaneously in one or more corpora located on different servers in the CLARIN network • Currently only search for concrete words or phrases is supported – no grammatical search or metadata filtering; however, as these features become supported by the CLARIN infrastructure, they will be built into Glossa Example 2: Simultaneous search for beschränkenden (“restrictive”) in three German corpora • The corpora are located on three different servers, each at a different institution: – The C4 Corpus at the Berlin-Brandenburg Academy of Sciences and Humanities – The TextGrid Digital Library at the Institute for the German Language in Mannheim – The Tübingen Baumbank des Deutschen – Diachrones Corpus at the The University of Tübingen

Results from all three corpora are shown in the same table

19 Conclusions

• Glossa can be used to search both local corpora and corpora that are made available in the CLARIN infrastructure • Grammatical features and metadata restrictions are supported for local corpora (and hopefully in the future also for other corpora in the CLARIN infrastructure) • More info here: http://www.tekstlab.uio.no/glossa2/front