The Nordic Dialect Corpus – an Advanced Research Tool
Total Page:16
File Type:pdf, Size:1020Kb
The Nordic Dialect Corpus – an advanced research tool Janne Bondi Johannessen Joel Priestley University of Oslo University of Oslo Oslo, Norway Oslo, Norway [email protected] [email protected] Kristin Hagen Tor Anders Åfarli University of Oslo Norwegian Univ. of Science & Tech. Oslo, Norway Trondheim, Norway [email protected] [email protected] Øystein Alexander Vangsnes University of Tromsø Tromsø, Norway [email protected] 2 Introduction Abstract In this paper, we describe the first, completed part of the Nordic Dialect Corpus. The corpus The paper describes the first part of the Nordic has a variety of features that combined makes it a Dialect Corpus. This is a tool that combines a very advanced tool for language researchers. number of useful features that together makes These features include: Linguistic contents (dia- it a unique and very advanced resource for re- lects from five closely related languages), anno- searchers of many fields of language search. tation (tagging and two types of transcription), The corpus is web-based and features full search interface (advanced possibilities for com- audio-visual representation linked to tran- bining a large array of search criteria and results scripts. presentation in an intuitive and simple interface), many search variables (linguistics-based, infor- 1 Credits mant-based, time-based), multimedia display (linking of sound and video to transcriptions), The Nordic Dialect Corpus is the result of close display of informant details (number of words collaboration between the partners in the re- and other information on informants), advanced search networks Scandinavian Dialect Syntax results handling (concordances, collocations, and Nordic Centre of Excellence in Microcom- counts and statistics shown in a variety of parative Syntax. The researchers in the network graphical modes, plus further processing). Fi- have contributed in everything from decisions to nally, and importantly, the corpus is freely avail- actual work ranging from methodology to re- able for research on the web. cordings, transcription, and annotation. Some of the corpus (in particular, recordings of infor- We give examples of both various kinds mants) has been financed by the national re- of searches, of displays of results and of results search councils in the individual countries, while handling. the technical development has been financed by the University of Oslo and the Norwegian Re- 3 Why the Nordic Dialect Corpus was search Council, plus the Nordic research funds developed NOS-HS and NordForsk. The Nordic Dialect Corpus was developed after a need for research material was voiced by mem- bers of NORMS (Nordic Centre of Excellence in Micro-comparative Syntax) and the ScanDiaSyn networks. Kristiina Jokinen and Eckhard Bick (Eds.) NODALIDA 2009 Conference Proceedings, pp. 73–80 Janne Bondi Johannessen, Joel Priestley, Kristin Hagen, Tor Anders Afarli˚ The overarching goal for these researchers is Country No of infor- No of to study the dialects of the North-Germanic lan- mants words guages, i.e., the Nordic languages spoken in the Denmark 7 19 088 Nordic countries, as dialects of the same lan- Faroe Is- 3 16 794 guage. The languages are closely related to each lands other, and three of them are mutually intelligible Finland 0 0 (Norwegian, Swedish and Danish), as are two Iceland 4 10 287 others (Faroese and Icelandic). All of them have Norway 45 132 417 some mutual intelligibility with each other if we Sweden 125 287 639 consider written forms. Sum 184 466 225 Studying the dialects only within the confines of each national language was therefore consid- Table 1: Corpus contents by 9. January 2009. ered to be misguided from a theoretical and prin- cipled point of view. Second, doing research Due to differences in the financing of the data across dialects over such a big area, covering six collection in the different countries, the data are countries (Denmark, Faroe Islands, Finland, Ice- less uniform than one might have wanted ideally. land, Norway, and Sweden), would be almost (Some recordings and transcriptions were done impossible if each researcher should get hold of for this corpus, while others were already done, relevant data on their own. such as most of the Swedish ones, which were Third, the research in NORMS and ScanDi- generously given us by the earlier project Swedia aSyn focusses on syntax – in which case data of 2000.) many different kinds were necessary. Question- Some recordings, such as those from Norway, naires for specific phenomena were needed (but the Swedish dialect of Oevdalian and the Danish will not be discussed in this paper), and record- dialect of Western Jutlandic, have two kinds of ings of spontaneous speech as it is used in ordi- recordings per informant: one semi-formal inter- nary conversations were very important. The lat- view (informant and project assistant), and one ter need is satisfied by the Nordic Dialect Cor- informal conversation between two informants. pus. Some dialects have recordings of both young and old informants, while others are only represented 4 Description of the Corpus by old ones. Some dialects are represented by 4.1 Linguistic contents and numbers both old and new recordings, where old ones are generally around fifty years old. Some dialects The corpus contains dialect data from the na- have been recorded by audio only, while others tional languages Danish, Faroese, Icelandic, have been recorded by both audio and video. All Norwegian, and Swedish. It is steadily growing, the dialects have recordings of informants be- since there are still new recordings that are being longing to both genders. Most importantly, how- done, or planned, while other recordings are in ever, all the recordings represent spontaneous various stages of finishing. At the moment, it speech. contains speech data from approximately 170 informants with 466 000 words, unevenly spread 4.2 Annotation: transcription and tagging between the five countries. Eventually, this will All the dialect data have been transcribed by rise to around 600 informants and the number of at least one transcription standard, and this work words will likely be more than doubled. The has been done for the most par in the individual numbers for the corpus as of today are given be- countries: Each dialect has been transcribed by low. the standard official orthography of that country. (For Norwegian, which has two standard orthog- raphies, Bokmål was chosen since there exist important computational tools for this variant.) In addition, all the Norwegian dialects and some Swedish ones have also been transcribed pho- netically.1 For the Norwegian dialects and the 1 The Norwegian phonetic transcription follows that of Pa- pazian and Helleland (2005). The transcription of the Oevdalian dialect follows the Oevdalian orthography (stan- 74 The Nordic Dialect Corpus — an advanced research tool Oevdalian Swedish ones that have two transcrip- manually repeatedly corrected file. The Tree- tions, the first transcription to be done was in Tagger gained an accuracy of 96.9 %. This tag- each case the phonetic one, and then the phonetic ger has then been used unchanged for the dialect transcription was translated to an orthographic corpus, under the assumption that the speech as transcription via a semi-automatic dialect trans- represented in the dialects and in Oslo are suffi- literator developed for the project. The fact that ciently similar once they are all transcribed by there are two transcriptions for dialects that are the same transcription standard. The Swedish very different from the standard national orthog- tagger is being trained in the same way. A writ- raphy makes it possible to search with both tran- ten language TnT tagger developed by Sofie Jo- scriptions in the corpus, and present search re- hansson Kokkinakis (2003) has been applied to sults in both, as illustrated below for the Swedish the Swedish dialect transcriptions (their standard dialect of Oevdalian: orthographic version). The new data will be used as training data for a new Swedish speech Tree- Tagger. 4.3 Search Interface The corpus uses an advanced search interface and results handling system Glossa (Nygaard it can.1PL we well do if 2007, Johannessen et al. 2008). The system al- come.1PL on lows for a large variety of search combinations making it possible to do very advanced and com- ‘We can possibly do it if we plex searches, even though the interface is very remember it.’ simple, with pull-down menus, and boxes that Figure 1. Two transcriptions for Oevdalian. expand only when prompted by the user. The corpus search system Corpus Work Bench The Text Laboratory at the University of Oslo (Christ 1994, Evert 2005) is used, so that the has the responsibility for the further technical simple corpus queries are translated to regular devopment, including tagging. The whole corpus expressions before querying – something that is will be grammatically tagged with POS and se- invisible to the user. lected morpho-syntactic features language by Several of the features in the search interface language. So far, the Norwegian data have been and the results display follow suggestions by par- tagged, while the Swedish data will be tagged ticipants in ScanDiaSyn and NORMS. soon. Tagging speech data is different from tag- Searching for lemmas and part of words: ging written data. Speech contains disfluencies, For those parts of the corpus that are tagged and interruptions and repetitions, and there are rarely lemmatised, it is possible to search for the lemma clear clause boundaries (Allwood, Nivre and only. This way we get all inflected forms of one Ahlsén 1989, Johannessen and Jørgensen 2006). lexeme. This feature is very useful when there is This is usually reflected in the transcription of suppletion in the stem of the word. For example, speech, which generally does not contain clause search for the Norwegian lemma gås (‘goose’) boundary or sentential markers such as full stops will give the results gås, gåsa, gjess, gjessene and exclamation marks (Jørgensen 2008, Rosén (various combinations of number and 2008).