
SLTU-2014, St. Petersburg, Russia, 14-16 May 2014 COMMUNITY-BASED RESOURCE BUILDING AND DATA COLLECTION Kristiina Jokinen1 and Graham Wilcock2 1Institute of Behavioural Sciences 2Department of Modern Languages University of Helsinki, Finland {firstname.lastname}@helsinki.fi The paper is structured as follows. Section 2 sets the starting point for the present study by introducing the ABSTRACT DigiSami project and the WikiTalk application. Section 3 The paper describes our work on participatory and briefly describes the Sami languages, and Section 4 reviews community-based resource collection for the Sami language. participatory community-based research methods. Section 5 This includes community events where participants wrote details the data collection events, participant recruitment, new Sami Wikipedia articles and took part in speech data and the three types of data collected: text, speech, and collection by reading aloud Sami Wikipedia articles and multimodal data. Finally, Section 6 provides discussion and discussing freely in group conversations. The aim was to concludes with future directions. increase the number of Sami Wikipedia articles and thereby 2. UNDER-RESOURCED LANGUAGES AND strengthen Wikipedia as a digital resource for the Sami LANGUAGE TECHNOLOGY language and to collect speech data to be used in developing Sami speech components. Such components are intended to Under-resourced languages typically do not have advanced be combined with the Sami Wikipedia in order to build a natural language processing tools such as part-of-speech spoken interactive knowledge access system. taggers and syntactic parsers, or sophisticated knowledge resources such as Wordnets and semantic web ontologies. A Index Terms — language resources development, serious consequence of this is that language technology Wikipedia, Sami language, community-based participatory applications that require such tools and resources cannot data collection offer much support for these languages. It is desirable to find ways to start developing the missing resources for as 1. INTRODUCTION many languages as possible, and many activities have been The new paradigms in communication technology affect initiated for this (e.g. the LRE Map [1] for sharing language language communities in various ways. Social media resources, or making speech corpora freely available for applications provide novel practices to bring people TTS development [18]). Suggestions for overcoming the together, and people’s information needs are being met from many challenges facing small languages include using the online collaboratively edited material such as Wikipedia. As same technology standards as big languages, using language is a means to provide information, its role in the crowdsourcing and connecting with other small language new situations is prominent, and it needs to adapt itself to communities [17]. digital changes. This includes the usual linguistic changes It is also important to consider what kinds of language such as lexicon, morphology and syntax, but has an effect technology applications can work successfully with minimal also on the level of interaction itself. As [8] notes, language NLP tools or without them. One sophisticated knowledge has become “a function that is performed digitally”. resource that does not depend on advanced NLP tools for a This paper discusses aspects of this development, and language is Wikipedia (http://www.wikipedia.org/). Many focuses on the data collection activities for the North Sami languages that are under-resourced in language technology language in the DigiSami project. The collection of speech tools have their own Wikipedia. The reason for this is that and multimodal video data contribute to building Sami Wikipedia pages can be written and edited by ordinary language resources for speech and language technology, people with no training in NLP, and the underlying software while supporting the planning and writing of new Sami running Wikipedia is made freely available for all languages Wikipedia articles strengthens the digital text resources for by the Wikimedia Foundation (http://www.wikimedia.org/). the Sami language. The speech data will be used to support This means that LT applications that use Wikipedia as a development of speech technology components that can be primary resource are potentially able to work successfully combined with the Sami Wikipedia to build a spoken even with languages for which sophisticated NLP tools are interactive knowledge access system. not available. 201 SLTU-2014, St. Petersburg, Russia, 14-16 May 2014 2.1. WikiTalk situated in the northern parts of Scandinavia, Finland and the Kola Peninsula in Russia. The languages are spread In earlier research [5, 20] we developed WikiTalk, a speech- among four countries: Norway, Sweden, Finland and based dialogue interface to Wikipedia, through which the Russia, and the speakers don’t necessarily understand each user can navigate around Wikipedia following links to any other. A few of the languages have died over the centuries, desired topics, and listen to chunks of interesting articles e.g. the last speaker of the Akkala Sami spoken in Russia being read out aloud on demand. One version of WikiTalk died in 2003. runs on a humanoid robot [2, 6], so that the robot can act as There are around 30,000 to 40,000 speakers of different a talkative conversational companion, able to talk in a very Sami languages, most of them in Norway. The languages well-informed manner about more or less any topic, while are: Southern Sami, Uume Sami, Pite Sami, Lule Sami, following the human partner's changing interests. Northern Sami, Inari Sami, Skolt Sami, Kildin Sami and Ter WikiTalk is intrinsically multilingual, as it can work Sami. Northern Sami is the biggest group with about 20,000 with any language if there is a Wikipedia in the language speakers and three main dialects: Torne Sami, Finnmark and if there are speech components for the language. Sami and Sea Sami. It is also a lingua franca, probably Sophisticated NLP tools such as part-of-speech taggers and because of the number of its speakers. Our project focuses syntactic parsers can be used if they are available, but they on Northern Sami, and in this paper, we often use “Sami are not required because the WikiTalk application can use language” to refer to Northern Sami. the actual sentences of the Wikipedia articles directly. It must be noted that there are more ethnic Sami people 2.2. The DigiSami Project than there are speakers of the languages. However, digital information technology has increased interest in the Sami The DigiSami project (http://www.helsinki.fi/digisami/) language, for example Wikitravel (http://wikitravel.org) aims to support digital content generation for the Sami offers Sami-English translations for tourist purposes. One of languages with the help of language technology and it the important goals of the DigiSami project is to encourage focuses especially on the Northern Sami language. It is one the Sami people to use their language in digital media, but of the projects in a joint research framework funded by the also to increase its visibility among the languages of the Academy of Finland and the Hungarian Academy of world. It is thus important to revitalize the language by Science. The goal of the research framework is to increase encouraging its use in digital contexts, but also to make it the visibility and use of small Finno-Ugric languages in the visible and available for those who are interested in the digital world, as well as to apply language technology tools language and culture of the Sami people. to develop and process digital material. Several of the Sami languages have some language At the time of writing, the grand total of articles in all technology tools. For instance, Northern Sami has a text Wikipedias (in all languages) is about 31 million. English analyzer, wordlists, and translation tools (more information Wikipedia is the biggest with 4.5 million articles. The Sami at http://giellatekno.uit.no/), and there is a Northern Sami Wikipedia (in Northern Sami) was started in 2004 and has Wikipedia created under the auspices of the Wikimedia between 7500 and 8000 articles. This places it in the middle Foundation. However, as mentioned above, many of the of all Wikipedias in terms of size, being ranked 134th by Sami Wikipedia articles are short compared with the English number of articles. For comparison, the Finnish Wikipedia or Finnish counterparts, and there are not so many links to has about 350,000 articles and is ranked 20th. other articles which are crucial for further navigation and Sami Wikipedia (http://se.wikipedia.org) contains about information search among the articles. 400,000 words in total. The average article length is about 500 characters (as Wikipedias grow, their average article 4. COMMUNITY-BASED RESOURCE BUILDING length tends to grow as well). No statistics are available The community effort supported by the DigiSami project to about how many articles were created by translating them create and write Wikipedia articles is intended to encourage from some other Wikipedia. people to develop the Northern Sami Wikipedia more. It is The extension of applications such as WikiTalk to also expected that this kind of community effort will give a include Northern Sami will require the necessary resources boost to the speakers of other Sami languages and other to be developed. In particular, there is a need for more under-resourced language communities to develop their own Wikipedia articles as well as for speech data. To address Wikipedia. Such activity will not only support the these needs, the project organized six community events at Wikipedia encyclopedia, but can be claimed to reap benefit which new Wikipedia articles were written and spoken also for language learning and language use. In fact, within language material was collected through video recordings. the context of multilingual language learning, it has been 3. SAMI LANGUAGE CHARACTERISTICS pointed out [14] how pedagogical methods emphasizing the speaker’s own activity and experience provide good results Sami is a Finno-Ugric language which is related to Finnish in children’s language learning.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages6 Page
-
File Size-