Text Corpora and the Challenge of Newly Written Languages
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the 1st Joint SLTU and CCURL Workshop (SLTU-CCURL 2020), pages 111–120 Language Resources and Evaluation Conference (LREC 2020), Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC Text Corpora and the Challenge of Newly Written Languages Alice Millour∗, Karen¨ Fort∗y ∗Sorbonne Universite´ / STIH, yUniversite´ de Lorraine, CNRS, Inria, LORIA 28, rue Serpente 75006 Paris, France, 54000 Nancy, France [email protected], [email protected] Abstract Text corpora represent the foundation on which most natural language processing systems rely. However, for many languages, collecting or building a text corpus of a sufficient size still remains a complex issue, especially for corpora that are accessible and distributed under a clear license allowing modification (such as annotation) and further resharing. In this paper, we review the sources of text corpora usually called upon to fill the gap in low-resource contexts, and how crowdsourcing has been used to build linguistic resources. Then, we present our own experiments with crowdsourcing text corpora and an analysis of the obstacles we encountered. Although the results obtained in terms of participation are still unsatisfactory, we advocate that the effort towards a greater involvement of the speakers should be pursued, especially when the language of interest is newly written. Keywords: text corpora, dialectal variants, spelling, crowdsourcing 1. Introduction increasing number of speakers (see for instance, the stud- Speakers from various linguistic communities are increas- ies on specific languages carried by Rivron (2012) or Soria ingly writing their languages (Outinoff, 2012), and the In- et al. (2018)). Although linguistic communities are be- ternet is increasingly multilingual.1 ing threatened all over the world, this represents a valuable Many of these languages are less-resourced: these linguis- opportunity to observe, document and equip with appropri- tic productions are not sufficiently documented, while there ate tools an increasing number of languages. These new is an urge to provide tools that sustain their digital use. spaces of written conversation have been taken over by lin- What is more, when a language is not standardized, we guistic communities which practice had been mainly oral need to build adapted resources and tools that embrace its until then (van Esch et al., 2019). When no orthography diversity. has been defined for a given language, or when one (or var- Including these linguistic productions into natural language ious) conventions exist but are not consistently used by the processing (NLP) pipelines hence requires efforts on two speakers, spellings may vary from one speaker to another. complementary fronts: (i) collecting or building resources Indeed, when the spellings are not standardized by an ar- that represent the use of the language, (ii) developing tools bitrary convention, speakers may transcribe the language which can cope with the variation mechanisms. based on how they speak it. In all cases, the very first resource that is needed for further In fact, standardizing spelling does not confine to defining processing is a text corpus. the orthography, as it usually also acts as a unification pro- After presenting the challenges related to the processing of cess of potential linguistic variants towards a sole written oral languages when they come to be written, we introduce form. By contrast, the absence of such a standardization in Section 3. the existing multilingual sources that are com- process authorizes the raw transcription of a multitude of monly used to collect corpora, as well as their main short- linguistic variants of a given language. These variants be- comings. ing transcribed according to the spelling habits and linguis- In Section 4., we show how crowdsourcing has been used in tic backgrounds of each speaker, the Internet, especially in the past to involve the members of linguistic communities its conversational nature, has become a breeding ground for into collaboratively building resources for their languages. linguistic diversity expression and observation. We argue that this method is all the more reasonable when Situations of spelling variations observed on the Internet it comes to collecting meaningful data for non-standardized are documented in diverse linguistic contexts such as the languages. ones of: In Section 5., we present the existing initiatives to crowd- source text corpora. Based on our own experiments and on • The Zapotec and Chatino communities in Oaxaca, the result of a survey regarding the digital use of a non- Mexico, as detailed in (Lillehaugen, 2016) in the con- standardized language, we explain why collecting this par- text of the Voces del Valle program. During this pro- ticular type of resource is challenging. gram, speakers were encouraged to write tweets in their languages. They were provided with spelling 2. The Need for Text Corpora guidance they were not compelled to follow. As stated The rise in use of SMS, online chat and of social media by the author: “The result was that, for the most part, in general has created a new space of expression for an the writers were non-systematic in their spelling deci- sions—but they were writing”. 1See, for instance, the reports provided by w3tech such as https://w3techs.com/technologies/history_ • Some of the communities speaking Tibetan dialects overview/content_language/ms/y. outside China, which develop a “written form based 111 on the spoken language” independently of the Classi- • an insufficient coverage to constitute the basis for fur- cal Literary Tibetan (Tournadre, 2014). ther linguistic resources developments. These corpora are unlikely to be balanced in terms of representative- • The Eton ethnic group, in Cameroon, about ness of the existing practices. which Rivron (2012) observes that Internet is the sup- port of “the extension of a mother tongue outside its • their nature and license sometimes require operations habitual context and uses, and the correlated develop- that result in a loss of information such as the metadata ment of its graphic system”. necessary to identify the languages or the structure of the document. • Speakers of Javanese dialects who “have their own way of writing down the words they use according • using them requires additional linguistic resources (to to the pronunciation they understand”, regardless of perform language identification, for instance). the official spelling. Each dialect developing its own spelling, a dialect that was “originally only recogniz- In the following, we first present the Wikipedia project, able through its oral narratives (pronunciation) is now which, with 306 active Wikipedias distributed under Cre- easily recognizable through the spelling used in social ative Commons licenses, undoubtedly provides the largest media” (Fauzi and Puspitorini, 2018). freely available multilingual corpus. Second, we present how the Web can more generally be • Communities using Arabizi to transcribe Arabic on- used as a source of text corpora. We focus on describing line: as reported by Tobaili et al. (2019), Arabizi al- how Web crawling has been used to gather corpora for less- lows multiple mappings between Arabic and Roman resourced languages, and briefly comment on the use of so- alphanumerical characters, and thus makes apparent cial networks-based corpora. dialectal variations usually hidden in the traditional writing. 3.1. Wikipedia as a Corpus • Communities transliterating Indian dialects with Ro- Wikipedia is an online collaborative multilingual encyclo- man alphabet without observing systematic conven- pedia supported by the WIKIMEDIA FOUNDATION, a non- tions for transliteration (Shekhar et al., 2018). profit organization. Along with providing structured information from which • Regional European languages such as Alsatian, a con- lexical semantic resources or ontologies can be derived, tinuum of Alemannic dialects, for which a great di- Wikipedia is an easily accessible source of text corpora versity of spellings is reported (Millour and Fort, widely used in the NLP community, and from which both 2019) even though a flexible spelling system, Or- well- and less-resourced languages benefit. thal (Crevenat-Werner´ and Zeidler, 2008), has been developed. Its popularity and its collaborative structure make it the In the following, we will refer to these proteiform lan- most natural environment to foster collaborative text pro- guages as “multi-variant”. The variation observed is indeed duction. We discuss in this section to which extent the result of (at least) two simultaneous mechanisms: the Wikipedia represents a valuable source of text corpora for dialectal and scriptural variations. Both of these degrees less-resourced and non-standardized languages, in terms of of freedom may be impacted by the usual dimensions for further NLP processing. variation (diachronic, diatopic, diastratic, and diamesic). After a short introduction on the size and quality of the ex- From a NLP perspective, these linguistic productions rep- isting Wikipedias, we describe how the issue of language resent a challenge. In fact, they push us to deal with the identification and the purpose of Wikipedia prevent it to issues that variation processes imply, either because these be the most appropriate virtual place to host dialectal and productions account for most of the written existence of a scriptural diversity. non-standardized language, or because