Text Segmentation by Language Using Minimum Description Length

Total Page:16

File Type:pdf, Size:1020Kb

Text Segmentation by Language Using Minimum Description Length Text Segmentation by Language Using Minimum Description Length Hiroshi Yamaguchi Kumiko Tanaka-Ishii Graduate School of Faculty and Graduate School of Information Information Science and Technology, Science and Electrical Engineering, University of Tokyo Kyushu University [email protected] [email protected] Abstract addressed in this paper is rare. The most similar The problem addressed in this paper is to seg- previous work that we know of comes from two ment a given multilingual document into seg- sources and can be summarized as follows. First, ments for each language and then identify the (Teahan, 2000) attempted to segment multilingual language of each segment. The problem was texts by using text segmentation methods used for motivated by an attempt to collect a large non-segmented languages. For this purpose, he used amount of linguistic data for non-major lan- a gold standard of multilingual texts annotated by guages from the web. The problem is formu- lated in terms of obtaining the minimum de- borders and languages. This segmentation approach scription length of a text, and the proposed so- is similar to that of word segmentation for non- lution finds the segments and their languages segmented texts, and he tested it on six different through dynamic programming. Empirical re- European languages. Although the problem set- sults demonstrating the potential of this ap- ting is similar to ours, the formulation and solution proach are presented for experiments using are different, particularly in that our method uses texts taken from the Universal Declaration of only a monolingual gold standard, not a multilin- Human Rights and Wikipedia, covering more than 200 languages. gual one as in Teahan’s study. Second, (Alex, 2005) (Alex et al., 2007) solved the problem of detecting 1 Introduction words and phrases in languages other than the prin- cipal language of a given text. They used statisti- For the purposes of this paper, a multilingual text cal language modeling and heuristics to detect for- means one containing text segments, limited to those eign words and tested the case of English embed- longer than a clause, written in different languages. ded in German texts. They also reported that such We can often find such texts in linguistic resources processing would raise the performance of German collected from the World Wide Web for many non- parsers. Here again, the problem setting is similar to major languages, which tend to also contain portions ours but not exactly the same, since the embedded of text in a major language. In automatic process- text portions were assumed to be words. Moreover, ing of such multilingual texts, they must first be seg- the authors only tested for the specific language pair mented by language, and the language of each seg- of English embedded in German texts. In contrast, ment must be identified, since many state-of-the-art our work considers more than 200 languages, and NLP applications are built by learning a gold stan- the portions of embedded text are larger: up to the dard for one specific language. Moreover, segmen- paragraph level to accommodate the reality of mul- tation is useful for other objectives such as collecting tilingual texts. The extension of our work to address linguistic resources for non-major languages and au- the foreign word detection problem would be an in- tomatically removing portions written in major lan- teresting future work. guages, as noted above. The study reported here was From a broader view, the problem addressed in motivated by this objective. The problem addressed this paper is further related to two genres of previ- in this article is thus to segment a multilingual text ous work. The first genre is text segmentation. Our by language and identify the language of each seg- problem can be situated as a sub-problem from the ment. In addition, for our objective, the set of target viewpoint of language change. A more common set- languages consists of not only major languages but ting in the NLP context is segmentation into seman- also many non-major languages: more than 200 lan- tically coherent text portions, of which a represen- guages in total. tative method is text tiling as reported by (Hearst, Previous work that directly concerns the problem 1997). There could be other possible bases for text 969 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 969–978, Jeju, Republic of Korea, 8-14 July 2012. c 2012 Association for Computational Linguistics segmentation, and our study, in a way, could lead learning data, since our objective requires us to con- to generalizing the problem. The second genre is sider segmentation by both major and non-major classification, and the specific problem of text clas- languages. For most non-major languages, only a sification by language has drawn substantial atten- limited amount of corpus data is available.1 tion (Grefenstette, 1995) (Kruengkrai et al., 2005) This constraint suggests the difficulty of applying (Kikui, 1996). Current state-of-the-art solutions use certain state-of the art machine learning methods re- machine learning methods for languages with abun- quiring a large learning corpus. Hence, our formu- dant supervision, and the performance is usually lation is based on the minimum description length high enough for practical use. This article con- (MDL), which works with relatively small amounts cerns that problem together with segmentation but of learning data. has another particularity in aiming at classification In this article, we use the following terms and into a substantial number of categories, i.e., more notations. A multilingual text to be segmented is than 200 languages. This means that the amount of denoted as X = x1, . , x X , where xi denotes training data has to remain small, so the methods the i-th character of X and | X| denotes the text’s to be adopted must take this point into considera- length. Text segmentation by| language| refers here tion. Among works on text classification into lan- to the process of segmenting X by a set of borders guages, our proposal is based on previous studies us- B = [B1,...,B B ], where B denotes the num- | | | | ing cross-entropy such as (Teahan, 2000) and (Juola, ber of borders, and each Bi indicates the location 1997). We explain these works in further detail in of a language border as an offset number of charac- 3. ters from the beginning. Note that a pair of square § This article presents one way to formulate the seg- brackets indicates a list. Segmentation in this paper mentation and identification problem as a combina- is character-based, i.e., a Bi may refer to a position torial optimization problem; specifically, to find the inside a word. The list of segments obtained from set of segments and their languages that minimizes B is denoted as X = [X0,...,X B ], where the con- | | the description length of a given multilingual text. In catenation of the segments equals X. The language the following, we describe the problem formulation of each segment Xi is denoted as Li, where Li , ∈ L and a solution to the problem, and then discuss the the set of languages. Finally, L = [L0,...,L B ] | | performance of our method. denotes the sequence of languages corresponding to each segment Xi. The elements in each adjacent pair 2 Problem Formulation in L must be different. We formulate the problem of segmenting a multi- In our setting, we assume that a small amount (up lingual text by language as follows. Given a multi- to kilobytes) of monolingual plain text sample data lingual text X, the segments X for a list of borders is available for every language, e.g., the Universal B are obtained with the corresponding languages L. Declaration of Human Rights, which serves to gen- Then, the total description length is obtained by cal- erate the language model used for language identifi- culating each description length of a segment Xi for cation. This entails two sub-assumptions. the language Li: First, we assume that for all multilingual text, B every text portion is written in one of the given | | (ˆ, ˆ) = arg min dl (X ). (1) languages; there is no input text of an unknown X L Li i X,L i=0 language without learning data. In other words, ∑ we use supervised learning. In line with recent The function dl Li (Xi) calculates the description trends in unsupervised segmentation, the problem length of a text segment Xi through the use of a of finding segments without supervision could be language model for Li. Note that the actual total solved through approaches such as Bayesian meth- description length must also include an additional ods; however, we report our result for the supervised term, log2 X , giving information on the number setting since we believe that every segment must be of segments| (with| the maximum to be segmented labeled by language to undergo further processing. 1In fact, our first motivation was to collect a certain amount Second, we cannot assume a large amount of of corpus data for non-major languages from Wikipedia. 970 by each character). Since this term is a common 3 Calculation of Cross-Entropy constant for all possible segmentations and the min- The first term of (3), log PL (Xi), is the cross- imization of formula (1) is not affected by this term, − 2 i we will ignore it. entropy of Xi for Li multiplied by Xi . Vari- ous methods for computing cross-entropy| have| been The model defined by (1) is additive for Xi, so the following formula can be applied to search for proposed, and these can be roughly classified into two types based on different methods of univer- language Li given a segment Xi : sal coding and the language model.
Recommended publications
  • Possessive Constructions in Modern Low Saxon
    POSSESSIVE CONSTRUCTIONS IN MODERN LOW SAXON a thesis submitted to the department of linguistics of stanford university in partial fulfillment of the requirements for the degree of master of arts Jan Strunk June 2004 °c Copyright by Jan Strunk 2004 All Rights Reserved ii I certify that I have read this thesis and that, in my opinion, it is fully adequate in scope and quality as a thesis for the degree of Master of Arts. Joan Bresnan (Principal Adviser) I certify that I have read this thesis and that, in my opinion, it is fully adequate in scope and quality as a thesis for the degree of Master of Arts. Tom Wasow I certify that I have read this thesis and that, in my opinion, it is fully adequate in scope and quality as a thesis for the degree of Master of Arts. Dan Jurafsky iii iv Abstract This thesis is a study of nominal possessive constructions in modern Low Saxon, a West Germanic language which is closely related to Dutch, Frisian, and German. After identifying the possessive constructions in current use in modern Low Saxon, I give a formal syntactic analysis of the four most common possessive constructions within the framework of Lexical Functional Grammar in the ¯rst part of this thesis. The four constructions that I will analyze in detail include a pronominal possessive construction with a possessive pronoun used as a determiner of the head noun, another prenominal construction that resembles the English s-possessive, a linker construction in which a possessive pronoun occurs as a possessive marker in between a prenominal possessor phrase and the head noun, and a postnominal construction that involves the preposition van/von/vun and is largely parallel to the English of -possessive.
    [Show full text]
  • Enhanced Input in LCTL Pedagogy
    Enhanced Input in LCTL Pedagogy Marilyn S. Manley Rowan University Abstract Language materials for the more-commonly-taught languages (MCTLs) often include visual input enhancement (Sharwood Smith 1991, 1993) which makes use of typographical cues like bolding and underlining to enhance the saliency of targeted forms. For a variety of reasons, this paper argues that the use of enhanced input, both visual and oral, is especially important as a tool for the less- commonly-taught languages (LCTLs). As there continues to be a scarcity of teaching resources for the LCTLs, individual teachers must take it upon themselves to incorporate enhanced input into their own self-made materials. Specific examples of how to incorpo- rate both visual and oral enhanced input into language teaching are drawn from the author’s own experiences teaching Cuzco Quechua. Additionally, survey results are presented from the author’s Fall 2010 semester Cuzco Quechua language students, supporting the use of both visual and oral enhanced input. Introduction Sharwood Smith’s input enhancement hypothesis (1991, 1993) responds to why it is that L2 learners often seem to ignore tar- get language norms present in the linguistic input they have received, resulting in non-target-like output. According to Sharwood Smith (1991, 1993), these learners may not be noticing, and therefore not consequently learning, particular target language forms due to the fact that they lack perceptual salience in the linguistic input. Therefore, in order to stimulate the intake of form as well as meaning, Sharwood Smith (1991, 1993) proposes improving the quality of language input through input enhancement, involving increasing the saliency of lin- guistic features for both visual input (ex.
    [Show full text]
  • New Age Tourism and Evangelicalism in the 'Last
    NEGOTIATING EVANGELICALISM AND NEW AGE TOURISM THROUGH QUECHUA ONTOLOGIES IN CUZCO, PERU by Guillermo Salas Carreño A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Anthropology) in The University of Michigan 2012 Doctoral Committee: Professor Bruce Mannheim, Chair Professor Judith T. Irvine Professor Paul C. Johnson Professor Webb Keane Professor Marisol de la Cadena, University of California Davis © Guillermo Salas Carreño All rights reserved 2012 To Stéphanie ii ACKNOWLEDGMENTS This dissertation was able to arrive to its final shape thanks to the support of many throughout its development. First of all I would like to thank the people of the community of Hapu (Paucartambo, Cuzco) who allowed me to stay at their community, participate in their daily life and in their festivities. Many thanks also to those who showed notable patience as well as engagement with a visitor who asked strange and absurd questions in a far from perfect Quechua. Because of the University of Michigan’s Institutional Review Board’s regulations I find myself unable to fully disclose their names. Given their public position of authority that allows me to mention them directly, I deeply thank the directive board of the community through its then president Francisco Apasa and the vice president José Machacca. Beyond the authorities, I particularly want to thank my compadres don Luis and doña Martina, Fabian and Viviana, José and María, Tomas and Florencia, and Francisco and Epifania for the many hours spent in their homes and their fields, sharing their food and daily tasks, and for their kindness in guiding me in Hapu, allowing me to participate in their daily life and answering my many questions.
    [Show full text]
  • Prayer Cards (709)
    Pray for the Nations Pray for the Nations A Che in China A'ou in China Population: 43,000 Population: 2,800 World Popl: 43,000 World Popl: 2,800 Total Countries: 1 Total Countries: 1 People Cluster: Tibeto-Burman, other People Cluster: Tai Main Language: Ache Main Language: Chinese, Mandarin Main Religion: Ethnic Religions Main Religion: Ethnic Religions Status: Unreached Status: Unreached Evangelicals: 0.00% Evangelicals: 0.00% Chr Adherents: 0.00% Chr Adherents: 0.00% Scripture: Translation Needed Scripture: Complete Bible www.joshuaproject.net Source: Operation China, Asia Harvest www.joshuaproject.net Source: Operation China, Asia Harvest "Declare his glory among the nations." Psalm 96:3 "Declare his glory among the nations." Psalm 96:3 Pray for the Nations Pray for the Nations A-Hmao in China Achang in China Population: 458,000 Population: 35,000 World Popl: 458,000 World Popl: 74,000 Total Countries: 1 Total Countries: 2 People Cluster: Miao / Hmong People Cluster: Tibeto-Burman, other Main Language: Miao, Large Flowery Main Language: Achang Main Religion: Christianity Main Religion: Ethnic Religions Status: Significantly reached Status: Partially reached Evangelicals: 75.0% Evangelicals: 7.0% Chr Adherents: 80.0% Chr Adherents: 7.0% Scripture: Complete Bible Scripture: Complete Bible www.joshuaproject.net www.joshuaproject.net Source: Anonymous Source: Wikipedia "Declare his glory among the nations." Psalm 96:3 "Declare his glory among the nations." Psalm 96:3 Pray for the Nations Pray for the Nations Achang, Husa in China Adi
    [Show full text]
  • Using Italian
    This page intentionally left blank Using Italian This is a guide to Italian usage for students who have already acquired the basics of the language and wish to extend their knowledge. Unlike conventional grammars, it gives special attention to those areas of vocabulary and grammar which cause most difficulty to English speakers. Careful consideration is given throughout to questions of style, register, and politeness which are essential to achieving an appropriate level of formality or informality in writing and speech. The book surveys the contemporary linguistic scene and gives ample space to the new varieties of Italian that are emerging in modern Italy. The influence of the dialects in shaping the development of Italian is also acknowledged. Clear, readable and easy to consult via its two indexes, this is an essential reference for learners seeking access to the finer nuances of the Italian language. j. j. kinder is Associate Professor of Italian at the Department of European Languages and Studies, University of Western Australia. He has published widely on the Italian language spoken by migrants and their children. v. m. savini is tutor in Italian at the Department of European Languages and Studies, University of Western Australia. He works as both a tutor and a translator. Companion titles to Using Italian Using French (third edition) Using Italian Synonyms A guide to contemporary usage howard moss and vanna motta r. e. batc h e lor and m. h. of f ord (ISBN 0 521 47506 6 hardback) (ISBN 0 521 64177 2 hardback) (ISBN 0 521 47573 2 paperback) (ISBN 0 521 64593 X paperback) Using French Vocabulary Using Spanish jean h.
    [Show full text]
  • Forms and Functions of Negation in Huaraz Quechua (Ancash, Peru): Analyzing the Interplay of Common Knowledge and Sociocultural Settings
    Forms and Functions of Negation in Huaraz Quechua (Ancash, Peru): Analyzing the Interplay of Common Knowledge and Sociocultural Settings Dissertation zur Erlangung des Grades eines Doktors der Philosophie am Fachbereich Geschichts- und Kulturwissenschaften der Freien Universität Berlin vorgelegt von Cristina Villari aus Verona (Italien) Berlin 2017 1. Gutachter: Prof. Dr. Michael Dürr 2. Gutachterin: Prof. Dr. Ingrid Kummels Tag der Disputation: 18.07.2017 To Ani and Leonel III Acknowledgements I wish to thank my teachers, colleagues and friends who have provided guidance, comments and encouragement through this process. I gratefully acknowledge the support received for this project from the Stiftung Lateinamerikanische Literatur. Many thanks go to my first supervisor Prof. Michael Dürr for his constructive comments and suggestions at every stage of this work. Many of his questions led to findings presented here. I am indebted to him for his precious counsel and detailed review of my drafts. Many thanks also go to my second supervisor Prof. Ingrid Kummels. She introduced me to the world of cultural anthropology during the doctoral colloquium at the Latin American Institute at the Free University of Berlin. The feedback she and my colleagues provided was instrumental in composing the sociolinguistic part of this work. I owe enormous gratitude to Leonel Menacho López and Anita Julca de Menacho. In fact, this project would not have been possible without their invaluable advice. During these years of research they have been more than consultants; Quechua teachers, comrades, guides and friends. With Leonel I have discussed most of the examples presented in this dissertation. It is only thanks to his contributions that I was able to explain nuances of meanings and the cultural background of the different expressions presented.
    [Show full text]
  • Semantic Transparency in the Lowland Quechua Morphosyntax
    Semantic transparency in Lowland Ecuadorian Quechua morphosyntax1 PIETER MUYSKEN Abstract In this paper the properties of Lowland Ecuadorian Quechua, a possibly pidginized variety from this Andean indigenous language family, are eval- uated in the light of the semantic-transparency hypothesis. It is argued that the typological perspective created by looking at a wider range of languages brings some of the basic ideas developed for creole languages into focus. 1. Introduction One of Pieter Seuren’s contributions to the field of creole studies has been the idea that creoles somehow represent semantically transparent structures, as a result of their special history. Together with the late Herman Wekker, Seuren has particularly elaborated this idea in their joint 1986 paper. The dimensions of semantic transparency proposed by Seuren and Wekker (1986: 64) are uniformity, universality, and simplicity. Furthermore, Pieter Seuren has repeatedly stressed the importance of typological considerations, most recently in his Western Linguistics (1998). In this brief paper I will start to illustrate the workings of the principle of semantic transparency for the possibly pidginized Quechua of the Amazonian lowlands of eastern Ecuador, Lowland Ecuadorian Quechua (LEQ). This variety has been described by Leonardi (1966) and Mugica (1967) and is represented in texts gathered by Oberem and Hartmann (1971), but the present paper is based mostly on my own fieldwork in Arajuno (Tena province). Quechua is spoken (by more than eight million speakers) mostly in rural areas of the highlands of Bolivia, Peru, and Ecuador, but small pockets of speakers are also found on the slopes of the Amazon basin of Peru, Colombia, Bolivia, and Ecuador.
    [Show full text]
  • Languages of the Middle Andes in Areal-Typological Perspective: Emphasis on Quechuan and Aymaran
    Languages of the Middle Andes in areal-typological perspective: Emphasis on Quechuan and Aymaran Willem F.H. Adelaar 1. Introduction1 Among the indigenous languages of the Andean region of Ecuador, Peru, Bolivia, northern Chile and northern Argentina, Quechuan and Aymaran have traditionally occupied a dominant position. Both Quechuan and Aymaran are language families of several million speakers each. Quechuan consists of a conglomerate of geo- graphically defined varieties, traditionally referred to as Quechua “dialects”, not- withstanding the fact that mutual intelligibility is often lacking. Present-day Ayma- ran consists of two distinct languages that are not normally referred to as “dialects”. The absence of a demonstrable genetic relationship between the Quechuan and Aymaran language families, accompanied by a lack of recognizable external gen- etic connections, suggests a long period of independent development, which may hark back to a period of incipient subsistence agriculture roughly dated between 8000 and 5000 BP (Torero 2002: 123–124), long before the Andean civilization at- tained its highest stages of complexity. Quechuan and Aymaran feature a great amount of detailed structural, phono- logical and lexical similarities and thus exemplify one of the most intriguing and intense cases of language contact to be found in the entire world. Often treated as a product of long-term convergence, the similarities between the Quechuan and Ay- maran families can best be understood as the result of an intense period of social and cultural intertwinement, which must have pre-dated the stage of the proto-lan- guages and was in turn followed by a protracted process of incidental and locally confined diffusion.
    [Show full text]
  • Intonation in Quechua: Questions and Analysis
    INTONATION IN QUECHUA: QUESTIONS AND ANALYSIS Erin O’Rourke University of Pittsburgh [email protected] ABSTRACT information about the peaks and valleys occurring in non-final position within the utterance. Research on the suprasegmental system of Quechua has largely focused on the placement of 2. BACKGROUND stress within a word. Previous descriptions of Quechua intonation found in the literature offer a Quechua is an agglutinative language with SOV schematic representation of the intonation contour. word order which is spoken by approximately 8 In order to examine Quechua intonation within the million people primarily in Peru, Bolivia and current framework of Autosegmental Metrical Ecuador and also in parts of Argentina, Colombia (AM) phonology, data from field recordings in and Chile [2]. The Quechuan language family can Cuzco of Southern Peruvian Quechua have been be divided into two main varieties, Central and analyzed. The current paper offers a preliminary Peripheral [8, 15]. In Peru the Peripheral variety sketch of the basic units of intonation employed in with the greatest number of speakers is Southern Quechua, including pitch accents and boundary Peruvian Quechua [3]. Cuzco Quechua, one of the tones. This analysis may provide additional data in Southern Peruvian varieties, has been chosen for the cross-comparison of intonation systems and this description of intonation given its large also aid in the task of applying the principals of the distribution of speakers. AM model to less-commonly studied intonation 3. STRESS IN QUECHUA systems. Quechua has a fixed location for primary stress. As noted in Cerrón-Palomino [2:128], research on Keywords: intonation, Autosegmental Metrical Quechua suprasegmentals has focused mainly on (AM) model, Quechua, pitch accent, boundary stress placement across different varieties: “The tone phenomena of accent, rhythm and intonation are 1.
    [Show full text]
  • FACTORS AFFECTING PROFICIENCY AMONG GUJARATI HERITAGE LANGUAGE LEARNERS on THREE CONTINENTS a Dissertation Submitted to the Facu
    FACTORS AFFECTING PROFICIENCY AMONG GUJARATI HERITAGE LANGUAGE LEARNERS ON THREE CONTINENTS A Dissertation submitted to the Faculty of the Graduate School of Arts and Sciences of Georgetown University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Linguistics By Sheena Shah, M.S. Washington, DC May 14, 2013 Copyright 2013 by Sheena Shah All Rights Reserved ii FACTORS AFFECTING PROFICIENCY AMONG GUJARATI HERITAGE LANGUAGE LEARNERS ON THREE CONTINENTS Sheena Shah, M.S. Thesis Advisors: Alison Mackey, Ph.D. Natalie Schilling, Ph.D. ABSTRACT This dissertation examines the causes behind the differences in proficiency in the North Indian language Gujarati among heritage learners of Gujarati in three diaspora locations. In particular, I focus on whether there is a relationship between heritage language ability and ethnic and cultural identity. Previous studies have reported divergent findings. Some have found a positive relationship (e.g., Cho, 2000; Kang & Kim, 2011; Phinney, Romero, Nava, & Huang, 2001; Soto, 2002), whereas others found no correlation (e.g., C. L. Brown, 2009; Jo, 2001; Smolicz, 1992), or identified only a partial relationship (e.g., Mah, 2005). Only a few studies have addressed this question by studying one community in different transnational locations (see, for example, Canagarajah, 2008, 2012a, 2012b). The current study addresses this matter by examining data from members of the same ethnic group in similar educational settings in three multi-ethnic and multilingual cities. The results of this study are based on a survey consisting of questionnaires, semi-structured interviews, and proficiency tests with 135 participants. Participants are Gujarati heritage language learners from the U.K., Singapore, and South Africa, who are either current students or recent graduates of a Gujarati School.
    [Show full text]
  • Research on the Protection and Inheritance of Samei Language
    Open Journal of Social Sciences, 2020, 8, 285-291 https://www.scirp.org/journal/jss ISSN Online: 2327-5960 ISSN Print: 2327-5952 Research on the Protection and Inheritance of Samei Language Renqin Yang College of Yi Studies, Southwest Minzu University, Chengdu, China How to cite this paper: Yang, R. Q. (2020). Abstract Research on the Protection and Inheritance of Samei Language. Open Journal of Social With the advancement of urbanization and the rapid development of society, Sciences, 8, 285-291. the Samei language has been continuously criticized and weakened. This ar- https://doi.org/10.4236/jss.2020.88024 ticle found that the current status of the protection and inheritance of the Sa- mei language is not optimistic through survey interviews. The author briefly Received: July 17, 2020 Accepted: August 21, 2020 outlines the current status of the Samei language, and raised the problems of Published: August 24, 2020 the protection and inheritance of the Samei language. Finally, in order to bet- ter protect and inherit the Samei language, the author believes that the local Copyright © 2020 by author(s) and government should pay attention to its protection and take effective and rele- Scientific Research Publishing Inc. This work is licensed under the Creative vant measures to protect the inheritance of Samei language. Commons Attribution International License (CC BY 4.0). Keywords http://creativecommons.org/licenses/by/4.0/ Open Access Samei Language, Language Inheritance, Protection Measures 1. Introduction Language is a symbol of a nation and an important feature that distinguishes it from other nations (Chen & Su, 2015).
    [Show full text]
  • Arxiv:1806.04291V1 [Cs.CL] 12 Jun 2018 Hnwrigo Hsfil.Sneidgnu Agae R Di Are Languages Indigenous Since P We field
    Challenges of language technologies for the indigenous languages of the Americas Manuel Mager Ximena Gutierrez-Vasques Instituto de Investigaciones en Matem´aticas GIL IINGEN Aplicadas y en Sistemas Universidad Nacional Universidad Nacional Aut´onoma de M´exico Aut´onoma de M´exico [email protected] [email protected] Gerardo Sierra Ivan Meza GIL IINGEN Instituto de Investigaciones en Matem´aticas Universidad Nacional Aplicadas y en Sistemas Aut´onoma de M´exico Universidad Nacional Aut´onoma de M´exico [email protected] [email protected] Abstract Indigenous languages of the American continent are highly diverse. However, they have received little attention from the technological perspective. In this paper, we review the research, the dig- ital resources and the available NLP systems that focus on these languages. We present the main challenges and research questions that arise when distant languages and low-resource scenarios are faced. We would like to encourage NLP research in linguistically rich and diverse areas like the Americas. Title and Abstract in Nahuatl Masehualtlahtoltecnologias ipan Americatlalli In nepapan Americatlalli imacehualtlahtol, inin tlahtolli ahmo quinpiah miac tlahtoltecnolog´ıas (“tecnolog´ıas del lenguaje”). Ipan inin amatl, tictemoah nochin macehualtlahtoltin intequiuh, nochin recursos digitales ihuan nochin tlahtoltecnolog´ıas in ye mochiuhqueh. Cequintin problemas monextiah ihcuac tlahtolli quinpiah tepitzin recursos kenin amoxtli, niman, ohuic quinchihuaz tecnolog´ıa ihuan ohuic quinchihuaz macehualtlahtolmatiliztli. Cenca importante in ocachi ticchihuilizqueh tlahtoltecnolog´ıas macehualtlahtolli, niman tipalehuilizqueh ahmo mopolozqueh inin tlahtolli. 1 Introduction The American continent is linguistically diverse, it comprises many indigenous languages that are nowa- days spoken from North to South America.
    [Show full text]