Net.Lang Towards the Multilingual Cyberspace

Total Page:16

File Type:pdf, Size:1020Kb

Net.Lang Towards the Multilingual Cyberspace MAAYA NETWORK NET.LA NG TOWARDS THE MULTILINGUAL CYBERSPACE EDITION AND COORDINATION : LAURENT VANNINI HERVÉ LE CROSNIER C&F ÉDITIONS 2012 Associate website : http://net-lang.net Other books (in french) from C&F éditions : Libres savoirs, les biens communs de la connaissance Ouvrage collectif coordonné par l’association VECAM Mai 2011, ISBN 978-2-915825-06-0 aux sources de l’utopie numérique De la contre-culture à la cyberculture : Steward Brand, un homme d’influence Par Fred Turner Avril 2012, ISBN 2-915825-10-6 Dans le labyrinthe Évaluer l’information sur internet Par Alexandre Serres Avril 2012, ISBN 978-2-915825-22-0 Complete catalog and online bookstore : http://cfeditions.com ISBN PDF edition 978-2-915825-24-4 C&F éditions, mars 2012 35C rue des Rosiers – 14000 Caen, France http://cfeditions.com This book is published under a Creative Commons license : attribution, share alike (http://creativecommons.org/licenses/by-sa/3.0/fr/). CONTENTS Credits 9 FOREWORDS 10 Irina Bokova General Director, UNESCO 13 Abdou Diouf General Secretary, La Francophonie 17 José Luis Dicenta General Secretary, Union Latine 21 Dwayne Bailey Research Director, ANLoc 23 Daniel Prado Executive Secretary, Maaya Network 27 PART 1 33 WHEN TECHNOLOGY MEETS MULTILINGUALISM Daniel Prado Language Presence in the Real World and Cyberspace 35 Michaël Oustinoff English Won’t Be the Internet’s Lingua Franca 53 Éric Poncet Technological Innovation and Language Preservation 69 Maik Gibson Preserving the Heritage of Extinct or Endangered Languages 75 Marcel Diki-Kidiri Cyberspace and Mother Tongue Education 89 PART 2 102 DIGITAL SPACES Stéphane Bortzmeyer Multilingualism and the Internet’s Standardisation 105 Mikami Yoshiki & Shigeaki Kodama Measuring Linguistic Diversity on the Web 119 Joseph Mariani How Language Technologies Support Multilingualism 141 Vassili Rivron The Use of Facebook by the Eton of Cameroon 161 Pann Yu Mon & Madhukara Phatak Search Engines and Asian Languages 169 Hervé Le Crosnier Digital Libraries 185 Dwayne Bailey Software Localization : Open Source as a Major Tool for Digital Multilingualism 205 Mélanie Dulong De Rosnay Translation and Localization of Creative Commons Licenses 221 PART 3 227 DIGITAL MULTILINGUALISM : BUILDING INCLUSIVE SOCIETIES Viola Krebs & Vicent Climent-Ferrando Languages, Cyberspace, Migrations 229 Annelies Braffort & Patrice Dalle Accessibility in Cyberspace : Sign Languages 249 Tjeerd de Graaf How Oral Archives Benefit Endangered Languages 269 Evgeny Kuzmin Linguistic Policies to Counter Languages Marginalization 287 Tunde Adegbola Multimedia and Signed, Written or Oral Languages 311 Adel El Zaim Cyberactivism and Regional Languages in the 2011 Arab Spring 325 Adama Samassékou Multilingualism, the Millenium Development Goals, and Cyberspace 337 PART 4 348 MULTILINGUALISM ON THE INTERNET : A MULTILATERAL ISSUE Isabella Pierangeli Borletti Describing the World : Multilingualism, the Internet, and Human Rights 351 Stéphane Bortzmeyer Multilingualism and Internet Governance 373 Marcel Diki-Kidiri Ethical Principles Required for an Equitable Language Presence in the Information Society 387 Stéphane Grumbach The Internet in China 401 Michaël Oustinoff The Economy of Languages 407 Daniel Prado & Daniel Pimienta Public Policies for Languages in Cyberspace 423 CONCLUSION 437 THE FUTURE SPEAKS, READS AND WRITES IN ALL LANGUAGES Adama Samassékou President of Maaya LAURA KRAFTOWITZ born in 1982 at Pittsburgh, splits her time between Paris and the Middle East. She is the author of numerous articles about the Israeli-Palestinian conflict, and the forthcoming memoir The End of Abu Jameel Street, regarding her time as a human rights activist in the Gaza Strip. She is currently completing her Master’s Degree in Political Science and History at the École des Hautes Études en Sciences Sociales in Paris, where she is researching the past century of Israeli and Palestinian binational ideation. HERVÉ LE CROSNIER is a senior lecturer at the University of Caen Basse-Normandie, where he teaches Internet technologies and digital culture. He is currently working with ISCC, the Institute for Communication Sciences of the CNRS. His research focuses on the impact of the Internet on social and cultural organization, and extending knowledge in the public domain. He is one of the founders of C&F éditions. JOHN ROSBOTTOM was Principal Lecturer in Computer Science at the University of Portsmouth, UK, until retirement in 2010. He now pursues interests and activities in several areas that were precluded during a busy working life, but continues to enjoy working on computing projects and using his language skills in the translation of technical books and documents. LAURENT VANNINI after working as a linguistic diversity activist for the Babels network, a journalist, and a telecom- munications consultant, decided to travel the world and return to academe, where he is currently studying in Arts and Languages at EHESS. His work probes “the disappearance of the animal and oblivion”. In addition to his writing on ICTs, he penned the French translation of the book From Counterculture to Cyberculture: Stewart Brand, the Whole Earth Network, and the Rise of Digital Utopianism, by Fred Turner (C&F éditions, 2012). 8 CREDITS Maaya Scientific Committee Adama Samassékou Daniel Prado Daniel Pimienta Marcel Diki-Kidiri Louis Pouzin Yoshiki Mikami Evgeny Kuzmin Editing and Coordination Laurent Vannini Hervé Le Crosnier Translation John Rosbottom Laura Kraftowitz Laurent Vannini Graphic Design and Page Layout Nicolas Taffin Kathleen Ponsard With the support of Unesco, communication and information sector La Francophonie Union Latine ANLoc, South Africa IDRC / CRDI, Canada 9 FOREWORDS UNESCO works to create the conditions for dialogue among civilisa- tions, cultures and peoples, based upon respect for commonly shared values. It is through this dialogue that the world can achieve global vi- sions of sustainable development encompassing observance of human rights, mutual respect and the alleviation of poverty, all of which are at the heart of UNESCO’s mission and activities. UNESCO’s mission is to contribute to the building of peace, the eradication of poverty, sustainable development and intercultural dialogue through education, the sciences, culture, communication and information. http://unesco.org 12 PRESERVING AND PROMOTING LINGUISTIC DIVERSITY BY IRINA BOKOVA General Director, UNESCO Languages are essential components of individual and common human heritage. They are the first and foremost vehicle for expressing identity, communicating ideas, attaining educational, economic and political autonomy, and promoting peace and sustainable human development. Languages are important for sharing information and knowledge and for transmitting unique cultural wisdom, including across generations and nations. They form an intrinsic part of the identity of individuals and people, and they are of vital importance to manage the cultural diversity of our world. They open opportunities for dialogue, cooperation and mutual understanding. In this perspective, a plural and diverse linguistic space can expand the conditions for such dialogue by allowing each and every individual to contribute freely in the languages of their choice. At the same time, languages are a fragile resource which requires en- dorsement, revitalization and promotion. Today, a significant number of languages are at risk of disappearing. About 97 per cent of the world’s population speaks about 4 per cent of the world’s languages. Conversely, about 96 per cent of the world’s languages are spoken by only about 3 per cent of people around the world. This means that at least half of more than 6,000 languages worldwide are losing their speakers. It is estimated that about 90 per cent of the languages may be replaced by dominant languages by the end of the twenty-first century. Irina Bokova 13 As the information and communication technologies (icts) became central to all aspects of social, cultural, economic and political life, it is important to ensure that everyone has access and can contribute with their own content to the multilingual internet. icts can be a powerful tool for the safeguarding and promotion of linguistic diversity. In prin- ciple, the internet is open to all languages of the world, but only when certain conditions are met, and when the necessary human and financial resources are in place. A multilingual internet is essential for nations, communities and individuals to access, share and use information and resources which are critical for sustainable development and for manag- ing innovation and change. unesco is strongly committed to promoting multilingualism on the in- ternet. A plural linguistic cyberspace allows the wealth of diversity to be put in common. These goals guide the Organization in its work with the Internet Corporation of Assigned Names and Numbers (icann). They are also debated in the various meetings of the Internet Governance Forum (igf) as well as in the World Summit on Information Society Forums. unesco’s partnership with the International Telecommunications Union (itu) in the Broadband Commission for Digital Development has also placed emphasis on the need for rich, culturally and linguistically diverse local content and applications as a key target in world leaders’ commit- ment to broadband inclusion for all. The importance of cultural and linguistic diversity is also echoed in the Recommendation concerning the Promotion and Use of Multilingualism and Universal Access
Recommended publications
  • LEV IAKOVLEVICH SHTERNBERG: at the OUTSET of SOVIET ETHNOGRAPHY1 Anna A. Sirina and Tat'iana P. Roon Introduction the Works Of
    First published in “Jochelson, Bogoras and Shternberg: A Scientific Exploration of Northeastern Siberia and the Shaping of Soviet Ethnography”, edited by Erich Kasten, 2018: 207 – 264. Fürstenberg/Havel: Kulturstiftung Sibirien. — Electronic edition for www.siberien-studies.org LEV IAKOVLEVICH SHTERNBERG: 8 AT THE OUTSET OF SOVIET ETHNOGRAPHY1 Anna A. Sirina and Tat‘iana P. Roon Introduction The works of Lev Iakovlevich Shternberg, the eminent Russian and Soviet scientist, accomplished theoretical evolutionist, professor, and corresponding member of the Soviet Academy of Sciences (from 1924), are well known by present day historians, ethnographers and anthropologists. An active public figure, he was engaged in the ethnography of the peoples of the Russian Far East and in Jewish ethnography. His acquaintance through correspondence with Friedrich Engels played a not insignificant role in his prominence: Engels wrote to Shternberg, having become acquainted with his discovery of group marriages amongst the Gilyak (or Nivkh) people (Shternberg et al. 1933; Shternberg 1933b: xvii-xix; see also Marx and Engels 1962, vol. 22: 364–367). This “discovery,” as with his description of other features of the Nivkh (Gilyak) social order, made on the Island of Sakhalin, was the beginning of his scientific activities. Over the period of his eight-year administrative exile he gathered unique field mate- rial on the language, folklore and social and religious life of the Nivkh. The results of his research were published in Russia beginning as early as 1893 (Shternberg 1893a; 1895; 1896). A considerable amount has been written on Lev Shternberg: obituaries, including ones in foreign journals,2 articles by ethnographers, archeologists, museologists and historians of science.3 Of particular interest is an absorbing book by Nina Gagen-Torn who, on the basis of personal recollections and archival documents, gives a vivid and emotional account of her teacher — the revolutionary and founder of the Leningrad School of Ethnography (Gagen-Torn 1971; 1975).
    [Show full text]
  • Unicode Ate My Brain
    UNICODE ATE MY BRAIN John Cowan Reuters Health Information Copyright 2001-04 John Cowan under GNU GPL 1 Copyright • Copyright © 2001 John Cowan • Licensed under the GNU General Public License • ABSOLUTELY NO WARRANTIES; USE AT YOUR OWN RISK • Portions written by Tim Bray; used by permission • Title devised by Smarasderagd; used by permission • Black and white for readability Copyright 2001-04 John Cowan under GNU GPL 2 Abstract Unicode, the universal character set, is one of the foundation technologies of XML. However, it is not as widely understood as it should be, because of the unavoidable complexity of handling all of the world's writing systems, even in a fairly uniform way. This tutorial will provide the basics about using Unicode and XML to save lots of money and achieve world domination at the same time. Copyright 2001-04 John Cowan under GNU GPL 3 Roadmap • Brief introduction (4 slides) • Before Unicode (16 slides) • The Unicode Standard (25 slides) • Encodings (11 slides) • XML (10 slides) • The Programmer's View (27 slides) • Points to Remember (1 slide) Copyright 2001-04 John Cowan under GNU GPL 4 How Many Different Characters? a A à á â ã ä å ā ă ą a a a a a a a a a a a Copyright 2001-04 John Cowan under GNU GPL 5 How Computers Do Text • Characters in computer storage are represented by “small” numbers • The numbers use a small number of bits: from 6 (BCD) to 21 (Unicode) to 32 (wchar_t on some Unix boxes) • Design choices: – Which numbers encode which characters – How to pack the numbers into bytes Copyright 2001-04 John Cowan under GNU GPL 6 Where Does XML Come In? • XML is a textual data format • XML software is required to handle all commercially important characters in the world; a promise to “handle XML” implies a promise to be international • Applications can do what they want; monolingual applications can mostly ignore internationalization Copyright 2001-04 John Cowan under GNU GPL 7 $$$ £££ ¥¥¥ • Extra cost of building-in internationalization to a new computer application: about 20% (assuming XML and Unicode).
    [Show full text]
  • Character Set Migration Best Practices For
    Character Set Migration Best Practices $Q2UDFOH:KLWH3DSHU October 2002 Server Globalization Technology Oracle Corporation Introduction - Database Character Set Migration Migrating from one database character set to another requires proper strategy and tools. This paper outlines the best practices for database character set migration that has been utilized on behalf of hundreds of customers successfully. Following these methods will help determine what strategies are best suited for your environment and will help minimize risk and downtime. This paper also highlights migration to Unicode. Many customers today are finding Unicode to be essential to supporting their global businesses. Oracle provides consulting services for very large or complex environments to help minimize the downtime while maximizing the safe migration of business critical data. Why migrate? Database character set migration often occurs from a requirement to support new languages. As companies internationalize their operations and expand services to customers all around the world, they find the need to support data storage of more World languages than are available within their existing database character set. Historically, many legacy systems required support for only one or possibly a few languages; therefore, the original character set chosen had a limited repertoire of characters that could be supported. For example, in America a 7-bit character set called ASCII is satisfactory for supporting English data exclusively. While in Europe a variety of 8 bit European character sets can support specific subsets of European languages together with English. In Asia, multi byte character sets that could support a given Asian language and English were chosen. These were reasonable choices that fulfilled the initial requirements and provided the best combination of economy and performance.
    [Show full text]
  • Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress
    1 Assessment of Options for Handling Full Unicode Character Encodings in MARC21 A Study for the Library of Congress Part 1: New Scripts Jack Cain Senior Consultant Trylus Computing, Toronto 1 Purpose This assessment intends to study the issues and make recommendations on the possible expansion of the character set repertoire for bibliographic records in MARC21 format. 1.1 “Encoding Scheme” vs. “Repertoire” An encoding scheme contains codes by which characters are represented in computer memory. These codes are organized according to a certain methodology called an encoding scheme. The list of all characters so encoded is referred to as the “repertoire” of characters in the given encoding schemes. For example, ASCII is one encoding scheme, perhaps the one best known to the average non-technical person in North America. “A”, “B”, & “C” are three characters in the repertoire of this encoding scheme. These three characters are assigned encodings 41, 42 & 43 in ASCII (expressed here in hexadecimal). 1.2 MARC8 "MARC8" is the term commonly used to refer both to the encoding scheme and its repertoire as used in MARC records up to 1998. The ‘8’ refers to the fact that, unlike Unicode which is a multi-byte per character code set, the MARC8 encoding scheme is principally made up of multiple one byte tables in which each character is encoded using a single 8 bit byte. (It also includes the EACC set which actually uses fixed length 3 bytes per character.) (For details on MARC8 and its specifications see: http://www.loc.gov/marc/.) MARC8 was introduced around 1968 and was initially limited to essentially Latin script only.
    [Show full text]
  • LCSH Section W
    W., D. (Fictitious character) William Kerr Scott Lake (N.C.) Waaddah Island (Wash.) USE D. W. (Fictitious character) William Kerr Scott Reservoir (N.C.) BT Islands—Washington (State) W.12 (Military aircraft) BT Reservoirs—North Carolina Waaddah Island (Wash.) USE Hansa Brandenburg W.12 (Military aircraft) W particles USE Waadah Island (Wash.) W.13 (Seaplane) USE W bosons Waag family USE Hansa Brandenburg W.13 (Seaplane) W-platform cars USE Waaga family W.29 (Military aircraft) USE General Motors W-cars Waag River (Slovakia) USE Hansa Brandenburg W.29 (Military aircraft) W. R. Holway Reservoir (Okla.) USE Váh River (Slovakia) W.A. Blount Building (Pensacola, Fla.) UF Chimney Rock Reservoir (Okla.) Waaga family (Not Subd Geog) UF Blount Building (Pensacola, Fla.) Holway Reservoir (Okla.) UF Vaaga family BT Office buildings—Florida BT Lakes—Oklahoma Waag family W Award Reservoirs—Oklahoma Waage family USE Prix W W. R. Motherwell Farmstead National Historic Park Waage family W.B. Umstead State Park (N.C.) (Sask.) USE Waaga family USE William B. Umstead State Park (N.C.) USE Motherwell Homestead National Historic Site Waahi, Lake (N.Z.) W bosons (Sask.) UF Lake Rotongaru (N.Z.) [QC793.5.B62-QC793.5.B629] W. R. Motherwell Stone House (Sask.) Lake Waahi (N.Z.) UF W particles UF Motherwell House (Sask.) Lake Wahi (N.Z.) BT Bosons Motherwell Stone House (Sask.) Rotongaru, Lake (N.Z.) W. Burling Cocks Memorial Race Course at Radnor BT Dwellings—Saskatchewan Wahi, Lake (N.Z.) Hunt (Malvern, Pa.) W.S. Payne Medical Arts Building (Pensacola, Fla.) BT Lakes—New Zealand UF Cocks Memorial Race Course at Radnor Hunt UF Medical Arts Building (Pensacola, Fla.) Waʻahila Ridge (Hawaii) (Malvern, Pa.) Payne Medical Arts Building (Pensacola, Fla.) BT Mountains—Hawaii BT Racetracks (Horse racing)—Pennsylvania BT Office buildings—Florida Waaihoek (KwaZulu-Natal, South Africa) W-cars W star algebras USE Waay Hoek (KwaZulu-Natal, South Africa : USE General Motors W-cars USE C*-algebras Farm) W.
    [Show full text]
  • Proposal to Encode 0D5F MALAYALAM LETTER ARCHAIC II
    Proposal to encode 0D5F MALAYALAM LETTER ARCHAIC II Shriramana Sharma, jamadagni-at-gmail-dot-com, India 2012-May-22 §1. Introduction In the Malayalam Unicode encoding, the independent letter form for the long vowel Ī is: ഈ where the length mark ◌ൗ is appended to the short vowel ഇ to parallel the symbols for U/UU i.e. ഉ/ഊ. However, Ī was originally written as: As these are entirely different representations of Ī, although their sound value may be the same, it is proposed to encode the archaic form as a separate character. §2. Background As the core Unicode encoding for the major Indic scripts is based on ISCII, and the ISCII code chart for Malayalam only contained the modern form for the independent Ī [1, p 24]: … thus it is this written form that came to be encoded as 0D08 MALAYALAM LETTER II. While this “new” written form is seen in print as early as 1936 CE [2]: … there is no doubt that the much earlier form was parallel to the modern Grantha , both of which are derived from the old Grantha Ī as seen below: 1 ma - īdṛgvidhā | tatarāya īṇ Old Grantha from the Iḷaiyānputtūr copper plates [3, p 13]. Also seen is the Vatteḻuttu Ī of the same time (line 2, 2nd char from right) which also exhibits the two dots. Of course, both are derived from old South Indian Brahmi Ī which again has the two dots. It is said [entire paragraph: Radhakrishna Warrier, personal communication] that it was the poet Vaḷḷattōḷ Nārāyaṇa Mēnōn (1878–1958) who introduced the new form of Ī ഈ.
    [Show full text]
  • The Boarnsterhim Corpus: a Bilingual Frisian-Dutch Panel and Trend Study
    The Boarnsterhim Corpus: A Bilingual Frisian-Dutch Panel and Trend Study Marjoleine Sloos, Eduard Drenth, Wilbert Heeringa Fryske Akademy, Royal Netherlands Academy of Sciences (KNAW) Doelestrjitte 8, 8911 DX Ljouwert, The Netherlands {msloos, edrenth, wheeringa}@fryske-akademy.nl Abstract The Boarnsterhim Corpus consists of 250 hours of speech in both West Frisian and Dutch by the same sample of bilingual speakers. The corpus contains original recordings from 1982-1984 and a replication study recorded 35 years later. The data collection spans speech of four generations, and combines panel and trend data. This paper describes the Boarnsterhim Corpus halfway the project which started in 2016 and describes the way it was collected, the annotations, potential use, and the envisaged tools and end-user web application. Keywords: West Frisian, Dutch, sociolinguistics, language variation and change, bilingualism, phonetics, phonology 1. Background The remainder of this paper describes the data collection and methodology in section 2. Section 3 West Frisian is mostly spoken in the province of Fryslân in describes the embedding in a larger infrastructure and the north of the Netherlands. All its speakers are bilingual section 4 provides background information about the tool with Dutch, which is the dominant language. West Frisian that is used to retrieve lexical frequency. Section 5 is mainly used in informal settings (van Bezooijen discusses previous and ongoing research, and further 2009:302) but also the most formal, viz. in the provincial research opportunities that this database may make parliament. In semi-formal interactions (e.g.shopping, in possible. Finally, section 6 concludes. church), Dutch is usually the preferred language.
    [Show full text]
  • The Nivkh People of Sakhalin and Their Music
    THE NIVKH PEOPLE OF SAKHALIN AND THEIR MUSIC MY NIVKH JOURNEY BY PIA SIIRALA © Pia Siirala, 2007 From the original Sahalinin Nivkh-Alkuperäiskansan Musiikki Matkapäiväkirja English translation copyright © 2009 by Lygia O’Riordan © Ensemble XXI 2 ISBN 978-952-5789-01-0 INTRODUCTION This is the day-to-day journal of my field trip late last year to Sakhalin, when I worked with the indigenous Nivkh people for a second time, collecting more material, both through notating music on the spot and by making recordings of the Nivkhs singing for me. Since then I have composed two more pieces of music based on this Nivkh material – a Suite for Solo Violin and Ulita’s Walk for chamber orchestra. Both works will be given their world premieres in 2008. The Nivkhs and their music have been an enormous inspiration to me, to our orchestra Ensemble XXI and to our conductor, Lygia O’Riordan. Before my first field trip in 2004, Lygia O’Riordan, the conductor of Ensemble XXI, initiated the first contact with the Nivkhs on her field trip to meet them in Sakhalin. At that time she met the Ethnomusicologist, Natalia Mamcheva, who introduced her to the legendary Nivkh singer, Tatiana Ulita. During that first encounter, Lygia O’Riordan recorded Tatiana Ulita’s singing. These songs provided me with the vital material for my composition, Nivkh Themes. On that same trip, Lygia O’Riordan travelled to Nogliki to see Nivkh performances first hand and to meet with the Nivkhs themselves. Armed with ancient recordings of Nivkh singing given to her by Natalia Mamcheva, as well as her own recordings, Lygia introduced me to the Nivkh music.
    [Show full text]
  • Comment on L2/13-065: Grantha Characters from Other Indic Blocks Such As Devanagari, Tamil, Bengali Are NOT the Solution
    Comment on L2/13-065: Grantha characters from other Indic blocks such as Devanagari, Tamil, Bengali are NOT the solution Dr. Naga Ganesan ([email protected]) Here is a brief comment on L2/13-065 which suggests Devanagari block for Grantha characters to write both Aryan and Dravidian languages. From L2/13-065, which suggests Devanagari block for writing Sanskrit and Dravidian languages, we read: “Similar dedicated characters were provided to cater different fonts as URUDU MARATHI SINDHI MARWARI KASHMERE etc for special conditions and needs within the shared DEVANAGARI range. There are provisions made to suit Dravidian specials also as short E short O LLLA RRA NNNA with new created glyptic forms in rhythm with DEVANAGARI glyphs (with suitable combining ortho-graphic features inside DEVANAGARI) for same special functions. These will help some aspirant proposers as a fall out who want to write every Indian language contents in Grantham when it is unified with DEVANAGARI because those features were already taken care of by the presence in that DEVANAGARI. Even few objections seen in some docs claim them as Tamil characters to induce a bug inside Grantham like letters from GoTN and others become tangential because it is going to use the existing Sanskrit properties inside shared DEVANAGARI and also with entirely new different glyph forms nothing connected with Tamil. In my doc I have shown rhythmic non-confusable glyphs for Grantham LLLA RRA NNNA for which I believe there cannot be any varied opinions even by GoTN etc which is very much as Grantham and not like Tamil form at all to confuse (as provided in proposals).” Grantha in Devanagari block to write Sanskrit and Dravidian languages will not work.
    [Show full text]
  • Finite-State Script Normalization and Processing Utilities: the Nisaba Brahmic Library
    Finite-state script normalization and processing utilities: The Nisaba Brahmic library Cibu Johny† Lawrence Wolf-Sonkin‡ Alexander Gutkin† Brian Roark‡ Google Research †United Kingdom and ‡United States {cibu,wolfsonkin,agutkin,roark}@google.com Abstract In addition to such normalization issues, some scripts also have well-formedness constraints, i.e., This paper presents an open-source library for efficient low-level processing of ten ma- not all strings of Unicode characters from a single jor South Asian Brahmic scripts. The library script correspond to a valid (i.e., legible) grapheme provides a flexible and extensible framework sequence in the script. Such constraints do not ap- for supporting crucial operations on Brahmic ply in the basic Latin alphabet, where any permuta- scripts, such as NFC, visual normalization, tion of letters can be rendered as a valid string (e.g., reversible transliteration, and validity checks, for use as an acronym). The Brahmic family of implemented in Python within a finite-state scripts, however, including the Devanagari script transducer formalism. We survey some com- mon Brahmic script issues that may adversely used to write Hindi, Marathi and many other South affect the performance of downstream NLP Asian languages, do have such constraints. These tasks, and provide the rationale for finite-state scripts are alphasyllabaries, meaning that they are design and system implementation details. structured around orthographic syllables (aksara)̣ as the basic unit.1 One or more Unicode characters 1 Introduction combine when rendering one of thousands of leg- The Unicode Standard separates the representation ible aksara,̣ but many combinations do not corre- of text from its specific graphical rendering: text spond to any aksara.̣ Given a token in these scripts, is encoded as a sequence of characters, which, at one may want to (a) normalize it to a canonical presentation time are then collectively rendered form; and (b) check whether it is a well-formed into the appropriate sequence of glyphs for display.
    [Show full text]
  • Modeling a Historical Variety of a Low-Resource Language: Language Contact Effects in the Verbal Cluster of Early-Modern Frisian
    Modeling a historical variety of a low-resource language: Language contact effects in the verbal cluster of Early-Modern Frisian Jelke Bloem Arjen Versloot Fred Weerman ILLC ACLC ACLC University of Amsterdam University of Amsterdam University of Amsterdam [email protected] [email protected] [email protected] Abstract which are by definition used less, and are less likely to have computational resources available. For ex- Certain phenomena of interest to linguists ample, in cases of language contact where there is mainly occur in low-resource languages, such as contact-induced language change. We show a majority language and a lesser used language, that it is possible to study contact-induced contact-induced language change is more likely language change computationally in a histor- to occur in the lesser used language (Weinreich, ical variety of a low-resource language, Early- 1979). Furthermore, certain phenomena are better Modern Frisian, by creating a model using studied in historical varieties of languages. Taking features that were established to be relevant the example of language change, it is more interest- in a closely related language, modern Dutch. ing to study a specific language change once it has This allows us to test two hypotheses on two already been completed, such that one can study types of language contact that may have taken place between Frisian and Dutch during this the change itself in historical texts as well as the time. Our model shows that Frisian verb clus- subsequent outcome of the change. ter word orders are associated with different For these reasons, contact-induced language context features than Dutch verb orders, sup- change is difficult to study computationally, and porting the ‘learned borrowing’ hypothesis.
    [Show full text]
  • Modernising Tbe Lexicon of Tbe West Frisian Language
    Modernising tbe Lexicon of tbe West Frisian Language Hindrik Sijens (LjouwertlLeeuwarden) A new toy in the world of modem electronic communications has arrived in Fryslän, namely Internet, and along with it e-mail. When asked what these new media might be ca lied in Frisian, the answers internet and e-mail are hardly surprising. We can see that both Fryslän and the Frisian language are also tak­ ing part in the digitalisation of the world. New concepts, new words in tech­ nology retain their original Anglo-American names in most European lan­ guages. This paper will deal with the process of developing and renewing the Frisian lexicon. But first I am going to give a brief introduction to Fryslän itself. Fryslän is one ofthe twelve provinces ofthe kingdom ofthe Netherlands and is situated in the northem part ofthe country. Fryslän is a flat province, much of its territory being below sea level. It is protected trom the sea by dikes. The capitalof Fryslän is Ljouwert, or Leeuwarden, as it is known in Dutch. In the past, Fryslän was an agricultural pro vin ce with more cattle living there than people. Nowadays farming still plays an important role in Fryslän's economy, but both the service industries and industry itself have also become important. Fryslän has a population of about 600,000 and approximately 55% of them consider Frisian to be their first language. While about 74% of the population claim that they can speak Frisian, 94% say that they can understand it when it is spoken, 65% claim that they can read Frisian, but only 17% say that they can write it (Gorter & Jonkman 1995:9-11).
    [Show full text]