A Multilingual Lexical Database Application with a Structured Interlingua

Total Page:16

File Type:pdf, Size:1020Kb

A Multilingual Lexical Database Application with a Structured Interlingua SIMuLLDA a Multilingual Lexical Database Application using a Structured Interlingua SIMuLLDA een toepassing van een meertalig lexicaal gegevensbestand met gebruikmaking van een gestructureerde tussentaal (met een samenvatting in het Nederlands) Proefschrift ter verkrijging van de graad van doctor aan de Universiteit Utrecht op het gezag van de Rector Magnificus, Prof. dr. W.H. Gispen, ingevolge het besluit van het College voor Promoties in het openbaar te verdedigen op vrijdag 7 juni 2002 des middags te 4:15 uur door Maarten Janssen geboren op 28 januari 1971 te Nijmegen Promotoren: Prof. dr. H.J. Verkuyl UiL-OTS, Universiteit Utrecht Prof. dr. A. Visser Faculteit Wijsbegeerte, Universiteit Utrecht Contents Preface vii 1 Multilingual Lexical Databases 1 1.1 Multilingual Lexical Databases . 1 1.2 Current Approaches and their Shortcomings . 2 1.2.1 Parallel Wordlists . 2 1.2.2 Hub-and-Spoke Model . 5 1.2.3 WordNet and EuroWordNet . 9 1.2.4 Acquilex et al. 15 1.2.5 Corpus Based Approaches . 19 1.3 Conclusion to Chapter 1 . 21 2 FCA and SIMuLLDA 23 2.1 Formal Concept Analysis . 23 2.1.1 Partial Ordering . 27 2.1.2 Hasse Diagrams . 30 2.2 Connotative Context . 31 2.3 The SIMuLLDA System . 35 2.3.1 Multilinguality . 38 2.3.2 Lexical Gap Filling . 43 2.4 Formal Properties of FCA . 45 2.4.1 FCA and Lattices . 45 2.4.2 Smallest Common Concept . 46 2.4.3 Maximal Filled Sub-Tables . 46 2.4.4 Distributive and Atomic Lattices . 47 2.4.5 Extending Contexts . 48 2.4.6 Models and the Number of Concepts . 49 2.4.7 Partial Ordering on Attributes . 51 2.5 JaLaBA: an Online FCA Tool . 52 2.5.1 Construing Formal Concepts . 53 2.5.2 Drawing Lattices . 56 2.6 Conclusion to Chapter 2 . 58 IV Contents 3 SIMuLLDA Elements 61 3.1 Words and Word-Forms . 61 3.1.1 Word-Form and Lexeme . 62 3.1.2 Morphemes . 68 3.1.3 History, Etymology . 70 3.2 The Language . 71 3.2.1 Language, Dialect and Idiolect . 71 3.2.2 New, Regional, and Infrequent Words . 75 3.3 The Interlingual Meanings . 77 3.3.1 Homonymy, Polysemy and Metonymy . 78 3.3.2 Word-meaning and Denotation . 84 3.3.3 Interlingua, Incommensurability, and Cultural Differ- ences . 88 3.3.4 Word meaning and the Colour of the Word . 94 3.4 The Definitional Attributes . 95 3.4.1 Semes,` Semantic Primitives, and Interpretative Seman- tics . 95 3.4.2 Lexicalisation of Definitional Attributes . 101 3.4.3 Adjusted Attributes . 102 3.4.4 The Value of Dictionary Definitions . 104 3.5 Conclusion to Chapter 3 . 107 4 Field Testing: Some Actual Dictionary Data 111 4.1 Horses and the Like . 111 4.2 Bodies of Water . 115 4.2.1 Hierarchy Problems . 118 4.2.2 Interlingual Alignments . 128 4.2.3 Generating Bilingual Definitions . 142 4.3 Ships and Sails . 143 4.4 Conclusion to Chapter 4 . 150 5 Extending the System 151 5.1 Derivations, Inflections, and Lexical Functions . 151 5.1.1 Meaning ⇔ Text Theory . 153 5.1.2 Lexemes as (Lexical) Functions . 156 5.1.3 Derivation and Inflection in SIMuLLDA . 158 5.2 Labels, Examples, Collocations . 162 5.2.1 Collocations . 162 5.2.2 Corpus Examples and Illustrative Sentences . 165 5.2.3 Register Labels and Group Labels . 167 5.3 Abstract Nouns, Verbs, Adjectives . 170 5.4 Lexical Database, Size, and Word Frequency . 173 5.5 Conclusion to Chapter 5 . 176 Contents V 6 Conclusion and Afterthoughts 177 6.1 Program for Further Research . 181 APPENDICES 183 A Lexical Definitions for Water 183 A.1 ‘Water’ in WordNet 1.6 . 183 A.2 English Definitions . 185 A.3 Dutch Definitions . 187 A.4 Italian Definitions . 190 A.5 German Definitions . 192 A.6 French Definitions . 194 A.7 Russian Definitions . 196 B Abbreviations and Notations 199 B.1 Dictionaries Referred to in this Thesis . 199 B.2 IPA Pronuncation Rules . 201 B.3 Lexical Functions . 202 B.4 Notational Conventions used in this Thesis . 204 Nederlandse Samenvatting 205 Curriculum Vitae 211 Acknowledgements 213 References 215 Index 222 Preface It is commonly accepted that there are about five to six thousand languages. For many pairs of languages hX, Y i, there is no dictionary X → Y or Y → X, there are only dictionaries for the pairs X → English/French/Spanish and dictionaries English/French/Spanish → Y (as well as the other way around). There is a clear need for dictionaries translating between lan- guages without the intervention of a small number of Western European languages with a colonial past. Also from a theoretical point of view, such a need can be defended. The creation of a dictionary of good quality takes a lot of time, and given the fact that 5000-6000 languages yield 25-30 million pairs of languages, it is important to have a database that provides the possibility to translate directly between pairs of languages, even though just a small subset of the millions of pairs will be spoken by sufficiently many speakers to make the enterprise probable. However, sufficiently many will remain to justify a theoretical attempt to think about a database. This thesis highlights some problems that play a role in the creation of such a database, attempts to solve some of them, and tries to show that some other problems cannot be solved. A well-known problem is that words are often hard to match across languages: different words from different languages do not have the same range of meanings, not all words from one language have an equivalent in the other, etc. In this thesis, a sketch will be given of a database in which most of these problems are solved. Crucial in this set-up is the structure of the interlingua, which provides the possibility to relate non-corresponding meanings in a structural way. With the set-up proposed in this thesis, which is called SIMuLLDA (short for Structured Interlingua MultiLingual Lexical Database Application) it is possible to generate a descriptive translation for words in the source lan- guage that lack a direct translation in the target language. This should ease the work of a lexicographer making a dictionary for a new pair of lan- guages. In the first chapter, a global explanation is given of why a set-up of a multilingual lexical database that can generate translations between ar- bitrary pairs of languages is an important asset to have. Furthermore, it VIII Preface is explained what requirements a lexical database has to fulfil in order to meet this demand. This is done by examining a number of existing the- ories, and by showing why they are (currently) not capable of generating translations. The second chapter explains the layout of the alternative lexical database set-up that is proposed in this thesis. As said above the crucial compo- nent of the SIMuLLDA set-up is the structured interlingua. The structure of the interlingua is provided by a logical formalism called Formal Concept Analysis. Therefore, the chapter gives an introduction of the FCA frame- work, after wich an explanation is given of how the FCA framework is applied to dictionaries in the SIMuLLDA set-up. It is also shown that with this set-up, it is possible to generate definitions for words without a ‘trans- lational synonym’. Since FCA is a logical framework, some of the logical properties are discussed, especially those relevant for the functioning of the SIMuLLDA system. Additionally, an on-line application called JaLaBA is introduced. This application can be used to generate the Hasse diagrams that represent the structure yielded by FCA automatically, displaying them 3-dimensionally. In the third chapter, the basic component of the SIMuLLDA set-up are investigated in depth: words, languages, interlingual meanings, and def- initional attributes. There are two motivations for this thorough investi- gation. The first is simply that a good theory should explain the intended interpretation of its basic elements, and especially so in the case of an ab- stract framework like FCA, which does not in itself have any interpretation at all. So if meanings are assigned in the system to words, it is important to define precisely what counts as a word. The second motivation is that in the SIMuLLDA set-up, meanings are interlingual and defined by means of what might be viewed as ‘semantic primitives’. Without restrictions on the interpretation of the primitives (the definitional attributes), the system would make incorrect claims. In the fourth chapter, the system is empirically tested. As said before, the SIMuLLDA system is designed to provide a practical aid for lexicogra- phers. So its practical applicability is of vital importance. The empirical test in chapter four is only a relatively small scale test: to really test the system, it should be applied on a larger scale. But the smaller test in this thesis does show the applicability of the system to an arbitrary semantic domain, and raises a number of important issues. In the fifth chapter, some extensions to the system are discussed. As discussed in the first four chapters, the system mainly applies to concrete nouns, and then also to the semantic definitions of these concrete nouns. To be a fully workable system, SIMuLLDA also needs to model the other com- ponents of dictionaries. So, chapter five discusses the treatment of labels, collocations, inflections, derivations, and illustrative examples, also taking into consideration the applicability of the system to other word classes like Preface IX abstract nouns, verbs, and adjectives. Finally, some aspects of an applica- tion for the system that should be developed in order to use SIMuLLDA are discussed.
Recommended publications
  • 1 Symbols (2286)
    1 Symbols (2286) USV Symbol Macro(s) Description 0009 \textHT <control> 000A \textLF <control> 000D \textCR <control> 0022 ” \textquotedbl QUOTATION MARK 0023 # \texthash NUMBER SIGN \textnumbersign 0024 $ \textdollar DOLLAR SIGN 0025 % \textpercent PERCENT SIGN 0026 & \textampersand AMPERSAND 0027 ’ \textquotesingle APOSTROPHE 0028 ( \textparenleft LEFT PARENTHESIS 0029 ) \textparenright RIGHT PARENTHESIS 002A * \textasteriskcentered ASTERISK 002B + \textMVPlus PLUS SIGN 002C , \textMVComma COMMA 002D - \textMVMinus HYPHEN-MINUS 002E . \textMVPeriod FULL STOP 002F / \textMVDivision SOLIDUS 0030 0 \textMVZero DIGIT ZERO 0031 1 \textMVOne DIGIT ONE 0032 2 \textMVTwo DIGIT TWO 0033 3 \textMVThree DIGIT THREE 0034 4 \textMVFour DIGIT FOUR 0035 5 \textMVFive DIGIT FIVE 0036 6 \textMVSix DIGIT SIX 0037 7 \textMVSeven DIGIT SEVEN 0038 8 \textMVEight DIGIT EIGHT 0039 9 \textMVNine DIGIT NINE 003C < \textless LESS-THAN SIGN 003D = \textequals EQUALS SIGN 003E > \textgreater GREATER-THAN SIGN 0040 @ \textMVAt COMMERCIAL AT 005C \ \textbackslash REVERSE SOLIDUS 005E ^ \textasciicircum CIRCUMFLEX ACCENT 005F _ \textunderscore LOW LINE 0060 ‘ \textasciigrave GRAVE ACCENT 0067 g \textg LATIN SMALL LETTER G 007B { \textbraceleft LEFT CURLY BRACKET 007C | \textbar VERTICAL LINE 007D } \textbraceright RIGHT CURLY BRACKET 007E ~ \textasciitilde TILDE 00A0 \nobreakspace NO-BREAK SPACE 00A1 ¡ \textexclamdown INVERTED EXCLAMATION MARK 00A2 ¢ \textcent CENT SIGN 00A3 £ \textsterling POUND SIGN 00A4 ¤ \textcurrency CURRENCY SIGN 00A5 ¥ \textyen YEN SIGN 00A6
    [Show full text]
  • FUPA) in the UCS Source: Michael Everson and Klaas Ruppel Version: 2.0 Date: 1998-11-02
    Title: Encoding Finno-Ugric Phonetic Alphabet (FUPA) in the UCS Source: Michael Everson and Klaas Ruppel Version: 2.0 Date: 1998-11-02 The Finno-Ugric Phonetic Alphabet (FUPA) 0. Introduction This paper presents a collection of characters, diacritics, and notation marks used in the FUPA scheme. This transcription is and has been used creatively by different scientists in different countries at different times (Lagercrantz probably made the most baroque use of the system); therefore the phonetic or phonematic values of the elements of the FUPA are intentionally left aside here. However, all the users of this scheme have one thing in common: they use the FUPA in a technical way, as described below. We prefer the term ÒFUPAÓ to ÒFUTÓ (Finno- Ugric Transcription) because of its analogy to the IPA (International Phonetic Alphabet). It is not the intention of this paper to encourage exact (or over-exact) phonetic transcription or the extensive use of diacritics. The intention is rather to facilitate the use of the FUPA by the Uralicist community in the context of the Universal Character Set (UCS), by ensuring that a comprehensive analysis of the system leads to the possibility of encoding FUPA texts, past and present, with the UCS. In the exploratory versions of this document, firm advice on how to use FUPA will not be given; when the study is complete, however, advice on which characters in the UCS refer to whch letters used in the FUPA system will be given. An example of this, we believe, will be the advice for the LATIN SMALL LETTER ENG to be used in coded representations of FUPA texts, past and present, for the velar nasal, in preference to the GREEK SMALL LETTER ETA, although a great many printed texts use an eta glyph (Ç) and not an eng glyph (Ë).
    [Show full text]
  • The Brill Typeface User Guide & Complete List of Characters
    The Brill Typeface User Guide & Complete List of Characters Version 2.06, October 31, 2014 Pim Rietbroek Preamble Few typefaces – if any – allow the user to access every Latin character, every IPA character, every diacritic, and to have these combine in a typographically satisfactory manner, in a range of styles (roman, italic, and more); even fewer add full support for Greek, both modern and ancient, with specialised characters that papyrologists and epigraphers need; not to mention coverage of the Slavic languages in the Cyrillic range. The Brill typeface aims to do just that, and to be a tool for all scholars in the humanities; for Brill’s authors and editors; for Brill’s staff and service providers; and finally, for anyone in need of this tool, as long as it is not used for any commercial gain.* There are several fonts in different styles, each of which has the same set of characters as all the others. The Unicode Standard is rigorously adhered to: there is no dependence on the Private Use Area (PUA), as it happens frequently in other fonts with regard to characters carrying rare diacritics or combinations of diacritics. Instead, all alphabetic characters can carry any diacritic or combination of diacritics, even stacked, with automatic correct positioning. This is made possible by the inclusion of all of Unicode’s combining characters and by the application of extensive OpenType Glyph Positioning programming. Credits The Brill fonts are an original design by John Hudson of Tiro Typeworks. Alice Savoie contributed to Brill bold and bold italic. The black-letter (‘Fraktur’) range of characters was made by Karsten Lücke.
    [Show full text]
  • Fraser's Lisu Orthography [L2/07-294 = WG2/N3326]
    Fraser’s Lisu orthography WG2/N3326 L2/07-294 ISO/IEC JTC1/SC2/WG2 Universal Multiple-Octet Coded Character Set Title: Fraser’s Lisu orthography: Research notes toward a Unicode encoding Doc Type: Working Group Document Author: Richard Cook Status: Expert Contribution Action: For consideration by UTC and JTC1/SC2/WG2 Date: 2007-09-15 This document examines questions relating to the encoding of elements of Fraser’s (unicameral) Lisu writing system, weighing the benefits of two possible encoding methods, one involving unification with (bicameral) Latin Script, and the other involving disunification. After examining the repertory and considering it in relation to encoded (Unicode 5.0, and Charts Amendment 3 WG2/N3263 [2007-04-27]) characters and case mappings, unification with Latin Script is proposed as the preferred encoding method. Unification of Fraser with Latin Script requires that only 10 characters (out of ~48) be proposed for future encoding. Suitable BMP code points are available in the Latin Extended-D block: A78E LATIN CAPITAL LETTER TURNED B A78F LATIN CAPITAL LETTER TURNED D A790 LATIN CAPITAL LETTER TURNED G A791 LATIN CAPITAL LETTER TURNED J A792 LATIN CAPITAL LETTER TURNED K A794 LATIN CAPITAL LETTER TURNED P A795 LATIN CAPITAL LETTER TURNED R A796 LATIN CAPITAL LETTER TURNED T A797 LATIN CAPITAL LETTER TURNED U A798 LATIN SMALL LETTER TURNED J Feedback from the technical committees will guide a future encoding proposal. Fraser’s Lisu orthography [L2/07-294 = WG2/N3326] [URL ; L2/07-294 = WG2/N3326] Fraser’s Lisu orthography Research notes toward a Unicode encoding by Richard Cook STEDT, SEI, Unicode Contents ● 0.0: Introduction ● 1.0: Repertory ● 2.0: Fraser and the UCS ❍ 2.1: Encoded v.
    [Show full text]
  • IPA Braille: an Updated Tactile Representation of the International Phonetic Alphabet
    IPA Braille: An Updated Tactile Representation of the International Phonetic Alphabet Print Edition Overview, Tables, and Sample Texts Edited by Robert Englebretson, Ph.D. Produced by CNIB For the International Council on English Braille 2008 List of ICEB Member Countries As Of April, 2008 Australia Canada New Zealand Nigeria South Africa United Kingdom United States ii Table of Contents ACKNOWLEDGEMENTS BY ROBERT ENGLEBRETSON.............................................. V FOREWORD by Fredric K. Schroeder .................................................................. VII ACCLAIM FOR IPA BRAILLE BY MARTHA PAMPERIN.............................................IX 1. INTRODUCTION ............................................................................................. 1 2. THE SYMBOLS OF THE IPA ............................................................................. 5 2.1. CONSONANTS (PULMONIC) .................................................................. 6 2.2. CONSONANTS (NON-PULMONIC) ....................................................... 11 2.3. OTHER SYMBOLS ................................................................................ 13 2.4. VOWELS ............................................................................................. 15 2.5. DIACRITICS ....................................................................................... 17 2.6. SUPRASEGMENTALS ........................................................................... 23 2.7. TONES & WORD ACCENTS..................................................................
    [Show full text]
  • Latin Extended-B Range: 0180–024F
    Latin Extended-B Range: 0180–024F This file contains an excerpt from the character code tables and list of character names for The Unicode Standard, Version 14.0 This file may be changed at any time without notice to reflect errata or other updates to the Unicode Standard. See https://www.unicode.org/errata/ for an up-to-date list of errata. See https://www.unicode.org/charts/ for access to a complete list of the latest character code charts. See https://www.unicode.org/charts/PDF/Unicode-14.0/ for charts showing only the characters added in Unicode 14.0. See https://www.unicode.org/Public/14.0.0/charts/ for a complete archived file of character code charts for Unicode 14.0. Disclaimer These charts are provided as the online reference to the character contents of the Unicode Standard, Version 14.0 but do not provide all the information needed to fully support individual scripts using the Unicode Standard. For a complete understanding of the use of the characters contained in this file, please consult the appropriate sections of The Unicode Standard, Version 14.0, online at https://www.unicode.org/versions/Unicode14.0.0/, as well as Unicode Standard Annexes #9, #11, #14, #15, #24, #29, #31, #34, #38, #41, #42, #44, #45, and #50, the other Unicode Technical Reports and Standards, and the Unicode Character Database, which are available online. See https://www.unicode.org/ucd/ and https://www.unicode.org/reports/ A thorough understanding of the information contained in these additional sources is required for a successful implementation.
    [Show full text]
  • Reference Chart for IPA Typography
    Hagiwara – Reference Chart for IPA Typography Reference Chart for IPA Typography Robert Hagiwara University of Manitoba Version 1.1 October 2002 Overview This table is intended to provide quick and concise reference for people who want to type or refer to IPA characters in either the Summer Institute of Linguistics (SIL) IPA Encore fonts or using Unicode. I refer to this table often in general word processing and in web-based documents associated with my monthly mystery spectrograms (http://www.umanitoba.ca/linguistics/robh/), and believe others might find it useful as well. The SIL fonts are available as freeware from SIL (Summer Institute of Linguistics, 2002). They are True Type fonts and, once installed, can be used like any standard font in computer word processing and similar applications. For word processing SIL encourages the use of a standard keyboard, but provides a table of codes which allow typing in an individual character by using the Alt key (or Option key on Macs). The typist holds down the Alt key and enters (using the number pad) a four-digit code. The Unicode standard is an international character-coding standard that is useful for Generalized Markup Language kinds of applications, such as HyperText Markup Language (HTML) for web pages. Each IPA character (indeed, just about every alphabetic character in just about every language) has been assigned a unique code, which can be referred to as a symbol in an HTML document. The codes may be entered as either a decimal or a hexadecimal number. In HTML, symbol codes start with an ampersand identifier “&” followed by a number sign “#”, and end with a semi-colon “;”.
    [Show full text]
  • View Extract
    Contents Preface .................................................................................................................. vi Getting started........................................................................................................ 1 I Phonemic symbols ................................................. 6 II Vowels ............................................................ 13 2A. Vowels /iː/ and /ɪ/ .......................................................................................... 14 2B. Vowels /e/ and /æ/......................................................................................... 17 2C. Vowels /ʌ/ and /æ/ ........................................................................................ 20 2D. Vowels /ɜː/ and /ʌ/ ........................................................................................ 22 2E. Vowels /ɑː/ and /æ/ ....................................................................................... 25 2F. Vowels /ɒ/ and /ɔː/ ....................................................................................... 28 2G. Vowels /ʊ/ and /uː/ ........................................................................................ 31 2H. Diphthongs .................................................................................................... 36 III Consonants ...................................................... 42 3A. Consonant /r/ and rhoticity ........................................................................... 42 3B. Consonants eth /ð/ and theta /θ/ ..................................................................
    [Show full text]
  • The Unicode Cookbook for Linguists
    The Unicode Cookbook for Linguists Managing writing systems using orthography profiles Steven Moran Michael Cysouw language Translation and Multilingual Natural science press Language Processing 10 Translation and Multilingual Natural Language Processing Editors: Oliver Czulo (Universität Leipzig), Silvia Hansen-Schirra (Johannes Gutenberg-Universität Mainz), Reinhard Rapp (Johannes Gutenberg-Universität Mainz) In this series: 1. Fantinuoli, Claudio & Federico Zanettin (eds.). New directions in corpus-based translation studies. 2. Hansen-Schirra, Silvia & Sambor Grucza (eds.). Eyetracking and Applied Linguistics. 3. Neumann, Stella, Oliver Čulo & Silvia Hansen-Schirra (eds.). Annotation, exploitation and evaluation of parallel corpora: TC3 I. 4. Czulo, Oliver & Silvia Hansen-Schirra (eds.). Crossroads between Contrastive Linguistics, Translation Studies and Machine Translation: TC3 II. 5. Rehm, Georg, Felix Sasaki, Daniel Stein & Andreas Witt (eds.). Language technologies for a multilingual Europe: TC3 III. 6. Menzel, Katrin, Ekaterina Lapshinova-Koltunski & Kerstin Anna Kunz (eds.). New perspectives on cohesion and coherence: Implications for translation. 7. Hansen-Schirra, Silvia, Oliver Czulo & Sascha Hofmann (eds). Empirical modelling of translation and interpreting. 8. Svoboda, Tomáš, Łucja Biel & Krzysztof Łoboda (eds.). Quality aspects in institutional translation. 9. Fox, Wendy. Can integrated titles improve the viewing experience? Investigating the impact of subtitling on the reception and enjoyment of film using eye tracking and questionnaire data. 10. Moran, Steven & Michael Cysouw. The Unicode cookbook for linguists: Managing writing systems using orthography profiles ISSN: 2364-8899 The Unicode Cookbook for Linguists Managing writing systems using orthography profiles Steven Moran Michael Cysouw language science press Steven Moran & Michael Cysouw. 2018. The Unicode Cookbook for Linguists: Managing writing systems using orthography profiles (Translation and Multilingual Natural Language Processing 10).
    [Show full text]
  • Proposal to Encode Additional Dialectology Latin Characters
    Proposal to encode additional dialectology Latin characters Denis Moyogo Jacquerye <[email protected]> 2014-07-28 1. Background The recent addition of the German dialectology symbols to Unicode raises the question of other dialectology transcription systems being encoded. Romance dialectology transcription systems were created in the late 19th century or early 20th, in the same period as German dialectology systems or the International Phonetic Alphabet (IPA). Besides the Böhmer-Ascoli transcription from which the Teuthonista transcription is derived (both used in German dialectology), another major transcription systems is the French Rousselot-Gilliéron system, and other systems have borrowed from these such as the Spanish Revista de Filología española system or the Romanian dialectology system. These transcription systems have been used, adapted or extended in a relatively large quantity of dialectology works (French, Spanish and Portuguese, Italian, and Romanian), some of these are being digitized and would require standardized characters. More recent transcription systems were developped in the second half of the 20th century, based on these earlier transcription systems and the IPA, to accomodate larger second generation dialectology works (European: ALE, Mediteranean: ALM, Romance languages: ALiR, and others), some of these require additional characters to be encoded. The characters being proposed are used mostly in Romance languages dialectology works but not exclusively, they are also used in transcription of other linguistic families present in the areas covered by these works. The characters needed for several other dialectology transcription systems, such as the Swedish, Danish and Norwegian systems would require further work and are out of the scope of this proposal.
    [Show full text]
  • Phonetic Extensions Supplement Range: 1D80–1DBF
    Phonetic Extensions Supplement Range: 1D80–1DBF This file contains an excerpt from the character code tables and list of character names for The Unicode Standard, Version 14.0 This file may be changed at any time without notice to reflect errata or other updates to the Unicode Standard. See https://www.unicode.org/errata/ for an up-to-date list of errata. See https://www.unicode.org/charts/ for access to a complete list of the latest character code charts. See https://www.unicode.org/charts/PDF/Unicode-14.0/ for charts showing only the characters added in Unicode 14.0. See https://www.unicode.org/Public/14.0.0/charts/ for a complete archived file of character code charts for Unicode 14.0. Disclaimer These charts are provided as the online reference to the character contents of the Unicode Standard, Version 14.0 but do not provide all the information needed to fully support individual scripts using the Unicode Standard. For a complete understanding of the use of the characters contained in this file, please consult the appropriate sections of The Unicode Standard, Version 14.0, online at https://www.unicode.org/versions/Unicode14.0.0/, as well as Unicode Standard Annexes #9, #11, #14, #15, #24, #29, #31, #34, #38, #41, #42, #44, #45, and #50, the other Unicode Technical Reports and Standards, and the Unicode Character Database, which are available online. See https://www.unicode.org/ucd/ and https://www.unicode.org/reports/ A thorough understanding of the information contained in these additional sources is required for a successful implementation.
    [Show full text]
  • Considerations in the Use of the Latin Script in Variant Internationalized Top-Level Domains
    Considerations in the use of the Latin script in variant internationalized top-level domains Final report of the ICANN VIP Study Group for the Latin script Executive summary The study group examined all the characters in the Unicode Character Code Chart version 6.1.0 that are associated with the Latin script and valid under the IDNA2008 protocol. It identified several forms of “confusability” that might require careful consideration in the collation of a subset of the broader repertoire for local use. The resolution of such issues is, however, highly dependent on local orthographic conventions. These frequently treat the same characters in different manners. Strings that are confusingly similar in the context of one language may have no such connotations in another. Noting that the Latin script is used by a larger number of separate language communities than is any other single script, attempting to provide a comprehensive overview of the needs of all of them is an unrealistic endeavor. A summary attempt at doing so nonetheless would be culturally insensitive to communities that have yet to join the IDN discussion. The study group therefore finds no basis for the categorical treatment of any code point assigned to an element of the Latin script as being equivalent to any other such code point. Nor does it believe that any such basis exists beyond what is already incorporated in the IDNA2008 protocol. The ICANN TLD application process should not permit requests for multiple Latin strings under the premise that they are variants of each other. Careful scrutiny is required when evaluating proposed TLD labels for confusability but that does not make them variants in the focused sense of the VIP study.
    [Show full text]