Report from ISO TC46 - Annual Meetings
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Computational Development of Lesser Resourced Languages
Computational Development of Lesser Resourced Languages Martin Hosken WSTech, SIL International © 2019, SIL International Modern Technical Capability l Grammar checking l Wikipedia l OCR l Localisation l Text to speech l Speech to text l Machine Translation © 2019, SIL International Digital Language Vitality l 0.2% doing well − 43% world population l 78% score nothing! − ~10% population © 2019, SIL International Simons and Thomas, 2019 Climbing from the Bottom l Language Tag l Linebreaking l Unicode encoding l Locale Information l Font − Character Lists − Sort order l Keyboard − physical l Content − phone © 2019, SIL International Language Tag l Unique orthography l lng – ISO639 identifier l Scrp – ISO 15924 l Structure: l RE – ISO 3166-1 − lng-Scrp-RE-variants − ahk = ahk-Latn-MM − https://ldml.api.sil.org/langtags.json BCP 47 © 2019, SIL International Language Tags l Variants l Policy Issues − dialect/language − ISO 639 is linguistic − orthography/script − Language tags are sociolinguistic − registration/private use © 2019, SIL International Unicode Encoding l Engineering detail l Policy Issues l Almost all scripts in − Use Unicode − Publish Orthography l Find a char Descriptions − Sequences are good l Implies an orthography © 2019, SIL International Fonts l Lots of fonts! l Policy Issues l SIL Fonts − Ensure industry support − Full script coverage − Encourage free fonts l Problems − adding fonts to phones © 2019, SIL− InternationalNoto styling Keyboards l Keyman l Wider industry − All platforms − More capable standard − Predictive text − More industry interest − Open Source − IDE © 2019, SIL International Keyboards l Policy Issues − Agreed layout l Per language l Physical & Mobile © 2019, SIL International Linebreaking l Unsolved problem l Word frequencies − Integration − open access − Description − same as for predictive text l Resources © 2019, SIL International Locale Information l A deep well! l Key terms l Unicode CLDR l Sorting − Industry base data l Dates, Times, etc. -
Technical Reference Manual for the Standardization of Geographical Names United Nations Group of Experts on Geographical Names
ST/ESA/STAT/SER.M/87 Department of Economic and Social Affairs Statistics Division Technical reference manual for the standardization of geographical names United Nations Group of Experts on Geographical Names United Nations New York, 2007 The Department of Economic and Social Affairs of the United Nations Secretariat is a vital interface between global policies in the economic, social and environmental spheres and national action. The Department works in three main interlinked areas: (i) it compiles, generates and analyses a wide range of economic, social and environmental data and information on which Member States of the United Nations draw to review common problems and to take stock of policy options; (ii) it facilitates the negotiations of Member States in many intergovernmental bodies on joint courses of action to address ongoing or emerging global challenges; and (iii) it advises interested Governments on the ways and means of translating policy frameworks developed in United Nations conferences and summits into programmes at the country level and, through technical assistance, helps build national capacities. NOTE The designations employed and the presentation of material in the present publication do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. The term “country” as used in the text of this publication also refers, as appropriate, to territories or areas. Symbols of United Nations documents are composed of capital letters combined with figures. ST/ESA/STAT/SER.M/87 UNITED NATIONS PUBLICATION Sales No. -
Multilingual Content Management and Standards with a View on AI Developments Laurent Romary
Multilingual content management and standards with a view on AI developments Laurent Romary To cite this version: Laurent Romary. Multilingual content management and standards with a view on AI developments. AI4EI - Conference Artificial Intelligence for European Integration, Oct 2020, Turin / Virtual, Italy. hal-02961857 HAL Id: hal-02961857 https://hal.inria.fr/hal-02961857 Submitted on 8 Oct 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution| 4.0 International License Multilingual content management and standards with a view on AI developments Laurent Romary Directeur de Recherche, Inria, team ALMAnaCH ISO TC 37, chair Language and AI • Central role of language in the revival of AI (machine-learning based models) • Applications: document management and understanding, chatbots, machine translation • Information sources: public (web, cultural heritage repositories) and private (Siri, Amazon Alexa) linguistic information • European context: cf. Europe's Languages in the Digital Age, META-NET White Paper Series • Variety of linguistic forms • Spoken, written, chats and forums • Multilingualism, accents, dialects, technical domains, registers, language learners • General notion of language variety • Classifying and referencing the relevant features • Role of standards and standards developing organization (SDO) A concrete example for a start Large scale corpus Language model BERT Devlin, J., Chang, M. -
Sc22/Wg20 N860
Final Draft for CEN CWA: European Culturally Specific ICT Requirements 1 2000-10-31 SC22/WG20 N860 Draft CWA/ESR:2000 Cover page to be supplied. Final Draft for CEN CWA: European Culturally Specific ICT Requirements 2 2000-10-31 Table of Contents DRAFT CWA/ESR:2000 1 TABLE OF CONTENTS 2 FOREWORD 3 INTRODUCTION 4 1 SCOPE 5 2 REFERENCES 6 3 DEFINITIONS AND ABBREVIATIONS 6 4 GENERAL 7 5 ELEMENTS FOR THE CHECKLIST 8 5.1 Sub-areas 8 5.2 Characters 8 5.3 Use of special characters 10 5.4 Numbers, monetary amounts, letter written figures 11 5.5 Date and time 12 5.6 Telephone numbers and addresses, bank account numbers and personal identification 13 5.7 Units of measures 14 5.8 Mathematical symbols 14 5.9 Icons and symbols, meaning of colours 15 5.10 Man-machine interface and Culture related political and legal requirements 15 ANNEX A (NORMATIVE) 16 Final Draft for CEN CWA: European Culturally Specific ICT Requirements 3 2000-10-31 FOREWORD The production of this document which describes European culturally specific requirements on information and communications technologies was agreed by the CEN/ISSS Workshop European Culturally Specific ICT Requirements (WS-ESR) in the Workshop’s Kick-Off meeting on 1998-11-23. The document has been developed through the collaboration of a number of contributing partners in WS-ESR. WS- ESR representation gathers a wide mix of interests, coming from academia, public administrations, IT-suppliers, and other interested experts. The present CWA (CEN Workshop Agreement) has received the support of representatives of each of these sectors. -
Tags for Identifying Languages File:///C:/W3/International/Draft-Langtags/Draft-Phillips-Lan
Tags for Identifying Languages file:///C:/w3/International/draft-langtags/draft-phillips-lan... Network Working Group A. Phillips, Ed. TOC Internet-Draft webMethods, Inc. Expires: October 7, 2004 M. Davis IBM April 8, 2004 Tags for Identifying Languages draft-phillips-langtags-02 Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on October 7, 2004. Copyright Notice Copyright (C) The Internet Society (2004). All Rights Reserved. Abstract This document describes a language tag for use in cases where it is desired to indicate the language used in an information object, how to register values for use in this language tag, and a construct for matching such language tags, including user defined extensions for private interchange. Table of Contents 1. Introduction 2. The Language Tag 2.1 Syntax 2.2 Language Tag Sources 2.2.1 Pre-Existing RFC3066 Registrations 1 of 20 08/04/2004 11:03 Tags for Identifying Languages file:///C:/w3/International/draft-langtags/draft-phillips-lan.. -
A Könyvtárüggyel Kapcsolatos Nemzetközi Szabványok
A könyvtárüggyel kapcsolatos nemzetközi szabványok 1. Állomány-nyilvántartás ISO 20775:2009 Information and documentation. Schema for holdings information 2. Bibliográfiai feldolgozás és adatcsere, transzliteráció ISO 10754:1996 Information and documentation. Extension of the Cyrillic alphabet coded character set for non-Slavic languages for bibliographic information interchange ISO 11940:1998 Information and documentation. Transliteration of Thai ISO 11940-2:2007 Information and documentation. Transliteration of Thai characters into Latin characters. Part 2: Simplified transcription of Thai language ISO 15919:2001 Information and documentation. Transliteration of Devanagari and related Indic scripts into Latin characters ISO 15924:2004 Information and documentation. Codes for the representation of names of scripts ISO 21127:2014 Information and documentation. A reference ontology for the interchange of cultural heritage information ISO 233:1984 Documentation. Transliteration of Arabic characters into Latin characters ISO 233-2:1993 Information and documentation. Transliteration of Arabic characters into Latin characters. Part 2: Arabic language. Simplified transliteration ISO 233-3:1999 Information and documentation. Transliteration of Arabic characters into Latin characters. Part 3: Persian language. Simplified transliteration ISO 25577:2013 Information and documentation. MarcXchange ISO 259:1984 Documentation. Transliteration of Hebrew characters into Latin characters ISO 259-2:1994 Information and documentation. Transliteration of Hebrew characters into Latin characters. Part 2. Simplified transliteration ISO 3602:1989 Documentation. Romanization of Japanese (kana script) ISO 5963:1985 Documentation. Methods for examining documents, determining their subjects, and selecting indexing terms ISO 639-2:1998 Codes for the representation of names of languages. Part 2. Alpha-3 code ISO 6630:1986 Documentation. Bibliographic control characters ISO 7098:1991 Information and documentation. -
Addressing — Digital Interchange Models
© The Calendaring and Scheduling Consortium, Inc. 2019 – All rights reserved CC/WD 19160-6:2019 CalConnect TC VCARD Addressing — Digital interchange models Working Dra Standard Warning for dras This document is not a CalConnect Standard. It is distributed for review and comment, and is subject to change without notice and may not be referred to as a Standard. Recipients of this dra are invited to submit, with their comments, notification of any relevant patent rights of which they are aware and to provide supporting documentation. Recipients of this dra are invited to submit, with their comments, notification of any relevant patent rights of which they are aware and to provide supporting documentation. The Calendaring and Scheduling Consortium, Inc. 2019 CC/WD 19160-6:2019:2019 © 2019 The Calendaring and Scheduling Consortium, Inc. All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on the internet or an intranet, without prior written permission. Permission can be requested from the address below. The Calendaring and Scheduling Consortium, Inc. 4390 Chaffin Lane McKinleyville California 95519 United States of America [email protected] www.calconnect.org ii © The Calendaring and Scheduling Consortium, Inc. 2019 – All rights reserved CC/WD 19160-6:2019:2019 Contents .Foreword...................................................................................................................................... -
2021 Nr. 10 2021-05-31
LIETUVOS STANDARTIZACIJOS DEPARTAMENTO BIULETENIS 2015 M. NR. 5 LIETUVOS STANDARTIZACIJOS DEPARTAMENTO BIULETENIS 2021 Nr. 10 2021-05-31 1 2021 M. NR. 10 (2021-05-31)|LIETUVOS STANDARTIZACIJOS DEPARTAMENTO BIULETENIS TURINYS ATNAUJINTA LST TK 36 APLINKOS APSAUGA VEIKLOS SRITIS ............................................................................................................. 3 IŠLEISTI LIETUVOS STANDARTAI IR STANDARTIZACIJOS LEIDINIAI ................................................................................................ 3 IŠLEISTA NACIONALINIO STANDARTO PATAISA ..................................................................................................................... 3 IŠLEISTI PERIMTIEJI EUROPOS STANDARTAI IR STANDARTIZACIJOS LEIDINIAI .................................................... 3 IŠVERSTI Į LIETUVIŲ KALBĄ PERIMTIEJI STANDARTAI .................................................................................................... 12 NEGALIOJANTYS LIETUVOS STANDARTAI IR STANDARTIZACIJOS LEIDINIAI...................................................................... 12 SKELBIAMI NEGALIOSIANČIAIS LIETUVOS STANDARTAI IR STANDARTIZACIJOS LEIDINIAI ...................................... 18 PERŽIŪRIMI NACIONALINIAI STANDARTAI ........................................................................................................................................... 18 PERŽIŪRIMI EUROPOS STANDARTAI IR STANDARTIZACIJOS LEIDINIAI ............................................................................... -
ISO 11940-2:2007 Db1b724a95ea/Iso-11940-2-2007
INTERNATIONAL ISO STANDARD 11940-2 First edition 2007-05-01 Information and documentation — Transliteration of Thai characters into Latin characters Part 2: Simplified transcription of Thai language iTeh STInformationANDA etR documentationD PREV —I TranslittérationEW des caractères thaï en caractères latins (stPartieand 2:a Transcriptionrds.iteh simplifiée.ai) de la langue thaï ISO 11940-2:2007 https://standards.iteh.ai/catalog/standards/sist/f107d633-0a12-4a12-9e97- db1b724a95ea/iso-11940-2-2007 Reference number ISO 11940-2:2007(E) © ISO 2007 ISO 11940-2:2007(E) PDF disclaimer This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed but shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO Central Secretariat accepts no liability in this area. Adobe is a trademark of Adobe Systems Incorporated. Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below. iTeh STANDARD PREVIEW (standards.iteh.ai) ISO 11940-2:2007 https://standards.iteh.ai/catalog/standards/sist/f107d633-0a12-4a12-9e97- db1b724a95ea/iso-11940-2-2007 COPYRIGHT PROTECTED DOCUMENT © ISO 2007 All rights reserved. -
The Tatoeba Translation Challenge--Realistic Data Sets For
The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT Jorg¨ Tiedemann University of Helsinki [email protected] https://github.com/Helsinki-NLP/Tatoeba-Challenge Abstract most important point is to get away from artificial This paper describes the development of a new setups that only simulate low-resource scenarios or benchmark for machine translation that pro- zero-shot translations. A lot of research is tested vides training and test data for thousands of with multi-parallel data sets and high resource lan- language pairs covering over 500 languages guages using data sets such as WIT3 (Cettolo et al., and tools for creating state-of-the-art transla- 2012) or Europarl (Koehn, 2005) simply reducing tion models from that collection. The main or taking away one language pair for arguing about goal is to trigger the development of open the capabilities of learning translation with little or translation tools and models with a much without explicit training data for the language pair broader coverage of the World’s languages. Using the package it is possible to work on in question (see, e.g., Firat et al.(2016a,b); Ha et al. realistic low-resource scenarios avoiding arti- (2016); Lakew et al.(2018)). Such a setup is, how- ficially reduced setups that are common when ever, not realistic and most probably over-estimates demonstrating zero-shot or few-shot learning. the ability of transfer learning making claims that For the first time, this package provides a do not necessarily carry over towards real-world comprehensive collection of diverse data sets tasks. -
Proposal for Generation Panel for Latin Script Label Generation Ruleset for the Root Zone
Generation Panel for Latin Script Label Generation Ruleset for the Root Zone Proposal for Generation Panel for Latin Script Label Generation Ruleset for the Root Zone Table of Contents 1. General Information 2 1.1 Use of Latin Script characters in domain names 3 1.2 Target Script for the Proposed Generation Panel 4 1.2.1 Diacritics 5 1.3 Countries with significant user communities using Latin script 6 2. Proposed Initial Composition of the Panel and Relationship with Past Work or Working Groups 7 3. Work Plan 13 3.1 Suggested Timeline with Significant Milestones 13 3.2 Sources for funding travel and logistics 16 3.3 Need for ICANN provided advisors 17 4. References 17 1 Generation Panel for Latin Script Label Generation Ruleset for the Root Zone 1. General Information The Latin script1 or Roman script is a major writing system of the world today, and the most widely used in terms of number of languages and number of speakers, with circa 70% of the world’s readers and writers making use of this script2 (Wikipedia). Historically, it is derived from the Greek alphabet, as is the Cyrillic script. The Greek alphabet is in turn derived from the Phoenician alphabet which dates to the mid-11th century BC and is itself based on older scripts. This explains why Latin, Cyrillic and Greek share some letters, which may become relevant to the ruleset in the form of cross-script variants. The Latin alphabet itself originated in Italy in the 7th Century BC. The original alphabet contained 21 upper case only letters: A, B, C, D, E, F, Z, H, I, K, L, M, N, O, P, Q, R, S, T, V and X. -
Osist ISO/DIS 3297:2019 01-Oktober-2019
- W cd E c0 I d- V 7b E 7 9 R i) c2 1 3 20 P .a /1 - D h st 97 e si 2 R it s/ -3 A s. : d is d rd ar d D r a d o- N a d n is A d an ta t- n st /s is T a l g os S t ul lo / s a 0c eh ( F at e T /c 72 i ai f . 78 eh 6 it e SLOVENSKI STANDARD s. 66 Informatika in dokumentacija - Mednarodna standardna številka serijske d 6- publikacije (ISSN) ar d oSIST ISO/DIS 3297:2019 d bc an - st c4 // e Information and documentation -- International standard serial01-oktober-2019 number (ISSN) s: 4 tp ht Information et documentation -- Numéro international normalisé des publications en série (ISSN) Ta slovenski standard je istoveten z: ICS: 01.140.20 oSIST ISO/DIS 3297:2019 Informacijske vede ISO/DIS 3297:2019 Information sciences 2003-01.Slovenski inštitut za standardizacijo. Razmnoževanje celote ali delov tega standarda ni dovoljeno.en,fr,de - W cd E c0 I d- V 7b E 7 9 R i) c2 1 3 20 P .a /1 - D h st 97 e si 2 R it s/ -3 A s. : d is d rd ar d D r a d o- oSIST ISO/DIS 3297:2019 N a d n is A d an ta t- n st /s is T a l g os S t ul lo / s a 0c eh ( F at e T /c 72 i ai f .