Textual Representations for Corpus-Based Bilingual Retrieval
Total Page:16
File Type:pdf, Size:1020Kb
Load more
										Recommended publications
									
								- 
												  Technical Reference Manual for the Standardization of Geographical Names United Nations Group of Experts on Geographical NamesST/ESA/STAT/SER.M/87 Department of Economic and Social Affairs Statistics Division Technical reference manual for the standardization of geographical names United Nations Group of Experts on Geographical Names United Nations New York, 2007 The Department of Economic and Social Affairs of the United Nations Secretariat is a vital interface between global policies in the economic, social and environmental spheres and national action. The Department works in three main interlinked areas: (i) it compiles, generates and analyses a wide range of economic, social and environmental data and information on which Member States of the United Nations draw to review common problems and to take stock of policy options; (ii) it facilitates the negotiations of Member States in many intergovernmental bodies on joint courses of action to address ongoing or emerging global challenges; and (iii) it advises interested Governments on the ways and means of translating policy frameworks developed in United Nations conferences and summits into programmes at the country level and, through technical assistance, helps build national capacities. NOTE The designations employed and the presentation of material in the present publication do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. The term “country” as used in the text of this publication also refers, as appropriate, to territories or areas. Symbols of United Nations documents are composed of capital letters combined with figures. ST/ESA/STAT/SER.M/87 UNITED NATIONS PUBLICATION Sales No.
- 
												  The Wili Benchmark Dataset for Written Natural Language Identification1 The WiLI benchmark dataset for written II. WRITTEN LANGUAGE BASICS language identification A language is a system of communication. This could be spoken language, written language or sign language. A spoken language Martin Thoma can have several forms of written language. For example, the E-Mail: [email protected] spoken Chinese language has at least three writing systems: Traditional Chinese characters, Simplified Chinese characters and Pinyin — the official romanization system for Standard Chinese. Abstract—This paper describes the WiLI-2018 benchmark dataset for monolingual written natural language identification. Languages evolve. New words like googling, television and WiLI-2018 is a publicly available,1 free of charge dataset of Internet get added, but written languages are also refactored. short text extracts from Wikipedia. It contains 1000 paragraphs For example, the German orthography reform of 1996 aimed at of 235 languages, totaling in 235 000 paragraphs. WiLI is a classification dataset: Given an unknown paragraph written in making the written language simpler. This means any system one dominant language, it has to be decided which language it which recognizes language and any benchmark needs to be is. adapted over time. Hence WiLI is versioned by year. Languages do not necessarily have only one name. According to Wikipedia, the Sranan language is also known as Sranan Tongo, Sranantongo, Surinaams, Surinamese, Surinamese Creole and I. INTRODUCTION Taki Taki. This makes ISO 369-3 valuable, but not all languages are represented in ISO 369-3. As ISO 369-3 uses combinations The identification of written natural language is a task which of 3 Latin letters and has 547 reserved combinations, it can appears often in web applications.
- 
												  Fare Un Libro.PdfRegione siciliana Assessorato dei beni culturali e dell’identità siciliana Dipartimento dei beni culturali e dell’identità siciliana CARLO PASTENA Sausage Sausage Sausage Sausage FARE UN LIBRO PALERMO 2013 Si ringraziano per la collaborazione i colleghi del CRICD, e in particolare: Mi- chele Di Dio, Paolo Gambino, Sebastiano Gambino, Angela Genovese, Ro- salia La Mendola, Fabio Militello, Pierantonio Passante e Paolo Tuzzolino. La copertina è di Fabio Militello. I disegni delle pp. 42-44 sono tratti da: P. Prévôt e F. Roher, Cours d’indu- stries graphiques. Techniques d’impression, Paris, Eyrolles, 2008, pp. 66-67. © Regione siciliana. Assessorato dei beni culturali e dell’identità siciliana. Dipartimento dei beni culturali e dell’identità siciliana. CRICD. Pastena, Carlo <1955-> Fare un libro / Carlo Pastena. – Palermo : CRICD, 2012. 1. Editoria. 070.50945 CDD-22 SBN Pal0241192 CIP - Biblioteca centrale della Regione siciliana “Alberto Bombace” Negli ultimi anni la produzione editoriale dell’Assessorato regionale dei beni culturali è considerevolmente aumentata, raggiungendo una media di cinquanta nuovi titoli ogni anno. A fronte di questa considerevole attività, molti Istituti periferici del Dipartimento regionale dei beni culturali hanno più volte manifestato l’esigenza di un’opera che, pur sinteticamente, introducesse non solo al DTP (Deskstop Publishing) cioè alle tecniche d’impaginazione e stampa, ma anche alle problematiche connesse al deposito obbligatorio, al diritto d’autore, alle norme fiscali in materia di editoria, ecc. Rispondendo a questa sempre più frequente richiesta, questo Centro regionale per l’inventario, la catalogazione e la documentazione, nell’adempimento dei suoi compiti istituzionali (l.r. 80/77 art. 9, lett. c.), intende così fornire a chi si accinge a pubblicare un’opera, alcune informazioni su come preparare un testo da dare in tipografia, sui principali tipi di carta utilizzate per la stampa, ecc.
- 
												  The Role of Language Script and Search DomainInf Retrieval (2009) 12:324–351 DOI 10.1007/s10791-008-9082-8 Mixed monolingual homepage finding in 34 languages: the role of language script and search domain Roi Blanco Æ Christina Lioma Received: 19 May 2008 / Accepted: 27 November 2008 / Published online: 20 December 2008 Ó Springer Science+Business Media, LLC 2008 Abstract The information that is available or sought on the World Wide Web (Web) is increasingly multilingual. Information Retrieval systems, such as the freely available search engines on the Web, need to provide fair and equal access to this information, regardless of the language in which a query is written or where the query is posted from. In this work, we ask two questions: How do existing state of the art search engines deal with languages written in different alphabets (scripts)? Do local language-based search domains actually facilitate access to information? We conduct a thorough study on the effect of multilingual queries for homepage finding, where the aim of the retrieval system is to return only one document, namely the homepage described in the query. We evaluate the effect of multilingual queries in retrieval performance with regard to (i) the alphabet in which the queries are written (e.g., Latin, Russian, Arabic), and (ii) the language domain where the queries are posted (e.g., google.com, google.fr). We query four major freely available search engines with 764 queries in 34 different languages, and look for the correct homepage in the top retrieved results. In order to have fair multilingual experi- mental settings, we use an ontology that is comparable across languages and also representative of realistic Web searches: football premier leagues in different countries; the official team name represents our query, and the official team homepage represents the document to be retrieved.
- 
												  Mission Report: 15–19 August 2016JOINT EXTERNAL EVALUATION OF IHR CORE CAPACITIES of the REPUBLIC OF ARMENIA Mission report: 15–19 August 2016 WHO/WHE/CPI/2017.13 JOINT EXTERNAL EVALUATION OF IHR CORE CAPACITIES of the REPUBLIC OF ARMENIA Mission report: 15–19 August 2016 WHO/WHE/CPI/2017.14 © World Health Organization 2017 Some rights reserved. This work is available under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 IGO licence (CC BY-NC-SA 3.0 IGO; https://creativecommons.org/licenses/by-nc-sa/3.0/igo). Under the terms of this licence, you may copy, redistribute and adapt the work for non-commercial purposes, provided the work is appropriately cited, as indicated below. In any use of this work, there should be no suggestion that WHO endorses any specific organization, products or services. The use of the WHO logo is not permitted. If you adapt the work, then you must license your work under the same or equivalent Creative Commons licence. If you create a translation of this work, you should add the following disclaimer along with the suggested citation: “This translation was not created by the World Health Organization (WHO). WHO is not responsible for the content or accuracy of this translation. The original English edition shall be the binding and authentic edition”. Any mediation relating to disputes arising under the licence shall be conducted in accordance with the mediation rules of the World Intellectual Property Organization (http://www.wipo.int/amc/en/mediation/rules). Suggested citation. Joint External Evaluation of IHR Core Capacities of the Republic of Armenia. Geneva: World Health Organization; 2017.
- 
												  Ansi Z39.47-1985ANSI® * Z39.47-1985 ISSN: 8756-0860 American National Standard for Information Sciences - Extended Latin Alphabet Coded Character Set for Bibliographic Use Secretariat Council of National Library and Information Associations Approved September 4, 1984 American National Standards Institute, Inc Abstract This standard establishes computer codes for an extended Latin alphabet character set to be used in bibliographic work when handling non-English items. This standard establishes the 7-bit and 8-bit code values. 3 sawvwn nkhhoiw jo umMm im Generated on 2013-01-28 18:16 GMT / http://hdl.handle.net/2027/mdp.39015079886241 Public Domain, Google-digitized / http://www.hathitrust.org/access_use#pd-google Por^XA/OrH fW* Foreword is not part of American National Standard Z39.47-1985.) This standard for the extended Latin alphabet coded character set for bibliographic use l\ <-. was developed by Subcommittee N on Coded Character Sets for Bibliographic Informa- tion Interchange of the American National Standards Committee on Library and Informa- tion Sciences and Related Publishing Practices, Z39, now the National Information Stan- . dards Organization (Z39). / L O I The codes given in this standard may be identified by the use of the notation ANSEL. The notation ANSEL should be taken to mean the codes prescribed by the latest edition of this standard. To explicitly designate a particular edition, the last two digits of the year of the issue may be appended, as in ANSEL 85. The standard establishes both the 7-bit and the 8-bit code values for the computer codes for characters used in bibliographic work when handling non-English items.
- 
												  Bibliographic Access to Non-Roman Scripts in Library Opacs: a Study of Selected Arl Academic Libraries in the United StatesBIBLIOGRAPHIC ACCESS TO NON-ROMAN SCRIPTS IN LIBRARY OPACS: A STUDY OF SELECTED ARL ACADEMIC LIBRARIES IN THE UNITED STATES By Ali Kamal Shaker BA—Cairo University-Egypt, 1995 MLIS—University of Wisconsin-Milwaukee, 2000 Submitted to the Graduate Faculty of School of Information Sciences in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Pittsburgh 2002 ii BIBLIOGRAPHIC ACCESS TO NON-ROMAN SCRIPTS IN LIBRARY OPACS A STUDY OF SELECTED ARL ACADEMIC LIBRARIES IN THE UNITED STATES Ali K. Shaker, PhD University of Pittsburgh, 2002 With the increasing availability of non-Roman script materials in academic/research libraries, bibliographic access to vernacular characters in library OPACs becomes one of the primary means for users of these materials to access and use them efficiently. Though Romanization, as a bibliographic control tool, has been studied extensively during the past five decades, investigations of using original (vernacular) scripts remain inadequate. The purpose of this study was to trace the transition from Romanization-based to vernacular-based bibliographic access to non-Roman script materials. Two major developments contributing to this transition were the availability of records with script characters from bibliographic utilities, and the development of a universal character set, the Unicode standard. The main data collection instrument was a self-administered mail questionnaire sent to a purposive sample of academic library members in the Association of Research Libraries with sizeable non-Roman script collections. Another data collection technique utilized was document/Web site analysis of bibliographic utilities and library automation vendors. Forty five questionnaires were obtained, which represented 65% of an actual population of 69 libraries.
- 
												  Global Index with Hyperlinks to PDF FilesIndex A allography, 129, 134, 136, 232, 233 alphabet, 89, 144–148, 150, 152, abbreviation, 14, 15, 32, 33, 60–63, 154, 155, 157, 166, 168– 76, 77, 91, 269, 282, 378, 170, 172, 177, 180, 182, 514, 631, 719, 819, 946, 185–187, 218, 223–230, 1078 232, 234–236, 241, 242, marker, 15 244, 429, 432, 463, 473, abjad, 51, 53, 54, 60, 76, 134, 178, 569, 763, 765, 807, 810, 180, 228, 625, 781, 807– 811, 822, 825, 829, 831, 811, 820, 822, 910, 1109 837, 841, 888, 907, 908, Abkhaz, 115 910, 916, 1109, 1112 abstract, 561, 563, 565, 567 a monument to hidebound object, 9 conservatism, 114 abugida, 54, 162, 163, 165, 177, Arabic, 910 180, 184, 228, 625, 807– Aramaic, 166 811, 818–820, 822 Armenian, 166, 185, 186 acrophonic, 208, 232 Bougainvillian, 837 principle, 168, 224, 232 Carian, 181, 182, 232, 233 acrostics, 1109, 1111–1113, 1116 consonantal, 208, 218, 810, Adobe, 441 811 Aegean, 807, 809 Coptic, 184, 775, 776, 778, 780 Aidarus, 980 Cyrillic, 115, 154, 1068, 1070, Ajami, 971 1080 Akkadian, 59, 116, 117, 216, 218, dual, 14 929, 1112 English, 825 al-Busiri, 980 Entlehnungs-, 150 al-Inkishafi, 977 Etruscan, 170, 171, 916 Albanian, 166, 373, 392 Georgian, 793 aleph, 466, 1114, 1115 Gothic, 166, 224 Algerian (typeface), 397 Greek, 115, 167–171, 182, 187, Algernon, 276, 278 225, 226, 230–232, 234, Alice in Wonderland, 350, 351 775–779, 781 alloglottoepy, 790 Hebrew, 231, 497 alloglottography, 790 history, 157 βʹ Index Korean, 807 476, 478, 479, 497, 498, Latin, 147, 151, 154, 171, 176, 530, 564, 568, 580, 622, 182, 300, 305, 443, 445, 625, 760, 810, 835,
- 
												  Data Quality Guide Table of ContentsSpectrum™ Technology Platform Version 2019.1.0 Data Quality Guide Table of Contents Using an Express Match Key 107 Analyzing Match Results 112 1 - Getting Started Dataflow Templates for Matching 127 Introduction to Data Quality 5 5 - Deduplication 2 - Parsing Filtering Out Duplicate Records 136 Creating a Best of Breed Record 140 Introduction to Parsing 8 Defining Domain-Independent Parsing Grammars in Dataflows 9 6 - Exception Records Culture-Specific Parsing 10 Analyzing Parsing Results 36 Designing a Dataflow to Handle Exceptions 146 Parsing Personal Names 40 Designing a Dataflow for Real-Time Dataflow Templates for Parsing 41 Revalidation 147 3 - Standardization 7 - Lookup Tables Standardizing Terms 56 Introduction to Lookup Tables 151 Standardizing Personal Names 57 Data Normalization Module Tables 151 Templates for Standardization 59 Universal Name Module Tables 156 Viewing the Contents of a Lookup Table 158 Adding a Term to a Lookup Table 159 4 - Matching Removing a Term from a Lookup Table 159 Modifying the Standardized Form of a Term 159 Matching Terminology 62 Reverting Table Customizations 160 Standard Fields 63 Creating a Lookup Table 160 Techniques for Defining Match Keys 64 Importing Data 161 Match Rules 67 Matching Records from a Single Source 80 Matching Records from One Source to Another 8 - Stages Reference Source 86 Matching Records Between and Within Advanced Matching Module 165 Sources 92 Business Steward Module 226 Matching Records Against a Database 98 Data Normalization Module 271 Matching Records Using Multiple Match Rules 100 Universal Name Module 289 Creating a Universal Matching Service 104 9 - ISO Country Codes and Module Support ISO Country Codes and Module Support 324 Spectrum™ Technology Platform 2019.1.0 Data Quality Guide 3 1 - Getting Started In this section Introduction to Data Quality 5 Getting Started Introduction to Data Quality Data quality involves ensuring the accuracy, timeliness, completeness, and consistency of the data used by an organization so that the data is fit for use.
- 
												  Challenge EuropeThe South Caucasus Between integration and fragmentation May 2015 Fuad Chiragov Vusal Gasimli Kornely Kakachia Reshad Karimov Andrey Makarychev Farhad Mammadov Mehmet Ögütçü Gulshan Pashayeva Amanda Paul Dennis Sammut Zaur Shiriyev Cavid Veliyev In strategic partnership with the King Baudouin Foundation The South Caucasus Between integration and fragmentation Fuad Chiragov Vusal Gasimli Kornely Kakachia Reshad Karimov Andrey Makarychev Farhad Mammadov Mehmet Ögütçü Gulshan Pashayeva Amanda Paul Dennis Sammut Zaur Shiriyev Cavid Veliyev The views expressed, and terminology used in these papers are those of the authors and do not represent those of the EPC or SAM. May 2015 ISSN-1783-2462 Table of contents About the authors 5 Abbreviations 6 Introduction 9 Europeanisation and Georgian foreign policy 11 Kornely Kakachia Russia's policies in the South Caucacus after the crisis in Ukraine: the vulnerabilities of realism 19 Andrey Makarychev Azerbaijan's foreign policy – A new paradigm of careful pragmatism 29 Farhad Mammadov Security challenges and conflict resolution efforts in the South Caucasus 37 Gulshan Pashayeva Armenia – Stuck between a rock and a hard place 45 Dennis Sammut Iran's policy in the South Caucasus – Between pragmatism and realpolitik 53 Amanda Paul Trade, economic and energy cooperation: challenges for a fragmented region 61 Vusal Gasimli NATO's South Caucasus paradigm: beyond 2014 67 Zaur Shiriyev The EU and the South Caucasus – Time for a stocktake 77 Amanda Paul Turkey's role in the South Caucasus: between fragmentation
- 
												  Gazetteer Service - Application Profile of the Web Feature Service Best PracticeBest Practices Document Open Geospatial Consortium Approval Date: 2012-01-30 Publication Date: 2012-02-17 External identifier of this OGC® document: http://www.opengis.net/doc/wfs-gaz-ap ® Reference number of this OGC document: OGC 11-122r1 Version 1.0 ® Category: OGC Best Practice Editors: Jeff Harrison, Panagiotis (Peter) A. Vretanos Gazetteer Service - Application Profile of the Web Feature Service Best Practice Copyright © 2012 Open Geospatial Consortium To obtain additional rights of use, visit http://www.opengeospatial.org/legal/ Warning This document defines an OGC Best Practices on a particular technology or approach related to an OGC standard. This document is not an OGC Standard and may not be referred to as an OGC Standard. This document is subject to change without notice. However, this document is an official position of the OGC membership on this particular technology topic. Document type: OGC® Best Practice Paper Document subtype: Application Profile Document stage: Approved Document language: English OGC 11-122r1 Contents Page 1 Scope ..................................................................................................................... 13 2 Conformance ......................................................................................................... 14 3 Normative references ............................................................................................ 14 4 Terms and definitions ........................................................................................... 15 5 Conventions
- 
												![T Barrata [ Ŧ Ŧ ]. Lettera Dell'alfabeto Lappone. Tabella [Dal Lat. Tabĕlla](https://docslib.b-cdn.net/cover/8498/t-barrata-%C5%A7-%C5%A7-lettera-dellalfabeto-lappone-tabella-dal-lat-tab%C4%95lla-4648498.webp)  T Barrata [ Ŧ Ŧ ]. Lettera Dell'alfabeto Lappone. Tabella [Dal Lat. TabĕllaT t barrata [ Ŧ ŧ ]. Lettera dell’alfabeto lappone. tabella [dal lat. tabĕlla, dim. di tabŭla, «tavola»]. 1. Termine latino per indicare la tavoletta ldi legno* usata per le prime stesure di opere o per scrivere appunti. Questa poteva essere cerata (ceratae o cerae) o non cerata. In seguito questo termine passò a indicare il supporto scrittorio di pergamena*, almeno fino a una certa epoca. 2. Insieme di dati tecnici distribuiti metodicamente in linee e colonne, in modo da consentirne un’agevole consultazione. (v. anche tavoletta cerata; tavoletta di legno). tabellae defixionis Nel mondo greco-latino, lamine sottili di metallo, solitamente di piombo, per la stesura a sgraffio di testi per lo più di contenuto magico. tabellionato [der. di tabellione, dal lat. tardo tabellio -onis, der. di tabĕlla, «tavoletta per scrivere» e quindi anche «documento, atto pubblico»]. La carica, l’ufficio di tabellione*, nell’antica Roma e nel Medioevo. In particolare nel tardo Medioevo e nel Rinascimento, segno del tabellionato, il segno tracciato a mano, derivato dal comune segno di croce, posto dal notaio dinanzi alla sua sottoscrizione*, costituente, nella sua peculiarità e identità, la garanzia dell’autenticità degli atti da lui rogati: nel XVII secolo fu sostituito con un’impronta a stampiglia, poi dal vero e proprio timbro*. tabellione [dal lat. tardo tabellio -onis, der. di tabĕlla, «tavoletta per scrivere» e quindi anche «documento, atto pubblico»]. 1. Nell’antica Roma, nome degli scribi pubblici, esperti di materie giuridiche, con funzioni anche ufficiali. 2. Nell’alto Medioevo, nome dei notai dell’Esarcato bizantino di Ravenna (chiamati anche tabulari) che avevano l’incarico di redigere e conservare gli atti giudiziali e privati.