A Survey of Orthographic Information in Machine Translation 3
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Word Representation Or Word Embedding in Persian Texts Siamak Sarmady Erfan Rahmani
1 Word representation or word embedding in Persian texts Siamak Sarmady Erfan Rahmani (Abstract) Text processing is one of the sub-branches of natural language processing. Recently, the use of machine learning and neural networks methods has been given greater consideration. For this reason, the representation of words has become very important. This article is about word representation or converting words into vectors in Persian text. In this research GloVe, CBOW and skip-gram methods are updated to produce embedded vectors for Persian words. In order to train a neural networks, Bijankhan corpus, Hamshahri corpus and UPEC corpus have been compound and used. Finally, we have 342,362 words that obtained vectors in all three models for this words. These vectors have many usage for Persian natural language processing. and a bit set in the word index. One-hot vector method 1. INTRODUCTION misses the similarity between strings that are specified by their characters [3]. The Persian language is a Hindi-European language The second method is distributed vectors. The words where the ordering of the words in the sentence is in the same context (neighboring words) may have subject, object, verb (SOV). According to various similar meanings (For example, lift and elevator both structures in Persian language, there are challenges that seem to be accompanied by locks such as up, down, cannot be found in languages such as English [1]. buildings, floors, and stairs). This idea influences the Morphologically the Persian is one of the hardest representation of words using their context. Suppose that languages for the purposes of processing and each word is represented by a v dimensional vector. -
Technical Reference Manual for the Standardization of Geographical Names United Nations Group of Experts on Geographical Names
ST/ESA/STAT/SER.M/87 Department of Economic and Social Affairs Statistics Division Technical reference manual for the standardization of geographical names United Nations Group of Experts on Geographical Names United Nations New York, 2007 The Department of Economic and Social Affairs of the United Nations Secretariat is a vital interface between global policies in the economic, social and environmental spheres and national action. The Department works in three main interlinked areas: (i) it compiles, generates and analyses a wide range of economic, social and environmental data and information on which Member States of the United Nations draw to review common problems and to take stock of policy options; (ii) it facilitates the negotiations of Member States in many intergovernmental bodies on joint courses of action to address ongoing or emerging global challenges; and (iii) it advises interested Governments on the ways and means of translating policy frameworks developed in United Nations conferences and summits into programmes at the country level and, through technical assistance, helps build national capacities. NOTE The designations employed and the presentation of material in the present publication do not imply the expression of any opinion whatsoever on the part of the Secretariat of the United Nations concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. The term “country” as used in the text of this publication also refers, as appropriate, to territories or areas. Symbols of United Nations documents are composed of capital letters combined with figures. ST/ESA/STAT/SER.M/87 UNITED NATIONS PUBLICATION Sales No. -
Methods for Estonian Large Vocabulary Speech Recognition
THESIS ON INFORMATICS AND SYSTEM ENGINEERING C Methods for Estonian Large Vocabulary Speech Recognition TANEL ALUMAE¨ Faculty of Information Technology Department of Informatics TALLINN UNIVERSITY OF TECHNOLOGY Dissertation was accepted for the commencement of the degree of Doctor of Philosophy in Engineering on November 1, 2006. Supervisors: Prof. Emer. Leo Vohandu,˜ Faculty of Information Technology Einar Meister, Ph.D., Institute of Cybernetics at Tallinn University of Technology Opponents: Mikko Kurimo, Dr. Tech., Helsinki University of Technology Heiki-Jaan Kaalep, Ph.D., University of Tartu Commencement: December 5, 2006 Declaration: Hereby I declare that this doctoral thesis, my original investigation and achievement, submitted for the doctoral degree at Tallinn University of Technology has not been submitted for any degree or examination. / Tanel Alumae¨ / Copyright Tanel Alumae,¨ 2006 ISSN 1406-4731 ISBN 9985-59-661-7 ii Contents 1 Introduction 1 1.1 The speech recognition problem . 1 1.2 Language specific aspects of speech recognition . 3 1.3 Related work . 5 1.4 Scope of the thesis . 6 1.5 Outline of the thesis . 7 1.6 Acknowledgements . 7 2 Basic concepts of speech recognition 9 2.1 Probabilistic decoding problem . 10 2.2 Feature extraction . 10 2.2.1 Signal acquisition . 11 2.2.2 Short-term analysis . 11 2.3 Acoustic modelling . 14 2.3.1 Hidden Markov models . 15 2.3.2 Selection of basic units . 20 2.3.3 Clustered context-dependent acoustic units . 20 2.4 Language Modelling . 22 2.4.1 N-gram language models . 23 2.4.2 Language model evaluation . 29 3 Properties of the Estonian language 33 3.1 Phonology . -
Modeling Popularity and Reliability of Sources in Multilingual Wikipedia
information Article Modeling Popularity and Reliability of Sources in Multilingual Wikipedia Włodzimierz Lewoniewski * , Krzysztof W˛ecel and Witold Abramowicz Department of Information Systems, Pozna´nUniversity of Economics and Business, 61-875 Pozna´n,Poland; [email protected] (K.W.); [email protected] (W.A.) * Correspondence: [email protected] Received: 31 March 2020; Accepted: 7 May 2020; Published: 13 May 2020 Abstract: One of the most important factors impacting quality of content in Wikipedia is presence of reliable sources. By following references, readers can verify facts or find more details about described topic. A Wikipedia article can be edited independently in any of over 300 languages, even by anonymous users, therefore information about the same topic may be inconsistent. This also applies to use of references in different language versions of a particular article, so the same statement can have different sources. In this paper we analyzed over 40 million articles from the 55 most developed language versions of Wikipedia to extract information about over 200 million references and find the most popular and reliable sources. We presented 10 models for the assessment of the popularity and reliability of the sources based on analysis of meta information about the references in Wikipedia articles, page views and authors of the articles. Using DBpedia and Wikidata we automatically identified the alignment of the sources to a specific domain. Additionally, we analyzed the changes of popularity and reliability in time and identified growth leaders in each of the considered months. The results can be used for quality improvements of the content in different languages versions of Wikipedia. -
VALTS ERNÇSTREITS (Riga) LIVONIAN ORTHOGRAPHY Introduction the Livonian Language Has Been Extensively Written for About
Linguistica Uralica XLIII 2007 1 VALTS ERNÇSTREITS (Riga) LIVONIAN ORTHOGRAPHY Abstract. This article deals with the development of Livonian written language and the related matters starting from the publication of first Livonian books until present day. In total four different spelling systems have been used in Livonian publications. The first books in Livonian appeared in 1863 using phonetic transcription. In 1880, the Gospel of Matthew was published in Eastern and Western Livonian dialects and used Gothic script and a spelling system similar to old Latvian orthography. In 1920, an East Livonian written standard was established by the simplification of the Finno-Ugric phonetic transcription. Later, elements of Latvian orthography, and after 1931 also West Livonian characteristics, were added. Starting from the 1970s and due to a considerable decrease in the number of Livonian mother tongue speakers in the second half of the 20th century the orthography was modified to be even more phonetic in the interest of those who did not speak the language. Additionally, in the 1930s, a spelling system which was better suited for conveying certain phonetic phenomena than the usual standard was used in two books but did not find any wider usage. Keywords: Livonian, standard language, orthography. Introduction The Livonian language has been extensively written for about 150 years by linguists who have been noting down examples of the language as well as the Livonians themselves when publishing different materials. The prime consideration for both of these groups has been how to best represent the Livonian language in the written form. The area of interest for the linguists has been the written representation of the language with maximal phonetic precision. -
Language Resource Management System for Asian Wordnet Collaboration and Its Web Service Application
Language Resource Management System for Asian WordNet Collaboration and Its Web Service Application Virach Sornlertlamvanich1,2, Thatsanee Charoenporn1,2, Hitoshi Isahara3 1Thai Computational Linguistics Laboratory, NICT Asia Research Center, Thailand 2National Electronics and Computer Technology Center, Pathumthani, Thailand 3National Institute of Information and Communications Technology, Japan E-mail: [email protected], [email protected], [email protected] Abstract This paper presents the language resource management system for the development and dissemination of Asian WordNet (AWN) and its web service application. We develop the platform to establish a network for the cross language WordNet development. Each node of the network is designed for maintaining the WordNet for a language. Via the table that maps between each language WordNet and the Princeton WordNet (PWN), the Asian WordNet is realized to visualize the cross language WordNet between the Asian languages. We propose a language resource management system, called WordNet Management System (WNMS), as a distributed management system that allows the server to perform the cross language WordNet retrieval, including the fundamental web service applications for editing, visualizing and language processing. The WNMS is implemented on a web service protocol therefore each node can be independently maintained, and the service of each language WordNet can be called directly through the web service API. In case of cross language implementation, the synset ID (or synset offset) defined by PWN is used to determined the linkage between the languages. English PWN. Therefore, the original structural 1. Introduction information is inherited to the target WordNet through its sense translation and sense ID. The AWN finally Language resource becomes crucial in the recent needs for cross cultural information exchange. -
Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different Topics
computers Article Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different Topics Włodzimierz Lewoniewski * , Krzysztof W˛ecel and Witold Abramowicz Department of Information Systems, Pozna´nUniversity of Economics and Business, 61-875 Pozna´n,Poland * Correspondence: [email protected]; Tel.: +48-(61)-639-27-93 Received: 10 May 2019; Accepted: 13 August 2019; Published: 14 August 2019 Abstract: On Wikipedia, articles about various topics can be created and edited independently in each language version. Therefore, the quality of information about the same topic depends on the language. Any interested user can improve an article and that improvement may depend on the popularity of the article. The goal of this study is to show what topics are best represented in different language versions of Wikipedia using results of quality assessment for over 39 million articles in 55 languages. In this paper, we also analyze how popular selected topics are among readers and authors in various languages. We used two approaches to assign articles to various topics. First, we selected 27 main multilingual categories and analyzed all their connections with sub-categories based on information extracted from over 10 million categories in 55 language versions. To classify the articles to one of the 27 main categories, we took into account over 400 million links from articles to over 10 million categories and over 26 million links between categories. In the second approach, we used data from DBpedia and Wikidata. We also showed how the results of the study can be used to build local and global rankings of the Wikipedia content. -
© 2020 Xiaoman Pan CROSS-LINGUAL ENTITY EXTRACTION and LINKING for 300 LANGUAGES
© 2020 Xiaoman Pan CROSS-LINGUAL ENTITY EXTRACTION AND LINKING FOR 300 LANGUAGES BY XIAOMAN PAN DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2020 Urbana, Illinois Doctoral Committee: Dr. Heng Ji, Chair Dr. Jiawei Han Dr. Hanghang Tong Dr. Kevin Knight ABSTRACT Information provided in languages which people can understand saves lives in crises. For example, language barrier was one of the main difficulties faced by humanitarian workers responding to the Ebola crisis in 2014. We propose to break language barriers by extracting information (e.g., entities) from a massive variety of languages and ground the information into an existing Knowledge Base (KB) which is accessible to a user in their own language (e.g., a reporter from the World Health Organization who speaks English only). The ambitious goal of this thesis is to develop a Cross-lingual Entity Extraction and Linking framework for 1,000 fine-grained entity types and 300 languages that exist in Wikipedia. Given a document in any of these languages, our framework is able to identify entity name mentions, assign a fine-grained type to each mention, and link it to an English KB if it is linkable. Traditional entity linking methods rely on costly human annotated data to train supervised learning-to-rank models to select the best candidate entity for each mention. In contrast, we propose a novel unsupervised represent-and-compare approach that can accurately capture the semantic meaning representation of each mention, and directly compare its representation with the representation of each candidate entity in the target KB. -
Generating Sense Embeddings for Syntactic and Semantic Analogy for Portuguese
Generating Sense Embeddings for Syntactic and Semantic Analogy for Portuguese Jessica´ Rodrigues da Silva1, Helena de Medeiros Caseli1 1Federal University of Sao˜ Carlos, Department of Computer Science [email protected], [email protected] Abstract. Word embeddings are numerical vectors which can represent words or concepts in a low-dimensional continuous space. These vectors are able to capture useful syntactic and semantic information. The traditional approaches like Word2Vec, GloVe and FastText have a strict drawback: they produce a sin- gle vector representation per word ignoring the fact that ambiguous words can assume different meanings. In this paper we use techniques to generate sense embeddings and present the first experiments carried out for Portuguese. Our experiments show that sense vectors outperform traditional word vectors in syn- tactic and semantic analogy tasks, proving that the language resource generated here can improve the performance of NLP tasks in Portuguese. 1. Introduction Any natural language (Portuguese, English, German, etc.) has ambiguities. Due to ambi- guity, the same word surface form can have two or more different meanings. For exam- ple, the Portuguese word banco can be used to express the financial institution but also the place where we can rest our legs (a seat). Lexical ambiguities, which occur when a word has more than one possible meaning, directly impact tasks at the semantic level and solving them automatically is still a challenge in natural language processing (NLP) applications. One way to do this is through word embeddings. Word embeddings are numerical vectors which can represent words or concepts in a low-dimensional continuous space, reducing the inherent sparsity of traditional vector- space representations [Salton et al. -
TEI and the Documentation of Mixtepec-Mixtec Jack Bowers
Language Documentation and Standards in Digital Humanities: TEI and the documentation of Mixtepec-Mixtec Jack Bowers To cite this version: Jack Bowers. Language Documentation and Standards in Digital Humanities: TEI and the documen- tation of Mixtepec-Mixtec. Computation and Language [cs.CL]. École Pratique des Hauts Études, 2020. English. tel-03131936 HAL Id: tel-03131936 https://tel.archives-ouvertes.fr/tel-03131936 Submitted on 4 Feb 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Préparée à l’École Pratique des Hautes Études Language Documentation and Standards in Digital Humanities: TEI and the documentation of Mixtepec-Mixtec Soutenue par Composition du jury : Jack BOWERS Guillaume, JACQUES le 8 octobre 2020 Directeur de Recherche, CNRS Président Alexis, MICHAUD Chargé de Recherche, CNRS Rapporteur École doctorale n° 472 Tomaž, ERJAVEC Senior Researcher, Jožef Stefan Institute Rapporteur École doctorale de l’École Pratique des Hautes Études Enrique, PALANCAR Directeur de Recherche, CNRS Examinateur Karlheinz, MOERTH Senior Researcher, Austrian Center for Digital Humanities Spécialité and Cultural Heritage Examinateur Linguistique Emmanuel, SCHANG Maître de Conférence, Université D’Orléans Examinateur Benoit, SAGOT Chargé de Recherche, Inria Examinateur Laurent, ROMARY Directeur de recherche, Inria Directeur de thèse 1. -
Creating Language Resources for Under-Resourced Languages: Methodologies, and Experiments with Arabic
Language Resources and Evaluation manuscript No. (will be inserted by the editor) Creating Language Resources for Under-resourced Languages: Methodologies, and experiments with Arabic Mahmoud El-Haj · Udo Kruschwitz · Chris Fox Received: 08/11/2013 / Accepted: 15/07/2014 Abstract Language resources are important for those working on computational methods to analyse and study languages. These resources are needed to help ad- vancing the research in fields such as natural language processing, machine learn- ing, information retrieval and text analysis in general. We describe the creation of useful resources for languages that currently lack them, taking resources for Arabic summarisation as a case study. We illustrate three different paradigms for creating language resources, namely: (1) using crowdsourcing to produce a small resource rapidly and relatively cheaply; (2) translating an existing gold-standard dataset, which is relatively easy but potentially of lower quality; and (3) using manual effort with ap- propriately skilled human participants to create a resource that is more expensive but of high quality. The last of these was used as a test collection for TAC-2011. An evaluation of the resources is also presented. The current paper describes and extends the resource creation activities and evaluations that underpinned experiments and findings that have previously appeared as an LREC workshop paper (El-Haj et al 2010), a student conference paper (El-Haj et al 2011b), and a description of a multilingual summarisation pilot (El-Haj et al 2011c; Giannakopoulos et al 2011). M. El-Haj School of Computing and Communications, Lancaster University, UK Tel.: +44(0)1524 51 0348 E-mail: [email protected] U. -
Mapping Arabic Wikipedia Into the Named Entities Taxonomy
Mapping Arabic Wikipedia into the Named Entities Taxonomy Fahd Alotaibi and Mark Lee School of Computer Science, University of Birmingham, UK {fsa081|m.g.lee}@cs.bham.ac.uk ABSTRACT This paper describes a comprehensive set of experiments conducted in order to classify Arabic Wikipedia articles into predefined sets of Named Entity classes. We tackle using four different classifiers, namely: Naïve Bayes, Multinomial Naïve Bayes, Support Vector Machines, and Stochastic Gradient Descent. We report on several aspects related to classification models in the sense of feature representation, feature set and statistical modelling. The results reported show that, we are able to correctly classify the articles with scores of 90% on Precision, Recall and balanced F-measure. KEYWORDS: Arabic Named Entity, Wikipedia, Arabic Document Classification, Supervised Machine Learning. Proceedings of COLING 2012: Posters, pages 43–52, COLING 2012, Mumbai, December 2012. 43 1 Introduction Relying on supervised machine learning technologies to recognize Named Entities (NE) in the text requires the development of a reasonable volume of data for the training phase. Manually developing a training dataset that goes beyond the news-wire domain is a non-trivial task. Examination of online and freely available resources, such as the Arabic Wikipedia (AW) offers promise because the underlying scheme of AW can be exploited in order to automatically identify NEs in context. To utilize this resource and develop a NEs corpus from AW means two different tasks should be addressed: 1) Identifying the NEs in the context regardless of assigning those into NEs semantic classes. 2) Classifying AW articles into predefined NEs taxonomy.