Multilingual Dictionary Based Construction of Core Vocabulary
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
GRAMMAR of SOLRESOL Or the Universal Language of François SUDRE
GRAMMAR OF SOLRESOL or the Universal Language of François SUDRE by BOLESLAS GAJEWSKI, Professor [M. Vincent GAJEWSKI, professor, d. Paris in 1881, is the father of the author of this Grammar. He was for thirty years the president of the Central committee for the study and advancement of Solresol, a committee founded in Paris in 1869 by Madame SUDRE, widow of the Inventor.] [This edition from taken from: Copyright © 1997, Stephen L. Rice, Last update: Nov. 19, 1997 URL: http://www2.polarnet.com/~srice/solresol/sorsoeng.htm Edits in [brackets], as well as chapter headings and formatting by Doug Bigham, 2005, for LIN 312.] I. Introduction II. General concepts of solresol III. Words of one [and two] syllable[s] IV. Suppression of synonyms V. Reversed meanings VI. Important note VII. Word groups VIII. Classification of ideas: 1º simple notes IX. Classification of ideas: 2º repeated notes X. Genders XI. Numbers XII. Parts of speech XIII. Number of words XIV. Separation of homonyms XV. Verbs XVI. Subjunctive XVII. Passive verbs XVIII. Reflexive verbs XIX. Impersonal verbs XX. Interrogation and negation XXI. Syntax XXII. Fasi, sifa XXIII. Partitive XXIV. Different kinds of writing XXV. Different ways of communicating XXVI. Brief extract from the dictionary I. Introduction In all the business of life, people must understand one another. But how is it possible to understand foreigners, when there are around three thousand different languages spoken on earth? For everyone's sake, to facilitate travel and international relations, and to promote the progress of beneficial science, a language is needed that is easy, shared by all peoples, and capable of serving as a means of interpretation in all countries. -
POS Tagging for Improving Code-Switching Identification In
POS Tagging for Improving Code-Switching Identification in Arabic Mohammed Attia1 Younes Samih2 Ali Elkahky1 Hamdy Mubarak2 Ahmed Abdelali2 Kareem Darwish2 1Google LLC, New York City, USA 2Qatar Computing Research Institute, HBKU Research Complex, Doha, Qatar 1{attia,alielkahky}@google.com 2{ysamih,hmubarak,aabdelali,kdarwish}@hbku.edu.qa Abstract and sociolinguistic reasons. Second, existing the- oretical and computational linguistic models are When speakers code-switch between their na- tive language and a second language or lan- based on monolingual data and cannot adequately guage variant, they follow a syntactic pattern explain or deal with the influx of CS data whether where words and phrases from the embedded spoken or written. language are inserted into the matrix language. CS has been studied for over half a century This paper explores the possibility of utiliz- from different perspectives, including theoretical ing this pattern in improving code-switching linguistics (Muysken, 1995; Parkin, 1974), ap- identification between Modern Standard Ara- plied linguistics (Walsh, 1969; Boztepe, 2003; Se- bic (MSA) and Egyptian Arabic (EA). We try to answer the question of how strong is the tati, 1998), socio-linguistics (Barker, 1972; Heller, POS signal in word-level code-switching iden- 2010), psycho-linguistics (Grosjean, 1989; Prior tification. We build a deep learning model en- and Gollan, 2011; Kecskes, 2006), and more re- riched with linguistic features (including POS cently computational linguistics (Solorio and Liu, tags) that outperforms the state-of-the-art re- 2008a; Çetinoglu˘ et al., 2016; Adel et al., 2013b). sults by 1.9% on the development set and 1.0% In this paper, we investigate the possibility of on the test set. -
SELECTING and CREATING a WORD LIST for ENGLISH LANGUAGE TEACHING by Deny A
Teaching English with Technology, 17(1), 60-72, http://www.tewtjournal.org 60 SELECTING AND CREATING A WORD LIST FOR ENGLISH LANGUAGE TEACHING by Deny A. Kwary and Jurianto Universitas Airlangga Dharmawangsa Dalam Selatan, Surabaya 60286, Indonesia d.a.kwary @ unair.ac.id / [email protected] Abstract Since the introduction of the General Service List (GSL) in 1953, a number of studies have confirmed the significant role of a word list, particularly GSL, in helping ESL students learn English. Given the recent development in technology, several researchers have created word lists, each of them claims to provide a better coverage of a text and a significant role in helping students learn English. This article aims at analyzing the claims made by the existing word lists and proposing a method for selecting words and a creating a word list. The result of this study shows that there are differences in the coverage of the word lists due to the difference in the corpora and the source text analysed. This article also suggests that we should create our own word list, which is both personalized and comprehensive. This means that the word list is not just a list of words. The word list needs to be accompanied with the senses and the patterns of the words, in order to really help ESL students learn English. Keywords: English; GSL; NGSL; vocabulary; word list 1. Introduction A word list has been noted as an essential resource for language teaching, especially for second language teaching. The main purpose for the creation of a word list is to determine the words that the learners need to know, in order to provide more focused learning materials. -
Modeling and Encoding Traditional Wordlists for Machine Applications
Modeling and Encoding Traditional Wordlists for Machine Applications Shakthi Poornima Jeff Good Department of Linguistics Department of Linguistics University at Buffalo University at Buffalo Buffalo, NY USA Buffalo, NY USA [email protected] [email protected] Abstract Clearly, descriptive linguistic resources can be This paper describes work being done on of potential value not just to traditional linguis- the modeling and encoding of a legacy re- tics, but also to computational linguistics. The source, the traditional descriptive wordlist, difficulty, however, is that the kinds of resources in ways that make its data accessible to produced in the course of linguistic description NLP applications. We describe an abstract are typically not easily exploitable in NLP appli- model for traditional wordlist entries and cations. Nevertheless, in the last decade or so, then provide an instantiation of the model it has become widely recognized that the devel- in RDF/XML which makes clear the re- opment of new digital methods for encoding lan- lationship between our wordlist database guage data can, in principle, not only help descrip- and interlingua approaches aimed towards tive linguists to work more effectively but also al- machine translation, and which also al- low them, with relatively little extra effort, to pro- lows for straightforward interoperation duce resources which can be straightforwardly re- with data from full lexicons. purposed for, among other things, NLP (Simons et al., 2004; Farrar and Lewis, 2007). 1 Introduction Despite this, it has proven difficult to create When looking at the relationship between NLP significant electronic descriptive resources due to and linguistics, it is typical to focus on the dif- the complex and specific problems inevitably as- ferent approaches taken with respect to issues sociated with the conversion of legacy data. -
A Pipeline for Computational Historical Linguistics
A Pipeline for Computational Historical Linguistics Lydia Steiner Peter F. Stadler Bioinformatics Group, Interdisciplinary Bioinformatics Group, University of Center for Bioinformatics, University of Leipzig, Interdisciplinary Center for Leipzig ∗ Bioinformatics, University of Leipzig∗; Max-Planck-Institute for Mathematics in the Science; Fraunhofer Institute for Cell Therapy and Immunology; Center for non-coding RNA in Technology and Health, University of Copenhagen; Institute for Theoretical Chemistry, University of Vienna; Santa Fe Institute Michael Cysouw Research unit Quantitative Language Comparison, LMU München ∗∗ There are many parallels between historical linguistics and molecular phylogenetics. In this paper we describe an algorithmic pipeline that mimics, as closely as possible, the traditional workflow of language reconstruction known as the comparative method. The pipeline consists of suitably modified algorithms based on recent research in bioinformatics, that are adapted to the specifics of linguistic data. This approach can alleviate much of the laborious research needed to establish proof of historical relationships between languages. Equally important to our proposal is that each step in the workflow of the comparative method is implemented independently, so language specialists have the possibility to scrutinize intermediate results. We have used our pipeline to investigate two groups of languages, the Tsezic languages from the Caucasus and the Mataco-Guaicuruan languages from South America, based on the lexical data from the Intercontinental Dictionary Series (IDS). The results of these tests show that the current approach is a viable and useful extension to historical linguistic research. 1. Introduction Molecular phylogenetics and historical linguistics are both concerned with the recon- struction of evolutionary histories, the former of biological organisms, the latter of human languages. -
THE IMPACT of USING WORDLISTS in the LANGUAGE CLASSROOM on STUDENTS’ VOCABULARY ACQUISITION Gülçin Coşgun Ozyegin University
International Journal of English Language Teaching Vol.4, No.3, pp.49-66, March 2016 ___Published by European Centre for Research Training and Development UK (www.eajournals.org) THE IMPACT OF USING WORDLISTS IN THE LANGUAGE CLASSROOM ON STUDENTS’ VOCABULARY ACQUISITION Gülçin Coşgun Ozyegin University ABSTRACT: Vocabulary has always been an area of interest for many researchers since words represent “the building block upon which knowledge of the second language can be built” and without them people cannot convey the intended meaning (Abrudan, 2010). Nation (1988) emphasised if teachers want to help their learners to deal with unknown words, it would be better to spend more time on vocabulary learning strategies rather than spending time on individual words. However, Schmitt in Schmitt and McCarthy (1997:200) stated that among vocabulary learning strategies only ‘guessing from context’ and ‘key word method’ have been investigated in depth. Therefore, there is need for more research on vocabulary learning whose pedagogical implications may contribute to the field of second language learning. Considering the above- mentioned issues, vocabulary is a worthwhile field to investigate. Hence, this paper aims at proposing a framework for vocabulary teaching strategy in English as a foreign language context. KEYWORDS: Wordlists, Vocabulary Teaching, Word walls, Vocabulary Learning Strategies INTRODUCTION Vocabulary has always been an area of interest and concern for many researchers and teachers alike since words represent “the building block upon which knowledge of the second language can be built” and without them people cannot convey the intended meaning (Abrudan, 2010). Anderson and Freebody (1981) state that a strong correlation exists between vocabulary and academic achievement. -
Swadesh Lists Are Not Long Enough: Drawing Phonological Generalizations from Limited Data
Language Documentation and Description ISSN 1740-6234 ___________________________________________ This article appears in: Language Documentation and Description, vol 16. Editor: Peter K. Austin Swadesh lists are not long enough: Drawing phonological generalizations from limited data RIKKER DOCKUM & CLAIRE BOWERN Cite this article: Rikker Dockum & Claire Bowern (2018). Swadesh lists are not long enough: Drawing phonological generalizations from limited data. In Peter K. Austin (ed.) Language Documentation and Description, vol 16. London: EL Publishing. pp. 35-54 Link to this article: http://www.elpublishing.org/PID/168 This electronic version first published: August 2019 __________________________________________________ This article is published under a Creative Commons License CC-BY-NC (Attribution-NonCommercial). The licence permits users to use, reproduce, disseminate or display the article provided that the author is attributed as the original creator and that the reuse is restricted to non-commercial purposes i.e. research or educational use. See http://creativecommons.org/licenses/by-nc/4.0/ ______________________________________________________ EL Publishing For more EL Publishing articles and services: Website: http://www.elpublishing.org Submissions: http://www.elpublishing.org/submissions Swadesh lists are not long enough: Drawing phonological generalizations from limited data Rikker Dockum and Claire Bowern Yale University Abstract This paper presents the results of experiments on the minimally sufficient wordlist size for drawing phonological generalizations about languages. Given a limited lexicon for an under-documented language, are conclusions that can be drawn from those data representative of the language as a whole? Linguistics necessarily involves generalizing from limited data, as documentation can never completely capture the full complexity of a linguistic system. We performed a series of sampling experiments on 36 Australian languages in the Chirila database (Bowern 2016) with lexicons ranging from 2,000 to 10,000 items. -
All-Words Word Sense Disambiguation Using Concept Embeddings
All-words Word Sense Disambiguation Using Concept Embeddings Rui Suzuki, Kanako Komiya, Masayuki Asahara, Minoru Sasaki, Hiroyuki Shinnou Ibaraki University, 4-12-1 Nakanarusawa, Hitachi, Ibaraki JAPAN, National Institute for Japanese Language and Linguistics, 10-2 Midoricho, Tachikawa, Tokyo, JAPAN, [email protected], kanako.komiya.nlp, minoru.sasaki.01, hiroyuki.shinnou.0828g @vc.ibaraki.ac.jp, [email protected] Abstract All-words word sense disambiguation (all-words WSD) is the task of identifying the senses of all words in a document. Since the sense of a word depends on the context, such as the surrounding words, similar words are believed to have similar sets of surrounding words. We therefore predict the target word senses by calculating the distances between the surrounding word vectors of the target words and their synonyms using word embeddings. In addition, we introduce the new idea of concept embeddings, constructed from concept tag sequences created from the results of previous prediction steps. We predict the target word senses using the distances between surrounding word vectors constructed from word and concept embeddings, via a bootstrapped iterative process. Experimental results show that these concept embeddings were able to improve the performance of Japanese all-words WSD. Keywords: word sense disambiguation, all-words, unsupervised 1. Introduction many words have the same article numbers. Several words Word sense disambiguation (WSD) involves identifying the can have the same precise article number, even when the senses of words in documents. In particular, the WSD task semantic breaks are considered. where the senses of all the words in a document are disam- biguated is referred to as all-words WSD. -
Northwest Caucasian Languages and Hattic
Kafkasya Calışmaları - Sosyal Bilimler Dergisi / Journal of Caucasian Studies Kasım 2020 / November 2020, Yıl / Vol. 6, № 11 ISSN 2149–9527 E-ISSN 2149-9101 Northwest Caucasian Languages and Hattic Ayla Bozkurt Applebaum* Abstract The relationships among five Northwest Caucasian languages and Hattic were investigated. A list of 193 core vocabulary words was constructed and examined to find look-alike words. Data for Abhkaz, Abaza, Kabardian (East Circassian), Adyghe (West Circassian) and Ubykh drew on the work of Starostin, Chirikba and Kuipers. A sub-set list of 15 look-alike words for Hattic was constructed from Soysal (2003). These lists were formulated as character data for reconstructing the phylogenetic relationships of the languages. The phylogenetic relationships of these languages were investigated by a well-known method, Neighbor Joining, as implemented in PAUP* 4.0. Supporting and dissenting evidence from human genetic population studies and archeological evidence were discussed. This project has produced a provisional set of character data for the Northwest Caucasian languages and, to a limited extent, Hattic. Phylogenetic trees have been generated and displayed to show their general character and the types of differences obtained by alternate methods. This research is a basis for further inquiries into the development of the Caucasian languages. Moreover, it presents an example of the method for contrast queries application in studying the evolution of language families. Keywords: Northwest Caucasian Languages, Hattic, Historical Linguistics, Circassian, Adyghe, Kabardian * Ayla Bozkurt Applebaum, ORCID 0000-0003-4866-4407, E-mail: [email protected] (Received/Gönderim: 15.10.2020; Accepted/Kabul: 28.11.2020) 63 Ayla Bozkurt Applebaum Kuzeybatı Kafkas Dilleri ve Hattice Özet Bu araştırma beş Kuzeybatı Kafkas Dilleri ve Hatik arasındaki ilişkiyi incelemektedir. -
A Web-Based Vocabulary Profiler for Chinese Language Teaching and Research
DA A web-based vocabulary profiler for Chinese language teaching and research Jun Da, Middle Tennessee State University This paper describes the design and development of a web-based Chinese vocabulary profiler that is intended as a handy tool for Chinese language instruction and research. Given a piece of text, the profiler is capable of generating character and vocabulary profiles with frequency and availability information based on a comparison of the set of characters and words found in the text with a selection of other pre-compiled reference lists such as Da’s Modern Chinese character and N-gram frequency lists, the HSK word list and customized vocabulary list provided by a user. 1. Introduction Vocabulary profiling is a measure of the proportions of low and high frequency vocabulary used in a written text. In addition to frequency information, a profiler designed for lexical analysis of texts often provides other information such as the presence/absence of the set of words from the input text in other specialized word lists and their collocation with other words, etc. While it has been used in stylistic studies of written texts (e.g., Graves’s (2004) profiling of Jane Austen’s novels and letters), vocabulary profiling has also been suggested as a useful instrument in second language acquisition research and pedagogy (Laufer and Nation 1995). For example, Morris and Cobb (2004) examined the potential of using vocabulary profile as a predictor of TESL (Teaching English as a Second Language) trainees’ academic performance and reported that vocabulary profile results correlate significantly with their course grades. -
Exploring the Corresponding Words Among the Subgroupings, Revising Swadesh List, Compared with Chinese
International Journal of Language and Linguistics 2019; 7(3): 110-118 http://www.sciencepublishinggroup.com/j/ijll doi: 10.11648/j.ijll.20190703.13 ISSN: 2330-0205 (Print); ISSN: 2330-0221 (Online) Exploring the Corresponding Words Among the Subgroupings, Revising Swadesh List, Compared with Chinese Ke Luo Beijing Institute of Spacecraft Environment Engineering, Beijing, China Email address: To cite this article: Ke Luo. Exploring the Corresponding Words Among the Subgroupings, Revising Swadesh List, Compared with Chinese. International Journal of Language and Linguistics . Vol. 7, No. 3, 2019, pp. 110-118. doi: 10.11648/j.ijll.20190703.13 Received : March 2, 2019; Accepted : April 29, 2019; Published : May 31, 2019 Abstract: In 1647, the Dutch linguist Marcus Zuerius van Boxhorn noted the similarity among certain Asian and European languages and first theorized that they were derived from a primitive common language which he called Scythian (proto-language). For hundreds of years, many scholars have been studying the cognate of the subgroupings of Indo-European languages. This paper with the Swadesh list compares several subgroupings of Indo-European languages and finds out that their cognate correspondence is closer. It is inferred that the Proto-Indo-European was a language with very rich vocabulary and should be contained independently in the subgroupings at the beginning of its dissemination. This paper proposes a revised Swadesh list which can be used to assess language homology degree in high or low, and compares English with Chinese. The result shows that Chinese and English have the same origin. Keywords: Indo-European Languages, Swadesh List, English, Chinese, Cognate conclusion. -
Simple Features for Statistical Word Sense Disambiguation
SENSEVAL-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, Barcelona, Spain, July 2004 Association for Computational Linguistics Simple Features for Statistical Word Sense Disambiguation Abolfazl K. Lamjiri, Osama El Demerdash, Leila Kosseim CLaC Laboratory Department of Computer Science Concordia University, Montreal, Canada fa keigho,osama el,[email protected] Abstract more than four words around the target are In this paper, we describe our experiments on considered (Ide and V´eronis, 1998). While re- statistical word sense disambiguation (WSD) searchers such as Masterman (1961), Gougen- using two systems based on different ap- heim and Michea (1961), agree with this ob- proaches: Na¨ıve Bayes on word tokens and Max- servation (Ide and V´eronis, 1998), our results imum Entropy on local syntactic and seman- demonstrate that this does not generally ap- tic features. In the first approach, we consider ply to all words. A large context window pro- a context window and a sub-window within it vides domain information which increases the around the word to disambiguate. Within the accuracy for some target words such as bank.n, outside window, only content words are con- but not others like different.a or use.v (see Sec- sidered, but within the sub-window, all words tion 3). This confirms Mihalcea's observations are taken into account. Both window sizes are (Mihalcea, 2002). In our system we allow a tuned by the system for each word to disam- larger context window size and for most of the biguate and accuracies of 75% and 67% were re- words such context window is selected by the spectively obtained for coarse and fine grained system.