KLPT – Kurdish Language Processing Toolkit

Total Page:16

File Type:pdf, Size:1020Kb

KLPT – Kurdish Language Processing Toolkit KLPT – Kurdish Language Processing Toolkit Sina Ahmadi Insight Centre for Data Analytics National University of Ireland Galway [email protected] Abstract Therefore, the development of underlying tasks in NLP for a specific language will potentially pave Despite the recent advances in applying the way for further contributions to the field, by ei- language-independent approaches to various ther improving the current tools or further progress natural language processing tasks thanks to in new tasks. For instance, tokenization as a fun- artificial intelligence, some language-specific tools are still essential to process a language in damental task is widely required in many other a viable manner. Kurdish language is a less- applications such as part-of-speech tagging, ma- resourced language with a notable diversity in chine translation and syntactic analysis. Once ad- dialects and scripts and lacks basic language dressed, future researchers can build upon it for processing tools. To address this issue, we in- more advanced tasks or eventually improve it. troduce a language processing toolkit to han- Despite a plethora of performant tools and spe- dle such a diversity in an efficient way. Our toolkit is composed of fundamental compo- cific frameworks for NLP, such as NLTK (Loper nents such as text preprocessing, stemming, and Bird, 2002), Stanza (Qi et al., 2020), Teanga tokenization, lemmatization and transliteration (Ziad et al., 2018) and spaCy2, the progress with and is able to get further extended by future respect to less-resourced languages is often hin- developers. This project is publicly available1. dered by not only the lack of basic tools and re- sources but also the accessibility of the previous 1 Introduction studies under an open-source licence. This is Language technology is an increasingly impor- particularly the case of Kurdish, a less-resourced tant field in our information era which is depen- Indo-European language that is the focus of the dent on our knowledge of the human language current paper. As an example, although the task and computational methods to process it. Un- of spell-checking and stemming for Kurdish have like the latter which undergoes constant progress been addressed by many previous studies, (Jaf with new methods and more efficient techniques and Ramsay, 2014; Salavati and Ahmadi, 2018; being invented, the processability of human lan- Mustafa and Rashid, 2018; Saeed et al., 2018a; guages does not evolve with the same pace. This Hawezi et al., 2019) to mention but a few, none is particularly the case of languages with scarce re- of them provides an implementation of their tool sources and limited grammars, also known as less- under any licence. resourced languages. On the other hand, some previous studies use Various natural language processing (NLP) specific frameworks that are hardly integrable and tasks are of pipeline architecture; that is, to ad- inter-operable. For instance, (Walther and Sagot, dress a specific task, a few other language pro- 2010) and (Walther et al., 2010) describe their cessing tasks may be initially required (Manning efforts in developing a large-scale morphological et al., 2014). With the current advances in the lexicon and a part-of-speech tagger for Kurdish open-source movements, more researchers and in- within the Alexina framework under the LGPL-LR dustrial developers are encouraged to share their licence. Despite the valuable impact of this study knowledge in an open-source manner, accessi- in the field, for example in (Cotterell et al., 2017) ble under certain conditions (Ljungberg, 2000). and (Gökırmak and Tyers, 2017), the tool does not 1https://github.com/sinaahmadi/klpt 2https://github.com/explosion/spaCy 72 Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), pages 72–84 Virtual Conference, November 19, 2020. c 2020 Association for Computational Linguistics > > S Z Z ì R S Q è G P (a) Consonants a: e: I i: o: u: U 0 (b) Vowels Table 1: A comparison of the Kurdish alphabets. Variations are specified with "/" seem to be widely used in the subsequent projects. Kurdish people that those are in fact different di- As such, projects such as (Jaf and Ramsay, 2014) alects of the Kurdish language (Haig and Matras, and (Ahmadi and Hassani, 2020a) tackle the very 2002; Matras, 2017). In this study, we remain with same topic from scratch. this theory and refer to them as Kurdish dialects. Language-specific toolkits have been previ- It is worth mentioning that despite the linguistic ously designed for various languages, such as similarities of Zazaki, also known as Dimlî, and IceNLP for Icelandic (Loftsson and Rögnvalds- Gorani languages and the popular belief that they son, 2007), VnCoreNLP for Vietnamese (Vu et al., are dialects of Kurdish, studies show that they be- 2018), FudanNLP for Chinese (Qiu et al., 2013), long to the Zaza-Gorani language family which PSI-Toolkit for Polish (Gralinski´ et al., 2013) and is independent from the Kurdish language (Paul, ParsiPardaz for Persian (Sarabi et al., 2013). In the 1998; Jugel, 2014; Ahmadi, 2020c). same vein, in order to facilitate the basic language Kurdish has been historically written in various processing tasks for Kurdish in an organized and scripts, namely Cyrillic, Armenian, Latin and Ara- methodical way and aware of the increasing im- bic among which the latter two are still widely portance of open-source and inter-operable tools in use. Efforts in standardization of the Kurdish for building more efficient systems and get fur- alphabets and orthographies have not succeeded ther advanced in the field, we present KLPT–the to be globally followed by all Kurdish speakers Kurdish language processing toolkit. This toolkit in all regions (Tavadze, 2019; Haig and Matras, is developed in Python and is composed of core 2002; Aydogan˘ , 2012). As such, the Kurmanji di- modules and is extendable by future developers. alect is mostly written in the Latin-based script while the Sorani, Southern Kurdish and Laki are 2 Kurdish Language mostly written in the Arabic-based script. That, Kurdish belongs to the Northwestern branch of the not only scatters readers and speakers to commu- Iranian languages within the Indo-European lan- nicate together, but also creates further challenges guage family which is spoken by 20-30 million in processing the language (Esmaili, 2012; Ah- speakers in the Kurdish regions of Turkey, Iraq, madi, 2019). Table1 provides the Latin-based and Iran and Syria and also, among the Kurdish di- Arabic-based Kurdish alphabets used for all the di- aspora around the world (Ahmadi et al., 2019). alects. The division of Kurdish into Northern Kurdish (or Kurdish language is a highly inflectional lan- Kurmanji), Central Kurdish (or Sorani), Southern guage, particularly due to a high number of affixes Kurdish and Laki, respectively with kmr, ckb, and clitics (Ahmadi and Hassani, 2020b). Regard- sdh and lki ISO 639-3 language codes, has ing nouns, although Sorani does not have gender been widely studied previously (Edmonds, 2013). or grammatical cases, it has a full article mark- Based on the structural differences between these, ing system for definite, indefinite and demonstra- some scholars believe that they are distinct lan- tive in singular and plural forms (Jugel, 2014). guages and therefore, refer to them as Kurdish lan- On the other hand, Kurmanji has a fewer num- guages (Kreyenbroek, 2005). On the other hand, ber of article markers for feminine and masculine it is also commonly believed by both scholars and genders (Thackston, 2006). With respect to the 73 Information retrieval and Text mining 3:63% Lexical resources 22.22 % 10:90% Dialectology 18.51 % 7:27% 9.25 % 3.70 % Sign language recognition 3.70 % 78:18% Machine Translation 11.11 % 12.96 % 1.85 % Optical character recognition 18.8 % Speech recognition Sorani Sorani and Kurmanji Other Kurmanji Sorani, Kurmanji and Gorani Morphological and syntactic analysis (a) per sub-field in NLP (b) per dialect Figure 1: Proportion of publications related to Kurdish language processing verbs, Kurdish has a few number of around 300 1998; Karimi, 2014). Unlike Kurmanji which single-word verbs (Walther and Sagot, 2010), e.g. uses oblique cases for this purpose, Sorani only kirdin/kirin “to do”, which are inflected based on uses different pronominal markers to specify erga- person (1,2,3, SG, PL), tense (past, present, future), tivity, therefore it is called split-ergative (Esmaili aspect (indefinite, perfect, progressive, imperfec- and Salavati, 2013). Except the past tenses, a tive) and mood (indicative, subjunctive, condi- nominative-accusative alignment is observed in tional). Unlike Kurmanji, Sorani Kurdish does other tenses. not have future tense and uses adverbs for this Not being equally documented and used, Kur- purpose. However, Kurdish extensively takes use dish dialects have different levels of linguistic of compound constructions for creating new verb resourcefulness. In comparison to Sorani and forms, particularly with (Noun + Verb), (Adjective Kurmanji which are widely used by the media + Verb) and (Preposition + Verb) forms (Traida, and press, Southern Kurdish and Laki are under- 2007). For instance, siław ‘hi (n)’, pîroz ‘holy’ documented and lack basic language resources (adj) and heł (verbal particle denoting ‘up’) with such as electronic dictionaries and corpora (Fat- the single-word verb kirdin can respectively form tah, 2000; Ahmadi et al., 2019; Ahmadi, 2020c). compound verbs siław kirdin “to greet”, pîroz kirdin “to congratulate” and heł kirdin “to turn 3 Current State of Kurdish Language on”. The stringing characteristic of the Arabic- Processing based script of Kurdish further adds to this mor- phological complexity in such a way that several The earliest works in the field of Kurdish language word forms may be concatenated together (Ah- processing date back to 2009. Our literature re- madi, 2020b). view indicates that some of these contributions fail to provide open-source solutions.
Recommended publications
  • 15 Things You Should Know About Spacy
    15 THINGS YOU SHOULD KNOW ABOUT SPACY ALEXANDER CS HENDORF @ EUROPYTHON 2020 0 -preface- Natural Language Processing Natural Language Processing (NLP): A Unstructured Data Avalanche - Est. 20% of all data - Structured Data - Data in databases, format-xyz Different Kinds of Data - Est. 80% of all data, requires extensive pre-processing and different methods for analysis - Unstructured Data - Verbal, text, sign- communication - Pictures, movies, x-rays - Pink noise, singularities Some Applications of NLP - Dialogue systems (Chatbots) - Machine Translation - Sentiment Analysis - Speech-to-text and vice versa - Spelling / grammar checking - Text completion Many Smart Things in Text Data! Rule-Based Exploratory / Statistical NLP Use Cases � Alexander C. S. Hendorf [email protected] -Partner & Principal Consultant Data Science & AI @hendorf Consulting and building AI & Data Science for enterprises. -Python Software Foundation Fellow, Python Softwareverband chair, Emeritus EuroPython organizer, Currently Program Chair of EuroSciPy, PyConDE & PyData Berlin, PyData community organizer -Speaker Europe & USA MongoDB World New York / San José, PyCons, CEBIT Developer World, BI Forum, IT-Tage FFM, PyData London, Berlin, PyParis,… here! NLP Alchemy Toolset 1 - What is spaCy? - Stable Open-source library for NLP: - Supports over 55+ languages - Comes with many pretrained language models - Designed for production usage - Advantages: - Fast and efficient - Relatively intuitive - Simple deep learning implementation - Out-of-the-box support for: - Named entity recognition - Part-of-speech (POS) tagging - Labelled dependency parsing - … 2 – Building Blocks of spaCy 2 – Building Blocks I. - Tokenization Segmenting text into words, punctuations marks - POS (Part-of-speech tagging) cat -> noun, scratched -> verb - Lemmatization cats -> cat, scratched -> scratch - Sentence Boundary Detection Hello, Mrs. Poppins! - NER (Named Entity Recognition) Marry Poppins -> person, Apple -> company - Serialization Saving NLP documents 2 – Building Blocks II.
    [Show full text]
  • Automatic Correction of Real-Word Errors in Spanish Clinical Texts
    sensors Article Automatic Correction of Real-Word Errors in Spanish Clinical Texts Daniel Bravo-Candel 1,Jésica López-Hernández 1, José Antonio García-Díaz 1 , Fernando Molina-Molina 2 and Francisco García-Sánchez 1,* 1 Department of Informatics and Systems, Faculty of Computer Science, Campus de Espinardo, University of Murcia, 30100 Murcia, Spain; [email protected] (D.B.-C.); [email protected] (J.L.-H.); [email protected] (J.A.G.-D.) 2 VÓCALI Sistemas Inteligentes S.L., 30100 Murcia, Spain; [email protected] * Correspondence: [email protected]; Tel.: +34-86888-8107 Abstract: Real-word errors are characterized by being actual terms in the dictionary. By providing context, real-word errors are detected. Traditional methods to detect and correct such errors are mostly based on counting the frequency of short word sequences in a corpus. Then, the probability of a word being a real-word error is computed. On the other hand, state-of-the-art approaches make use of deep learning models to learn context by extracting semantic features from text. In this work, a deep learning model were implemented for correcting real-word errors in clinical text. Specifically, a Seq2seq Neural Machine Translation Model mapped erroneous sentences to correct them. For that, different types of error were generated in correct sentences by using rules. Different Seq2seq models were trained and evaluated on two corpora: the Wikicorpus and a collection of three clinical datasets. The medicine corpus was much smaller than the Wikicorpus due to privacy issues when dealing Citation: Bravo-Candel, D.; López-Hernández, J.; García-Díaz, with patient information.
    [Show full text]
  • Current Issues in Kurdish Linguistics Current Issues in Kurdish Linguistics 1 Bamberg Studies in Kurdish Linguistics Bamberg Studies in Kurdish Linguistics
    Bamberg Studies in Kurdish Linguistics 1 Songül Gündoğdu, Ergin Öpengin, Geofrey Haig, Erik Anonby (eds.) Current issues in Kurdish linguistics Current issues in Kurdish linguistics 1 Bamberg Studies in Kurdish Linguistics Bamberg Studies in Kurdish Linguistics Series Editor: Geofrey Haig Editorial board: Erik Anonby, Ergin Öpengin, Ludwig Paul Volume 1 2019 Current issues in Kurdish linguistics Songül Gündoğdu, Ergin Öpengin, Geofrey Haig, Erik Anonby (eds.) 2019 Bibliographische Information der Deutschen Nationalbibliothek Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der Deut schen Nationalbibliographie; detaillierte bibliographische Informationen sind im Internet über http://dnb.d-nb.de/ abrufbar. Diese Veröff entlichung wurde im Rahmen des Elite-Maststudiengangs „Kul- turwissenschaften des Vorderen Orients“ durch das Elitenetzwerk Bayern ge- fördert, einer Initiative des Bayerischen Staatsministeriums für Wissenschaft und Kunst. Die Verantwortung für den Inhalt dieser Veröff entlichung liegt bei den Auto- rinnen und Autoren. Dieses Werk ist als freie Onlineversion über das Forschungsinformations- system (FIS; https://fi s.uni-bamberg.de) der Universität Bamberg erreichbar. Das Werk – ausgenommen Cover, Zitate und Abbildungen – steht unter der CC-Lizenz CC-BY. Lizenzvertrag: Creative Commons Namensnennung 4.0 http://creativecommons.org/licenses/by/4.0. Herstellung und Druck: Digital Print Group, Nürnberg Umschlaggestaltung: University of Bamberg Press © University of Bamberg Press, Bamberg 2019 http://www.uni-bamberg.de/ubp/ ISSN: 2698-6612 ISBN: 978-3-86309-686-1 (Druckausgabe) eISBN: 978-3-86309-687-8 (Online-Ausgabe) URN: urn:nbn:de:bvb:473-opus4-558751 DOI: http://dx.doi.org/10.20378/irbo-55875 Acknowledgements This volume contains a selection of contributions originally presented at the Third International Conference on Kurdish Linguistics (ICKL3), University of Ams- terdam, in August 2016.
    [Show full text]
  • Welsh Language Technology Action Plan Progress Report 2020 Welsh Language Technology Action Plan: Progress Report 2020
    Welsh language technology action plan Progress report 2020 Welsh language technology action plan: Progress report 2020 Audience All those interested in ensuring that the Welsh language thrives digitally. Overview This report reviews progress with work packages of the Welsh Government’s Welsh language technology action plan between its October 2018 publication and the end of 2020. The Welsh language technology action plan derives from the Welsh Government’s strategy Cymraeg 2050: A million Welsh speakers (2017). Its aim is to plan technological developments to ensure that the Welsh language can be used in a wide variety of contexts, be that by using voice, keyboard or other means of human-computer interaction. Action required For information. Further information Enquiries about this document should be directed to: Welsh Language Division Welsh Government Cathays Park Cardiff CF10 3NQ e-mail: [email protected] @cymraeg Facebook/Cymraeg Additional copies This document can be accessed from gov.wales Related documents Prosperity for All: the national strategy (2017); Education in Wales: Our national mission, Action plan 2017–21 (2017); Cymraeg 2050: A million Welsh speakers (2017); Cymraeg 2050: A million Welsh speakers, Work programme 2017–21 (2017); Welsh language technology action plan (2018); Welsh-language Technology and Digital Media Action Plan (2013); Technology, Websites and Software: Welsh Language Considerations (Welsh Language Commissioner, 2016) Mae’r ddogfen yma hefyd ar gael yn Gymraeg. This document is also available in Welsh.
    [Show full text]
  • Perceptual Dialectology and GIS in Kurdish 1
    Perceptual Dialectology and GIS in Kurdish 1 Full title: A perceptual dialectological approach to linguistic variation and spatial analysis of Kurdish varieties Main Author: Eva Eppler, PhD, RCSLT, Mag. Phil Reader/Associate Professor in Linguistics Department of Media, Culture and Language University of Roehampton | London | SW15 5SL [email protected] | www.roehampton.ac.uk Tel: +44 (0) 20 8392 3791 Co-author: Josef Benedikt, PhD, Mag.rer.nat. Independent Scholar, Senior GIS Researcher GeoLogic Dr. Benedikt Roegergasse 11/18 1090 Vienna, Austria [email protected] | www.geologic.at Short Title: Perceptual Dialectology and GIS in Kurdish Perceptual Dialectology and GIS in Kurdish 2 Abstract: This paper presents results of a first investigation into Kurdish linguistic varieties and their spatial distribution. Kurdish dialects are used across five nation states in the Middle East and only one, Sorani, has official status in one of them. The study employs the ‘draw-a-map task’ established in Perceptual Dialectology; the analysis is supported by Geographical Information Systems (GIS). The results show that, despite the geolinguistic and geopolitical situation, Kurdish respondents have good knowledge of the main varieties of their language (Kurmanji, Sorani and the related variety Zazaki) and where to localize them. Awareness of the more diverse Southern Kurdish varieties is less definitive. This indicates that the Kurdish language plays a role in identity formation, but also that smaller isolated varieties are not only endangered in terms of speakers, but also in terms of their representations in Kurds’ mental maps of the linguistic landscape they live in. Acknowledgments: This work was supported by a Santander and by Ede & Ravenscroft Research grant 2016.
    [Show full text]
  • Learning to Read by Spelling Towards Unsupervised Text Recognition
    Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta Andrea Vedaldi Andrew Zisserman Visual Geometry Group Visual Geometry Group Visual Geometry Group University of Oxford University of Oxford University of Oxford [email protected] [email protected] [email protected] tttttttttttttttttttttttttttttttttttttttttttrssssss ttttttt ny nytt nr nrttttttt ny ntt nrttttttt zzzz iterations iterations bcorote to whol th ticunthss tidio tiostolonzzzz trougfht to ferr oy disectins it has dicomered Training brought to view by dissection it was discovered Figure 1: Text recognition from unaligned data. We present a method for recognising text in images without using any labelled data. This is achieved by learning to align the statistics of the predicted text strings, against the statistics of valid text strings sampled from a corpus. The figure above visualises the transcriptions as various characters are learnt through the training iterations. The model firstlearns the concept of {space}, and hence, learns to segment the string into words; followed by common words like {to, it}, and only later learns to correctly map the less frequent characters like {v, w}. The last transcription also corresponds to the ground-truth (punctuations are not modelled). The colour bar on the right indicates the accuracy (darker means higher accuracy). ABSTRACT 1 INTRODUCTION This work presents a method for visual text recognition without read (ri:d) verb • Look at and comprehend the meaning of using any paired supervisory data. We formulate the text recogni- (written or printed matter) by interpreting the characters or tion task as one of aligning the conditional distribution of strings symbols of which it is composed.
    [Show full text]
  • Detecting Personal Life Events from Social Media
    Open Research Online The Open University’s repository of research publications and other research outputs Detecting Personal Life Events from Social Media Thesis How to cite: Dickinson, Thomas Kier (2019). Detecting Personal Life Events from Social Media. PhD thesis The Open University. For guidance on citations see FAQs. c 2018 The Author https://creativecommons.org/licenses/by-nc-nd/4.0/ Version: Version of Record Link(s) to article on publisher’s website: http://dx.doi.org/doi:10.21954/ou.ro.00010aa9 Copyright and Moral Rights for the articles on this site are retained by the individual authors and/or other copyright owners. For more information on Open Research Online’s data policy on reuse of materials please consult the policies page. oro.open.ac.uk Detecting Personal Life Events from Social Media a thesis presented by Thomas K. Dickinson to The Department of Science, Technology, Engineering and Mathematics in partial fulfilment of the requirements for the degree of Doctor of Philosophy in the subject of Computer Science The Open University Milton Keynes, England May 2019 Thesis advisor: Professor Harith Alani & Dr Paul Mulholland Thomas K. Dickinson Detecting Personal Life Events from Social Media Abstract Social media has become a dominating force over the past 15 years, with the rise of sites such as Facebook, Instagram, and Twitter. Some of us have been with these sites since the start, posting all about our personal lives and building up a digital identify of ourselves. But within this myriad of posts, what actually matters to us, and what do our digital identities tell people about ourselves? One way that we can start to filter through this data, is to build classifiers that can identify posts about our personal life events, allowing us to start to self reflect on what we share online.
    [Show full text]
  • Natural Language Toolkit Based Morphology Linguistics
    Published by : International Journal of Engineering Research & Technology (IJERT) http://www.ijert.org ISSN: 2278-0181 Vol. 9 Issue 01, January-2020 Natural Language Toolkit based Morphology Linguistics Alifya Khan Pratyusha Trivedi Information Technology Information Technology Vidyalankar Institute of Technology Vidyalankar Institute of Technology, Mumbai, India Mumbai, India Karthik Ashok Prof. Kanchan Dhuri Information Technology, Information Technology, Vidyalankar Institute of Technology, Vidyalankar Institute of Technology, Mumbai, India Mumbai, India Abstract— In the current scenario, there are different apps, of text to different languages, grammar check, etc. websites, etc. to carry out different functionalities with respect Generally, all these functionalities are used in tandem with to text such as grammar correction, translation of text, each other. For e.g., to read a sign board in a different extraction of text from image or videos, etc. There is no app or language, the first step is to extract text from that image and a website where a user can get all these functions/features at translate it to any respective language as required. Hence, to one place and hence the user is forced to install different apps do this one has to switch from application to application or visit different websites to carry out those functions. The which can be time consuming. To overcome this problem, proposed system identifies this problem and tries to overcome an integrated environment is built where all these it by providing various text-based features at one place, so that functionalities are available. the user will not have to hop from app to app or website to website to carry out various functions.
    [Show full text]
  • Design and Implementation of an Open Source Greek Pos Tagger and Entity Recognizer Using Spacy
    DESIGN AND IMPLEMENTATION OF AN OPEN SOURCE GREEK POS TAGGER AND ENTITY RECOGNIZER USING SPACY A PREPRINT Eleni Partalidou1, Eleftherios Spyromitros-Xioufis1, Stavros Doropoulos1, Stavros Vologiannidis2, and Konstantinos I. Diamantaras3 1DataScouting, 30 Vakchou Street, Thessaloniki, 54629, Greece 2Dpt of Computer, Informatics and Telecommunications Engineering, International Hellenic University, Terma Magnisias, Serres, 62124, Greece 3Dpt of Informatics and Electronic Engineering, International Hellenic University, Sindos, Thessaloniki, 57400, Greece ABSTRACT This paper proposes a machine learning approach to part-of-speech tagging and named entity recog- nition for Greek, focusing on the extraction of morphological features and classification of tokens into a small set of classes for named entities. The architecture model that was used is introduced. The greek version of the spaCy platform was added into the source code, a feature that did not exist before our contribution, and was used for building the models. Additionally, a part of speech tagger was trained that can detect the morphologyof the tokens and performs higher than the state-of-the-art results when classifying only the part of speech. For named entity recognition using spaCy, a model that extends the standard ENAMEX type (organization, location, person) was built. Certain experi- ments that were conducted indicate the need for flexibility in out-of-vocabulary words and there is an effort for resolving this issue. Finally, the evaluation results are discussed. Keywords Part-of-Speech Tagging, Named Entity Recognition, spaCy, Greek text. 1. Introduction In the research field of Natural Language Processing (NLP) there are several tasks that contribute to understanding arXiv:1912.10162v1 [cs.CL] 5 Dec 2019 natural text.
    [Show full text]
  • Spelling Correction: from Two-Level Morphology to Open Source
    Spelling Correction: from two-level morphology to open source Iñaki Alegria, Klara Ceberio, Nerea Ezeiza, Aitor Soroa, Gregorio Hernandez Ixa group. University of the Basque Country / Eleka S.L. 649 P.K. 20080 Donostia. Basque Country. [email protected] Abstract Basque is a highly inflected and agglutinative language (Alegria et al., 1996). Two-level morphology has been applied successfully to this kind of languages and there are two-level based descriptions for very different languages. After doing the morphological description for a language, it is easy to develop a spelling checker/corrector for this language. However, what happens if we want to use the speller in the "free world" (OpenOffice, Mozilla, emacs, LaTeX, ...)? Ispell and similar tools (aspell, hunspell, myspell) are the usual mechanisms for these purposes, but they do not fit the two-level model. In the absence of two-level morphology based mechanisms, an automatic conversion from two-level description to hunspell is described in this paper. previous work and reuse the morphological information. 1. Introduction Two-level morphology (Koskenniemi, 1983; Beesley & Karttunen, 2003) has been applied successfully to the 2. Free software for spelling correction morphological description of highly inflected languages. Unfortunately there are not open source tools for spelling There are two-level based descriptions for very different correction with these features: languages (English, German, Swedish, French, Spanish, • It is standardized in the most of the applications Danish, Norwegian, Finnish, Basque, Russian, Turkish, (OpenOffice, Mozilla, emacs, LaTeX, ...). Arab, Aymara, Swahili, etc.). • It is based on the two-level morphology. After doing the morphological description, it is easy to The spell family of spell checkers (ispell, aspell, myspell) develop a spelling checker/corrector for the language fits the first condition but not the second.
    [Show full text]
  • Translation, Sentiment and Voices: a Computational Model to Translate and Analyze Voices from Real-Time Video Calling
    Translation, Sentiment and Voices: A Computational Model to Translate and Analyze Voices from Real-Time Video Calling Aneek Barman Roy School of Computer Science and Statistics Department of Computer Science Thesis submitted for the degree of Master in Science Trinity College Dublin 2019 List of Tables 1 Table 2.1: The table presents a count of studies 22 2 Table 2.2: The table present a count of studies 23 3 Table 3.1: Europarl Training Corpus 31 4 Table 3.2: VidALL Neural Machine Translation Corpus 32 5 Table 3.3: Language Translation Model Specifications 33 6 Table 4.1: Speech Analysis Corpus Statistics 42 7 Table 4.2: Sentences Given to Speakers 43 8 Table 4.3: Hacki’s Vocal Intensity Comparators (Hacki, 1996) 45 9 Table 4.4: Model Vocal Intensity Comparators 45 10 Table 4.5: Sound Intensities for Sentence 1 and Sentence 2 46 11 Table 4.6: Classification of Sentiment Scores 47 12 Table 4.7: Sentiment Scores for Five Sentences 47 13 Table 4.8: Simple RNN-based Model Summary 52 14 Table 4.9: Simple RNN-based Model Results 52 15 Table 4.10: Embedded-based RNN Model Summary 53 16 Table 4.11: Embedded-RNN based Model Results 54 17 Table 5.1: Sentence A Evaluation of RNN and Embedded-RNN Predictions from 20 Epochs 63 18 Table 5.2: Sentence B Evaluation of RNN and Embedded-RNN Predictions from 20 Epochs 63 19 Table 5.3: Evaluation on VidALL Speech Transcriptions 64 v List of Figures 1 Figure 2.1: Google Neural Machine Translation Architecture (Wu et al.,2016) 10 2 Figure 2.2: Side-by-side Evaluation of Google Neural Translation Model (Wu
    [Show full text]
  • A Natural Language Processing System for Extracting Evidence of Drug Repurposing from Scientific Publications
    The Thirty-Second Innovative Applications of Artificial Intelligence Conference (IAAI-20) A Natural Language Processing System for Extracting Evidence of Drug Repurposing from Scientific Publications Shivashankar Subramanian,1,2 Ioana Baldini,1 Sushma Ravichandran,1 Dmitriy A. Katz-Rogozhnikov,1 Karthikeyan Natesan Ramamurthy,1 Prasanna Sattigeri,1 Kush R. Varshney,1 Annmarie Wang,3,4 Pradeep Mangalath,3,5 Laura B. Kleiman3 1IBM Research, Yorktown Heights, NY, USA, 2University of Melbourne, VIC, Australia, 3Cures Within Reach for Cancer, Cambridge, MA, USA, 4Massachusetts Institute of Technology, Cambridge, MA, USA, 5Harvard Medical School, Boston, MA, USA Abstract (Pantziarka et al. 2017; Bouche, Pantziarka, and Meheus 2017; Verbaanderd et al. 2017). However, manual review to More than 200 generic drugs approved by the U.S. Food and Drug Administration for non-cancer indications have shown identify and analyze potential evidence is time-consuming promise for treating cancer. Due to their long history of safe and intractable to scale, as PubMed indexes millions of ar- patient use, low cost, and widespread availability, repurpos- ticles and the collection is continuously updated. For exam- ing of these drugs represents a major opportunity to rapidly ple, the PubMed query Neoplasms [mh] yields more than improve outcomes for cancer patients and reduce healthcare 3.2 million articles.1 Hence there is a need for a (semi-) au- costs. In many cases, there is already evidence of efficacy for tomatic approach to identify relevant scientific articles and cancer, but trying to manually extract such evidence from the provide an easily consumable summary of the evidence. scientific literature is intractable.
    [Show full text]