A Rule-Based Normalization System for Greek Noisy User-Generated Text

Total Page:16

File Type:pdf, Size:1020Kb

A Rule-Based Normalization System for Greek Noisy User-Generated Text A Rule-Based Normalization System for Greek Noisy User-Generated Text Marsida Toska Uppsala University Department of Linguistics and Philology Master Programme in Language Technology Master’s Thesis in Language Technology, 30 ECTS credits November 9, 2020 Supervisor: Eva Pettersson Abstract The ever-growing usage of social media platforms generates daily vast amounts of textual data which could potentially serve as a great source of information. There- fore, mining user-generated data for commercial, academic, or other purposes has already attracted the interest of the research community. However, the informal writing which often characterizes online user-generated texts poses a challenge for automatic text processing with Natural Language Processing (NLP) tools. To mitigate the effect of noise in these texts, lexical normalization has been proposed as a preprocessing method which in short is the task of converting non-standard word forms into a canonical one. The present work aims to contribute to this field by developing a rule-based normalization system for Greek Tweets. We perform an analysis of the categories of the out-of-vocabulary (OOV) word forms identified in the dataset and define hand-crafted rules which we combine with edit distance (Levenshtein distance approach) to tackle noise in the cases under scope. To evaluate the performance of the system we perform both an intrinsic and an extrinsic evaluation in order to explore the effect of normalization on the part-of-speech-tagging. The results of the intrinsic evaluation suggest that our system has an accuracy of approx. 95% compared to approx. 81% for the baseline. In the extrinsic evaluation, it is observed a boost of approx. 8% in the tagging performance when the text has been preprocessed through lexical normalization. Contents Acknowledgements5 1 Introduction6 1.1 Purpose..................................6 1.2 Outline..................................6 2 Background8 2.1 Text Normalization............................8 2.1.1 Normalization of Historical Texts................8 2.1.2 Normalization of Texts for Text-to-Speech Systems......9 2.2 Characteristics of Noisy User-Generated Texts............. 10 2.3 Methods for the Normalization of Noisy User-Generated Texts.... 11 2.3.1 The Rule-Based Approach.................... 11 2.3.2 Levenshtein Distance....................... 12 2.3.3 Comparison of Phonetic Similarity............... 13 2.3.4 Statistical Machine Translation and Neural Methods...... 14 2.4 Greek Language and Greek Tweets................... 14 2.4.1 Overview of Greek Spelling................... 15 2.4.2 Linguistic Phenomena in Greek Tweets............. 15 2.5 Part-of-Speech Tagging.......................... 16 3 Data & Resources 17 3.1 The Dataset................................ 17 3.2 Resources................................. 17 3.2.1 Hunspell............................. 17 3.2.2 PANACEA n-gram Corpus................... 18 3.2.3 UD Pipe.............................. 18 4 Preprocessing 19 4.1 Cleaning the Dataset........................... 19 4.2 Test Set.................................. 20 4.3 Systematic Analysis of Greek Tweets.................. 20 4.3.1 Sentence Structure........................ 20 4.3.2 Lower/Upper Case........................ 20 4.3.3 Neologisms............................ 20 4.3.4 Greeklish and Engreek...................... 21 4.3.5 Stress, Contractions, Elongations, Space............ 21 4.3.6 Non-Standard Abbreviations.................. 22 4.3.7 Misspellings............................ 23 4.4 Scope Definition............................. 23 5 System Architecture 25 5.1 Module 1: Regular Expressions..................... 26 5.1.1 Non-Standard Abbreviations.................. 26 5.1.2 Elongations............................ 26 5.1.3 Contractions........................... 26 3 5.1.4 Misjoined words......................... 27 5.1.5 Truecasing............................. 27 5.2 Module 2: Rule about Stress Restoration................ 27 5.2.1 Rule Overview.......................... 27 5.2.2 Rule Analysis........................... 28 5.2.3 Handling of Special Cases.................... 29 5.3 Module 3: Edit distance......................... 29 5.3.1 Extraction of IV subset...................... 29 5.3.2 Extraction of Candidates..................... 30 5.3.3 Final Selection of the IV Counterpart.............. 31 6 Evaluation and Results 33 6.1 Performance of the Rule-Based System................. 33 6.2 Error Analysis............................... 34 6.3 Effect of Normalization on Tagging................... 35 7 Discussion 36 8 Conclusion 38 4 Acknowledgements For her guidance, constructive feedback, continuous support and encouragement, I would like to warmly thank my supervisor Eva Pettersson. I also wish to thank my absolutely unique family, boyfriend and over the world spread out friends for always being there for me, even when they were not. 5 1 Introduction User-generated texts, such as social media texts (e.g. tweets), constitute a vast source of information for opinion and event extraction. However, most of this information is composed in a language that is notorious for its high variability in sentence structure, the extensive usage of non-standard word forms and the presence of ungrammatical linguistic units (e.g. misspelled words) resulting in noise (Sikdar and Gambäck, 2016). Since most Natural Language Processing (NLP) tools are trained on formal texts, such as news text, it has been observed that the performance of such tools declines when run over informal text (Gimpel et al., 2011; O’Connor et al., 2010). Good tagging and parsing results are essential for applications such as opinion mining, information retrieval etc (Kong et al., 2014). Therefore, there is seen a need to preprocess noisy user-generated texts (NUGT) through normalization, so that the performance of preprocessing NLP tasks, such as Part-of-Speech (POS) tagging, parsing and other subsequent ones, is not compromised significantly, if at all (Clark and Araki, 2011). 1.1 Purpose Normalization could concisely be defined as the task of converting word tokens into their standard form (Han et al., 2013). Depending on the text genre (formal vs. informal) and its purpose (preprocessing text for Text-to-Speech (TTS) systems, information retrieval or other NLP tasks such as tagging and parsing) it poses different challenges and subsequently requires a different approach as well. The purpose of the present work is to explore the extent to which the rule-based normalization of Greek social media texts, specifically Greek tweets (tweets written in the Modern Greek language), can lead to more accurate tagging results. We opted for the rule- based approach because it has proven to deliver optimal results in other languages, at least when straightforward mappings suffice (Ruiz et al., 2014; Sidarenka et al., 2013). Additionally, including the levenshtein-distance algorithm, allows us deal with any type of spelling deviations, a phenomenon which we expect to be abundant in user-generated texts, such as tweets. Therefore, the research questions which we will attempt to answer in this project, could be formulated as follows: 1) What are the core categories of Greek tweets that could potentially be in need for normalization? 2) How well does a rule-based system combined with Levenshtein distance perform in normalizing Greek tweets? 3) What is the effect, if any, of normalization in the tagging results of Greek tweets? 1.2 Outline The chapters are organized as follows: Chapter 2 provides information on the background of the task of normalization. As it is not a new field, a few of the most common areas of application and methods are described. There is also provided an introduction into the Greek NUGT as well as a brief overview of POS-tagging. 6 Chapter 3 introduces the dataset along with the resources that were used for the purposes of this project. Part of the data was used for the analysis of the phenomena occurring in Greek Tweets and the creation of an annotated test set, that was used for the evaluation of the system in the end. Chapter 4 contains information about the preprocessing steps, such as cleaning the data, performing an analysis and categorization of non-standard word forms and annotating the test set. Chapter 5 describes the actual implementation of the system by giving an overview of the architecture and illustrating approaches through examples. Chapter 6 gives an overview of the evaluation results, both of the system itself and with regard to POS-tagging, at which phase the answers to the research questions, as resulted from the work, are summarized. Chapters 7 and 8 discuss the results and provide a brief conclusion respectively. 7 2 Background 2.1 Text Normalization Text normalization (or canonicalization or standardization) is a prevalent task in the NLP pipeline. It includes normalization subtasks such as tokenization, lemmatization, stemming or sentence segmentation, but it can also be encountered in a more complex form in situations where, for instance, out-of-vocabulary (OOV) words must be addressed by converting them into a standard (lexicon approved) form. Lexical normalization, where the focus of this work lies, consists in preprocessing a text on a word level and transform it into a form that can be easily analyzed and processed by other tasks in the NLP pipeline (e.g. taggers) or downstream applications (e.g. machine translation systems) for these to produce consistent results (Han et al., 2013; Jurafsky and Martin, 2009) . As one may already
Recommended publications
  • Greek Numbers 05/10/2006 12:02 PM
    Greek numbers 05/10/2006 12:02 PM History topic: Greek number systems There were no single Greek national standards in the first millennium BC. since the various island states prided themselves on their independence. This meant that they each had their own currency, weights and measures etc. These in turn led to small differences in the number system between different states since a major function of a number system in ancient times was to handle business transactions. However we will not go into sufficient detail in this article to examine the small differences between the system in separate states but rather we will look at its general structure. We should say immediately that the ancient Greeks had different systems for cardinal numbers and ordinal numbers so we must look carefully at what we mean by Greek number systems. Also we shall look briefly at some systems proposed by various Greek mathematicians but not widely adopted. The first Greek number system we examine is their acrophonic system which was use in the first millennium BC. 'Acrophonic' means that the symbols for the numerals come from the first letter of the number name, so the symbol has come from an abreviation of the word which is used for the number. Here are the symbols for the numbers 5, 10, 100, 1000, 10000. Acrophonic 5, 10, 100, 1000, 10000. We have omitted the symbol for 'one', a simple '|', which was an obvious notation not coming from the initial letter of a number. For 5, 10, 100, 1000, 10000 there will be only one puzzle for the reader and that is the symbol for 5 which should by P if it was the first letter of Pente.
    [Show full text]
  • Preliminary Version of the Text Analysis Component, Including: Ner, Event Detection and Sentiment Analysis
    D4.1 PRELIMINARY VERSION OF THE TEXT ANALYSIS COMPONENT, INCLUDING: NER, EVENT DETECTION AND SENTIMENT ANALYSIS Grant Agreement nr 611057 Project acronym EUMSSI Start date of project (dur.) December 1st 2013 (36 months) Document due Date : November 30th 2014 (+60 days) (12 months) Actual date of delivery December 2nd 2014 Leader GFAI Reply to [email protected] Document status Submitted Project co-funded by ICT-7th Framework Programme from the European Commission EUMSSI_D4.1 Preliminary version of the text analysis component 1 Project ref. no. 611057 Project acronym EUMSSI Project full title Event Understanding through Multimodal Social Stream Interpretation Document name EUMSSI_D4.1_Preliminary version of the text analysis component.pdf Security (distribution PU – Public level) Contractual date of November 30th 2014 (+60 days) delivery Actual date of December 2nd 2014 delivery Deliverable name Preliminary Version of the Text Analysis Component, including: NER, event detection and sentiment analysis Type P – Prototype Status Submitted Version number v1 Number of pages 60 WP /Task responsible GFAI / GFAI & UPF Author(s) Susanne Preuss (GFAI), Maite Melero (UPF) Other contributors Mahmoud Gindiyeh (GFAI), Eelco Herder (LUH), Giang Tran Binh (LUH), Jens Grivolla (UPF) EC Project Officer Mrs. Aleksandra WESOLOWSKA [email protected] Abstract The deliverable reports on the resources and tools that have been gathered and installed for the preliminary version of the text analysis component Keywords Text analysis component, Named Entity Recognition, Named Entity Linking, Keyphrase Extraction, Relation Extraction, Topic modelling, Sentiment Analysis Circulated to partners Yes Peer review Yes completed Peer-reviewed by Eelco Herder (L3S) Coordinator approval Yes EUMSSI_D4.1 Preliminary version of the text analysis component 2 Table of Contents 1.
    [Show full text]
  • + Natali A, Professor of Cartqraphy, the Hebreu Uhiversity of -Msalem, Israel DICTIONARY of Toponymfc TERLMINO~OGY Wtaibynafiail~
    United Nations Group of E%perts OR Working Paper 4eographicalNames No. 61 Eighteenth Session Geneva, u-23 August1996 Item7 of the E%ovisfonal Agenda REPORTSOF THE WORKINGGROUPS + Natali a, Professor of Cartqraphy, The Hebreu UhiVersity of -msalem, Israel DICTIONARY OF TOPONYMfC TERLMINO~OGY WtaIbyNafiaIl~- . PART I:RaLsx vbim 3.0 upi8elfuiyl9!J6 . 001 . 002 003 004 oo!l 006 007 . ooa 009 010 . ol3 014 015 sequala~esfocJphabedcsaipt. 016 putting into dphabetic order. see dso Kqucna ruIt!% Qphabctk 017 Rtlpreat8Ii00, e.g. ia 8 computer, wflich employs ooc only numm ds but also fetters. Ia a wider sense. aIso anploying punauatiocl tnarksmd-SymboIs. 018 Persod name. Esamples: Alfredi ‘Ali. 019 022 023 biliaw 024 02s seecIass.f- 026 GrqbicsymboIusedurunitiawrIdu~morespedficaty,r ppbic symbol in 1 non-dphabedc writiog ryste.n& Exmlptes: Chinese ct, , thong; Ambaric u , ha: Japaoese Hiragana Q) , no. 027 -.modiGed Wnprehauive term for cheater. simplified aad character, varIaoL 031 CbmJnyol 032 CISS, featm? 033 cQdedrepfwltatiul 034 035 036 037' 038 039 040 041 042 047 caavasion alphabet 048 ConMQo table* 049 0nevahte0frpointinlhisgr8ti~ . -.- w%idofplaaecoordiaarurnm;aingoftwosetsofsnpight~ -* rtcight8ngfIertoeachotkrodwithap8ltKliuofl8qthonbo&. rupenmposedonr(chieflytopogtaphtc)map.see8lsouTM gz 051 see axxdimtes. rectangufar. 052 A stahle form of speech, deriyed from a pbfgin, which has became the sole a ptincipal language of 8 qxech comtnunity. Example: Haitian awle (derived from Fresh). ‘053 adllRaIfeatlue see feature, allhlral. 054 055 * 056 057 Ac&uioaofsoftwamrcqkdfocusingrdgRaIdatabmem rstoauMe~osctlto~thisdatabase. 058 ckalog of defItitioas of lbe contmuofadigitaldatabase.~ud- hlg data element cefw labels. f0mw.s. internal refm codMndtextemty,~well~their-p,. 059 see&tadichlq. 060 DeMptioa of 8 basic unit of -Lkatifiile md defiile informatioa tooccqyrspecEcdataf!eldinrcomputernxaxtLExampk Pateofmtifii~ofluwtby~namaturhority’.
    [Show full text]
  • Information Retrieval and Web Search
    Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea (Note: Some of the slides in this slide set were adapted from an IR course taught by Prof. Ray Mooney at UT Austin) IR System Architecture User Interface Text User Text Operations Need Logical View User Query Database Indexing Feedback Operations Manager Inverted file Query Searching Index Text Ranked Retrieved Database Docs Ranking Docs Text Processing Pipeline Documents to be indexed Friends, Romans, countrymen. OR User query Tokenizer Token stream Friends Romans Countrymen Linguistic modules Modified tokens friend roman countryman Indexer friend 2 4 1 2 Inverted index roman countryman 13 16 From Text to Tokens to Terms • Tokenization = segmenting text into tokens: • token = a sequence of characters, in a particular document at a particular position. • type = the class of all tokens that contain the same character sequence. • “... to be or not to be ...” 3 tokens, 1 type • “... so be it, he said ...” • term = a (normalized) type that is included in the IR dictionary. • Example • text = “I slept and then I dreamed” • tokens = I, slept, and, then, I, dreamed • types = I, slept, and, then, dreamed • terms = sleep, dream (stopword removal). Simple Tokenization • Analyze text into a sequence of discrete tokens (words). • Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token. – However, frequently they are not. • Simplest approach is to ignore all numbers and punctuation and use only case-insensitive
    [Show full text]
  • In Search of Fundamentals to Resist Ethnic Calamities and Maintain National Integrity
    Scholarship Report – L. Picha Meiji Jingu & Shiseikan In Search of fundamentals to resist ethnic calamities and maintain national integrity Lefkothea Picha June-July 2013 1 Scholarship Report Meiji Jingu (明治神宮) & Shiseikan (至誠館) Contents Acknowledgements and impressions …………………………………… Page 3 a. Cultural trip at Izumo Taishia and Matsue city Introduction: Japan’s latest Tsunami versus Greek financial crisis reflecting national ethos…………………............................................................................ 6 Part 1 The Historic Horizon in Greece................................................... 7 a. Classical Period b. Persian wars c. Alexander’s the Great Empire d. Roman and Medieval Greece e. The Byzantine Period f. The Ottoman domination g. Commentary on the Byzantine epoch and Ottoman occupation h. World War II i. Greece after World War II j. Restoration of Democracy and Greek Politics in the era of Financial crisis Part 2 The liturgical and spiritual Greek ethos …………………………. 10 a. Greek mythology, the ancient Greek philosophy and Shinto b. Orthodox theology, Christian ethics and Shinto c. Purification process – Katharmos in ancient Greece, Christian Baptism and Misogi Part 3 Greek warriors’ ethos and Reflections on Bushido ..................... 14 a. Battle of Thermopylae (480 BC) b. Ancient Greek warriors armor c. Comparison of Spartan soldiers and Samurai d. The motto “freedom or death and the Greek anthem e. Monument of the unknown soldier and Yasukunii shrine f. Women warriors and their supportive role against invaders g. Reflections on Bushido and its importance in the modern era h. Personal training in Budo and relation with Shiseikan Part 4 Personal view on Greek nation’s metamorphosis………………… 20 a. From the illustrious ancestors to the cultural decay. Is catastrophy a chance to revive Greek nation? Lefkothea Picha 2 Scholarship Report Meiji Jingu (明治神宮) & Shiseikan (至誠館) Acknowledgments and impressions I would like to thank Araya Kancho for the scholarship received.
    [Show full text]
  • Sirius - Wikipedia Coordinates: 06 H 4 5 M 08.9 1 7 3 S, −1 6 ° 4 2 ′ 5 8.01 7 ″
    12/2/2018 Sirius - Wikipedia Coordinates: 06 h 4 5 m 08.9 1 7 3 s, −1 6 ° 4 2 ′ 5 8.01 7 ″ Sirius Sirius (/ˈsɪriəs/, a romanization of Greek Σείριος, Seirios, lit. "glowing" or "scorching") is a star system Sirius A and B and the brightest star in the Earth's night sky. With a visual apparent magnitude of −1.46, it is almost twice as bright as Canopus, the next brightest star. The system has the Bayer designation Alpha Canis Majoris (α CMa). What the naked eye perceives as a single star is a binary star system, consisting of a white main-sequence star of spectral type A0 or A1, termed Sirius A, and a faint white dwarf companion of spectral type DA2, called Sirius B. The distance separating Sirius A from its companion varies between 8.2 and 31.5 AU.[24] Sirius appears bright because of its intrinsic luminosity and its proximity to Earth. At a distance of 2.6 parsecs (8.6 ly), as determined by the Hipparcos astrometry satellite,[2][25][26] the Sirius system is one of Earth's near neighbours. Sirius is gradually moving closer to the Solar System, so it will slightly increase in brightness over the next 60,000 years. After that time its distance will begin to increase and it will become fainter, but it will continue to be the brightest star in the Earth's night sky for the next 210,000 years.[27] The position of Sirius (circled). Sirius A is about twice as massive as the Sun (M☉) and has an absolute visual magnitude of 1.42.
    [Show full text]
  • Pre-Proto-Iranians of Afghanistan As Initiators of Sakta Tantrism: on the Scythian/Saka Affiliation of the Dasas, Nuristanis and Magadhans
    Iranica Antiqua, vol. XXXVII, 2002 PRE-PROTO-IRANIANS OF AFGHANISTAN AS INITIATORS OF SAKTA TANTRISM: ON THE SCYTHIAN/SAKA AFFILIATION OF THE DASAS, NURISTANIS AND MAGADHANS BY Asko PARPOLA (Helsinki) 1. Introduction 1.1 Preliminary notice Professor C. C. Lamberg-Karlovsky is a scholar striving at integrated understanding of wide-ranging historical processes, extending from Mesopotamia and Elam to Central Asia and the Indus Valley (cf. Lamberg- Karlovsky 1985; 1996) and even further, to the Altai. The present study has similar ambitions and deals with much the same area, although the approach is from the opposite direction, north to south. I am grateful to Dan Potts for the opportunity to present the paper in Karl's Festschrift. It extends and complements another recent essay of mine, ‘From the dialects of Old Indo-Aryan to Proto-Indo-Aryan and Proto-Iranian', to appear in a volume in the memory of Sir Harold Bailey (Parpola in press a). To com- pensate for that wider framework which otherwise would be missing here, the main conclusions are summarized (with some further elaboration) below in section 1.2. Some fundamental ideas elaborated here were presented for the first time in 1988 in a paper entitled ‘The coming of the Aryans to Iran and India and the cultural and ethnic identity of the Dasas’ (Parpola 1988). Briefly stated, I suggested that the fortresses of the inimical Dasas raided by ¤gvedic Aryans in the Indo-Iranian borderlands have an archaeological counterpart in the Bronze Age ‘temple-fort’ of Dashly-3 in northern Afghanistan, and that those fortresses were the venue of the autumnal festival of the protoform of Durga, the feline-escorted Hindu goddess of war and victory, who appears to be of ancient Near Eastern origin.
    [Show full text]
  • Constructing a Lexicon of Arabic-English Named Entity Using SMT and Semantic Linked Data
    Constructing a Lexicon of Arabic-English Named Entity using SMT and Semantic Linked Data Emna Hkiri, Souheyl Mallat, and Mounir Zrigui LaTICE Laboratory, Faculty of Sciences of Monastir, Tunisia Abstract: Named entity recognition is the problem of locating and categorizing atomic entities in a given text. In this work, we used DBpedia Linked datasets and combined existing open source tools to generate from a parallel corpus a bilingual lexicon of Named Entities (NE). To annotate NE in the monolingual English corpus, we used linked data entities by mapping them to Gate Gazetteers. In order to translate entities identified by the gate tool from the English corpus, we used moses, a statistical machine translation system. The construction of the Arabic-English named entities lexicon is based on the results of moses translation. Our method is fully automatic and aims to help Natural Language Processing (NLP) tasks such as, machine translation information retrieval, text mining and question answering. Our lexicon contains 48753 pairs of Arabic-English NE, it is freely available for use by other researchers Keywords: Named Entity Recognition (NER), Named entity translation, Parallel Arabic-English lexicon, DBpedia, linked data entities, parallel corpus, SMT. Received April 1, 2015; accepted October 7, 2015 1. Introduction Section 5 concludes the paper with directions for future work. Named Entity Recognition (NER) consists in recognizing and categorizing all the Named Entities 2. State of the art (NE) in a given text, corresponding to two intuitive IAJIT First classes; proper names (persons, locations and Written and spoken Arabic is considered as difficult organizations), numeric expressions (time, date, money language to apprehend in the domain of Natural and percent).
    [Show full text]
  • Task Force for the Review of the Romanization of Greek RE: Report of the Task Force
    CC:DA/TF/ Review of the Romanization of Greek/3 Report, May 18, 2010 page: 1 TO: ALA/ALCTS/CCS/Committee on Cataloging: Description and Access (CC:DA) FROM: ALA/ALCTS/CCS/CC:DA Task Force for the Review of the Romanization of Greek RE: Report of the Task Force CHARGE TO THE TASK FORCE The Task Force is charged with assessing draft Romanization tables for Greek, educating CC:DA as necessary, and preparing necessary reports to support the revision process, leading to ultimate approval of an updated ALA-LC Romanization scheme for Greek. In particular, the Task Force should review the May 2010 draft for a timely report by ALA to LC. Review of subsequent tables may be called for, depending on the viability of this latest draft. The ALA-LC Romanization table - Greek, Proposed Revision May 2010 is located at the LC Policy and Standards Division website at: http://www.loc.gov/catdir/cpso/romanization/greekrev.pdf [archived as a supplement to this report on the CC:DA site] BACKGROUND INFORMATION FROM THE LIBRARY OF CONGRESS We note that when the May 2010 Greek table was presented for general review via email, the LC Policy and Standards Division offered the following information comparing the May 2010 table with the existing table, Greek (Also Coptic), available at the LC policy and Standards Division web site at: http://www.loc.gov/catdir/cpso/romanization/greek.pdf: "The Policy and Standards Division has taken another look at the revised Greek Romanization tables in conjunction with comments from the library community and its own staff with knowledge of Greek.
    [Show full text]
  • The Active Imperfect of the Verbs of the '2Nd Conjugation'
    </SECTION<SECTION<LINK "pan-n*"> "art" "opt"> TITLE "Articles"> <TARGET "pan" DOCINFO AUTHOR "Nikolaos Pantelidis"TITLE "The active imperfect of the verbs of the ‘2nd conjugation’ in the Peloponnesian varieties of Modern Greek"SUBJECT "JGL, Volume 4"KEYWORDS "Modern Greek dialects, Peloponnesian varieties, imperfect, morphologization, leveling, violation of the trisyllabic window, loss of morpheme boundary, doubling of morphemes"SIZE HEIGHT "220"WIDTH "150"VOFFSET "4"> The active imperfect of the verbs of the ‘2nd conjugation’ in the Peloponnesian varieties of Modern Greek* Nikolaos Pantelidis Demokritus University of Thrace The present paper treats the different types of formation and the inflectional patterns of the active imperfect of the verbs that in traditional grammar are known as verbs of the ‘2nd conjugation’ in the Peloponnesian varieties of Modern Greek (except Tsakonian and Maniot), mainly from a diachronic point of view. A reconstruction of the processes that led to the current situa- tion is attempted and directions for further possible changes are suggested. The diachrony of the morphology of the imperfect of the ‘2nd conjugation’ in the Peloponnesian varieties involves developments such as morpho- logization of a phonological process and the evolution of number-oriented allomorphy at the level of aspectual markers, while at the same time offering interesting insights into the mechanisms and scope of morphological chang- es and the morphological structure of the Modern Greek verb. These devel- opments can also offer important
    [Show full text]
  • The Greek Alphabet Sight and Sounds of the Greek Letters (Module B) the Letters and Pronunciation of the Greek Alphabet 2 Phonology (Part 2)
    The Greek Alphabet Sight and Sounds of the Greek Letters (Module B) The Letters and Pronunciation of the Greek Alphabet 2 Phonology (Part 2) Lesson Two Overview 2.0 Introduction, 2-1 2.1 Ten Similar Letters, 2-2 2.2 Six Deceptive Greek Letters, 2-4 2.3 Nine Different Greek Letters, 2-8 2.4 History of the Greek Alphabet, 2-13 Study Guide, 2-20 2.0 Introduction Lesson One introduced the twenty-four letters of the Greek alphabet. Lesson Two continues to present the building blocks for learning Greek phonics by merging vowels and consonants into syllables. Furthermore, this lesson underscores the similarities and dissimilarities between the Greek and English alphabetical letters and their phonemes. Almost without exception, introductory Greek grammars launch into grammar and vocabulary without first firmly grounding a student in the Greek phonemic system. This approach is appropriate if a teacher is present. However, it is little help for those who are “going at it alone,” or a small group who are learning NTGreek without the aid of a teacher’s pronunciation. This grammar’s introductory lessons go to great lengths to present a full-orbed pronunciation of the Erasmian Greek phonemic system. Those who are new to the Greek language without an instructor’s guidance will welcome this help, and it will prepare them to read Greek and not simply to translate it into their language. The phonic sounds of the Greek language are required to be carefully learned. A saturation of these sounds may be accomplished by using the accompanying MP3 audio files.
    [Show full text]
  • Fasttext-Based Intent Detection for Inflected Languages †
    information Article FastText-Based Intent Detection for Inflected Languages † Kaspars Balodis 1,2,* and Daiga Deksne 1 1 Tilde, Vien¯ıbas Gatve 75A, LV-1004 R¯ıga, Latvia; [email protected] 2 Faculty of Computing, University of Latvia, Rain, a blvd. 19, LV-1586 R¯ıga, Latvia * Correspondence: [email protected] † This paper is an extended version of our paper published in 18th International Conference AIMSA 2018, Varna, Bulgaria, 12–14 September 2018. Received: 15 January 2019; Accepted: 25 April 2019; Published: 1 May 2019 Abstract: Intent detection is one of the main tasks of a dialogue system. In this paper, we present our intent detection system that is based on fastText word embeddings and a neural network classifier. We find an improvement in fastText sentence vectorization, which, in some cases, shows a significant increase in intent detection accuracy. We evaluate the system on languages commonly spoken in Baltic countries—Estonian, Latvian, Lithuanian, English, and Russian. The results show that our intent detection system provides state-of-the-art results on three previously published datasets, outperforming many popular services. In addition to this, for Latvian, we explore how the accuracy of intent detection is affected if we normalize the text in advance. Keywords: intent detection; word embeddings; dialogue system 1. Introduction and Related Work Recent developments in deep learning have made neural networks the mainstream approach for a wide variety of tasks, ranging from image recognition to price forecasting to natural language processing. In natural language processing, neural networks are used for speech recognition and generation, machine translation, text classification, named entity recognition, text generation, and many other tasks.
    [Show full text]