A Rule-Based Normalization System for Greek Noisy User-Generated Text
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Greek Numbers 05/10/2006 12:02 PM
Greek numbers 05/10/2006 12:02 PM History topic: Greek number systems There were no single Greek national standards in the first millennium BC. since the various island states prided themselves on their independence. This meant that they each had their own currency, weights and measures etc. These in turn led to small differences in the number system between different states since a major function of a number system in ancient times was to handle business transactions. However we will not go into sufficient detail in this article to examine the small differences between the system in separate states but rather we will look at its general structure. We should say immediately that the ancient Greeks had different systems for cardinal numbers and ordinal numbers so we must look carefully at what we mean by Greek number systems. Also we shall look briefly at some systems proposed by various Greek mathematicians but not widely adopted. The first Greek number system we examine is their acrophonic system which was use in the first millennium BC. 'Acrophonic' means that the symbols for the numerals come from the first letter of the number name, so the symbol has come from an abreviation of the word which is used for the number. Here are the symbols for the numbers 5, 10, 100, 1000, 10000. Acrophonic 5, 10, 100, 1000, 10000. We have omitted the symbol for 'one', a simple '|', which was an obvious notation not coming from the initial letter of a number. For 5, 10, 100, 1000, 10000 there will be only one puzzle for the reader and that is the symbol for 5 which should by P if it was the first letter of Pente. -
Preliminary Version of the Text Analysis Component, Including: Ner, Event Detection and Sentiment Analysis
D4.1 PRELIMINARY VERSION OF THE TEXT ANALYSIS COMPONENT, INCLUDING: NER, EVENT DETECTION AND SENTIMENT ANALYSIS Grant Agreement nr 611057 Project acronym EUMSSI Start date of project (dur.) December 1st 2013 (36 months) Document due Date : November 30th 2014 (+60 days) (12 months) Actual date of delivery December 2nd 2014 Leader GFAI Reply to [email protected] Document status Submitted Project co-funded by ICT-7th Framework Programme from the European Commission EUMSSI_D4.1 Preliminary version of the text analysis component 1 Project ref. no. 611057 Project acronym EUMSSI Project full title Event Understanding through Multimodal Social Stream Interpretation Document name EUMSSI_D4.1_Preliminary version of the text analysis component.pdf Security (distribution PU – Public level) Contractual date of November 30th 2014 (+60 days) delivery Actual date of December 2nd 2014 delivery Deliverable name Preliminary Version of the Text Analysis Component, including: NER, event detection and sentiment analysis Type P – Prototype Status Submitted Version number v1 Number of pages 60 WP /Task responsible GFAI / GFAI & UPF Author(s) Susanne Preuss (GFAI), Maite Melero (UPF) Other contributors Mahmoud Gindiyeh (GFAI), Eelco Herder (LUH), Giang Tran Binh (LUH), Jens Grivolla (UPF) EC Project Officer Mrs. Aleksandra WESOLOWSKA [email protected] Abstract The deliverable reports on the resources and tools that have been gathered and installed for the preliminary version of the text analysis component Keywords Text analysis component, Named Entity Recognition, Named Entity Linking, Keyphrase Extraction, Relation Extraction, Topic modelling, Sentiment Analysis Circulated to partners Yes Peer review Yes completed Peer-reviewed by Eelco Herder (L3S) Coordinator approval Yes EUMSSI_D4.1 Preliminary version of the text analysis component 2 Table of Contents 1. -
+ Natali A, Professor of Cartqraphy, the Hebreu Uhiversity of -Msalem, Israel DICTIONARY of Toponymfc TERLMINO~OGY Wtaibynafiail~
United Nations Group of E%perts OR Working Paper 4eographicalNames No. 61 Eighteenth Session Geneva, u-23 August1996 Item7 of the E%ovisfonal Agenda REPORTSOF THE WORKINGGROUPS + Natali a, Professor of Cartqraphy, The Hebreu UhiVersity of -msalem, Israel DICTIONARY OF TOPONYMfC TERLMINO~OGY WtaIbyNafiaIl~- . PART I:RaLsx vbim 3.0 upi8elfuiyl9!J6 . 001 . 002 003 004 oo!l 006 007 . ooa 009 010 . ol3 014 015 sequala~esfocJphabedcsaipt. 016 putting into dphabetic order. see dso Kqucna ruIt!% Qphabctk 017 Rtlpreat8Ii00, e.g. ia 8 computer, wflich employs ooc only numm ds but also fetters. Ia a wider sense. aIso anploying punauatiocl tnarksmd-SymboIs. 018 Persod name. Esamples: Alfredi ‘Ali. 019 022 023 biliaw 024 02s seecIass.f- 026 GrqbicsymboIusedurunitiawrIdu~morespedficaty,r ppbic symbol in 1 non-dphabedc writiog ryste.n& Exmlptes: Chinese ct, , thong; Ambaric u , ha: Japaoese Hiragana Q) , no. 027 -.modiGed Wnprehauive term for cheater. simplified aad character, varIaoL 031 CbmJnyol 032 CISS, featm? 033 cQdedrepfwltatiul 034 035 036 037' 038 039 040 041 042 047 caavasion alphabet 048 ConMQo table* 049 0nevahte0frpointinlhisgr8ti~ . -.- w%idofplaaecoordiaarurnm;aingoftwosetsofsnpight~ -* rtcight8ngfIertoeachotkrodwithap8ltKliuofl8qthonbo&. rupenmposedonr(chieflytopogtaphtc)map.see8lsouTM gz 051 see axxdimtes. rectangufar. 052 A stahle form of speech, deriyed from a pbfgin, which has became the sole a ptincipal language of 8 qxech comtnunity. Example: Haitian awle (derived from Fresh). ‘053 adllRaIfeatlue see feature, allhlral. 054 055 * 056 057 Ac&uioaofsoftwamrcqkdfocusingrdgRaIdatabmem rstoauMe~osctlto~thisdatabase. 058 ckalog of defItitioas of lbe contmuofadigitaldatabase.~ud- hlg data element cefw labels. f0mw.s. internal refm codMndtextemty,~well~their-p,. 059 see&tadichlq. 060 DeMptioa of 8 basic unit of -Lkatifiile md defiile informatioa tooccqyrspecEcdataf!eldinrcomputernxaxtLExampk Pateofmtifii~ofluwtby~namaturhority’. -
Information Retrieval and Web Search
Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea (Note: Some of the slides in this slide set were adapted from an IR course taught by Prof. Ray Mooney at UT Austin) IR System Architecture User Interface Text User Text Operations Need Logical View User Query Database Indexing Feedback Operations Manager Inverted file Query Searching Index Text Ranked Retrieved Database Docs Ranking Docs Text Processing Pipeline Documents to be indexed Friends, Romans, countrymen. OR User query Tokenizer Token stream Friends Romans Countrymen Linguistic modules Modified tokens friend roman countryman Indexer friend 2 4 1 2 Inverted index roman countryman 13 16 From Text to Tokens to Terms • Tokenization = segmenting text into tokens: • token = a sequence of characters, in a particular document at a particular position. • type = the class of all tokens that contain the same character sequence. • “... to be or not to be ...” 3 tokens, 1 type • “... so be it, he said ...” • term = a (normalized) type that is included in the IR dictionary. • Example • text = “I slept and then I dreamed” • tokens = I, slept, and, then, I, dreamed • types = I, slept, and, then, dreamed • terms = sleep, dream (stopword removal). Simple Tokenization • Analyze text into a sequence of discrete tokens (words). • Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token. – However, frequently they are not. • Simplest approach is to ignore all numbers and punctuation and use only case-insensitive -
In Search of Fundamentals to Resist Ethnic Calamities and Maintain National Integrity
Scholarship Report – L. Picha Meiji Jingu & Shiseikan In Search of fundamentals to resist ethnic calamities and maintain national integrity Lefkothea Picha June-July 2013 1 Scholarship Report Meiji Jingu (明治神宮) & Shiseikan (至誠館) Contents Acknowledgements and impressions …………………………………… Page 3 a. Cultural trip at Izumo Taishia and Matsue city Introduction: Japan’s latest Tsunami versus Greek financial crisis reflecting national ethos…………………............................................................................ 6 Part 1 The Historic Horizon in Greece................................................... 7 a. Classical Period b. Persian wars c. Alexander’s the Great Empire d. Roman and Medieval Greece e. The Byzantine Period f. The Ottoman domination g. Commentary on the Byzantine epoch and Ottoman occupation h. World War II i. Greece after World War II j. Restoration of Democracy and Greek Politics in the era of Financial crisis Part 2 The liturgical and spiritual Greek ethos …………………………. 10 a. Greek mythology, the ancient Greek philosophy and Shinto b. Orthodox theology, Christian ethics and Shinto c. Purification process – Katharmos in ancient Greece, Christian Baptism and Misogi Part 3 Greek warriors’ ethos and Reflections on Bushido ..................... 14 a. Battle of Thermopylae (480 BC) b. Ancient Greek warriors armor c. Comparison of Spartan soldiers and Samurai d. The motto “freedom or death and the Greek anthem e. Monument of the unknown soldier and Yasukunii shrine f. Women warriors and their supportive role against invaders g. Reflections on Bushido and its importance in the modern era h. Personal training in Budo and relation with Shiseikan Part 4 Personal view on Greek nation’s metamorphosis………………… 20 a. From the illustrious ancestors to the cultural decay. Is catastrophy a chance to revive Greek nation? Lefkothea Picha 2 Scholarship Report Meiji Jingu (明治神宮) & Shiseikan (至誠館) Acknowledgments and impressions I would like to thank Araya Kancho for the scholarship received. -
Sirius - Wikipedia Coordinates: 06 H 4 5 M 08.9 1 7 3 S, −1 6 ° 4 2 ′ 5 8.01 7 ″
12/2/2018 Sirius - Wikipedia Coordinates: 06 h 4 5 m 08.9 1 7 3 s, −1 6 ° 4 2 ′ 5 8.01 7 ″ Sirius Sirius (/ˈsɪriəs/, a romanization of Greek Σείριος, Seirios, lit. "glowing" or "scorching") is a star system Sirius A and B and the brightest star in the Earth's night sky. With a visual apparent magnitude of −1.46, it is almost twice as bright as Canopus, the next brightest star. The system has the Bayer designation Alpha Canis Majoris (α CMa). What the naked eye perceives as a single star is a binary star system, consisting of a white main-sequence star of spectral type A0 or A1, termed Sirius A, and a faint white dwarf companion of spectral type DA2, called Sirius B. The distance separating Sirius A from its companion varies between 8.2 and 31.5 AU.[24] Sirius appears bright because of its intrinsic luminosity and its proximity to Earth. At a distance of 2.6 parsecs (8.6 ly), as determined by the Hipparcos astrometry satellite,[2][25][26] the Sirius system is one of Earth's near neighbours. Sirius is gradually moving closer to the Solar System, so it will slightly increase in brightness over the next 60,000 years. After that time its distance will begin to increase and it will become fainter, but it will continue to be the brightest star in the Earth's night sky for the next 210,000 years.[27] The position of Sirius (circled). Sirius A is about twice as massive as the Sun (M☉) and has an absolute visual magnitude of 1.42. -
Pre-Proto-Iranians of Afghanistan As Initiators of Sakta Tantrism: on the Scythian/Saka Affiliation of the Dasas, Nuristanis and Magadhans
Iranica Antiqua, vol. XXXVII, 2002 PRE-PROTO-IRANIANS OF AFGHANISTAN AS INITIATORS OF SAKTA TANTRISM: ON THE SCYTHIAN/SAKA AFFILIATION OF THE DASAS, NURISTANIS AND MAGADHANS BY Asko PARPOLA (Helsinki) 1. Introduction 1.1 Preliminary notice Professor C. C. Lamberg-Karlovsky is a scholar striving at integrated understanding of wide-ranging historical processes, extending from Mesopotamia and Elam to Central Asia and the Indus Valley (cf. Lamberg- Karlovsky 1985; 1996) and even further, to the Altai. The present study has similar ambitions and deals with much the same area, although the approach is from the opposite direction, north to south. I am grateful to Dan Potts for the opportunity to present the paper in Karl's Festschrift. It extends and complements another recent essay of mine, ‘From the dialects of Old Indo-Aryan to Proto-Indo-Aryan and Proto-Iranian', to appear in a volume in the memory of Sir Harold Bailey (Parpola in press a). To com- pensate for that wider framework which otherwise would be missing here, the main conclusions are summarized (with some further elaboration) below in section 1.2. Some fundamental ideas elaborated here were presented for the first time in 1988 in a paper entitled ‘The coming of the Aryans to Iran and India and the cultural and ethnic identity of the Dasas’ (Parpola 1988). Briefly stated, I suggested that the fortresses of the inimical Dasas raided by ¤gvedic Aryans in the Indo-Iranian borderlands have an archaeological counterpart in the Bronze Age ‘temple-fort’ of Dashly-3 in northern Afghanistan, and that those fortresses were the venue of the autumnal festival of the protoform of Durga, the feline-escorted Hindu goddess of war and victory, who appears to be of ancient Near Eastern origin. -
Constructing a Lexicon of Arabic-English Named Entity Using SMT and Semantic Linked Data
Constructing a Lexicon of Arabic-English Named Entity using SMT and Semantic Linked Data Emna Hkiri, Souheyl Mallat, and Mounir Zrigui LaTICE Laboratory, Faculty of Sciences of Monastir, Tunisia Abstract: Named entity recognition is the problem of locating and categorizing atomic entities in a given text. In this work, we used DBpedia Linked datasets and combined existing open source tools to generate from a parallel corpus a bilingual lexicon of Named Entities (NE). To annotate NE in the monolingual English corpus, we used linked data entities by mapping them to Gate Gazetteers. In order to translate entities identified by the gate tool from the English corpus, we used moses, a statistical machine translation system. The construction of the Arabic-English named entities lexicon is based on the results of moses translation. Our method is fully automatic and aims to help Natural Language Processing (NLP) tasks such as, machine translation information retrieval, text mining and question answering. Our lexicon contains 48753 pairs of Arabic-English NE, it is freely available for use by other researchers Keywords: Named Entity Recognition (NER), Named entity translation, Parallel Arabic-English lexicon, DBpedia, linked data entities, parallel corpus, SMT. Received April 1, 2015; accepted October 7, 2015 1. Introduction Section 5 concludes the paper with directions for future work. Named Entity Recognition (NER) consists in recognizing and categorizing all the Named Entities 2. State of the art (NE) in a given text, corresponding to two intuitive IAJIT First classes; proper names (persons, locations and Written and spoken Arabic is considered as difficult organizations), numeric expressions (time, date, money language to apprehend in the domain of Natural and percent). -
Task Force for the Review of the Romanization of Greek RE: Report of the Task Force
CC:DA/TF/ Review of the Romanization of Greek/3 Report, May 18, 2010 page: 1 TO: ALA/ALCTS/CCS/Committee on Cataloging: Description and Access (CC:DA) FROM: ALA/ALCTS/CCS/CC:DA Task Force for the Review of the Romanization of Greek RE: Report of the Task Force CHARGE TO THE TASK FORCE The Task Force is charged with assessing draft Romanization tables for Greek, educating CC:DA as necessary, and preparing necessary reports to support the revision process, leading to ultimate approval of an updated ALA-LC Romanization scheme for Greek. In particular, the Task Force should review the May 2010 draft for a timely report by ALA to LC. Review of subsequent tables may be called for, depending on the viability of this latest draft. The ALA-LC Romanization table - Greek, Proposed Revision May 2010 is located at the LC Policy and Standards Division website at: http://www.loc.gov/catdir/cpso/romanization/greekrev.pdf [archived as a supplement to this report on the CC:DA site] BACKGROUND INFORMATION FROM THE LIBRARY OF CONGRESS We note that when the May 2010 Greek table was presented for general review via email, the LC Policy and Standards Division offered the following information comparing the May 2010 table with the existing table, Greek (Also Coptic), available at the LC policy and Standards Division web site at: http://www.loc.gov/catdir/cpso/romanization/greek.pdf: "The Policy and Standards Division has taken another look at the revised Greek Romanization tables in conjunction with comments from the library community and its own staff with knowledge of Greek. -
The Active Imperfect of the Verbs of the '2Nd Conjugation'
</SECTION<SECTION<LINK "pan-n*"> "art" "opt"> TITLE "Articles"> <TARGET "pan" DOCINFO AUTHOR "Nikolaos Pantelidis"TITLE "The active imperfect of the verbs of the ‘2nd conjugation’ in the Peloponnesian varieties of Modern Greek"SUBJECT "JGL, Volume 4"KEYWORDS "Modern Greek dialects, Peloponnesian varieties, imperfect, morphologization, leveling, violation of the trisyllabic window, loss of morpheme boundary, doubling of morphemes"SIZE HEIGHT "220"WIDTH "150"VOFFSET "4"> The active imperfect of the verbs of the ‘2nd conjugation’ in the Peloponnesian varieties of Modern Greek* Nikolaos Pantelidis Demokritus University of Thrace The present paper treats the different types of formation and the inflectional patterns of the active imperfect of the verbs that in traditional grammar are known as verbs of the ‘2nd conjugation’ in the Peloponnesian varieties of Modern Greek (except Tsakonian and Maniot), mainly from a diachronic point of view. A reconstruction of the processes that led to the current situa- tion is attempted and directions for further possible changes are suggested. The diachrony of the morphology of the imperfect of the ‘2nd conjugation’ in the Peloponnesian varieties involves developments such as morpho- logization of a phonological process and the evolution of number-oriented allomorphy at the level of aspectual markers, while at the same time offering interesting insights into the mechanisms and scope of morphological chang- es and the morphological structure of the Modern Greek verb. These devel- opments can also offer important -
The Greek Alphabet Sight and Sounds of the Greek Letters (Module B) the Letters and Pronunciation of the Greek Alphabet 2 Phonology (Part 2)
The Greek Alphabet Sight and Sounds of the Greek Letters (Module B) The Letters and Pronunciation of the Greek Alphabet 2 Phonology (Part 2) Lesson Two Overview 2.0 Introduction, 2-1 2.1 Ten Similar Letters, 2-2 2.2 Six Deceptive Greek Letters, 2-4 2.3 Nine Different Greek Letters, 2-8 2.4 History of the Greek Alphabet, 2-13 Study Guide, 2-20 2.0 Introduction Lesson One introduced the twenty-four letters of the Greek alphabet. Lesson Two continues to present the building blocks for learning Greek phonics by merging vowels and consonants into syllables. Furthermore, this lesson underscores the similarities and dissimilarities between the Greek and English alphabetical letters and their phonemes. Almost without exception, introductory Greek grammars launch into grammar and vocabulary without first firmly grounding a student in the Greek phonemic system. This approach is appropriate if a teacher is present. However, it is little help for those who are “going at it alone,” or a small group who are learning NTGreek without the aid of a teacher’s pronunciation. This grammar’s introductory lessons go to great lengths to present a full-orbed pronunciation of the Erasmian Greek phonemic system. Those who are new to the Greek language without an instructor’s guidance will welcome this help, and it will prepare them to read Greek and not simply to translate it into their language. The phonic sounds of the Greek language are required to be carefully learned. A saturation of these sounds may be accomplished by using the accompanying MP3 audio files. -
Fasttext-Based Intent Detection for Inflected Languages †
information Article FastText-Based Intent Detection for Inflected Languages † Kaspars Balodis 1,2,* and Daiga Deksne 1 1 Tilde, Vien¯ıbas Gatve 75A, LV-1004 R¯ıga, Latvia; [email protected] 2 Faculty of Computing, University of Latvia, Rain, a blvd. 19, LV-1586 R¯ıga, Latvia * Correspondence: [email protected] † This paper is an extended version of our paper published in 18th International Conference AIMSA 2018, Varna, Bulgaria, 12–14 September 2018. Received: 15 January 2019; Accepted: 25 April 2019; Published: 1 May 2019 Abstract: Intent detection is one of the main tasks of a dialogue system. In this paper, we present our intent detection system that is based on fastText word embeddings and a neural network classifier. We find an improvement in fastText sentence vectorization, which, in some cases, shows a significant increase in intent detection accuracy. We evaluate the system on languages commonly spoken in Baltic countries—Estonian, Latvian, Lithuanian, English, and Russian. The results show that our intent detection system provides state-of-the-art results on three previously published datasets, outperforming many popular services. In addition to this, for Latvian, we explore how the accuracy of intent detection is affected if we normalize the text in advance. Keywords: intent detection; word embeddings; dialogue system 1. Introduction and Related Work Recent developments in deep learning have made neural networks the mainstream approach for a wide variety of tasks, ranging from image recognition to price forecasting to natural language processing. In natural language processing, neural networks are used for speech recognition and generation, machine translation, text classification, named entity recognition, text generation, and many other tasks.