Improving Exploration in Reinforcement Learning Through

Total Page:16

File Type:pdf, Size:1020Kb

Improving Exploration in Reinforcement Learning Through Extraction of Linguistic Resources from Multilingual Corpora and their Exploitation AHMAD RAZA SHAHID Ph.D. Thesis The University of York Artificial Intelligence Group Department of Computer Science United Kingdom 9th February 2012 Abstract Increasing availability of on-line and off-line multilingual resources along with the developments in the related automatic tools that can process this information, such as GIZA++ (Och & Ney 2003), has made it possible to build new multilin- gual resources that can be used for NLP/IR tasks. Lexicon generation is one such task, which if done by hand is quite expensive with human and capital costs involved. Generation of multilingual lexicons can now be automated, as is done in this research work. Wikipedia1, an on-line multilingual resource was gainfully employed to automatically build multilingual lexicons using simple search strategies. Europarl parallel corpus (Koehn 2002) was used to create multilingual sets of synonyms, that were later used to carry out the task of Word Sense Disambigua- tion (WSD) on the original corpus from which they were derived. The theoretical analysis of the methodology validated our approach. The multilingualsets of synonymswere then used to learn unsupervised mod- 1http://www.wikipedia.org/ ii Abstract iii els of word morphology in the individual languages. The set of experiments we carried out, along with another unsupervised technique, were evaluated against the gold standard. Our results compared very favorably with the other approach. The combination of the two approaches gave even better results. Contents ListofTables .............................. ix ListofFigures.............................. xi Acknowledgements . .. .. .. .. .. .. .. xii Declaration ............................... xiii 1 Introduction and Motivation 2 1.1 InitialMotivation ......................... 2 1.2 NLPandInformationRetrieval(IR) . 3 1.2.1 NLP............................ 3 1.2.2 CorporabasedApproaches. 4 1.2.3 InformationRetrieval(IR) . 6 1.2.4 MultilingualNLPandIR. 7 1.3 MultilingualResources . 8 1.4 ProblemStatements........................ 9 1.4.1 BuildingMultilingualLexicons . 9 1.4.2 CreatingandusingMultilingualSynsets . 10 1.4.3 MorphologicalAnalysis of MultilingualSynsets . 11 1.4.4 EvaluationanditsChallenges . 12 1.5 ThesisOutline........................... 12 1.6 NoteonTerminology . .. .. .. .. .. .. 13 2 Literature Review 14 2.1 Resources ............................. 15 2.1.1 Wikipedia......................... 15 2.1.2 ParallelCorpora . .. .. .. .. .. 16 2.1.3 WordNet:ALexicalSemanticResource . 18 iv CONTENTS v 2.2 NLPandIR ............................ 20 2.2.1 WordSenseDisambiguation . 20 2.2.1.1 SupervisedDisambiguation . 22 2.2.1.2 UnsupervisedDisambiguation. 26 2.2.1.3 PPAttachmentAmbiguity. 31 2.2.1.4 WordNetandWSD . 35 2.2.1.5 MultilingualDisambiguation . 37 2.2.1.6 DisambiguationinWikipedia . 38 2.2.1.7 UsingWikipediaforWSD . 40 2.2.2 Morphology ....................... 41 2.2.2.1 Analogy .................... 42 2.2.2.2 Harris’sApproach . 43 2.2.2.3 UnsupervisedApproach . 44 2.2.3 InformationRetrieval. 46 2.2.3.1 VectorSpaceModel . 47 2.2.3.2 TFIDF..................... 49 2.2.3.3 PerformanceMeasures . 50 2.2.3.4 ProbabilisticInformationRetrieval . 51 2.2.3.5 Multi-Lingual Information Retrieval (MLIR) . 53 2.2.3.6 ProbabilisticMulti-lingualIR . 56 2.2.3.7 Dictionary-BasedMLIR. 57 2.2.3.8 CorporabasedApproachesforIR . 59 2.3 MachineLearning(ML) . 60 2.3.1 Clustering......................... 60 2.3.2 MeasuresofClusteringQuality . 63 2.3.3 DecisionTrees ...................... 64 2.4 BuildingResourcesfromCorpora . 64 2.4.1 Extracting LinguisticResources from Wikipedia . 64 2.4.2 Building MultilingualLexicons and WordNets using Par- allelCorpora ....................... 68 3 Extraction of Multilingual Lexicons from Wikipedia 73 3.1 MainIdea ............................. 73 3.2 Methodology ........................... 75 3.3 LexiconGeneration. .. .. .. .. .. .. 77 3.3.1 Algorithm......................... 79 3.3.2 GeneralLexicons. 80 3.3.2.1 EBGandEGFP ................ 80 3.3.2.2 HistogramsforEBG. 82 3.3.2.3 HistogramsforEGFP . 85 3.3.2.4 Removal of Redundancy and Numeric Values 86 vi CONTENTS 3.3.2.5 HeptaLex ................... 94 3.3.3 DomainSpecificDictionaries . 96 3.3.3.1 ComputerScienceSpecificLexicon . 97 3.3.3.2 CategoryTranslations . 99 3.4 SomeProgrammingRelatedIssues . 100 3.5 AnalysisofLanguagesinWikipedia . 104 3.6 Evaluation............................. 107 3.7 Conclusion ............................ 107 4 Extraction of Multilingual Synsets from Aligned Corpora 113 4.1 MainIdea ............................. 113 4.2 Assumptions ........................... 114 4.3 ParallelCorporaandPre-processing . 115 4.3.1 StructureoftheEuroparlCorpus . 116 4.3.2 Pre-processing . 118 4.4 WordAlignment ......................... 119 4.4.1 Creation of Vocabulary and Sentence Files . 120 4.4.2 CreationofWordClasses. 122 4.5 CollationofWordsintoPhrases . 124 4.5.1 Author’sNote. 134 4.6 Disambiguation .......................... 135 4.7 Evaluation............................. 135 4.7.1 Baseline Comparisonfor Extractionof MultilingualSynsets136 4.7.1.1 ExperimentalDesign . 137 4.7.2 IssueswithEvaluation . 140 4.7.3 UsingClusteringforEvaluation . 141 4.7.3.1 Experiments . 144 4.7.4 Discussion ........................ 152 4.7.4.1 WhytheTags?. 153 4.7.4.2 Which Machine Learning Approach to Use? . 153 4.7.5 UsingDecisionTreesforEvaluation . 153 4.7.5.1 ExperimentalDesign . 155 4.8 Discussion............................. 156 4.9 ErrorAnalysis........................... 157 4.10 SemEval Parallel Corpora and Generation of Multilingual Synsets 159 4.11 TheoreticalAnalysis . 161 4.11.1 SenseInventory. 161 4.11.2 GoldStandard . 163 4.11.3 Methodology . 164 4.11.4 Experiments . 165 4.11.5 Baseline Comparison for Extraction of Multilingual Synsets174 CONTENTS vii 4.12 Discussion............................. 175 4.13 Conclusion ............................ 175 5 Morphology and Lexical Distances 179 5.1 MainIdea ............................. 179 5.2 MorphologicalAnalysis. 180 5.3 Experiments............................ 181 5.4 EditDistance ........................... 183 5.4.1 CalculatingEditDistances . 184 5.4.2 Edit Distances between Multilingual Proto-Synsets . 186 5.5 LookingforWordParadigms . 188 5.5.1 MergingParadigms. 191 5.5.2 Merging Paradigms based on Common Number of Stems 194 5.5.2.1 Signature Refinement and Merging Paradigms basedonCommonNumberofStems . 197 5.5.3 Discussion ........................ 199 5.6 Further Experimentsin MultilingualMorphology . 201 5.7 EvaluationofMorphologicalAnalysis . 201 5.7.1 Evaluation ........................ 203 5.7.2 SegmentationwithSupport. 204 5.7.3 AnalogyPrinciple . 205 5.7.4 Results .......................... 205 5.8 Conclusion ............................ 206 6 Conclusion 210 6.1 Summary ............................. 210 6.1.1 Automatic Generation of MultilingualLexicons . 211 6.1.2 Extraction of Multilingual Proto-Synsets from Parallel Corpora.......................... 212 6.1.3 MorphologicalAnalysis . 213 6.2 Contributions ........................... 215 6.3 FutureWork............................ 215 References 217 List of Tables 3.1 100entrieschosenfromHeptaLex,part1 . 109 3.2 100entrieschosenfromHeptaLex,part2 . 110 3.3 100entrieschosenfromHeptaLexpart3. 111 3.4 ResultsofevaluationofHeptaLex . 112 4.1 AsamplefromEnglishpartoftheEuroparlcorpus . 117 4.2 Sample of German translationof the Englishexample . 118 4.3 SampleoftheEnglishvocabularyfile . 120 4.4 Sentence file containing 2 pairs of sentences for English-German 121 4.5 SampleofEnglish-Germaninput . 121 4.6 Examples of word alignment probability for English and German 123 4.7 TheresultofwordalignmentforEnglish-German . 123 4.8 Sentences in French and Greek corresponding to Table 4.7 . 124 4.9 Generatedmultilingualsynsets . 125 4.10 Number of English words aligned with a non-pivotal language . 128 4.11 GroupnumbersassignedtoEnglishwords . 130 4.12 Corresponding German words and their Num values. 130 4.13 Example of non-consecutive alignment of German words . 131 4.14 The German entries followingthe ones in Table 4.13 . 131 4.15 proto-synsetsforEnglishandGermanonly . 131 4.16 Corresponding entries for French for the generation of phrases . 132 4.17 Revised group numbers for English alignment with French ... 132 4.18 CorrespondinginformationforFrench . 132 4.19 Entries for Greek for English-Greek word alignment . 133 4.20 Revised group numbers for English after English-Greek alignment 133 viii LIST OF TABLES ix 4.21 Greek table after changing group number information in English 133 4.22 Phrase group information for final generation of phrases. .... 134 4.23 Thefinalphrasesgenerated. 134 4.24 ResultsofBaselineComparison . 140 4.25 Outputoftheclusteringtool. 149 4.26 Impuritymeasuresfordifferentscenarios . 151 4.27 Accuracy figures for Word Alignment when English is used as a targetlanguage .......................... 158 4.28 A sample of the sense inventory for the target word bank..... 162 4.29 Summary for all languages for the five target words in SemEval 172 4.30 Avg. figuresforFrench,Spanish,andItalian . 173 4.31 Average figures for French and Italian for the target words. 173 4.32 Average figures for Spanish and Italian for the target words.. 173 5.1 Anexampleofmorphologicalsyntacticvariation . 186 5.2 PairofGreeksynonyms. 186 5.3 Asampleofeditdistancesforasynset-pair . 187 5.4 Asampleofsynset-pairs . 188 5.5 Asampleoftheparadigmscreated.. 193 5.6 Numbersofparadigmscreatedvsmergethreshold .
Recommended publications
  • Grapheme-To-Lexeme Feedback in the Spelling System: Evidence from a Dysgraphic Patient
    COGNITIVE NEUROPSYCHOLOGY, 2006, 23 (2), 278–307 Grapheme-to-lexeme feedback in the spelling system: Evidence from a dysgraphic patient Michael McCloskey Johns Hopkins University, Baltimore, USA Paul Macaruso Community College of Rhode Island, Warwick, and Haskins Laboratories, New Haven, USA Brenda Rapp Johns Hopkins University, Baltimore, USA This article presents an argument for grapheme-to-lexeme feedback in the cognitive spelling system, based on the impaired spelling performance of dysgraphic patient CM. The argument relates two features of CM’s spelling. First, letters from prior spelling responses intrude into sub- sequent responses at rates far greater than expected by chance. This letter persistence effect arises at a level of abstract grapheme representations, and apparently results from abnormal persistence of activation. Second, CM makes many formal lexical errors (e.g., carpet ! compute). Analyses revealed that a large proportion of these errors are “true” lexical errors originating in lexical selec- tion, rather than “chance” lexical errors that happen by chance to take the form of words. Additional analyses demonstrated that CM’s true lexical errors exhibit the letter persistence effect. We argue that this finding can be understood only within a functional architecture in which activation from the grapheme level feeds back to the lexeme level, thereby influencing lexical selection. INTRODUCTION a brain-damaged patient with an acquired spelling deficit, arguing from his error pattern that Like other forms of language processing, written the cognitive system for written word produc- word production implicates multiple levels of tion includes feedback connections from gra- representation, including semantic, orthographic pheme representations to orthographic lexeme lexeme, grapheme, and allograph levels.
    [Show full text]
  • ON SOME CATEGORIES for DESCRIBING the SEMOLEXEMIC STRUCTURE by Yoshihiko Ikegami
    ON SOME CATEGORIES FOR DESCRIBING THE SEMOLEXEMIC STRUCTURE by Yoshihiko Ikegami 1. A lexeme is the minimum unit that carries meaning. Thus a lexeme can be a "word" as well as an affix (i.e., something smaller than a word) or an idiom (i.e,, something larger than a word). 2. A sememe is a unit of meaning that can be realized as a single lexeme. It is defined as a structure constituted by those features having distinctive functions (i.e., serving to distinguish the sememe in question from other semernes that contrast with it).' A question that arises at this point is whether or not one lexeme always corresponds to just one serneme and no more. Three theoretical positions are foreseeable: (I) one which holds that one lexeme always corresponds to just one sememe and no more, (2) one which holds that one lexeme corresponds to an indefinitely large number of sememes, and (3) one which holds that one lexeme corresponds to a certain limited number of sememes. These three positions wiIl be referred to as (1) the "Grundbedeutung" theory, (2) the "use" theory, and (3) the "polysemy" theory, respectively. The Grundbedeutung theory, however attractive in itself, is to be rejected as unrealistic. Suppose a preliminary analysis has revealed that a lexeme seems to be used sometimes in an "abstract" sense and sometimes in a "concrete" sense. In order to posit a Grundbedeutung under such circumstances, it is to be assumed that there is a still higher level at which "abstract" and "concrete" are neutralized-this is certainly a theoretical possibility, but it seems highly unlikely and unrealistic from a psychological point of view.
    [Show full text]
  • A Method for Automatically Building and Evaluating Dictionary Resources
    A Method for Automatically Building and Evaluating Dictionary Resources Smaranda Muresan∗, Judith Klavansy ∗Computer Science Department, Columbia University 500 West 120th, New York, USA [email protected] yCenter for Research on Information Access, Columbia University 535 West 114th St, New York, USA [email protected] Abstract This paper describes a method toward automatically building dictionaries from text. We present DEFINDER, a rule-based system for extraction of definitions from on-line consumer-oriented medical articles. We provide an extensive evaluation on three dimensions: i) performance of the definition extraction technique in terms of precision and recall, ii) quality of the built dictionary as judged both by specialists and lay users, iii) coverage of existing on-line dictionaries. The corpus we used for the study is publicly available. A major contribution of the paper is the range of quantitative and qualitative evaluation methods. 1. Introduction used in the context of summarization of technical articles to Most machine readable dictionaries or glossaries are ei- provide explanation of the technical terms in lay language. ther manually built by human experts or transformed in Our system, was extensively evaluated. First we eval- electronic forms from hard-copy versions through an ex- uated the performance of the definition extraction method pensive digitization process. Also for some particular do- by comparing the results against a gold standard in terms mains, such as medical domain, the effort is concentrated in of precision and recall. Second, we evaluated the quality of building technical dictionaries for specialists that are of lit- the dictionary as judged both by specialists and lay users.
    [Show full text]
  • Biomedical Entity Representations with Synonym Marginalization
    Biomedical Entity Representations with Synonym Marginalization Mujeen Sung Hwisang Jeon Jinhyuk Leey Jaewoo Kangy Korea University fmujeensung,j hs,jinhyuk lee,[email protected] Abstract Unlike named entities from general domain text, typical biomedical entities have several different Biomedical named entities often play impor- surface forms, making the normalization of biomed- tant roles in many biomedical text mining ical entities very challenging. For instance, while tools. However, due to the incompleteness two chemical entities ‘motrin’ and ‘ibuprofen’ be- of provided synonyms and numerous varia- tions in their surface forms, normalization of long to the same concept ID (MeSH:D007052), biomedical entities is very challenging. In this they have completely different surface forms. paper, we focus on learning representations of On the other hand, mentions having similar sur- biomedical entities solely based on the syn- face forms could also have different meanings onyms of entities. To learn from the incom- (e.g. ‘dystrophinopathy’ (MeSH:D009136) and plete synonyms, we use a model-based candi- ‘bestrophinopathy’ (MeSH:C567518)). These ex- date selection and maximize the marginal like- amples show a strong need for building latent rep- lihood of the synonyms present in top candi- dates. Our model-based candidates are itera- resentations of biomedical entities that capture se- tively updated to contain more difficult neg- mantic information of the mentions. ative samples as our model evolves. In this In this paper, we propose a novel framework for way, we avoid the explicit pre-selection of learning biomedical entity representations based on negative samples from more than 400K can- the synonyms of entities.
    [Show full text]
  • Part 1: Introduction to The
    PREVIEW OF THE IPA HANDBOOK Handbook of the International Phonetic Association: A guide to the use of the International Phonetic Alphabet PARTI Introduction to the IPA 1. What is the International Phonetic Alphabet? The aim of the International Phonetic Association is to promote the scientific study of phonetics and the various practical applications of that science. For both these it is necessary to have a consistent way of representing the sounds of language in written form. From its foundation in 1886 the Association has been concerned to develop a system of notation which would be convenient to use, but comprehensive enough to cope with the wide variety of sounds found in the languages of the world; and to encourage the use of thjs notation as widely as possible among those concerned with language. The system is generally known as the International Phonetic Alphabet. Both the Association and its Alphabet are widely referred to by the abbreviation IPA, but here 'IPA' will be used only for the Alphabet. The IPA is based on the Roman alphabet, which has the advantage of being widely familiar, but also includes letters and additional symbols from a variety of other sources. These additions are necessary because the variety of sounds in languages is much greater than the number of letters in the Roman alphabet. The use of sequences of phonetic symbols to represent speech is known as transcription. The IPA can be used for many different purposes. For instance, it can be used as a way to show pronunciation in a dictionary, to record a language in linguistic fieldwork, to form the basis of a writing system for a language, or to annotate acoustic and other displays in the analysis of speech.
    [Show full text]
  • Unit 2 Structures Handout.Pdf
    2. The definition of a language as a structure of structures 2.1. Phonetics and phonology Relevance for studying language in its natural or primary medium: oral sounds rather than written symbols. Phonic medium: the range of sounds produced by the speech organs insofar as the play a role in language Speech sounds: Individual sounds within that range Phonetics is the study of the phonic medium: The study of the production, transmission, and reception of human sound-making used in speech. e.g. classification of sounds as voiced vs voiceless: /b/ vs /p/ Phonology is the study of the phonic medium not in itself but in relation with language. e.g. application of voice to the explanation of differences within the system of language: housen vs housev usen vs usev 2.1.1. Phonetics It is usually divided into three branches which study the phonic medium from three points of view: Articulatory phonetics: speech sounds according to the way in which they are produced by the speech organs. Acoustic phonetics: speech sounds according to the physical properties of their sound-waves. Auditory phonetics: speech sounds according to their perception and identification. Articulatory phonetics has the longest tradition, and its progress in the 19th century contributed a standardize and internationally accepted system of phonetic transcription: the origins of the International Phonetic Alphabet used today and relying on sound symbols and diacritics. It studies production in relation with the vocal tract, i.e., organs such as: lungs trachea or windpipe, containing: larynx vocal folds glottis pharyngeal cavity nose mouth, containing fixed organs: teeth and teeth ridge hard palate pharyngeal wall mobile organs: lips tongue soft palate jaw According to their function and participation, sounds may take several features: Voice: voiced vs voiceless sounds, according to the participation of the vocal folds e.g.
    [Show full text]
  • 12 Morphology and Lexical Semantics
    248 Beth Levin and Malka Rappaport Hovav 12 Morphology and Lexical Semantics BETH LEVIN AND MALKA RAPPAPORT HOVAV The relation between lexical semantics and morphology has not been the subject of much study. This may seem surprising, since a morpheme is often viewed as a minimal Saussurean sign relating form and meaning: it is a concept with a phonologically composed name. On this view, morphology has both a semantic side and a structural side, the latter sometimes called “morphological realization” (Aronoff 1994, Zwicky 1986b). Since morphology is the study of the structure and derivation of complex signs, attention could be focused on the semantic side (the composition of complex concepts) and the structural side (the composition of the complex names for the concepts) and the relation between them. In fact, recent work in morphology has been concerned almost exclusively with the composition of complex names for concepts – that is, with the struc- tural side of morphology. This dissociation of “form” from “meaning” was foreshadowed by Aronoff’s (1976) demonstration that morphemes are not necessarily associated with a constant meaning – or any meaning at all – and that their nature is basically structural. Although in early generative treat- ments of word formation, semantic operations accompanied formal morpho- logical operations (as in Aronoff’s Word Formation Rules), many subsequent generative theories of morphology, following Lieber (1980), explicitly dissoci- ate the lexical semantic operations of composition from the formal structural
    [Show full text]
  • DICTIONARY News
    Number 17 y July 2009 Kernerman kdictionaries.com/kdn DICTIONARY News KD’s BLDS: a brief introduction In 2005 K Dictionaries (KD) entered a project of developing We started by establishing an extensive infrastructure both dictionaries for learners of different languages. KD had already contentwise and technologically. The lexicographic compilation created several non-English titles a few years earlier, but those was divided into 25 projects: 1 for the French core, 8 for the were basic word-to-word dictionaries. The current task marked dictionary cores of the eight languages, another 8 for the a major policy shift for our company since for the first time we translation from French to eight languages, and 8 more for the were becoming heavily involved in learner dictionaries with translation of each language to French. target languages other than English. That was the beginning of An editorial team was set up for developing each of the nine our Bilingual Learners Dictionaries Series (BLDS), which so (French + 8) dictionary cores. The lexicographers worked from far covers 20 different language cores and keeps growing. a distance, usually at home, all over the world. The chief editor The BLDS was launched with a program for eight French for each language was responsible for preparing the editorial bilingual dictionaries together with Assimil, a leading publisher styleguide and the list of headwords. Since no corpora were for foreign language learning in France which made a strategic publicly available for any of these languages, each editor used decision to expand to dictionaries in cooperation with KD. different means to retrieve information in order to compile the The main target users were identified as speakers of French headword list.
    [Show full text]
  • Studies in the Linguistic Sciences
    UNIVERSITY OF ILLINOIS Llb'RARY AT URBANA-CHAMPAIGN MODERN LANGUAGE NOTICE: Return or renew all Library Materialsl The Minimum Fee (or each Lost Book is $50.00. The person charging this material is responsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for discipli- nary action and may result in dismissal from the University. To renew call Telephone Center, 333-8400 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN ModeriJLanp & Lb Library ^rn 425 333-0076 L161—O-I096 Che Linguistic Sciences PAPERS IN GENERAL LINGUISTICS ANTHONY BRITTI Quantifier Repositioning SUSAN MEREDITH BURT Remarks on German Nominalization CHIN-CHUAN CHENG AND CHARLES KISSEBERTH Ikorovere Makua Tonology (Part 1) PETER COLE AND GABRIELLA HERMON Subject to Object Raising in an Est Framework: Evidence from Quechua RICHARD CURETON The Inclusion Constraint: Description and Explanation ALICE DAVISON Some Mysteries of Subordination CHARLES A. FERGUSON AND APIA DIL The Sociolinguistic Valiable (s) in Bengali: A Sound Change in Progress GABRIELLA HERMON Rule Ordering Versus Globality: Evidencefrom the Inversion Construction CHIN-W. KIM Neutralization in Korean Revisited DAVID ODDEN Principles of Stress Assignment: A Crosslinguistic View PAULA CHEN ROHRBACH The Acquisition of Chinese by Adult English Speakers: An Error Analysis MAURICE K.S. WONG Origin of the High Rising Changed Tone in Cantonese SORANEE WONGBIASAJ On the Passive in Thai Department of Linguistics University of Illinois^-^ STUDIES IN THE LINGUISTIC SCIENCES PUBLICATION OF THE DEPARTMENT OF LINGUISTICS UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN EDITORS: Charles W. Kisseberth, Braj B.
    [Show full text]
  • WME 3.0: an Enhanced and Validated Lexicon of Medical Concepts
    WME 3.0: An Enhanced and Validated Lexicon of Medical Concepts Anupam Mondal1 Dipankar Das1 Erik Cambria2 Sivaji Bandyopadhyay1 1Department of Computer Science and Engineering 2School of Computer Science and Engineering Jadavpur University, Kolkata, India Nanyang Technological University, Singapore [email protected], [email protected] [email protected], 1sivaji cse [email protected] Abstract However, medical text is in general unstructured since doctors do not like to fill forms and pre- Information extraction in the medical do- fer free-form notes of their observations. Hence, main is laborious and time-consuming due a lexical design is difficult due to lack of any to the insufficient number of domain- prior knowledge of medical terms and contexts. specific lexicons and lack of involve- Therefore, we are motivated to enhance a med- ment of domain experts such as doctors ical lexicon namely WordNet of Medical Events and medical practitioners. Thus, in the (WME 2.0) which helps to identify medical con- present work, we are motivated to de- cepts and their features. In order to enrich this sign a new lexicon, WME 3.0 (WordNet lexicon, we have employed various well-known of Medical Events), which contains over resources like conventional WordNet, SentiWord- 10,000 medical concepts along with their Net (Esuli and Sebastiani, 2006), SenticNet (Cam- part of speech, gloss (descriptive expla- bria et al., 2016), Bing Liu (Liu, 2012), and nations), polarity score, sentiment, sim- Taboada’s Adjective list (Taboada et al., 2011) ilar sentiment words, category, affinity and a preprocessed English medical dictionary1 on score and gravity score features.
    [Show full text]
  • Morphemes by Kirsten Mills Student, University of North Carolina at Pembroke, 1998
    Morphemes by Kirsten Mills http://www.uncp.edu/home/canada/work/caneng/morpheme.htm Student, University of North Carolina at Pembroke, 1998 Introduction Morphemes are what make up words. Often, morphemes are thought of as words but that is not always true. Some single morphemes are words while other words have two or more morphemes within them. Morphemes are also thought of as syllables but this is incorrect. Many words have two or more syllables but only one morpheme. Banana, apple, papaya, and nanny are just a few examples. On the other hand, many words have two morphemes and only one syllable; examples include cats, runs, and barked. Definitions morpheme: a combination of sounds that have a meaning. A morpheme does not necessarily have to be a word. Example: the word cats has two morphemes. Cat is a morpheme, and s is a morpheme. Every morpheme is either a base or an affix. An affix can be either a prefix or a suffix. Cat is the base morpheme, and s is a suffix. affix: a morpheme that comes at the beginning (prefix) or the ending (suffix) of a base morpheme. Note: An affix usually is a morpheme that cannot stand alone. Examples: -ful, -ly, -ity, -ness. A few exceptions are able, like, and less. base: a morpheme that gives a word its meaning. The base morpheme cat gives the word cats its meaning: a particular type of animal. prefix: an affix that comes before a base morpheme. The in in the word inspect is a prefix. suffix: an affix that comes after a base morpheme.
    [Show full text]
  • A Study of Chinese Medical Students As Dictionary Users and Potential Users for an Online Medical Termfinder
    Lexicography ASIALEX https://doi.org/10.1007/s40607-017-0031-9 ORIGINAL PAPER A study of Chinese medical students as dictionary users and potential users for an online medical termfinder Jun Ding1 Received: 2 May 2017 / Accepted: 12 November 2017 Ó Springer-Verlag GmbH Germany, part of Springer Nature 2017 Abstract This paper examines the dictionary use of Chinese medical students as ESP learners and their trial experience with the online Health Termfinder (HTF), a specialized dictionary tool for medical terms. Empirical data were collected first from a survey among a group of medical students at Fudan University on their previous dictionary-use behaviors and initial response to the introduction to the HTF and its bilingualization, and second from an assignment of using the bilingualized HTF for reading comprehension with the same group of students. The survey findings reveal that the medical students in general are aware of the importance of dictionaries in their ESP learning in spite of the alarming fact that the majority of them have not even used any English medical dictionaries (monolingual or bilin- gual). They are thus quite open to the idea of using an online termbank of spe- cialized medical terms. But the HTF assignment findings also show that proper guidance and training is necessary for the termbank to be made actually useful for the lexical needs of Chinese ESP learners. It is also expected that the students’ feedback on their HTF experience, especially their demand for the termbank to include more cross-disciplinary medical terms, could be taken into consideration by the HTF builders.
    [Show full text]