Senja Pollak Polavtomatsko Modeliranje Področnega Znanja Iz

Univerza v Ljubljani/University of Ljubjana Filozofska fakulteta/Faculty of Arts Oddelek za prevajalstvo/Department of Translation Senja Pollak Polavtomatsko modeliranje področnega znanja iz večjezičnih korpusov Semi-automatic Domain Modeling from Multilingual Corpora Doktorska disertacija/Doctoral dissertation Mentorica/Supervisor: Študijski program: Prevodoslovje Izr. prof. dr. Špela Vintar Study program: Translation Studies University of Ljubljana, Faculty of Arts, Ljubljana, Slovenia Somentorica/Co-supervisor: Prof.ssa. Paola Velardi, University La Sapienza, Rome, Italy Ljubljana, 2014 Table of contents Acknowledgements .......................................................................................................................... v Povzetek ......................................................................................................................................... vii Abstract .......................................................................................................................................... ix 1 Introduction ............................................................................................................................... 1 1.1 Domain modeling .................................................................................................................... 1 1.2 Research goals ......................................................................................................................... 1 1.3 Contributions to science .......................................................................................................... 2 1.4 Structure of the thesis .............................................................................................................. 3 2 Background and related work .................................................................................................. 5 2.1 Language and meaning: Linguistic perspective ...................................................................... 5 2.1.1 Lexical meaning .............................................................................................................. 5 2.1.2 Lexicography and terminography ................................................................................... 7 2.1.3 Dictionaries and terminological collections .................................................................. 11 2.1.4 Definitions ..................................................................................................................... 12 2.1.5 Types of lexical definitions (defining strategies) .......................................................... 14 2.1.6 Lexicographic principles of meaningful definitions ..................................................... 22 2.2 Domain modeling: Computational perspective ..................................................................... 24 2.2.1 (Semi-)automatic domain modeling: Extracting terms, definitions, semantic relations and ontology construction .............................................................................. 24 2.2.2 Modeling the domain of language technologies ........................................................... 28 2.2.3 Web services and workflows ......................................................................................... 29 3 Problem description, corpus presentation and initial domain modeling ........................... 31 3.1 Problem description ............................................................................................................... 31 3.2 Building the Language Technologies Corpus ....................................................................... 33 3.2.1 Constructing the small LTC proceedings corpus .......................................................... 34 3.2.2 Constructing the main Language Technologies Corpus (LT corpus) ............................ 35 3.3 Domain modeling through topic ontology construction ....................................................... 37 3.3.1 Modeling the LTC proceedings corpus ......................................................................... 37 3.3.2 Modeling the Language Technologies corpus ............................................................... 42 3.4 Setting the stage for automatic definition extraction: Analyzing definitions in running text ........................................................................................................................... 44 3.4.1 Genus et differentia definition type ............................................................................... 45 3.4.2 Defining by paraphrases, synonyms, sibling concepts or antonyms ............................. 51 3.4.3 Extensional definitions .................................................................................................. 52 3.4.4 Other types of definitions: defining by purpose or properties ...................................... 53 4 Methodology and background technologies .......................................................................... 57 4.1 Overview of the definition extraction methodology ............................................................. 57 4.2 Definition extraction evaluation methodology ...................................................................... 59 4.3 Background technologies and resources ............................................................................... 60 i. i i i I I I 4.3.1 ToTrTaLe morphosyntactic tagger and lemmatiser ...................................................... 60 4.3.2 LUIZ terminology extractor ......................................................................................... 62 4.3.3 WordNet and sloWNet .................................................................................................. 63 4.3.4 ClowdFlows workflow composition and execution environment ................................ 64 4.4 Evaluation of selected background technologies .................................................................. 64 4.4.1 ToTrTaLe evaluation ..................................................................................................... 65 4.4.2 LUIZ evaluation ............................................................................................................ 66 5 Definition extraction from Slovene and English text corpora ............................................ 71 5.1 Extracting definitions from Slovene texts ............................................................................. 71 5.1.1 Pattern-based definition extraction ............................................................................... 72 5.1.2 Term-based definition extraction .................................................................................. 85 5.1.3 SloWNet-based definition extraction ........................................................................... 96 5.2 Extracting definitions from English texts ........................................................................... 100 5.2.1 Pattern-based definition extraction ............................................................................. 100 5.2.2 Term-based definition extraction ................................................................................ 107 5.2.3 WordNet-based definition extraction .......................................................................... 114 5.3 Results of Slovene and English definition extraction methods and their combinations ....................................................................................................................... 118 5.3.1 Combining different approaches on the Slovene subcorpus ....................................... 118 5.3.2 Combining different approaches on the English subcorpus ....................................... 120 5.3.3 Subjectivity of evaluation results ................................................................................ 123 5.3.4 Analysis of different types of definition candidates ................................................... 124 6 Workflow implementation in ClowdFlows ......................................................................... 133 6.1 Load corpus widget ............................................................................................................. 134 6.2 ToTrTaLe widget ................................................................................................................. 135 6.3 LUIZ widget ........................................................................................................................ 137 6.4 Definition extraction widgets .............................................................................................. 137 6.4.1 Pattern-based definition extraction widget ................................................................. 137 6.4.2 Term-based definition extraction widget .................................................................... 138 6.4.3 Wordnet-based definition extraction widget ............................................................... 138 6.5 Auxiliary widgets ................................................................................................................ 138 6.5.1 Merge sentences widget .............................................................................................. 138 6.5.2 String to file widget .................................................................................................... 138 6.5.3 Term viewer widget ...................................................................................................

Load more