Senja Pollak Polavtomatsko Modeliranje Področnega Znanja Iz

Senja Pollak Polavtomatsko Modeliranje Področnega Znanja Iz

Univerza v Ljubljani/University of Ljubjana Filozofska fakulteta/Faculty of Arts Oddelek za prevajalstvo/Department of Translation Senja Pollak Polavtomatsko modeliranje področnega znanja iz večjezičnih korpusov Semi-automatic Domain Modeling from Multilingual Corpora Doktorska disertacija/Doctoral dissertation Mentorica/Supervisor: Študijski program: Prevodoslovje Izr. prof. dr. Špela Vintar Study program: Translation Studies University of Ljubljana, Faculty of Arts, Ljubljana, Slovenia Somentorica/Co-supervisor: Prof.ssa. Paola Velardi, University La Sapienza, Rome, Italy Ljubljana, 2014 Table of contents Acknowledgements .......................................................................................................................... v Povzetek ......................................................................................................................................... vii Abstract .......................................................................................................................................... ix 1 Introduction ............................................................................................................................... 1 1.1 Domain modeling .................................................................................................................... 1 1.2 Research goals ......................................................................................................................... 1 1.3 Contributions to science .......................................................................................................... 2 1.4 Structure of the thesis .............................................................................................................. 3 2 Background and related work .................................................................................................. 5 2.1 Language and meaning: Linguistic perspective ...................................................................... 5 2.1.1 Lexical meaning .............................................................................................................. 5 2.1.2 Lexicography and terminography ................................................................................... 7 2.1.3 Dictionaries and terminological collections .................................................................. 11 2.1.4 Definitions ..................................................................................................................... 12 2.1.5 Types of lexical definitions (defining strategies) .......................................................... 14 2.1.6 Lexicographic principles of meaningful definitions ..................................................... 22 2.2 Domain modeling: Computational perspective ..................................................................... 24 2.2.1 (Semi-)automatic domain modeling: Extracting terms, definitions, semantic relations and ontology construction .............................................................................. 24 2.2.2 Modeling the domain of language technologies ........................................................... 28 2.2.3 Web services and workflows ......................................................................................... 29 3 Problem description, corpus presentation and initial domain modeling ........................... 31 3.1 Problem description ............................................................................................................... 31 3.2 Building the Language Technologies Corpus ....................................................................... 33 3.2.1 Constructing the small LTC proceedings corpus .......................................................... 34 3.2.2 Constructing the main Language Technologies Corpus (LT corpus) ............................ 35 3.3 Domain modeling through topic ontology construction ....................................................... 37 3.3.1 Modeling the LTC proceedings corpus ......................................................................... 37 3.3.2 Modeling the Language Technologies corpus ............................................................... 42 3.4 Setting the stage for automatic definition extraction: Analyzing definitions in running text ........................................................................................................................... 44 3.4.1 Genus et differentia definition type ............................................................................... 45 3.4.2 Defining by paraphrases, synonyms, sibling concepts or antonyms ............................. 51 3.4.3 Extensional definitions .................................................................................................. 52 3.4.4 Other types of definitions: defining by purpose or properties ...................................... 53 4 Methodology and background technologies .......................................................................... 57 4.1 Overview of the definition extraction methodology ............................................................. 57 4.2 Definition extraction evaluation methodology ...................................................................... 59 4.3 Background technologies and resources ............................................................................... 60 i. i i i I I I 4.3.1 ToTrTaLe morphosyntactic tagger and lemmatiser ...................................................... 60 4.3.2 LUIZ terminology extractor ......................................................................................... 62 4.3.3 WordNet and sloWNet .................................................................................................. 63 4.3.4 ClowdFlows workflow composition and execution environment ................................ 64 4.4 Evaluation of selected background technologies .................................................................. 64 4.4.1 ToTrTaLe evaluation ..................................................................................................... 65 4.4.2 LUIZ evaluation ............................................................................................................ 66 5 Definition extraction from Slovene and English text corpora ............................................ 71 5.1 Extracting definitions from Slovene texts ............................................................................. 71 5.1.1 Pattern-based definition extraction ............................................................................... 72 5.1.2 Term-based definition extraction .................................................................................. 85 5.1.3 SloWNet-based definition extraction ........................................................................... 96 5.2 Extracting definitions from English texts ........................................................................... 100 5.2.1 Pattern-based definition extraction ............................................................................. 100 5.2.2 Term-based definition extraction ................................................................................ 107 5.2.3 WordNet-based definition extraction .......................................................................... 114 5.3 Results of Slovene and English definition extraction methods and their combinations ....................................................................................................................... 118 5.3.1 Combining different approaches on the Slovene subcorpus ....................................... 118 5.3.2 Combining different approaches on the English subcorpus ....................................... 120 5.3.3 Subjectivity of evaluation results ................................................................................ 123 5.3.4 Analysis of different types of definition candidates ................................................... 124 6 Workflow implementation in ClowdFlows ......................................................................... 133 6.1 Load corpus widget ............................................................................................................. 134 6.2 ToTrTaLe widget ................................................................................................................. 135 6.3 LUIZ widget ........................................................................................................................ 137 6.4 Definition extraction widgets .............................................................................................. 137 6.4.1 Pattern-based definition extraction widget ................................................................. 137 6.4.2 Term-based definition extraction widget .................................................................... 138 6.4.3 Wordnet-based definition extraction widget ............................................................... 138 6.5 Auxiliary widgets ................................................................................................................ 138 6.5.1 Merge sentences widget .............................................................................................. 138 6.5.2 String to file widget .................................................................................................... 138 6.5.3 Term viewer widget ...................................................................................................

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    195 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us