The Semi-Automatic Expansion of Existing Terminological Ontologies Using Knowledge Patterns Discovered on the WWW: an Implementation and Evaluation Phd Series, No

A Service of Leibniz-Informationszentrum econstor Wirtschaft Leibniz Information Centre Make Your Publications Visible. zbw for Economics Halskov, Jakob Doctoral Thesis The Semi-automatic expansion of existing terminological ontologies using knowledge patterns discovered on the WWW: An implementation and evaluation PhD Series, No. 28.2007 Provided in Cooperation with: Copenhagen Business School (CBS) Suggested Citation: Halskov, Jakob (2007) : The Semi-automatic expansion of existing terminological ontologies using knowledge patterns discovered on the WWW: An implementation and evaluation, PhD Series, No. 28.2007, ISBN 9788759383377, Copenhagen Business School (CBS), Frederiksberg, http://hdl.handle.net/10398/7731 This Version is available at: http://hdl.handle.net/10419/208683 Standard-Nutzungsbedingungen: Terms of use: Die Dokumente auf EconStor dürfen zu eigenen wissenschaftlichen Documents in EconStor may be saved and copied for your Zwecken und zum Privatgebrauch gespeichert und kopiert werden. personal and scholarly purposes. Sie dürfen die Dokumente nicht für öffentliche oder kommerzielle You are not to copy documents for public or commercial Zwecke vervielfältigen, öffentlich ausstellen, öffentlich zugänglich purposes, to exhibit the documents publicly, to make them machen, vertreiben oder anderweitig nutzen. publicly available on the internet, or to distribute or otherwise use the documents in public. Sofern die Verfasser die Dokumente unter Open-Content-Lizenzen (insbesondere CC-Lizenzen) zur Verfügung gestellt haben sollten, If the documents have been made available under an Open gelten abweichend von diesen Nutzungsbedingungen die in der dort Content Licence (especially Creative Commons Licences), you genannten Lizenz gewährten Nutzungsrechte. may exercise further usage rights as specified in the indicated licence. https://creativecommons.org/licenses/by-nc-nd/3.0/ www.econstor.eu by Jakob Halskov Copenhagen Business School Department of International Language Studies and Computational Linguistics Submitted: March 22 2007 Supervisor: Bodil Nistrup Madsen Associate supervisor: Caroline Barrière Abstract The research object of this thesis is the so-called knowledge patterns and their usefulness in automatically extracting specic semantic relations from unannotated and uncategorized text on the WWW so as to facili- tate semi-automatic updating and extension of existing ontological and terminological resources. The main contribution of the thesis is the implementation of a com- plete ontology extension framework called WWW2REL which is 100% based on a knowledge-poor, domain-independent processing of WWW text snippets and includes the three stages of pattern discovery, pattern ltering and relation instance ranking. Unlike most comparable systems WWW2REL is special in that it is both highly portable, can be applied to any semantic relation type and operates directly on uncategorized WWW text snippets. The system is tested on the biomedical UMLS Metathesaurus for four dierent relation types and manually evaluated by four domain experts. It is demonstrated that high precision in the task of knowledge discovery from a noisy text source can be achieved using a very simple instance rele- vance measure and two ranking heuristics. In contrast, many comparable systems operate on richly annotated academic text and tend to apply heuristics which are custom-tailored to a specic domain and/or relation type. When selecting the overall best ranking scheme, average system performance across all four relation types ranges between 70% to 65% of the maximum possible F-score by top 10 and top 50 relation instances, respectively. Finally, the thesis experiments also examine the portability of individual knowledge patterns and of the ranking heuristics. It is concluded that synonymy KPs are the most domain independent closely followed by ISA KPs, whereas patterns for may_prevent and especially induces are more dependent on the domain. Empirical experiments also suggest that a ranking heuristic which penalizes relation instances whose arguments occur frequently in a general language corpus can be highly eective, but may need to be adapted to the domain in question. Abstract Forskningsgenstanden for dette projekt er de såkaldte vidensmønstre og deres anvendelighed i forhold til den automatiske fremnding af seman- tiske relationer fra uopmærket og ukategoriseret tekstmateriale på WWW med henblik på en halvautomatisk opdatering af eksisterende terminolo- giske ressourcer. Afhandlingens hovedbidrag består i en implementering og evaluering af et komplet ontologiudvidelsesværktøj, kaldet WWW2REL, som er 100% baseret på en vidensfattig og domæneuafhængig behandling af tekstfrag- menter på WWW og omfatter både mønsteridentikation og mønsterl- trering så vel som en automatisk relevansvurdering af de ekstraherede relationer. I modsætning til de este sammenlignelige systemer skiller WWW2REL sig ud ved at være både domæne- og relationsuafhængig og samtidig netbaseret. Systemet afprøves på den biomedicinske UMLS Metathesaurus for re forskellige relationstyper og evalueres manuelt af re fageksperter. Det påvises, at domæneuafhængig vidensfremnding fra en ukategoriseret tek- stkilde kan ske med høj præcision ved hjælp af et meget enkelt relevans- mål og to heuristiske sorteringsmetoder. Mange sammenlignelige systemer anvender udelukkende faglitterære og semantisk opmærkede tekster og er ofte skræddersyet til et enkelt domæne og/eller bestemt relation- stype. Ved valg af den samlet set bedste rangordningsalgoritme opnår systemet i gennemsnit mellem 70% og 65% af den højest mulige F-score ved henholdsvis top 10 og top 50 relationskandidater. Afhandlingen undersøger endvidere anvendeligheden af de enkelte vi- densmønstre og sorteringsmetoder på tværs af domæner. Det konkluderes, at vidensmønstre for synonymi er de mest tværfaglige, tæt fulgt af mønstre for den generiske relation, hvorimod vidensmønstre for de to kausale relationer er mindre tværfaglige. Slutteligt indikerer empiriske eksperimenter, at en sorteringsmetode, som straer relationer, hvis argumenter er hyp- pige i et almensprogligt korpus kan være meget eektiv, men sandsynligvis bør tilpasses til det enkelte domæne. List of Tables 1 Academic word families . 31 2 NP compression in academic writing . 32 3 Example semantic relations in the UMLS . 49 4 Data mining, text mining, IR and computational linguistics . 60 5 Performance of pattern-based relation extraction systems . 63 6 System test (from [Mukherjea and Sahay, 2006]) . 66 7 Comparison of pattern-based relation extraction systems . 73 8 A contingency table of observed and expected frequencies . 78 9 characteristic VPs in the BioMed corpus ranked by log-likelihood versus the BNC . 80 10 Characteristic VPs in the BioMed corpus ranked by log-odds versus the BNC . 81 11 Effect inducing drugs (examples) . 84 12 Example term variants for an induces/induced_by relation . 85 13 Effects of variant expansion, lexical and frequency filtering for ISA . 86 14 Lexically filtered, relatively frequent UMLS term pairs for KP discovery (examples) . 86 15 Knowledge patterns in context . 87 16 Query templates, training pairs and corpus sizes per relation type . 89 17 Top 10 unfiltered patterns by frequency of occurrence in snippets . 91 18 Top 10 unfiltered ISA patterns by frequency of occurrence in snippets 91 19 Top 10 “induces” and “may_prevent” patterns containing a verb . 92 20 Negative term pairs for the “induces” relation . 95 21 Negative term pairs for the “may_prevent” relation . 95 22 Negative term pairs for the synonymy relation . 95 23 Non-ISA pairs . 96 24 Querying Google . 97 25 Positive pairs . 99 26 Negative pairs . 99 27 Induces pattern candidates (examples) . 101 28 May_prevent pattern candidates (examples) . 101 29 Synonymy pattern candidates (examples) . 101 30 ISA pattern candidates (examples) . 102 31 ISA KPs filtered by iteration range, average sample frequency and precision . 103 32 synonymy KPs filtered by iteration range, average sample frequency and precision . 104 33 Number of filtered KPs used in system evaluation . 104 34 Automatic NP conflation . 107 35 Categories used in manual evaluation of relation correctness . 111 36 “may_prevent” - most frequent STY combinations . 113 37 “induces” - most frequent STY combinations . 114 38 System inputs for evaluation . 114 39 Interpretations of kappa values . 115 1 40 Inter-annotator agreement across all experiments . 116 41 How unsure are the experts? . 117 42 All candidates of “haloperidol ISA X” where the head is “antipsychotics”120 43 correct candidates in individual experiments . 121 44 Aspirin induces X - top 10 candidates . 133 45 “Aspirin induces X”: precision of sample-based schemes . 133 46 Selenium may_prevent X - top 10 candidates . 135 47 “Selenium may_prevent X”: precision of sample-based schemes . 135 48 X induces vomiting - top 10 candidates . 137 49 “X induces vomiting”: precision of sample-based schemes . 137 50 “X induces emesis”: precision of sample-based schemes . 139 51 X induces emesis - top 10 candidates . 139 52 Drugs which induce emesis|vomiting . 140 53 {drugs} induce emesis”: precision of sample-based schemes . 141 54 Ranking of “glucose” synonyms . 142 55 Synonyms of “glucose” - top 10 candidates . 143 56 Levels of precision in chemical nomenclature . 144 57 Synonyms of “lactose” - top 10 candidates . 146 58 Synonyms of “formaldehyde”: precision and recall of sample-based schemes . 147 59 Synonyms of “formaldehyde” - top 5 candidates . 147 60 Synonyms of

Load more