A Knowledge Discovery Approach
Total Page:16
File Type:pdf, Size:1020Kb
Semantic XML Tagging of Domain-Specific Text Archives: A Knowledge Discovery Approach Dissertation zur Erlangung des akademisches Grades Doktoringenieur (Dr.-Ing.) angenommen durch die Fakult¨at fur¨ Informatik der Otto-von-Guericke-Universit¨at Magdeburg von Diplom-Kaufmann Peter Karsten Winkler, geboren am 1. Oktober 1971 in Berlin Gutachterinnen und Gutachter: Prof. Dr. Myra Spiliopoulou Prof. Dr. Gunter Saake Prof. Dr. Stefan Conrad Ort und Datum des Promotionskolloquiums: Magdeburg, 22. Januar 2009 Karsten Winkler. Semantic XML Tagging of Domain-Specific Text Archives: A Knowl- edge Discovery Approach. Dissertation, Faculty of Computer Science, Otto von Guericke University Magdeburg, Magdeburg, Germany, January 2009. Contents List of Figures v List of Tables vii List of Algorithms xi Abstract xiii Zusammenfassung xv Acknowledgments xvii 1 Introduction 1 1.1 TheAbundanceofText ............................ 1 1.2 Defining Semantic XML Markup . 3 1.3 BenefitsofSemanticXMLMarkup . 9 1.4 ResearchQuestions ............................... 12 1.5 ResearchMethodology ............................. 14 1.6 Outline...................................... 16 2 Literature Review 19 2.1 Storage, Retrieval, and Analysis of Textual Data . ....... 19 2.1.1 Knowledge Discovery in Textual Databases . 19 2.1.2 Information Storage and Retrieval . 23 2.1.3 InformationExtraction. 25 2.2 Discovering Concepts in Textual Data . 26 2.2.1 Topic Discovery in Text Documents . 27 2.2.2 Extracting Relational Tuples from Text . 31 2.2.3 Learning Taxonomies, Thesauri, and Ontologies . 35 2.3 Semantic Annotation of Text Documents . 39 2.3.1 Manual Semantic Text Annotation . 40 2.3.2 Semi-Automated Semantic Text Annotation . 43 2.3.3 Automated Semantic Text Annotation . 47 2.4 Schema Discovery in Marked-Up Text Documents . 50 2.5 Summary .................................... 53 Contents 3 DIAsDEM Framework 57 3.1 Terminology................................... 57 3.2 ObjectivesandOverview. 62 3.3 Knowledge Discovery Phase . 65 3.4 Knowledge Application Phase . 67 3.5 Summary .................................... 68 4 DIAsDEM Knowledge Discovery Process 71 4.1 Terminology................................... 71 4.2 Pre-ProcessingofTextDocuments . 75 4.2.1 Creating and Tokenizing Text Units . 75 4.2.2 ExtractingNamedEntities. 80 4.2.3 Lemmatizing Words and Word Sense Disambiguation . 83 4.2.4 Establishing a Controlled Vocabulary . 86 4.2.5 Mapping Text Units onto Text Unit Vectors . 94 4.3 Clustering of Text Unit Vectors . 99 4.3.1 Clustering Textual Data: An Overview . 99 4.3.2 Selecting a Clustering Algorithm . 107 4.3.3 Ranking Clusters of Text Unit Vectors . 115 4.3.4 Iterative Clustering of Text Unit Vectors . 124 4.4 Post-Processing of Discovered Patterns . 132 4.4.1 Recommending Semantic Cluster Labels . 133 4.4.2 Establishing a Concept-Based XML DTD . 138 4.4.3 Semantic XML Tagging of Text Documents . 143 4.5 Bridging Knowledge Discovery and Knowledge Application . ........148 4.6 Process Automation vs. Expert Involvement . 152 4.7 Summary ....................................154 5 DIAsDEM Workbench 157 5.1 Key Characteristics and Architecture . 157 5.2 OverviewofCoreTasks ............................160 5.2.1 Pre-Processing of Text Documents . 161 5.2.2 Iterative Clustering of Text Unit Vectors . 166 5.2.3 Post-Processing of Discovered Patterns . 169 5.3 Summary ....................................171 6 Experimental Evaluation 173 6.1 Assessing the Quality of Semantic XML Markup . 173 6.1.1 Quality Criteria for Semantic XML Markup . 174 6.1.2 Extending DIAsDEM Workbench . 179 6.2 Real-World Applications of the DIAsDEM Framework . 180 6.2.1 Semantic XML Markup for Competitive Intelligence . 180 6.2.2 German Commercial Register Entries . 181 ii Contents 6.2.3 News about U.S. Mergers and Acquisitions . 194 6.3 Summary ....................................206 7 Conclusions 209 7.1 SummaryandContribution . 209 7.2 FutureResearch.................................212 7.2.1 Structuring the Concept-Based XML DTD . 212 7.2.2 Temporal Aspects of Discovered Knowledge . 213 7.2.3 Towards Automated Knowledge Discovery . 214 7.3 ConcludingRemarks ..............................215 A Contents of the Supplementary Web Site 217 B Specifications of Abstract Data Types 219 B.1 ADT Notation, Primitive Data Types, and Arrays . 219 B.2 ADTforStrings.................................220 B.3 ADTforVectors ................................220 B.4 ADTforTextDocuments . .220 B.5 ADTforTextArchives.............................221 B.6 ADTforTextUnits...............................221 B.7 ADTforTextUnitLayers .... ....... ....... ....... ..222 B.8 ADTforConcepts ...............................222 B.9 ADTforSetsofConcepts ...........................223 B.10ADTforNamedEntityTypes . 223 B.11 ADT for Sets of Named Entity Types . 224 B.12ADTforNamedEntities . .224 B.13ADTforSetsofNamedEntities. 225 B.14 ADT for Semantically Marked-Up Text Units . 225 B.15 ADT for Semantically Marked-Up Text Unit Layers . 226 B.16 ADT for Semantically Marked-Up Text Documents . 226 B.17 ADT for Semantically Marked-Up Text Archives . 227 B.18 ADT for Conceptual Document Structures . 227 B.19ADTforKDTAlgorithms . .228 B.20ADTforKDTProcessFlows . 229 B.21ADTforTokens.................................229 B.22 ADT for Tokenized Text Units . 229 B.23 ADT for Intermediate Named Entities . 230 B.24 ADT for Sets of Intermediate Named Entities . 231 B.25ADTforTextUnitVectors . .231 B.26 ADT for Intermediate Text Units . 232 B.27 ADT for Intermediate Text Unit Layers . 233 B.28 ADT for Intermediate Text Documents . 234 B.29 ADT for Intermediate Text Archives . 234 iii Contents B.30 ADT for Controlled Vocabulary Terms . 235 B.31 ADT for Controlled Vocabularies . 235 B.32ADTforTextUnitClusters . 236 B.33 ADT for Text Unit Clusterings . 238 B.34 ADT for Descriptor Weighting Schemata . 239 B.35 ADT for Clustering Algorithms . 240 B.36 ADT for Cluster Quality Criteria . 241 B.37ADTforIterationMetadata . 241 B.38ADTforKDTProcessMetadata . 242 C List of Relevant German Vocabulary 245 List of Abbreviations 251 Notation and List of Symbols 253 Bibliography 257 Erkl¨arung 295 iv List of Figures 1.1 Taxonomy of Markup in Text Documents . 4 1.2 TriangleofReference ............................. 5 1.3 Systems Development Research Process and Specific Research Process . 15 2.1 Process of Knowledge Discovery in Textual Databases . ...... 22 2.2 Fundamental and Complementary Research Areas . 54 3.1 Outline of the Two-Phase DIAsDEM Framework. 64 3.2 Knowledge Discovery Process of the DIAsDEM Framework . 65 3.3 Knowledge Application Process of the DIAsDEM Framework . 67 4.1 Text Document ˇtE3 Decomposed into Three Distinct Text Unit Layers . 76 4.2 Iterative Clustering in the DIAsDEM Knowledge Discovery Process . 126 4.3 Illustration of Pattern Discovery in Four Clustering Iterations . 128 4.4 Iterative Classification in the DIAsDEM Knowledge Application Process . 150 4.5 Generic Process of Knowledge Discovery in Textual Databases .......155 5.1 Architectural Overview of DIAsDEM Workbench ............158 5.2 Screen Shot of DIAsDEM Workbench GUI Client ..........159 5.3 Screen Shot of the Replace Named Entities 2.1 Task..........160 5.4 Screen Shot of the Batch Script Editor Tool ..............161 5.5 Screen Shot of the Thesaurus Editor Tool ................165 5.6 Screen Shot of the Cluster Text Unit Vectors (hypKNOWsys) Task .......................................167 5.7 Visualization of Text Unit Clustering Created by the Monitor Cluster Quality 2.2 Task ...............................168 5.8 Visualization of Text Unit Cluster Created by the Monitor Cluster Quality 2.2 Task ...............................169 5.9 Screen Shot of the Derive Conceptual DTD 2.2 Task . 170 6.1 Screen Shot of the Tagging Quality Evaluator 2.2 Tool . 180 List of Tables 1.1 Excerpt of XML Document Containing Semantically Marked-Up Text . 8 2.1 Knowledge Discovery in Textual Databases: Tasks and Applications . 21 4.1 Five Reuters News Items Used in Examples . 72 4.2 Text Document ˇtE1 Decomposed into Text Unit Layer ˇrE1 .......... 76 4.3 Tokenized Text Units after Tokenizing the Elements of Text Unit Layer ˇrE1 79 4.4 Excerpt from the Extended Named Entity Hierarchy and Exemplary Named Entities ..................................... 80 4.5 Processed Text Units of Intermediate Text Unit Layer ¯rE1 after Extracting NamedEntities ................................. 82 4.6 Intermediate Named Entities Identified in Tokenized Text Units of Inter- mediate Text Unit Layer ¯rE1 .......................... 83 4.7 Processed Text Units of Intermediate Text Unit Layer ¯rE1 after Lemmati- zation and Word Sense Disambiguation . 86 4.8 Excerpt of ISO-2788 Thesaurus for Text Documents ˇtE1 through ˇtE5 and Corresponding DIAsDEM-Specific Controlled Vocabulary Terms . 92 4.9 Text Unit Descriptors of the Controlled Vocabulary VE and Weighting Components for Exemplary Text Documents ˇtE1 through ˇtE5 ........ 97 4.10 Text Unit Vectors of Intermediate Text Unit Layer ¯rE1 ........... 97 4.11 Common Proximity Measures between Two Text Documents Represented by m-Dimensional Property Vectors t1 and t2 ................102 4.12 Five Relative Cluster Validity Indices . .......106 4.13 Summary of Three Proposed Clustering Algorithms w.r.t. the Fulfillment of DIAsDEM-Specific Selection Criteria . 110 4.14 Sentences Assigned to Qualitatively Acceptable Text Unit Cluster 4 in Iteration1....................................116 4.15 Sentences Assigned to Qualitatively Unacceptable, Inhomogeneous Text UnitCluster7inIteration1