Personalized Concept Hierarchy Construction
Total Page:16
File Type:pdf, Size:1020Kb
Personalized Concept Hierarchy Construction Hui Yang CMU-LTI-11-018 Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213 www.lti.cs.cmu.edu Thesis Committee: Jamie Callan (Carnegie Mellon University, Chair) Jaime Carbonell (Carnegie Mellon University) Christos Faloutsos (Carnegie Mellon University) Eduard Hovy (University of Southern California) Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy In Language and Information Technologies © 2011, Hui Yang Abstract Aconcepthierarchyisasetofconceptsandrelationsbetweenthoseconcepts.Sinceancient times, concept hierarchies have been used to organize and access information. In some situ- ations, task-specific and user-specific concept hierarchies are necessary to allow an overview and easy access a large set of documents. For example, in regulatory reforms, rule-makers in government regulatory agencies must quickly identify and respond to issues raised in public comments. A concept hierarchy constructed for a set of public comments hierarchically or- ganizes the comments and a user is able to easily “drill down” into documents that discuss a specific topic. Particularly, this dissertation addresses how to construct concept hierarchies from text collections automatically or with a-human-in-the-loop. The novel metric-based concept hierarchy construction framework transforms concept hierarchy construction into a multi- criterion optimization problem. It incrementally clusters concepts based on minimum evo- lution of hierarchy structure, as well as optimization derived from the modeling of concept abstractness and concept coherence. Moreover, this dissertation represents the semantic distance between concepts as a wide range of features, each of which corresponds to a state- of-the-art concept hierarchy construction technique, such as lexico-syntactic pattern, con- textual information, and co-occurrence. The use of multiple features allows a further study of the interaction between features and di↵erent types of semantic relations as well as the interaction between features and concepts at di↵erent abstraction levels. Besides the automatic framework for concept hierarchy construction, this dissertation also proposes an e↵ective human-guided concept hierarchy construction framework to address personalization by learning from periodic manual guidance and directing the learned models towards personal preferences. Through human-computer interactions, the human and the i machine work together to organize concepts into hierarchies. The machine’s predictions not only save the user’s e↵ort but also make sensible suggestions to assist the user. This is one of the first works of real-time machine learning for organizing personalized and task-specific information in an interactive paradigm. This dissertation also studies user behaviors during concept hierarchy construction. It explores whether people create concept hierarchies more quickly or more consistently using the proposed frameworks, whether there are consistent dataset-specific or user-specific dif- ferences in the hierarchies that people construct, whether people are self-consistent, and how these factors interact with di↵erent construction methods. The user study elaborates that dataset difficulty is a major factor a↵ecting how people organize information into concept hierarchies. It also reveals that people are quite self-consistent in building hierarchies. This novel finding provides foundations to study the di↵erences in concept hierarchy construction behaviors between individuals. Last but not least, the dissertation proposes a novel similarity metric for measuring hierarchy similarity. Fragment-based Similarity (FBS) employs a unique bag-of-word repre- sentation for hierarchies and takes a fragment-based view to calculate hierarchy similarity. FBS well approximates tree edit distance and greatly improves tree edit distance’s efficiency from NP-hard to only O(n3)andO(n)ifpairwisenodesimilaritiesarepre-calculated. The research in this dissertation is an important step forward of concept hierarchy con- struction. It addresses important problems of concept hierarchy construction, especially considers how to better model these problems with good theoretical foundations, to study these problems via extensive empirical experiments and user studies, and to solve these problems by developing practical applications for constructing personal concept hierarchies. ii Acknowledgement It is my great pleasure to express my deep and sincere gratitude to those who made this PhD dissertation possible. Foremost, I am deeply grateful to my advisor, Professor Jamie Callan, for his continuous support and wonderful guidance throughout my entire PhD study. Professor Jamie Callan is a great advisor who is inspiring, perceptive, and patient. Professor Jamie Callan gave me the greatest encouragement to explore the fabulous research area of Information Retrieval and tremendously help me to focus on the essential things. Most importantly, I learn from Professor Jamie Callan how to be a rigorous scholar. To me, Professor Jamie Callan is not only an academic advisor, but also a role model and a lifetime mentor. Besides my advisor, I wish to express my sincerest gratitude to the rest of my thesis committee: Professor Jaime Carbonell, Professor Christos Faloutsos, and Professor Eduard Hovy, for their valuable advice and insightful comments. I greatly benefit from their encour- agement, brilliant ideas and high-standard questions. IwouldalsoexpressmywarmestgratitudetoProfessorTat-SengChuawhointroduced me to the wonderful field of Information Retrieval and gave me important guidance and encouragement during my initial attempts in academic research. I am also indebted to many collaborators and friends at Carnegie Mellon University, National University of Singapore, University of Pittsburgh, Microsoft and elsewhere for their great support and kind help. I benefit enormously from those extensive discussions, lunch time chats, and practice talks. I extend my thanks to Professor Yiming Yang, Professor Stuart Shulman, Anton Mityagin, Krysta Svore, Professor Jingtao Wang, Professor Milos Hauskrecht, Professor Scott Falman, Professor Noah Smith, Professor Aarti Singh, Dr. Alex Hauptmann, Professor Lori Levin, Professor Hwee-Tou Ng, Professor Chin-Hui Lee, Dr. Jon iii Elsas, Professor Jaime Arguello, Dr. Yifen Huang, Dr. Vasco Calais Pedro, Professor Jiong Sun, Dr. Kaimin Chang, Yi-Chia Wang, Dr. Jonathan Chung-Kuan Huang, Dr. Meryem Pinar Donmez, Andreas Zollmann, Professor Luo Si, Professor Yi Zhang, Dr. Fan Li, Dr. Jian Zhang, Yi Chang, Chuang Wu, Dr. Jie Lu, Dr. Paul Ogilvie, Dr. Kevyn Collins- Thompson, Dr. Yanjun Qi, Professor Yan Liu, Dr. Rong Yan, Yangbo Zhu, Pucktada Treeratpituk, Le Zhao, Ni Lao, Anagha Kulkarni, Dr. Shinjae Yoo, Dr. Abhimanyu Lad, Dr. Lingyun Gu, Dr. Wen Wu, Justin Betteridge, Frank Lin, Andrew Schlaikjer, Hideki Shima, Dr. Tien-Ho Lin, Dr. Oznur Tastan, David Pane, Mark Hoy, Thi Truong Avrahami, Dr. Michal Valko, Dr. Richard Pelikan, and many many more. Iowemywarmestthankstomyentirefamilyfortheirloveandunderstanding.My parents always provide me helpful and timely advice and help me get through the difficult times. My husband is the one who is always be my side and always believes in me. I am so grateful to his unconditional love and enormous support. Without them, this dissertation would not be possible. I also thank my daughter Victoria, who just turned to two-year old, for not crying too much when mom had to work. Idedicatethisdissertationtomyfamily. iv Contents Abstract i Acknowledgement iii 1 Introduction 1 1.1 DataExplorationinNoticeCommentRulemaking . 3 1.2 Search Result Organization in Web Search . 6 1.3 PersonalizationinConceptHierarchies . 9 1.3.1 An Experiment on Personal Di↵erences in Concept Hierarchies . 11 1.4 Challenges . 13 1.5 OurApproach................................... 16 1.6 ContributionsofThisDissertation. 17 1.7 Outline . 19 2RelatedWork 20 2.1 Ontology Learning . 20 2.1.1 Pattern-Based Ontology Learning . 20 2.1.2 Clustering-BasedOntologyLearning . 29 2.1.3 OtherOntologyLearningApproaches . 33 2.2 Human-GuidedMachineLearning . 35 2.3 Interactive Technologies for Ontologies . 37 2.4 Summary . 38 v 3 The Problem 39 3.1 Problem Definition . 39 3.2 OntoCop-AConceptHierarchyConstructionTool . 42 3.3 Datasets . 44 3.3.1 The Public Comment Datasets . 45 3.3.2 The Web Datasets . 47 3.3.3 North American Industry Classification System (NAICS) . 47 3.3.4 WordNet.................................. 48 3.3.5 ODP . 49 3.3.6 Summary . 49 3.4 MeasuringHierarchySimilarity . 50 3.4.1 Tree Edit Distance . 51 3.4.2 Schema Measurement . 54 3.4.3 IndirectMeasurement .......................... 54 3.5 Fragment-Based Similarity . 55 3.5.1 VectorRepresentationofHierarchies . 56 3.5.2 IdentifyingMatchingFragments . 58 3.5.3 AggregatingSimilarityScores . 61 3.5.4 Experiments . 63 3.6 Summary . 69 4 Concept Extraction 71 4.1 Concept Mining . 72 4.2 ConceptFiltering ................................. 73 4.3 ConceptUnification................................ 75 4.4 Experimental Results . 77 4.5 Summary . 79 5 Metric-based Concept Hierarchy Construction 81 5.1 DesirablePropertiesforAConcepthierarchy. 82 5.1.1 Minimum Semantic Distance and Minimum Evolution. 82 5.1.2 Abstractness................................ 84 vi 5.1.3 Long Distance Coherence . 85 5.1.4 Summary . 86 5.2 Terminology: Concept hierarchy, Hierarchy Metric, and Information Function 86 5.3 The Metric-based