Hierarchical Information Clustering Using Ontology Languages
Total Page:16
File Type:pdf, Size:1020Kb
Hierarchical Information Clustering Using Ontology Languages Travis D. Breaux Joel W. Reed Department of Computer Science, Computation Sciences and Engineering, North Carolina State University Oak Ridge National Laboratory, [email protected] [email protected] Abstract number of IR systems. While clustering offers a unique improvement over conventional, uninformed keyword The tools to analyze and visualize information search, traditional clustering requires sufficiently large from multiple, inhomogeneous sources have traditionally populations of words before exact word matches can be relied on improvements in statistical methods. The used to decide relatedness. A fundamental limitation in results from statistical methods, however, overlook these methods includes word indexing that is missing relevant semantic features present within natural important semantic relationships available in emerging language and text-based information. Emerging ontologies. An ontology provides specific relationships research in ontology languages (e.g. RDF, RDFS, SUO- between words that can serve as an interpretation in the KIF, and OWL) offers promising avenues for overcoming clustering algorithm. The ontology can provide a single these limitations by leveraging existing and future point-of-view or be combined with other ontologies to libraries of meta-data and semantic mark-up. Using produce more complex views of the information not semantic features (e.g. hypernyms, meronyms, synonyms, previously obtainable by traditional methods. etc.) encoded in ontology languages, methods such as This paper begins with a background in clustering and keyword search and clustering can be augmented to ontologies. In describing ontologies, we also provide a analyze and visualize documents at conceptually higher brief overview of semantic features commonly supported levels. We present findings from a hierarchical clustering in ontology languages. Following, we introduce our system modified for ontological indexing and run on a approach using a hierarchical clustering system combined topic-centric test collection of documents each with fewer with our own ontology formatted in an extended RDF/ than 200 words. Our findings show that ontologies can RDFS. Finally, the results of our implementation included impose a complete interpretation or subjective clustering visualizations are presented and discussed followed by a onto a document set that is at least as good as meta-word review of related work. search. 2. Background 1. Introduction Statistical methods in text -based information analysis With the Internet and World Wide Web came generally seek to uncover correlations among word improved distribution and storage capabilities of frequencies in a collection of documents. Perhaps the information and an increase in the production and most elementary approach, the keyword search, organizes expansion of personal, commercial, and government online documents by indexed words. Extensions to this method services. Recently, emerging wireless and remote access apply various algorithms that produce relational rank technologies are further increasing the ubiquity of factors specific to features in the information domain or network access and the size of information flows. For this user context such as link relevance among web pages [3], reason, information retrieval (IR) tasks capable of or feature usage in software applications [4]. In identifying the most relevant information have continued applications with limited a priori domain knowledge, to receive growing attention with ontologies offering popular approaches include document clustering obtained potential new approaches by providing deeper by computing relatedness scores using a vector space interpretations into information. model. These scores rank and relate documents by word In particular, categorization and search tasks that use frequencies within documents, commonly called the bag of statistical methods such as Latent Semantic Indexing [1] or words, and normalizing an overall document score across the Vector Space Model [2] combined with hierarchical several documents in a collection. The term frequency clustering have been successfully demonstrated in a inverse document frequency (TFIDF) is a well established relatedness score. In addition, a minimum level of word several word forms (i.e., nouns, prepositions, verbs, etc.) filtering aimed at reducing word form complexity (e.g. noun The part-whole relations for meronyms (i.e., parts of a and verb stemming, contraction expansion, etc.) such as whole) and holonyms (i.e., whole of its parts) are perhaps Porter stemming [5] or by reducing the number of the next most important ontological features for nouns. statistically irrelevant words known as stop words [6] (e.g., Unlike the categorical relations, the part-whole relations articles, pronouns, prepositions, etc.) is performed. The have a number of variations exclusive to certain nouns [7], general theory behind relatedness scores in text -based complicating the separation of part-whole structure from information analysis follows: abstract concepts are largely content which is desirable in ontology language design. represented by nouns (e.g. persons, places, or things) and Other common noun-specific relations include synonyms, verbs (e.g., actions and some events) and conceptually antonyms, and homonyms. related documents will share similar nouns and verbs. The Exactly which relationships and other features are frequency of relatedness, therefore, attempts to describe present in an ontology language is dictated by the “just how close” two documents are by counting the intended application of a specific language. For example, number of common nouns and verbs between them. the ontology languages based on subsets of first-order Present-day ontologies can be grouped into two logic place more emphasis on logical operators and set- general categories: those that form meta-language theoretic relations including disjointedness, transitivity dictionaries and those that are derived from knowledge and equivalence classes. Alternatively, part-whole bases built for inference engines and expert systems. In relations are very popular in medical ontology languages the former group, the ontology is organized around the where the need to describe the composition of biological words in a natural language via their lexical attributes (i.e. systems is an obvious priority. Evaluating an ontology part-of-speech) and semantic relations. In the latter group, language is therefore a matter of determining what the ontology is composed of predicates that in appearance relationships are supported by the language and required are words or word phrases from natural language (e.g. by the ontology or application domain. Adapting an FruitOrVegetable1) or concepts using several semantic existing ontology to a new application requires the ability relations (e.g., AboveGroundLevelInAConstruction1). to distinguish and separate desirable features from the Since the content of these ontologies primarily serves as undesirable to guarantee both the quality and persistent logical predicates, there is little emphasis placed on availability of extracted information. explicitly encoding individual relations such as in the case Ontology languages may be community standards, of dictionary-style ontologies. In addition to the non- such as LOOM [8], or they may be unique to one orthogonal conceptual predicates, the latter group often implementation, such as Princeton University’s WordNet lacks verbs as another consequence of conventional [9]. Recently, there has been much effort to develop formal inference (i.e., logical implications replacing terms standard ontology mark-up languages for indexing and indicative of state transitions.) For IR applications that searching HTML documents. Simple HTML Ontology primarily use the natural language content of documents Extensions (SHOE) is an ontology language intended to in their sorting algorithms, the dictionary-based provide inference capability over arbitrary categories, ontologies are best suited for expanding relationships relations and custom data-types [10]. Publishers mark-up between terms within a document. existing documents with SHOE instances, referencing external SHOE ontologies that either stand-alone or extend 2.1 Ontology Languages other ontologies. Developers of SHOE have since deferred their efforts to the Semantic Web. The Web Ontology Language (OWL) is a continuing W3C project derived Ontology languages provide the formal structures that from a number of efforts including RDF/ RDFS, the link terms through semantic relations. The categorical, Semantic Web, DAML, and OIL. In the spirit of SHOE, taxonomic or class relations for hypernyms OWL provides a language for composing ontologies that (i.e., super-class) and hyponyms (i.e., sub-class) used in can be aligned with HTML content. Whereas the SHOE term abstraction and refinement, respectively, are so language implements a form of Horn logic, OWL attempts popular they almost uniquely define the ontological to implement Description Logic as an extension of RDF prospect in many applications. In natural language, these [11]. In both efforts, the formalism of the ontology relations are applicable to both nouns and verbs, language is driven by the desired inferential capabilities although, the emphasis in ontology development has been found in their respective logics. The inferential capability mostly on nouns or concepts that are compositions of is added to