Semantic Web Topic Models: Integrating Ontological Knowledge and Probabilistic Topic Models
by
Mehdi Allahyari
(Under the Direction of Krys Kochut)
Abstract
Currently we are coping with a plethora of text (more than 80% of web) gener- ated and disseminated globally on the web, and thanks to new technologies such as smart devices and social networks it keeps growing exponentially every day. This tremendous amount of text mostly unstructured is easy to be processed and perceived by humans, but significantly hard for machines to understand. Needless to say, this volume of text is an invaluable source of information and knowledge. Thus, there is an increasing need to design methods and algorithms in order to effectively process this sheer volume of text and extract high quality information in an automatic fashion. Probabilistic topic models are a class of latent variable models for textual data that can be used to produce interpretable summarization of documents in the form of their constituent topics. However, because topic models are entirely unsupervised, they may create topics that are not always meaning- ful and understandable to humans. In this dissertation, we develop novel topic models that combine probabilistic topic modeling with domain knowledge in the form of ontologies within a single framework. These models effectively enhance topic modeling process and produce the topics that are best aligned with user modeling goals. We first describe the ontology-based topic model, OntoLDA, in which a document is a mixture of topics where as topics are distributions over the ontology concepts and concepts are multinomial distributions over the words. We demonstrate the utility of this model in order to automatically generate labels for the topics. We next propose the sOntoLDA topic model which combines the DB- pedia ontology with the topic modeling and use this model for semantic tagging of web documents. For all these models, we develop learning algorithms and show their usefulness with experiments conducted on real-world datasets.
Index words: Semantic Web, Ontologies, DBpedia, Topic models, Statistical learning, Domain knowledge Semantic Web Topic Models: Integrating Ontological Knowledge and Probabilistic Topic Models
by
Mehdi Allahyari
B.S., University of Kashan, 2005
A Dissertation Submitted to the Graduate Faculty of The University of Georgia in Partial Fulfillment of the Requirements for the Degree
Doctor of Philosophy
Athens, Georgia
2016 c 2016 Mehdi Allahyari All Rights Reserved Semantic Web Topic Models: Integrating Ontological Knowledge and Probabilistic Topic Models
by
Mehdi Allahyari
Approved:
Major Professor: Krys Kochut
Committee: John Miller Will York
Electronic Version Approved:
Suzanne Barbour Dean of the Graduate School The University of Georgia August 2016 Semantic Web Topic Models: Integrating Ontological Knowledge and Probabilistic Topic Models
Mehdi Allahyari
June 2016 For my parents and my family: Parisa, and Ryan
v Acknowledgments
Over the past six years I have received support from a great many individuals. I owe my appreciation to all these people without whom this dissertation could not have been completed. First and foremost I want to express my sincere appreciation and gratitude to my advisor Dr. Krys Kochut who has been a mentor, colleague and friend. His willingness to give me the freedom to explore research on topics for which I have been truly passionate, and at the same time his guidance to recover when I faltered made my Ph.D. experience productive and rewarding. I am also deeply grateful to other professors in my advising committee, Dr. John Miller and Will York, for their encouragement and valuable advice during the years I was striving to advance my research. Most importantly, I owe my deepest gratitude to my wife Dr. Parisa Darkhal and my parents for their unflagging love and continuous support throughout my life and during my studies to whom this dissertation is dedicated.
vi Contents
Acknowledgments vi
List of Figures x
List of Tables xii
1 Introduction 1 1.1 Outline ...... 4
2 Background 7 2.1 Semantic Web ...... 7 2.2 Probabilistic Topic Models ...... 14
3 Related Work 18 3.1 LDA-based Topic Models for Modeling Observed Features of Doc- uments ...... 19 3.2 LDA-based Topic Models exploiting Prior Knowledge ...... 24
4 Motivating Example 28
vii 4.1 Combining Ontologies with Topic Models ...... 29
5 Ontology-Based Text Classification 34 5.1 Introduction ...... 36 5.2 Ontology-based Text Categorization ...... 37 5.3 Classification Categories ...... 41 5.4 Categorization Algorithm ...... 45 5.5 Experiments ...... 53 5.6 Conclusion and Future Work ...... 58
6 OntoLDA: An Ontology-based Topic Model for Automatic Topic Labeling 59 6.1 Introduction ...... 61 6.2 Background ...... 64 6.3 Motivating Example ...... 68 6.4 Related Work ...... 70 6.5 Problem Formulation ...... 74 6.6 Concept-based Topic Labeling ...... 82 6.7 Experiments ...... 94 6.8 Conclusions ...... 104
7 Combining Topic Models with Wikipedia Category Network for Semantic Tagging 105 7.1 Introduction ...... 106 7.2 Related Work ...... 111
viii 7.3 Probabilistic Topic Models ...... 114 7.4 Ontology-based Topic Models for Semantic Tagging ...... 116 7.5 Experiments ...... 122 7.6 Classifying the Tagged Documents ...... 137 7.7 Conclusions ...... 142
8 Conclusions and Future Work 145 8.1 Summary of Contributions ...... 146 8.2 Future Work ...... 148
Bibliography 151
ix List of Figures
2.1 Graph structure of the example RDF statement ...... 10 2.2 Linked Open Data Cloud ...... 13 2.3 LDA Graphical Model ...... 16
5.1 Thematic graph from the example text ...... 40 5.2 Instanced graph with selected matched entities for topics defined as union, intersection, and a combination of contexts...... 45
6.1 LDA Graphical Model ...... 67 6.2 Graphical representation of OntoLDA model ...... 77 6.3 Example of a topic represented by top concepts learned by OntoLDA. 84 6.4 Semantic graph of the example topic φ described in Fig. 6.3 with |V φ| =13 ...... 85 6.5 Dominant thematic graph of the example topic described in Fig. 6.4 87 6.6 Core concepts of the Dominant thematic graph of the example topic described in Fig. 6.5 ...... 88
x 6.7 Label graph of the concept “Oakland Raiders” along with its mScore to the category “American Football League teams”...... 90 6.8 Comparison of the systems using human evaluation ...... 100
7.1 Graphical representation of LDA model ...... 114 7.2 Graphical representation of sOntoLDA model ...... 117
7.3 Precision, Coverage and MAP of Exact Match for Wikipedia Dataset...... 126
7.4 Precision, Coverage and MAP of Hierarchical Match for Wikipedia Dataset...... 128 7.5 Relations between the top 4 Wikipedia categories assigned to the article “Tooth brushing”...... 132 7.6 Precision, Coverage and MAP for Reuters Dataset...... 134
xi List of Tables
4.1 Example learned topics that are less useful or meaningful to the user. 29 4.2 Top-10 words for topics from a document set...... 30 4.3 Top-10 words for topics from a document set. The second row presents the manually assigned labels...... 31 4.4 Sample topics with top-10 words and top-5 generated labels Mei et al. method...... 32 4.5 Sample topics with top-10 words and top-5 generated labels by our topic labeling method...... 33
5.1 Category details of used text corpus ...... 54 5.2 Fine-grained categorization of main categories ...... 54 5.3 Highlevel ontological contexts categorization ...... 55 5.4 Categorization based on unions of sub-categories ...... 56 5.5 Categorization based on unions of sub-categories ...... 56
6.1 Example of a topic with its label...... 62 6.2 Example topics with top-10 words learned from a document set. . . 69
xii 6.3 Example topics with top-10 words learned from a document set. The second row presents the manually assigned labels...... 70 6.4 Example of topic-word representation learned by LDA and topic- concept representation learned by OntoLDA...... 75
6.5 Notation used in this paper ...... 76 6.6 Example of a topic with top-10 concepts (first column) and top-10 labels (second column) generated by our proposed method . . . . . 93 6.7 Sample topics of the BAWE corpus with top-6 generated labels for the Mei method and OntoLDA + Concept Labeling, along with top-10 words ...... 98 6.8 Sample topics of the Reuters corpus with top-6 generated labels for the Mei method and OntoLDA + Concept Labeling, along with top-10 words ...... 99 6.9 Example topics from the two document sets (top-10 words are shown). The third row presents the manually assigned labels ...... 102 6.10 Topic Coherence on top T words. A higher coherence score means the topics are more coherent ...... 103 6.11 Example topics with top-10 concept distributions in OntoLDA model104
7.1 Examples of five topics with their labels...... 108 7.2 Precision, Coverage and MAP values of “exact match” for Wikipedia Dataset...... 125 7.3 Precision, Coverage and MAP values of “Hierarchical match” for Wikipedia Dataset...... 129
xiii 7.4 Top 5 categories selected for the article “Tooth brushing”...... 131 7.5 Precision, Coverage and MAP values of “Hierarchical match” for Reuters Dataset...... 135 7.6 An example of top-10 words for 5 categories (topics) in Wikipedia Dataset ...... 136 7.7 An example of top-10 words for 5 categories (topics) in Reuters Dataset ...... 137 7.8 Multi-label classification Precision, Recall and F-Measure values of different algorithms...... 142 7.9 Precision, Recall and F-Measure values of different algorithms. . . . 143
xiv Chapter 1
Introduction
Extracting high quality and useful information from massive and constantly grow- ing collections of text documents has become a challenging task and has gained a great deal of attention in recent years. This tremendous amount of text data, which is often unstructured, is created in a variety of forms such as social networks, web and other type of information-centric applications. Understanding and mod- eling the content of documents can be very beneficial in many applications like information retrieval, natural language processing, document classification, doc- ument summarization, etc. Consequently, there is an increasing need to design methods and algorithms in order to effectively process this avalanche of text in a wide variety of text applications. Probabilistic topic models are being widely used to address these complex problems by virtue of their sound theoretical founda- tions in statistics and for their capability to be extended and combined with other models in a systematic manner.
1 Probabilistic topic models such as Latent Dirichlet Allocation[17] are powerful techniques to analyze the content of documents and extract the underlying topics represented in the collection. Topic models usually assume that individual docu- ments are mixtures of one or more topics, while topics are probability distributions over the words. These models have been extensively used in a variety of text pro- cessing tasks, such as word sense disambiguation [47, 20], relation extraction [121], text classification [42, 65], and information retrieval [117]. Thus, topic models pro- vide an effective framework for extracting the latent semantics from the unstruc- tured text collection. In addition to modeling textual data such as webpages, news articles, emails, scientific and medical publications, etc. [86, 67, 108, 94, 100, 111], topic models have demonstrated to be useful in modeling non-textual data like image collections [13, 7, 119].
However, due to the fact that topic models are entirely unsupervised, purely statistical and data driven, they may produce topics that are not always meaning- ful and understandable to humans. In other words, the discovered topics may not always correspond to what the user had in mind. We introduce a mechanism to cope with this issue. We develop topic models that allow the user to impact and guide the learned topics, while still maintaining the statistical pattern discovery abilities, which makes the topic models a powerful tool.
In this dissertation, we first propose an ontology-based method for automatic
2 document classification into dynamically defined topics of interest. We investigate what benefits can be obtained by taking advantage of ontologies compare to using traditional supervised machine learning algorithms. We next explore how to inte- grate prior knowledge in the form of ontology to the topic modeling framework. We propose knowledge-based topic models that incorporate domain knowledge to guide the topic identification process. We particularly develop topic models that combine the ontological knowledge bases such as DBpedia ontology and Linked Open Data (LOD) with probabilistic topic models to benefit the best of the two worlds. Integration of prior knowledge with topic models enhances the effective- ness of topic modeling and can be very helpful by directing the model towards the topics that are best aligned with user modeling goals, when multiple candidate topic decompositions exist for a given corpus of documents. The main contributions of this research can be summarized as follows:
• We identify the restrictions of using traditional supervised learning tech- niques for document classification task. We introduce an ontology-based method for automatic text document classification into dynamically defined topics. We show the benefits of our method over traditional methods and demonstrate its effectiveness through comprehensive evaluation.
• We introduce a new ontology-based topic model called OntoLDA topic model for automatic topic labeling task. It captures the relationships between the ontology concepts and the learned topics from text corpora, and generates labels for the topics relying on the ontological concepts.
3 • We develop a collapsed Gibbs sampling inference algorithm for the OntoLDA topic model.
• We evaluate the effectiveness of ontology-based topic models by conducting extensive experiments of different datasets.
• We develop a knowledge-based topic model called sOntoLDA topic model, which creates semantic tags for Web resources and online documents. It systematically combines the prior knowledge from the DBpedia ontology with the statistical topic models in a principled manner. Furthermore, it captures the distributions of DBpedia categories (concepts) over the words of the document collection.
• We create a collapsed Gibbs sampling inference algorithm for the sOntoLDA topic model.
• We show how sOntoLDA topic model functions though illustrative examples and demonstrate the utility of it by running comprehensive evaluations on multiple datasets.
1.1 Outline
The reminder of this dissertation is organized as follows:
• We formally define Latent Dirichlet Allocation (LDA) topic model and infer- ence algorithms for topic models in Chapter 2. This chapter also describes several primary concepts associated with the Semantic Web.
4 • We survey a wide variety of related work from extensions and modifications researchers have carried out on the standard LDA model, to some of the recent research on exploiting prior knowledge in topic models in Chapter 3.
• In Chapter 4, we describe a motivating example to answer questions such as “How ontologies can benefit probabilistic topic models?” and “how domain knowledge from the ontologies can be integrated with unsupervised topic models?”
• Chapter 5 describes an ontology-based method for text classification into dynamically defined set of topics. In this method, ontology is the effectively the classifier and not only it does not require a training set, but also allows the user to change the topics of interest without re-training the classifier.
• Chapter 6 and 7 describe different mechanisms for the integration of ontologi- cal domain knowledge with unsupervised topic models. Chapter 6 introduces an ontology-based topic model for the task of automatic topic modeling, and illustrates the theory behind it and how it functions. This chapter addition- ally demonstrates experiments showing usefulness of ontology-based topic models. In Chapter 7, we describe a knowledge-based topic model for tag- ging documents in an automatic way. We discuss the generative process of this model and provide the inference algorithm relying on the collapsed Gibbs sampling. Moreover, we conduct extensive experiments showing the utility and robustness of this model.
• Chapter 8 concludes the dissertation, summarizing the contributions and
5 describing directions for further research building on the foundations estab- lished in this work.
6 Chapter 2
Background
In this section, we formally describe some of the related concepts and notations. We begin with the formal definition of several of fundamental Semantic Web con- cepts. We then, formally define Latent Dirichlet Allocation (LDA), the state-of-art probabilistic topic modeling technique.
2.1 Semantic Web
The Semantic Web is considered as an extension to the current Web through stan- dards by the World Wide Web Consortium (W3C)1 and was initiated by Tim Berners-Lee and formally introduced to the world by the May 2001 Scientific American article “The Semantic Web”. The Semantic Web brings structure to the Web content, making the information not only human readable but also repre- senting it in a form that is machine-processable [9]. Tim Berners-Lee articulated
1www.w3.org
7 a Web that apart from being an infrastructure for documents and their links, it would express a Web of data (i.e.,Web with resources with relations), which en- ables the information to be shared and reused across applications. For example, resources could represent objects such as organizations, people, locations, etc and links between then describe the relationships among them. In order to bring struc- ture to Web and allow information exchange across applications, Semantic Web requires a few technologies. In the following section, we outline a few fundamen- tal Semantic Web technologies that are necessary for achieving the functionality previously mentioned.
2.1.1 Ontology
Ontologies have been designed as a way to express knowledge about a domain in the Semantic Web. Tom Gruber defines an ontology as an “explicit and formal specification of a conceptualization” [39]. We define the concept of ontology using the definition presented in W3C’s OWL Use Case and Requirements Documents2 as follows:
An ontology O formally defines a common set of terms that are used to de- scribe and represent a domain. An ontology defines the terms used to describe and represent an area of knowledge.
According to the definition above, we should mention a few points about on-
2http://www.w3.org/TR/webont-req/
8 tology: 1) Ontology is domain specific, i.e., it is used to describe and represent an area of knowledge such as area in education, medicine, etc [122]. 2) Ontology con- sists of terms and relationships among these terms. Terms are often called classes or concepts and relationships are called properties. By virtue of introduction of standard languages and recent advancements in ontology creation, working with ontologies has become a lot easier, which has significantly impacted the knowledge and data integration, exchange and collaborative work. Ontologies can be broadly classified into two categories: (a) Domain-specific ontologies that are important source of knowledge in those particular domains. Biomedical ontologies such as “Gene Ontology (GO)” and “Ontology for Biomedical Investigations3” are exam- ples of this type where the first one is an ontology for describing the function of genes and gene products and the latter one is an integrated ontology for the de- scription of life-science and clinical investigations. (b) In contrast, general-scope ontologies that cover multiple domains and include concepts from numerous areas. DBpedia [5], which is an encyclopedic ontology, derived from Wikipedia contains knowledge from biology, science, art, music and many more domains.
2.1.2 Resource Description Framework (RDF)
RDF was originally created in early 1999 by W3C as a standard for encoding metadata. RDF is the Semantic Web’s data model, which is designed to make statements about resources, especially web resources, in the form of hsubject, predicate, objecti expressions. These expressions are called triples. For example,
3http://www.obofoundry.org/
9 we can express the fact “Barack Obama is the president of the United States.” as an RDF triple: hBarack Obama, isPresidentOf, United Statesi, where its graph structure is represented in Figure 2.1. RDF allows the structured and semi- structured data to be mixed and shared across different applications.
Barack Obama isPresidentOf United States
Figure 2.1: Graph structure of the example RDF statement
2.1.3 RDF Schema
RDFS is a set of classes and properties used to describe and encode RDF triples. According the W3C4 RDFS is formally defined as:
“RDFS is a recommendation from W3C and it is an extensible knowledge rep- resentation language that one can use to create a vocabulary for describing classes, sub-classes and properties of RDF resources.”
Using this definition, RDFS is the RDF’s vocabulary description language. For example, we can define a common vocabulary for various classes (types) of laptops and their properties.
4 http://www.w3.org/TR/rdf-schema/
10 2.1.4 Linked Open Data (LOD)
Linked Data is about creating typed links between data from various sources [10]. In other words, linked Data is a method of publishing structured data in such a way that is interlinked with other data sources. Linked Data is based on the standard Web technologies such as HTTP, RDF and URI. Tim Berners-Lee illustrated a set of rules for publishing linked data on the web as follows:
1. Use URIs as the identifiers for things.
2. Use HTTP so that the things can be looked up.
3. Provide useful information when people look up a URI, using standards such as RDF, SPARQL, etc.
4. Include links to other URIs, so that they can find more things.
Since Linked Data has been introduced, many publishers have published their datasets in the Linked Data format. These datasets are in different forms such as XML, RDF, CSV, text, etc, and cover multiple domains. As of 2014, the number of datasets on the LOD is over 10145. Figure 2.2 illustrates the most recent image representing the datasets in the Linked Open Data cloud. One of the most important and primary datasets of LOD is DBpedia [5, 11]. DBpedia is an ontology containing structured information extracted from the Wikipedia and is publicly available on the Web. The English version of DBpedia knowledge base
5http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/
11 describes 4.58 million things, out of which 4.22 million are classified in a consistent ontology, including 1,445,000 persons, 735,000 places (including 478,000 populated places), 411,000 creative works (including 123,000 music albums, 87,000 films and 19,000 video games), 241,000 organizations (including 58,000 companies and 49,000 educational institutions), 251,000 species and 6,000 diseases6. DBpedia knowledge base is very useful and provides many advantages: it covers many domains; because it’s extracted from Wikipedia, it automatically evolves as Wikipedia changes; it is multilingual and provides localized versions in 125 languages. Altogether it contains 3 billion pieces of information (RDF triples) out of which 580 million were extracted from the English edition of Wikipedia; because DBpedia is structured, it allows us to ask quite complex queries against Wikipedia. Hence, it should be feasible to leverage this invaluable knowledge in a given data/text mining task. In fact, the rich knowledge sources such as ontologies in the Semantic Web have been extensively utilized in a variety of data mining and knowledge discovery tasks [90].
6http://dbpedia.org/about
12 Aspire NBN Roehampton Dutch Resolving Sailors Ariadne Ships and Ships Uclan Aspire Core JudaicaLink Lists Deutsche Trent Aspire Biographie Resource DM2E Portsmouth Nottingham RKB Epsrc Aspire DOI Explorer Plymouth Web RKB Italy Journal Semantic Explorer NTU Aspire Ghent Multimedia AAT Getty Lab University Lab RKB Ulm LOD RDF Sztaki Explorer RKB Deploy Biosamples Explorer Libris Keele FT UCL Aspire RKB Aspire RKB Sabiork Bio2RDF Explorer Kaunas Swedish Open Explorer Heritage Library Dev8d Open Cultural Open Brunel Aspire RKB Pisa IBM RKB SGD Explorer Gene O'Reilly RKB Dotac Bio2RDF Explorer Atlas Atlas RDF Resources Expression Explorer British lobid National WEB NDL Museum Collection Authorities Diet Library Diet MMU Resources Aspire OAI RKB Linked Datasets as of Datasets Linked 2014 August RDF de Cuenca de Explorer Linkeddata Universidad Chembl Radatana Bio2RDF Orphanet Gutenberg RKB RKB Risks Explorer Explorer Identifiers Eurecom OS RKB Explorer LOD Serendipity Project UTPL Verrijktkoninkrijk Aspire FU-Berlin Harper Adams Gutenberg LSR Bibbase Ring Ciard lobid HGNC Bio2RDF Bio2RDF RKB Iproclass Bio2RDF Resex RKB IEEE Explorer Organizations Explorer TWC Bio2RDF Irefindex IEEEvis RKB Aspire Datos Sussex Explorer Bne.es Shoah Names RKB ERA Asn.us Victims GOA Theses.fr Darmstadt SGD Explorer Bio2RDF Bio2RDF RKB RKB kisti Web Explorer Explorer RKB Courseware Semantic RKB Explorer Grundlagen Bio2RDF Newcastle Drugbank RKB Citeseer Explorer Pub Org Roma VIVO Explorer Indiana Bielefeld University RKB Bio2RDF Identifiers Bio2RDF Pubmed Bio2RDF Bio2RDF Pharmgkb Explorer Sudoc.fr Budapest Clinicaltrials Homologene Lista Materia Mientos ECS ECS Eprints Uniprot Encabeza Southampton RKB B3Kat RKB NDC Eprints Harvest Explorer Eprints MSC Explorer Bio2RDF Bible Taxon RKB Ontology LAAS Bio2RDF JITA Muninn Bio2RDF RKB Bio2RDF Wiki Explorer Wormbase Taxonomy World War I YSA Bio2RDF RKB Ncbigene Explorer Yso.fi RDF Dataset Explorer Bio2RDF RAE2001 Aspire Biomodels Manchester CTD Colinda chem2 bio2rdf Library Bio2RDF Princeton Findingaids GeneID RKB Bio2RDF Qmul Explorer Aspire RKB DBSNP DBLP JISC RKB Quran Bio2RDF RKB Bio2RDF Southampton Explorer Biomodels Semantic Lisbon Explorer Idref.fr Explorer Yso.fi Allars Edunet Organic GNI Uniprot DCS Taxonomy RKB RKB ZDB Sheffield Life Explorer Explorer ECO Data OMIM unlocode Curriculum Linked Bio2RDF Bio2RDF L3S OMIM DBLP Bio2RDF FU-Berlin Resources BNB Testee Irit RKB Bluk Diseasome ACM RKB Dewey Mesh Explorer Decimal Gesis Explorer Bio2RDF Thesoz LOD Classification Ietflang RKB ACBDLS Mis KB Explorer Deepblue GNOSS Museos Dailymed FU-Berlin Data RDF Uniprot Bnf.fr TCGA Bio2RDF DWS Linked Food Affymetrix Group NSF RKB Reload Reactome Linked VIVO linkedct Explorer Pdev for of Florida University STW Lemon Nobel Prizes Thesaurus Economics IDS Bibsonomy Bio2RDF Assets Data Taxon- Open concept Ac Uk RDF ECS RKB Disgenet DNB License Explorer Taxon- concept CKAN DBpedia Data Open Worldcat Occurences Wiktionary Enipedia Uniprot Morelab Base Thesaurus Drug Viaf Metadata Tekord Olia Interaction Knowledge DB Sider Museum LCSH DNB Isocat RKB Amsterdam GND AS EDN AS LOD EDN FU-Berlin Product Ski Wordnet Explorer Racers Austrian RKB Data Food Linked Facts Open Greek Explorer GNU CS Webscience Wordnet Archiveshub Licenses Aves3D Lemonuby ac.uk SISVU DBpedia FAO EU KDATA Southampton PT Ontology WALS Geospecies Geopolitical DBpedia Skos DBpedia Lite Agrovoc Cornetto Lexinfo Taxon FU-Berlin IT Drugbank DBpedia concept Linklion ES Glottolog DogFood Oasis Product Ontology ISO 639 DBpedia Semantic WebSemantic URI DBpedia News Portal Ontos Wiki Livejournal Burner LOD2 Project of RDF (W3C) DBnary Alpino Linked Wordnet Ecology FR (VU) Open Data JA Uni Bootsnall KUPKB NL WOLD Wordnet DBpedia Siegen DBpedia IServe StatusNet DBpedia EUNIS My Wiki Status Lingvoj Project PlanetData StatusNet StatusNet StatusNet Delicious Equestriarp Tags2con Belle Experiment Ilikefreedom Sweetie KO StatusNet YAGO StatusNet DBpedia live Berlios DE Mail Garnica Mark linuxwrangling web.org LOV Plywood Mrblog Linked Semantic- DBpedia StatusNet DBpedia Tekk Lexvo StatusNet StatusNet User Linked Cooleysekula Water Ced117 Feedback StatusNet W3C Datahub.io StatusNet Johndrink Semanlink Opencyc UMBEL Spraci Timttmy Jonkman StatusNet StatusNet StatusNet EL Dataspaces MyOpenlink Fcac Apache GNOSS Debian System Package Tracking DBpedia StatusNet StatusNet Flickr Legadolibre Wrappr RDF StatusNet Mamalibre Ohloh Typepad Pandaid ness StatusNet StatusNet Green Wordpress GNOSS chickenkiller CE4R Kathryl StatusNet Freebase Competitive- Russwurm ineverycrea Fragdev Fcestrada StatusNet StatusNet Ourcoffs StatusNet StatusNet Haus Code GNOSS Jugem Red Uno DBpedia Internacional Glou Macno BBC StatusNet Finder StatusNet Revyu Wildlife Social- StatusNet Sebseb01 semweb Thesaurus Dataspaces OpenlinkSW MDB StatusNet StatusNet Ssweeny Maps Linked StatusNet GNOSS Lydiastench Gomertronic Tech StatusNet Interactive Deusto Hackerposse Bonifaz Progval GNOSS Franke FOAF- StatusNet StatusNet Nextweb Profiles StatusNet Alexandre Prefix.cc Linked StatusNet Crunchbase thornton2 brainz Music- DBTune pedia Quitter Europeana StatusNet piana Vulnera- StatusNet Thelovebug RISP Dtdns Data Geo Open StatusNet Open Didactalia StatusNet Somsants StatusNet Calais Names StatusNet doomicile Geo StatusNet Otbm Data Skilledtests Linked StatusNet Proyecto Apadrina Data BBC mes and Eviajes GNOSS Miguiad BBC GNOSS Trends NYTimes StatusNet Prospects Qdnx Music 1w6 Program- Tschlotfeldt Linked Open Linked Geo- NUTS vocab Data Clean StatusNet Reegle Energy StatusNet StatusNet Kenzoid Morphtown StatusNet GNOSS Qth Artenue Lotico Postblue Vosmedios StatusNet America StatusNet StatusNet Status.net Chronicling Data RDFize last.fm chromic Survey Linked GNOSS Hellenic Museos StatusNet Espania Ordnance Geo Data StatusNet Fire Fire Brigade Mulestable Linked RFID GovUK Accom. In temp. Athelia NALT Impact ind. Households German Transparency Police Labor Law Labor Thesaurus Hellenic StatusNet Samnoble DBTune Scoffoni Jamendo StatusNet StatusNet Soucy Rainbowdash tl1n Elviajero StatusNet RKB StatusNet StatusNet cordis NHS David Planetlibre StatusNet Reference Explorer Thesaurus Jargon DBTune Applications Sessions StatusNet John Peel John Icane Opensimchat Environmental Haberthuer DBTropes EEARod GADM 20100 Data Geovocab Courts Linked Camera StatusNet Deputati Thesaurus Freelish gegeweb StatusNet StatusNet Belfalas WWW Pokepedia DBTune StatusNet OCD Magnatune shnoulle Foundation Open StatusNet Piracy Linked Linked Bka Eurostat Open Ludost Mobile StatusNet Network StatusNet Indymedia XBRL Enakting Web Mortality Edgar Linked Semantic Traveler Open Data Nmasuno Energy Exdc Linked Info Wiki Project Kaimi Railway Keuser Crime schiessle StatusNet StatusNet StatusNet StatusNet Enakting Data GovUK Atari GEMET artists Frosch Education last.fm DBTune Data Maymay StatusNet StatusNet GovUK Transport Brazilian Sat Govtrack Politicians Austria Acorn Survey of Survey Thesaurus Geological ldnfai Gem. GovUK Market Data OSM Archieven StatusNet Housing Thesaurus StatusNet Datenfahrt Linked Imirhil Audiovisuele SORS Eurostat GovUK Recit Input ind. Funding f. StatusNet Local auth. Local Gvmnt. Grant Transparency Enakting Population StatusNet UK FAO GovUK Lebsanft Datos StatusNet Imd Crime Rank 2010 Rank Postcodes Local Abiertos 270a.info Zaragoza Openly 270a.info Transparency Geo StatusNet Ecuador Orangeseeds City Greek Mean GovUK Mystatus Happy Loius GovUK Societal StatusNet Stock Index Wellbeing Rank 2010 Rank EEA Yesterday Linked Lichfield Geography Environment OECD Deuxpi Umthes Spip Deprivation Imd Deprivation Wellbeing lsoa Administrative StatusNet 270a.info NHS StatusNet Enakting mkuttner StatusNet Eurostat FU-Berlin Guide London Randomness Enakting Ecuador Borehole CO2Emission ABS Open Data Oceandrilling GovUK 270a.info Worldbank StatusNet Temporary Households 270a.info Turruta Integralblue Homelessness Wellbeing Housing Housing Types Reference BIS Accommodated Zaragoza data.gov.uk GovUK GovUK Societal Deprivation Imd Deprivation in Crime Rank 2010 Rank Crime Thist Currency Gazetteer StatusNet Coreyavis 270a.info Alexandria SKOS Designators Eurovoc Digital Library Digital ESD IATI as Toolkit Linked Data Linked ECB GovUK Household RKB Households Lettings Prp Lettings Stations Composition Charging Crime Social Lettings Social General Needs General 270a.info BFS ESD Explorer Data Hiico Open RDF Project Geo Standards Election 270a.info GovUK World GovUK StatusNet Impact imd env.imd Granted Eurostat Planning Wordnet Indicators rank 2010 rank Factbook FU-Berlin Applications GovUK Impact Opendata Geographic Indicators Affordable Access Access Rank Scotland Simd Scotland Housing Starts Housing Energy Data NVS Enakting Open Denmark Euskadi GovAgriBus EPO UIS Simd ness 1000 GovUK Scotland Opendata 270a.info Crime Rank Crime Homeless- Accept. per Enel GovUK IMF NUTS Services FRB Shops Linked Authority Dev Local Dev Mean GovUK 270a.info 270a.info GovUK Wellbeing Ctic Worthwhile new Builds new Public Transparency Dataset UK API Energy EfficiencyEnergy Impact Indicators Sense Bizkai Rank Data for Scotland Tourists in Legislation Opendata Castilla y Leon Castilla Simd Housing Simd AEMET Plans GovUK for Data 2010 Quality Agency Linked GovUK GovUK Transparency Neighbourhood std. dev. Imd Score Impact Indicators yesterday Environment Government Bathing WaterBathing Statistics Web Integration wellb. happy data.gov.uk Rank Scotland Opendata Simd Health Simd GovUK GovUK Societal Asturias Societal La 2010 Wellbeing builds GovUK impact Wellbeing Deprv. Imd Employment Empl. Empl. Rank Rank La 2010 La Rank GovUK Nomenclator Deprivation imd Deprivation GovUK Net Net Add. efficiency new Dwellings CIPFA Affordable indicators energy indicators Transparency Housing Starts Housing Impact Indicators RDF Bodies Eionet EU EU Agencies GovUK Impact Indicators SOA ODCL Transparency GovUK 2010 Housing Starts Housing Number GovUK GovUK Societal societal Bedrooms GovUK Wellbeing Societal Households wellbeing Lettings Prp Lettings Score '10 Score score 2010 score Social lettings Social General Needs General Wellbeing total employment GovUK Deprv. imd Income Rank La Rank Income GovUK Deprivation Imd Deprivation deprivation imd deprivation Population Projections Houseolds Projections Households Households Grant GovUK Imd 2010 Input ind. GovUK GovUK Societal Government Wellbeing Funding From Funding Local Local Authority Rank 2010 Rank Health Rank la Rank Health 2010 Deprivation Imd Deprivation GovUK Societal 2008 Wellbeing GovUK Graph Deprivation Imd Deprivation Scotland Housing Rank la Rank Housing GovUK service Households Opendata Simd Rank Simd expenditure GovUK Societal Wellbeing Deprivation Imd Rank 2010 Imd Rank 2010 GovUK Societal GovUK Wellbeing per 1000 per Health Score Health GovUK impact Households Deprivation Imd Deprivation indicators Opendata tr. families BPR Accommodated transparency GovUK Income Rank Income societal Scotland Simd Scotland wellbeing rank la '10 la rank deprv. imd GovUK societal rank '07 rank wellbeing GovUK deprv. imd Granted Planning GovUK Impact Applications Indicators Transparency Impact Indicators 2001 Housing Starts Housing to RDF Census Spanish Pupils by Pupils Datazone Opendata Education School and School Scotland Graph Scotland Arago- dbpedia 2010 GovUK Societal Wellbeing 2010 GovUK Rank La Rank Deprivation Imd Deprivation GovUK Education Rank La Rank Education Rank Families Imd Income Rank Transparency Opendata Working w. tr. Education Input Input indicators Local authorities Local Opendata Scotland Simd Scotland Employment Scotland Simd Scotland
Figure 2.2: Linked Open Data Cloud
13 2.2 Probabilistic Topic Models
Probabilistic topic models are a set of algorithms that are used to uncover the hidden thematic structure from a collection of documents. The main idea of topic modeling is to create a probabilistic generative model for the corpus of text doc- uments. In topic models, documents are mixture of topics, where a topic is a probability distribution over words. The two main topic models are Probabilistic Latent Semantic Analysis (pLSA) [44] and Latent Dirichlet Allocation (LDA) [17]. Hofmann (1999) introduced pLSA for document modeling. pLSA model does not provide any probabilistic model at the document level which makes it difficult to generalize it to model new unseen documents. Blei et al. [17] extended this model by introducing a Dirichlet prior on mixture weights of topics per documents, and called the model Latent Dirichlet Allocation (LDA). In this section we describe the LDA method. The Latent Dirichlet Allocation (LDA) [17] is a generative probabilistic model for extracting thematic information (topics) of a collection of documents. LDA assumes that each document is made up of various topics, where each topic is a probability distribution over words.
Let D = {d1, d2, . . . , d|D|} is the corpus and V = {w1, w2, . . . , w|V|} is the vo- cabulary of the corpus. A topic zj, 1 ≤ j ≤ K is represented as a multinomial P|V| probability distribution over the |V| words, p(wi|zj), i p(wi|zj) = 1. LDA gen- erates the words in a two-stage process: words are generated from topics and topics are generated by documents. More formally, the distribution of words given the
14 document is calculated as follows:
K X p(wi|d) = p(wi|zj)p(zj|d) (2.1) j=1 The graphical model of LDA is shown in Figure 7.1 and the generative process for the corpus D is as follows:
1. For each topic k ∈ {1, 2,...,K}, sample a word distribution φk ∼ Dir(β)
2. For each document d ∈ {1, 2,..., D},
(a) Sample a topic distribution θd ∼ Dir(α)
(b) For each word wn, where n ∈ {1, 2,...,N}, in document d,
i. Sample a topic zi ∼ Mult(θd)
ii. Sample a word wn ∼ Mult(φzi )
The joint distribution of the model (hidden and observed variables) is:
K |D| N ! Y Y Y P (φ1:K , θ1:D, z1:D, w1:D) = P (φj|β) P (θd|α) P (zd,n|θd)P (wd,n|φ1:K , zd,n) j=1 d=1 n=1
2.2.1 Inference and Parameter Estimation for LDA
In the LDA model, the word-topic distribution p(w|z) and topic-document distri- bution p(z|d) are learned entirely in an unsupervised manner, without any prior knowledge about what words are related to the topics and what topics are related to individual documents.
15 ↵
✓
z