Wikitology: a Novel Hybrid Knowledge Base Derived from Wikipedia

Total Page:16

File Type:pdf, Size:1020Kb

Wikitology: a Novel Hybrid Knowledge Base Derived from Wikipedia WIKITOLOGY: A NOVEL HYBRID KNOWLEDGE BASE DERIVED FROM WIKIPEDIA by Zareen Saba Syed Thesis submitted to the Faculty of the Graduate School of the University of Maryland in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2010 ABSTRACT Title of dissertation: WIKITOLOGY: A NOVEL HYBRID KNOWLEDGE BASE DERIVED FROM WIKIPEDIA Zareen Saba Syed, Doctor of Philosophy, 2010 Dissertation directed by: Professor Timothy W. Finin Department of Computer Science and Electrical Engineering World knowledge may be available in different forms such as relational databases, triple stores, link graphs, meta-data and free text. Human minds are capable of un- derstanding and reasoning over knowledge represented in different ways and are influenced by different social, contextual and environmental factors. By following a similar model, we have integrated a variety of knowledge sources in a novel way to produce a single hybrid knowledge base i.e., Wikitology, enabling applications to better access and exploit knowledge hidden in different forms. Wikipedia proves to be an invaluable resource for generating a hybrid knowl- edge base due to the availability and interlinking of structured, semi-structured and un-structured encyclopedic information. However, Wikipedia is designed in a way that facilitates human understanding and contribution by providing interlinking of articles and categories for better browsing and search of information, making the content easily understandable to humans but requiring intelligent approaches for being exploited by applications directly. Research projects like Cyc [61] have resulted in the development of a complex broad coverage knowledge base however, relatively few applications have been built that really exploit it. In contrast, the design and development of Wikitology KB has been incremental and has been driven and guided by a variety of applications and approaches that exploit the knowledge available in Wikipedia in different ways. This evolution has resulted in the development of a hybrid knowledge base that not only incorporates and integrates a variety of knowledge resources but also a variety of data structures, and exposes the knowledge hidden in different forms to applications through a single integrated query interface. We demonstrate the value of the derived knowledge base by developing prob- lem specific intelligent approaches that exploit Wikitology for a diverse set of use cases, namely, document concept prediction, cross document co-reference resolution defined as a task in Automatic Content Extraction (ACE) [1], Entity Linking to KB entities defined as a part of Text Analysis Conference - Knowledge Base Population Track 2009 [65] and interpreting tables [94]. These use cases directly serve to evalu- ate the utility of the knowledge base for different applications and also demonstrate how the knowledge base could be exploited in different ways. Based on our work we have also developed a Wikitology API that applications can use to exploit this unique hybrid knowledge resource for solving real world problems. The different use cases that exploit Wikitology for solving real world prob- lems also contribute to enriching the knowledge base automatically. The document concept prediction approach can predict inter-article and category-links for new Wikipedia articles. Cross document co-reference resolution and entity linking pro- vide a way for specifically linking entity mentions in Wikipedia articles or external articles to the entity articles in Wikipedia and also help in suggesting redirects. In addition to that we have also developed specific approaches aimed at automatically enriching the Wikitology KB by unsupervised discovery of ontology elements using the inter-article links, generating disambiguation trees for entities and estimating the page rank of Wikipedia concepts to serve as a measure of popularity. The set of approaches combined together can contribute to a number of steps in a broader uni- fied framework for automatically adding new concepts to the Wikitology knowledge base. WIKITOLOGY: A NOVEL HYBRID KNOWLEDGE BASE DERIVED FROM WIKIPEDIA by Zareen Saba Syed Thesis submitted to the Faculty of the Graduate School of the University of Maryland in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2010 c Copyright by Zareen Saba Syed 2010 To My Parents. ii Table of Contents List of Tables v List of Figures vii 1 Introduction 1 1.1 Advantages of Deriving a Knowledge Base from Wikipedia . 4 1.2 Thesis Statement . 7 1.3 Contributions . 7 1.4 Thesis Outline . 12 2 Background and Related Work 15 2.1 Wikipedia Structure and Content . 16 2.2 Wikipedia in-built Search . 24 2.3 Existing Approaches for Refining and Enriching Wikipedia . 24 2.4 Use of Wikipedia as a Knowledge Resource . 27 2.5 Other Open Knowledge Sources . 27 3 Introduction to Wikitology Knowledge Base 36 3.1 Evolution of Wikitology Knowledge Base . 40 3.2 Evaluating the Knowledge Base . 42 3.3 Automatic Enrichment of Knowledge Base . 42 3.4 Wikitology v/s Existing Wikipedia based Knowledge Resources . 43 3.5 Conclusion . 45 4 Novel Wikitology Based Approaches for Common Use Cases 46 4.1 Named Entity Classification . 47 4.2 Document Concept Prediction . 59 4.3 Cross Document Co-reference Resolution . 83 4.4 Entity Linking . 94 4.5 Interpreting Tables . 101 5 Wikitology Hybrid KB Design, Query Interface and API 112 5.1 Introduction . 112 5.2 Related Work . 113 5.3 Wikitology Architecture and Design . 115 5.4 Wikitology API . 130 5.5 Example Applications . 131 5.6 Conclusions . 139 6 Approaches for Automatically Enriching Wikitology 140 6.1 Unsupervised techniques for discovering ontology elements from Wikipedia article links . 141 6.2 Disambiguation Trees . 161 6.3 A Broader Framework for automatically enriching Wikitology . 168 iii 7 Evaluation Summary 173 7.1 Approaches for Evaluating Knowledge Bases . 173 7.2 Evaluating Wikitology Knowledge Base . 174 7.3 Evaluation for Document Concept Prediction task . 176 7.4 Evaluation for Cross Document Coreference Resolution Task . 177 7.5 Evaluation for Entity Linking Task . 179 7.6 Evaluation for Interpreting Tables . 181 7.7 Discussion . 182 7.8 Target Applications for Wikitology . 183 7.9 Conclusions . 185 8 Conclusions 186 8.1 Discussion . 188 8.2 Future Work . 192 Bibliography 193 iv List of Tables 2.1 Wikipedia Category Statistics (Wikipedia March 2006 dump). 21 2.2 Property related statistics in DBpedia Infobox Ontology. 30 3.1 Summary of features distinguishing Wikitology from existing knowl- edge resources derived from Wikipedia. 44 4.1 Wikipedia statistics, articles with word born in them. 53 4.2 Results showing accuracy obtained using different feature sets. H: Headings Only, F: Words in first paragraph, H+F: Headings + Words in first paragraph, H+F+S: Headings + Words in first paragraph + Stop words. 57 4.3 Confusion matrix using SVM. 57 4.4 Confusion matrix using Decision Trees . 57 4.5 Confusion matrix using Nave Bayes . 58 4.6 Top three concepts predicted for a single test document using Method 1 and Method 2. 71 4.7 Titles of documents in the test sets. 73 4.8 Common concept prediction results for a set of documents . 73 4.9 Common concept prediction results for a set of documents related to \Organic farming" using Method 3 with different pulses and activa- tion functions. 74 4.10 Twelve features were computed for each pair of entities using Wikitol- ogy, seven aimed at measuring their similarity and five for measuring their dissimilarity. 90 4.11 Evaluation results for Cross Document Entity Coreference Resolution using Wikitology Generated Features. 93 4.12 Accuracy obtained for Entity Linking Task for entities that exist in Wikipedia using different approaches. 98 4.13 Accuracy obtained for positive and negative entity matches using learned score thresholds. 99 4.14 Accuracy for PageRank prediction for different classification Algo- rithms. 105 4.15 Accuracy per PageRank class using Decision Tree (J48 pruned) algo- rithm. 106 4.16 Feature contributions for PageRank approximations. 106 4.17 Test Data Set for Interpreting Tables. 110 4.18 Entity Type Distribution in Test Data Set for Interpreting Tables. 111 4.19 Evaluation Results for Class Label Prediction and Cell Linking. 111 5.1 Property related statistics in DBpedia Infobox Ontology. 126 5.2 Wikitology Index Structure and Field Types. 128 6.1 Fifteen slots were discovered for musician Michael Jackson along with scores and example fillers. 150 v 6.2 Comparison of the precision, recall and F-measure for different feature sets for entity classification. The k column shows the number of clusters that maximized the F score. 154 6.3 Manual evaluation of discovered properties. 155 6.4 Evaluation results for class hierarchy prediction using different simi- larity functions. 156 6.5 Disambiguation Features evaluation for different entity types. 167 7.1 Summary of performance of Wikitology based approaches. 184 vi List of Figures 2.1 Structure of a typical Wikipedia Article with Title, Contents, Infobox, Table of Contents with different sections and Categories. 18 2.2 Demonstration of DBpedia resource on Barack Obama. 29 2.3 Linked Open Data datasets on the web (april 2008). 31 2.4 Freebase Topic on Barack Obama along with structured data. 32 2.5 Demonstration of Powerset Factz available online. 33 3.1 Wikitology is a hybrid knowledge base integrating information in structured and unstructured forms. 40 3.2 Evolution of the Wikitology System showing the different data struc- tures incorporated over time and the different applications targeted. 41 4.1 These graphs show the precision, average precision, recall and f- measure metrics as the average similarity threshold varies from 0.1 to 0.8. Legend label M1 is method 1, M1(1) and M1(2) are method 1 with scoring schemes 1 and 2, respectively, and SA1 and SA2 repre- sent the use of spreading activation with one and two pulses, respec- tively. 76 4.2 In the concept prediction task, precision, average precision and re- call improve at higher similarity thresholds, with average precision remaining higher than precision, indicating that our ranking scheme ranks relevant links higher than irrelevant links.
Recommended publications
  • Objective Assessment of Cerebellar Ataxia: a Comprehensive and Refned Approach Bipasha Kashyap1 ✉ , Dung Phan1, Pubudu N
    www.nature.com/scientificreports OPEN Objective Assessment of Cerebellar Ataxia: A Comprehensive and Refned Approach Bipasha Kashyap1 ✉ , Dung Phan1, Pubudu N. Pathirana1, Malcolm Horne2, Laura Power3 & David Szmulewicz2,3,4 Parametric analysis of Cerebellar Ataxia (CA) could be of immense value compared to its subjective clinical assessments. This study focuses on a comprehensive scheme for objective assessment of CA through the instrumented versions of 9 commonly used neurological tests in 5 domains- speech, upper limb, lower limb, gait and balance. Twenty-three individuals diagnosed with CA to varying degrees and eleven age-matched healthy controls were recruited. Wearable inertial sensors and Kinect camera were utilised for data acquisition. Binary and multilabel discrimination power and intra-domain relationships of the features extracted from the sensor measures and the clinical scores were compared using Graph Theory, Centrality Measures, Random Forest binary and multilabel classifcation approaches. An optimal subset of 13 most important Principal Component (PC) features were selected for CA- control classifcation. This classifcation model resulted in an impressive performance accuracy of 97% (F1 score = 95.2%) with Holmesian dimensions distributed as 47.7% Stability, 6.3% Timing, 38.75% Accuracy and 7.24% Rhythmicity. Another optimal subset of 11 PC features demonstrated an F1 score of 84.2% in mapping the total 27 PC across 5 domains during CA multilabel discrimination. In both cases, the balance (Romberg) test contributed the most (31.1% and 42% respectively), followed by the peripheral tests whereas gait (Walking) test contributed the least. These fndings paved the way for a better understanding of the feasibility of an instrumented system to assist informed clinical decision- making.
    [Show full text]
  • The Fourth Paradigm
    ABOUT THE FOURTH PARADIGM This book presents the first broad look at the rapidly emerging field of data- THE FOUR intensive science, with the goal of influencing the worldwide scientific and com- puting research communities and inspiring the next generation of scientists. Increasingly, scientific breakthroughs will be powered by advanced computing capabilities that help researchers manipulate and explore massive datasets. The speed at which any given scientific discipline advances will depend on how well its researchers collaborate with one another, and with technologists, in areas of eScience such as databases, workflow management, visualization, and cloud- computing technologies. This collection of essays expands on the vision of pio- T neering computer scientist Jim Gray for a new, fourth paradigm of discovery based H PARADIGM on data-intensive science and offers insights into how it can be fully realized. “The impact of Jim Gray’s thinking is continuing to get people to think in a new way about how data and software are redefining what it means to do science.” —Bill GaTES “I often tell people working in eScience that they aren’t in this field because they are visionaries or super-intelligent—it’s because they care about science The and they are alive now. It is about technology changing the world, and science taking advantage of it, to do more and do better.” —RhyS FRANCIS, AUSTRALIAN eRESEARCH INFRASTRUCTURE COUNCIL F OURTH “One of the greatest challenges for 21st-century science is how we respond to this new era of data-intensive
    [Show full text]
  • Abstract Counterexample-Based Refinement for Powerset
    Abstract Counterexample-based Refinement for Powerset Domains R. Manevich1,†, J. Field2 , T. A. Henzinger3,§, G. Ramalingam4,¶, and M. Sagiv1 1 Tel Aviv University, {rumster,msagiv}@tau.ac.il 2 IBM T.J. Watson Research Center, [email protected] 3 EPFL, [email protected] 4 Microsoft Research India, [email protected] Abstract. Counterexample-guided abstraction refinement (CEGAR) is a powerful technique to scale automatic program analysis techniques to large programs. However, so far it has been used primarily for model checking in the context of predicate abstraction. We formalize CEGAR for general powerset domains. If a spurious abstract counterexample needs to be removed through abstraction refinement, there are often several choices, such as which program location(s) to refine, which ab- stract domain(s) to use at different locations, and which abstract values to compute. We define several plausible preference orderings on abstrac- tion refinements, such as refining as “late” as possible and as “coarse” as possible. We present generic algorithms for finding refinements that are optimal with respect to the different preference orderings. We also compare the different orderings with respect to desirable properties, in- cluding the property if locally optimal refinements compose to a global optimum. Finally, we point out some difficulties with CEGAR for non- powerset domains. 1 Introduction The CEGAR (Counterexample-guided Abstraction Refinement) paradigm [1, 3] has been the subject of a significant body of work in the automatic verification community. The basic idea is as follows. First, we statically analyze a program using a given abstraction. When an error is discovered, the analyzer generates an abstract counterexample, and checks whether the error occurs in the corre- sponding concrete execution path.
    [Show full text]
  • Microhoo: Lessons from a Takeover Attempt Nihat Aktas, Eric De Bodt
    MicroHoo: Lessons from a takeover attempt Nihat Aktas, Eric de Bodt, and Richard Roll This draft: July 14, 2010 ABSTRACT On February 1, 2008, Microsoft offered $43.7 billion for Yahoo. This was a milestone in the Microsoft versus Google battle to control the internet search industry and the related online advertising market. We provide an in-depth analysis of this failed takeover attempt. We attempt to identify the sources of overbidding (signaling, hubris-overconfidence, rational overpayment) and we show that the corporate control market has a disciplinary dimension but of an incidental and latent nature. Our results highlight the existence of takeover anticipation premiums in the stock prices of the potential targets. JEL classification: G34 Keywords: Overbidding; Competitive advantage; Rational overpayment; Latent competition Aktas de Bodt Roll Univ. Lille Nord de EMLYON Business UCLA Anderson France School 110 Westwood Plaza ECCCS Address 23 av. Guy de Collongue Los Angeles, CA 90095- 1 place Déliot - BP381 69130 Ecully 1481 59020 Lille Cédex France USA France Voice +33-4-7833-7847 +33-3-2090-7477 +1-310-825-6118 Fax +33-4-7833-7928 +33-3-2090-7629 +1-310-206-8404 E-mail [email protected] eric.debodt@univ- [email protected] lille2.fr 1 MicroHoo: Lessons from a takeover attempt 1. Introduction Fast Internet access, software as a service, cloud computing, netbooks, mobile platforms, one-line application stores,…, the first decade of the twenty-first century brought all the ingredients of a new technological revolution. These new technologies share one common denominator: the Internet. Two large competitors are face to face.
    [Show full text]
  • Abstract Domain of Boxes
    BOXES: Abstract Domain of Boxes Arie Gurfinkel and Sagar Chaki Software Engineering Institute Carnegie Mellon University January 28, 2011 © 2011 Carnegie Mellon University NO WARRANTY THIS CARNEGIE MELLON UNIVERSITY AND SOFTWARE ENGINEERING INSTITUTE MATERIAL IS FURNISHED ON AN “AS-IS" BASIS. CARNEGIE MELLON UNIVERSITY MAKES NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED, AS TO ANY MATTER INCLUDING, BUT NOT LIMITED TO, WARRANTY OF FITNESS FOR PURPOSE OR MERCHANTABILITY, EXCLUSIVITY, OR RESULTS OBTAINED FROM USE OF THE MATERIAL. CARNEGIE MELLON UNIVERSITY DOES NOT MAKE ANY WARRANTY OF ANY KIND WITH RESPECT TO FREEDOM FROM PATENT, TRADEMARK, OR COPYRIGHT INFRINGEMENT. Use of any trademarks in this presentation is not intended in any way to infringe on the rights of the trademark holder. This Presentation may be reproduced in its entirety, without modification, and freely distributed in written or electronic form without requesting formal permission. Permission is required for any other use. Requests for permission should be directed to the Software Engineering Institute at [email protected]. This work was created in the performance of Federal Government Contract Number FA8721-05-C-0003 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center. The Government of the United States has a royalty-free government-purpose license to use, duplicate, or disclose the work, in whole or in part and in any manner, and to have or permit others to do so, for
    [Show full text]
  • An Overview of Hadoop
    Hadoop Jon Dehdari Introduction Hadoop Project Distributed Filesystem MapReduce Jobs Hadoop Ecosystem Current Status An Overview of Hadoop Jon Dehdari The Ohio State University Department of Linguistics 1 / 26 Hadoop What is Hadoop? Jon Dehdari Introduction Hadoop is a software framework for scalable distributed Hadoop Project computing Distributed Filesystem MapReduce Jobs Hadoop Ecosystem Current Status 2 / 26 Hadoop MapReduce Jon Dehdari Introduction Hadoop Project Distributed Follows Google's MapReduce framework for distributed Filesystem computing MapReduce Jobs Hadoop Ecosystem Scalable - from one computer to thousands Current Status Fault-tolerant - assumes computers will die Cheap - uses commodity PCs, no special hardware needed More active role in distributed computing than Grid Engine, Torque, PBS, Maui, Moab, etc., which are just schedulers 3 / 26 Hadoop Map & Reduce Jon Dehdari Introduction Based on the functional programming concepts of Hadoop Project higher-order functions: Distributed Filesystem Map MapReduce Jobs FunProg: apply a function to every element in a list, Hadoop Ecosystem return new list Current Status MapRed: apply a function to every row in a file block, return new file block Reduce (Fold) FunProg: recursively apply a function to list, return scalar value MapRed: recursively apply a function to file block, return less-composite value than before 4 / 26 Hadoop Map & Reduce Jon Dehdari Introduction Based on the functional programming concepts of Hadoop Project higher-order functions: Distributed Filesystem Map
    [Show full text]
  • Angle Measurement Performance]
    SURVEYING INSTRUMENTS SOKKIA SET1000•SET2000 SET3000•SET4000 POWERSET SERIES TOTAL STATIONS Introducing the Version 4.2 POWERSET Series Enjoy the benefits of all the latest technology with standardized SDR software you can upgrade with ease With the integration of corrective hardware sensors, self-collimating soft- ware and the most powerful, easy-to-use field application programming ever, SOKKIA introduces a new era in CAS (Computer Aided Surveying). The POWERSET Series sets a new standard for surveying efficiency: 1 ) an extensive range of surveying software; 2) a large internal memory for high-speed data processing; 3) reliable memory cards for storage of survey data. • Entirely new design features a miniaturized telescope unit that makes sighting as easy as using a theodolite. Thanks to the POWERSET Series' advanced optics, you can use reflective sheet targets and standard glass prisms for greater flexibility in the field. • The dual-axis compensator and collimation program ensure consistently accurate measurements. • The POWERSET Series is designed for maximum ease of use. Large screens on each side of the instrument are easy to read in any field conditions, and the keyboards on both faces feature full alphanumeric keys. • All these advanced functions are packed into an incredibly compact instrument weighing a mere 5.6kg (12.4 lbs.) SET1000/SET2000 Dist. meas. with reflecting prism Range: 3,500m (11,400ft.)* Accuracy: ±(2+2ppmxD)mm** Dist. meas. with reflective sheet target Range: 120m (390ft.)*** Accuracy: ±(4+3ppmxD)mm** Angle measurement Display resolution: 0.5"/0.1 mgon or 1"/0.2mgon Accuracy: SET1000 1"(0.3mgon) SET2000 2"(0.6mgon) SET3000 Dist.
    [Show full text]
  • Where Computer Vision Needs Help from Computer Science
    Where computer vision needs help from computer science Bill Freeman Electrical Engineering and Computer Science Dept. Computer Science and Ar=ficial Intelligence Laboratory MassachuseDs Ins=tute of Technology SODA Jan. 24, 2011 Monday, January 24, 2011 Outline • How I got excited about computer vision • Computer vision applica=ons • Computer vision techniques and problems: – High-level vision: combinatorial problems – Low-level vision: underdetermined problems – Miscellaneous problems Monday, January 24, 2011 The Taiyuan University of Technology Computer Center staff, and me (1987) Monday, January 24, 2011 The Taiyuan University of Technology Computer Center staff, and me (1987) Monday, January 24, 2011 Me and my wife, riding from the Foreigners’ Cafeteria Monday, January 24, 2011 Me in my office at the Computer Center Monday, January 24, 2011 Years ago, I read this book (re-issued by MIT Press in 2010), and got very excited about computer vision. Monday, January 24, 2011 Goal of computer vision Marr: “To tell what is where by looking”. Want to: – Es=mate the shapes and proper=es of things. – Recognize objects – Find and recognize people – Find road lanes and other cars – Help a robot walk, navigate, or fly. – Inspect for manufacturing Monday, January 24, 2011 Some par=cular goals of computer vision • Wave a camera around, get a 3-d model out. • Capture body pose of actor dancing. • Detect and recognize faces. • Recognize objects. • Track people or objects • Enhance images Monday, January 24, 2011 Companies and applica=ons • Cognex • Poseidon
    [Show full text]
  • Catalog & Wiring Guide
    ® CATALOG & WIRING GUIDE irrigation & landscape lighting www.paigewire.com Contents WHY PAIGE 3 IRRIGATION Wires and Cables Installation Tips 4 Power Wires 4 - 6 Control Valve Wires 7 Communication Cables 8 Decoder, 2-wire, 2-core cables 9 Decoder Cable Fuse Devices 10 Decoder Cable Fuse Devices, with Lightning Protection 11 Protecting Electric Equipment from Lightning and Power Surges Lightning arrester 12 Grounding controllers photo: courtesy Hunter Industries 12 Bonding controllers 13 Shielding underground wires and cables 13 Grounding decoder circuits 14 - 18 Bonding and shielding decoder circuits 14 - 18 Components for grounding, bonding and shielding 18 Installation tips for grounding, bonding and shielding 18 - 19 SPLICE ACCESSORIES Irrigation wire connectors, dry locations 20 Irrigation wire connectors, wet and damp locations 21 - 23 Wire Marking Tags 22 LANDSCAPE LIGHTING Low Voltage Lighting Cables 24 Timers 24 Photo Controls 24 Waterproof Connectors 25 Low Voltage Lighting Components 26 CONTACTS 27 WARRANTY 27 2 www.paigewire.com Why ® ? Because you need irrigation and lighting wires and cables made for professional use from a company: ® That thinks globally and can provide products international and services in any country in the world Who understands where you’re coming from Who covers the total spectrum of your cable and accessory needs Who can talk-the-talk with your people Who from coast-to-coast maintains the industry’s largest, most comprehensive stock so your customers’ needs are met promptly Because you need one-to-one service: From someone who answers the phone when you call. Who’s obsessed with personal, immediate response, and service. Someone who delivers the hard-to-find items.
    [Show full text]
  • Proceedings of NAACL-HLT-2010 Workshop on Semantic Search
    NAACL HLT 2010 Workshop on Semantic Search Proceedings of the Workshop June 5, 2010 Los Angeles, California USB memory sticks produced by Omnipress Inc. 2600 Anderson Street Madison, WI 53707 USA c 2010 The Association for Computational Linguistics Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ii Introduction Welcome to the NAACL HLT Workshop on Semantic Search! Information retrieval (IR) research has been actively driven by the challenging information overload problem and many successful general-purpose commercial search engines. While the popularity of the largest search engines is a confirmation of the success and utility of IR, the identification, representation, and use of the often-complex semantics behind user queries has not yet been fully explored. In this workshop we target methods that exploit semantics in search-related tasks. One of the major obstacles in bridging the gap between IR and Natural Language Processing (NLP) is how to retain the flexibility and precision of working with text at the lexical level while gaining the greater descriptive precision that NLP provides. We have solicited contributions on automatic analysis of queries and documents in order to encode and exploit information beyond surface-level keywords: named entities, relations, semantic roles, etc. This workshop is meant to accelerate the pace of progress in semantic search techniques by connecting IR and NLP, bridging semantic analysis and search methodologies, and exploring the potentials of search utilizing semantics. We also focus on forming an interest group from different areas of research, exploring collaboration opportunities, providing deeper insight into bringing semantics into search, and provoking or encouraging discussions on all of its potential.
    [Show full text]
  • Arxiv:1506.05692V1 [Stat.ML] 18 Jun 2015 Keywords: Available
    A hybrid algorithm for Bayesian network structure learning with application to multi-label learning Maxime Gasse, Alex Aussem∗, Haytham Elghazel Universit´ede Lyon, CNRS Universit´eLyon 1, LIRIS, UMR5205, F-69622, France Abstract We present a novel hybrid algorithm for Bayesian network structure learning, called H2PC. It first reconstructs the skeleton of a Bayesian network and then performs a Bayesian-scoring greedy hill-climbing search to orient the edges. The algorithm is based on divide-and-conquer constraint-based subroutines to learn the local structure around a target variable. We conduct two series of experimental comparisons of H2PC against Max-Min Hill-Climbing (MMHC), which is currently the most powerful state-of-the-art algorithm for Bayesian network structure learning. First, we use eight well-known Bayesian network benchmarks with various data sizes to assess the quality of the learned structure returned by the algorithms. Our extensive experiments show that H2PC outperforms MMHC in terms of goodness of fit to new data and quality of the network structure with respect to the true dependence structure of the data. Second, we investigate H2PC’s ability to solve the multi-label learning problem. We provide theoretical results to characterize and identify graphically the so-called minimal label powersets that appear as irreducible factors in the joint distribution under the faithfulness condition. The multi-label learning problem is then decomposed into a series of multi-class classification problems, where each multi-class variable encodes a label powerset. H2PC is shown to compare favorably to MMHC in terms of global classification accuracy over ten multi-label data sets covering different application domains.
    [Show full text]
  • UC Berkeley Grad – Anthropology) at Xerox PARC Suggests Videotaping Interactions
    2/4/12 CS160: User Interface Design Task Analysis & Contextual Inquiry 02/01/12 UNIVERSITYBerkeley OF CALIFORNIA http://www.youtube.com/watch?v=KHILJBw-104 1 2/4/12 Episode 4: 99% Details EECS Colloquium Exploring the Universe with " Interactive Art San Francisco artist and entrepreneur Scott Snibbe will present selections from twenty years of interactive exhibits, interactive art, and interactive music. He will show many examples of interactive media with technologies including computer vision, haptics, multitouch, and iPads, including recent work creating the first app album with Björk: Biophilia; and the recent interactive exhibits for James Cameron's movie Avatar. He will discuss the educational and Sco Snibbe societal benefits of interactivity; and the joys, challenges, and research involved in the creation and distribution of interactive media on the cutting edge of interactive technology. Wed Feb 1, 4-5pm Soda 306 2 2/4/12 Borrowing a Kinect $150 Check to UC Regents Will leave uncashed unless not returned by end of term We have a limited supply Up to 2 Kinects per group Come to Soda 514 today 3:30-5pm More times listed on Piazza Due Last Monday: IPA1 Grades on bSpace soon… Regrades: Write down where you think you deserve more points and submit physical copy to us. We will regrade entire assignment. Your grade can decrease during regrading. 3 2/4/12 Due Last Monday IDA 1 More observations/interviews early on More detail in prototypes More extensive user testing Grades on bSpace soon… Review: Group Brainstorm 4 2/4/12 Upcoming Due Dates Mon.
    [Show full text]