10651_9789813227910_TP.indd 1 24/7/17 1:49 PM World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
ISSN: 2529-7686
Published Vol. 1 Semantic Computing edited by Phillip C.-Y. Sheu
10651 - Semantic Computing.indd 1 27-07-17 5:07:03 PM World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence – Vol. 1 SEMANTIC COMPUTING
Editor Phillip C-Y Sheu University of California, Irvine
World Scientific
NEW JERSEY • LONDON • SINGAPORE • BEIJING • SHANGHAI • HONG KONG • TAIPEI • CHENNAI • TOKYO
10651_9789813227910_TP.indd 2 24/7/17 1:49 PM World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
ISSN: 2529-7686
Published Vol. 1 Semantic Computing edited by Phillip C.-Y. Sheu
Catherine-D-Ong - 10651 - Semantic Computing.indd 1 22-08-17 1:34:22 PM Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
Library of Congress Cataloging-in-Publication Data Names: Sheu, Phillip C.-Y., editor. Title: Semantic computing / editor, Phillip C-Y Sheu, University of California, Irvine. Other titles: Semantic computing (World Scientific (Firm)) Description: Hackensack, New Jersey : World Scientific, 2017. | Series: World Scientific encyclopedia with semantic computing and robotic intelligence ; vol. 1 | Includes bibliographical references and index. Identifiers: LCCN 2017032765| ISBN 9789813227910 (hardcover : alk. paper) | ISBN 9813227915 (hardcover : alk. paper) Subjects: LCSH: Semantic computing. Classification: LCC QA76.5913 .S46 2017 | DDC 006--dc23 LC record available at https://lccn.loc.gov/2017032765
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
Copyright © 2018 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or me- chanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
Desk Editor: Catherine Domingo Ong
Typeset by Stallion Press Email: [email protected]
Printed in Singapore
Catherine-D-Ong - 10651 - Semantic Computing.indd 2 22-08-17 1:34:22 PM July 21, 2017 8:45:03am WSPC/B2975 Foreword World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Foreword
The line between computer science and robotics is continuing is realistic to expect that future robots are connected to the to be softened. On one hand computers are continuing to be world of knowledge, so as to interact with humans intelli- humanized and a large number of cyber-physical systems are gently as a collaborative partner to solve general as well as being developed to act upon the physical world. On the other domain specific problems. We call such intelligence robotic hand, the robotic community is working on future robots that intelligence. are versatile computing machines with very sophisticated It is for this reason that a marriage between Semantic intelligence compatible to human brain. Computing and Robotic Intelligence is natural. Accordingly, As a branch of computer science, semantic computing we name this encyclopedia the Encyclopedia with Semantic (SC) addresses the derivation, description, integration, and Computing and Robotic Intelligence (ESCRI). We expect the use of semantics (\meaning", \context", \intention") for all marriage of the two, together with other enabling technolo- types of resource including data, document, tool, device, gies (such as machine vision, big data, cloud computing, process and people. It considers a typical, but not exclusive, mobile computing, VR, IoT, etc.), could synergize more semantic system architecture to consist of five layers: significant contributions to achieve the ultimate goal of Artificial Intelligence. . (Layer 1) Semantic Analysis, that analyzes and converts We believe that while the current society is information signals such as pixels and words (content) to meanings centric, the future will be knowledge centric. The challenge is (semantics); how to structure, deliver, share and make use of the large . (Layer 2) Semantic Data Integration, that integrates data amounts of knowledge effectively and productively. The and their semantics from different sources within a unified concept of this encyclopedia may constantly evolve to meet model; the challenge. Incorporating the recent technological advan- . (Layer 3) Semantic Services, that utilize the content and ces in semantic computing and robotic intelligence, the en- semantics as the source of knowledge and information for cyclopedia has a long term goal of building a new model of building new applications - which may in turn be available reference that delivers quality knowledge to both profes- as services for building more complex applications; sionals and laymen alike. . (Layer 4) Semantic Service Integration, that integrates I would like to thank members of the editorial board for services and their semantics from different sources within a their kind support, and thank Dr. K. K. Phua, Dr. Yan Ng, Ms. unified model; and Catherine Ong and Mr. Mun Kit Chew of World Scientific . (Layer 5) Semantic Interface, that realize the user inten- Publishing for helping me to launch this important journal. tions to be described in natural language, or express in some other natural ways. A traditional robot often works with a confined context. If we Phillip Sheu take the Internet (including IoT) as the context and consider Editor-in-Chief the advances in clouding computing and mobile computing, it University of California, Irvine
v This page intentionally left blankThis blank
July 21, 2017 8:44:01am WSPC/B2975 content World Scienti¯c Encyclopedia with Semantic Computing and Robotic Intelligence
CONTENTS
Foreword v
Part 1: Understanding Semantics 1
Open information extraction 3 D.-T. Vo and E. Bagheri
Methods and resources for computing semantic relatedness 9 Y. Feng and E. Bagheri
Semantic summarization of web news 15 F. Amato, V. Moscato, A. Picariello, G. Sperl{,A.D'Acierno and A. Penta
Event identi¯cation in social networks 21 F. Zarrinkalam and E. Bagheri
Community detection in social networks 29 H. Fani and E. Bagheri
High-level surveillance event detection 37 F. Persia and D. D'Auria
Part 2: Data Science 43
Selected topics in statistical computing 45 S. B. Chatla, C.-H. Chen and G. Shmueli
Bayesian networks: Theory, applications and sensitivity issues 63 R. S. Kenett
GLiM: Generalized linear models 77 J. R. Barr and S. Zacks
OLAP and machine learning 85 J. Jin
Survival analysis via Cox proportional hazards additive models 91 L. Bai and D. Gillen
Deep learning 103 X. Hao and G. Zhang Two-stage and sequential sampling for estimation and testing with prescribed precision 111 S. Zacks
Business process mining 117 A. Pourmasoumi and E. Bagheri
The information quality framework for evaluating data science programs 125 S. Y. Coleman and R. S. Kenett
(Continued ) July 21, 2017 8:44:01am WSPC/B2975 content World Scienti¯c Encyclopedia with Semantic Computing and Robotic Intelligence
CONTENTS — (Continued )
Part 3: Data Integration 139
Enriching semantic search with preference and quality scores 141 M. Missiko®, A. Formica, E. Pourabbas and F. Taglino Multilingual semantic dictionaries for natural language processing: The case of BabelNet 149 C. D. Bovi and R. Navigli
Model-based documentation 165 F. Farazi, C. Chapman, P. Raju and W. Byrne
Entity linking for tweets 171 P. Basile and A. Caputo
Enabling semantic technologies using multimedia ontology 181 A. M. Rinaldi
Part 4: Applications 189
Semantic software engineering 191 T. Wang, A. Kitazawa and P. Sheu
A multimedia semantic framework for image understanding and retrieval 197 A. Penta Use of semantics in robotics — improving doctors' performance using a cricothyrotomy simulator 209 D. D'Auria and F. Persia
Semantic localization 215 S. Ma and Q. Liu
Use of semantics in bio-informatics 223 C. C. N. Wang and J. J. P. Tsai
Actionable intelligence and online learning for semantic computing 231 C. Tekin and M. van der Schaar
Subject Index 239
Author Index 241 July 20, 2017 7:08:31pm WSPC/B2975 ch-01 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Open information extraction † Duc-Thuan Vo* and Ebrahim Bagheri Laboratory for Systems, Software and Semantics (LS3Þ Ryerson University, Toronto, ON, Canada *[email protected] † [email protected]
Open information extraction (Open IE) systems aim to obtain relation tuples with highly scalable extraction in portable across domain by identifying a variety of relation phrases and their arguments in arbitrary sentences. The first generation of Open IE learns linear chain models based on unlexicalized features such as Part-of-Speech (POS) or shallow tags to label the intermediate words between pair of potential arguments for identifying extractable relations. Open IE currently is developed in the second generation that is able to extract instances of the most frequently observed relation types such as Verb, Noun and Prep, Verb and Prep, and Infinitive with deep linguistic analysis. They expose simple yet principled ways in which verbs express relationships in linguistics such as verb phrase-based extraction or clause-based extraction. They obtain a significantly higher performance over previous systems in the first generation. In this paper, we describe an overview of two Open IE generations including strengths, weaknesses and application areas. Keywords: Open information extraction; natural language processing; verb phrase-based extraction; clause-based extraction.
1. Information Extraction and Open corpora. Etzioni et al.1 have introduced one of the pioneering Information Extraction Open IE systems called TextRunner.6 This system tackles an unbounded number of relations and eschews domain-specific Information Extraction (IE) is growing as one of the active training data, and scales linearly. This system does not pre- research areas in artificial intelligence for enabling computers suppose a predefined set of relations and is targeted at all to read and comprehend unstructured textual content.1 IE relations that can be extracted. Open IE is currently being systems aim to distill semantic relations which present rele- developed in its second generation in systems such as vant segments of information on entities and relationships 7 7 8 between them from large numbers of textual documents. The ReVerb, OLLIE, and ClausIE, which extend from previous Open IE systems such as TextRunner,6 StatSnowBall,9 and main objective of IE is to extract and represent information in WOE.10 Figure 1 summarizes the differences of traditional IE a tuple of two entities and a relationship between them. For 1,11
systems and the new IE systems which are called Open IE. instance, given the sentence \Barack Obama is the President of the United States", they venture to extract the relation tuple President of (Barack Obama, the United States) automati- cally. The identified relations can be used for enhancing 2. First Open IE Generation machine reading by building knowledge bases in Resource In the first generation, Open IE systems aimed at constructing Description Framework (RDF) or ontology forms. Most IE a general model that could express a relation based on – systems2 5 focus on extracting tuples from domain-specific unlexicalized features such as POS or shallow tags, e.g., a corpora and rely on some form of pattern-matching tech- description of a verb in its surrounding context or the pres- nique. Therefore, the performance of these systems is heavily ence of capitalization and punctuation. While traditional IE dependent on considerable domain specific knowledge. requires relations to be specified in their input, Open IE Several methods employ advanced pattern matching techni- systems use their relation-independent model as self-training ques in order to extract relation tuples from knowledge bases to learn relations and entities in the corpora. TextRunner is by learning patterns based on labeled training examples that one of the first Open IE systems. It applied a Naive Bayes serve as initial seeds. model with POS and Chunking features that trained tuples Many of the current IE systems are limited in terms of using examples heuristically generated from the Penn Tree- scalability and portability across domains while in most bank. Subsequent work showed that a linear-chain Condi- corpora likes news, blog, email, encyclopedia, the extractors tional Random Field (CRF)1,6 or Markov Logic Network9 need to be able to extract relation tuples from across different can be used for identifying extractable relations. Several domains. Therefore, there has been move towards next gen- Open IE systems have been proposed in the first generation, eration IE systems that can be highly scalable on large Web including TextRunner, WOE, and StatSnowBall that typically
3 July 20, 2017 7:08:31pm WSPC/B2975 ch-01 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
4 D.-T. Vo & E. Bagheri
Bill Gates, Microsoft co-founder, stepped down as IE Co-founder(Bill Gates, Microsoft) CEO in January 2000. Gates was included in Director-of (MacLorrance, Ciao) the Forbes wealthiest list since 1987 and was the Employee-of (MacLorrance, Ciao) wealthiest from 1995 to 2007......
It was announced that IBM would buy Ciao for an undisclosed amount. The CEO, MacLorrance has (Bill Gate, be, Microsoft co-founder) occupied the corner office of the Hopkinton, company Open IE (Bill Gates, stepped down as, CEO) (Bill Gates, was included in, the Forbes wealthiest list) The company’s storage business is also threatened by (Bill Gates, was, the wealthiest) new, born-on-the Web could providers like Dropbox and (IBM, would buy, Ciao) Box, and … (MacLorrance, has occupied, the corner office of the Hopkinton) ...
IE Open IE Input Sentences + Labeled relations Sentences Relation Specified relations in advance Free discovery Extractor Specified relations Independent-relations
Fig. 1. IE versus Open IE.
consist of the following three stages: (1) Intermediate levels for determining the relationship between entities for learning of analysis and (2) Learning models and (3) Presentation, models of the next stage. which we elaborate in the following: Learning models Intermediate levels of analysis An Open IE would learn a general model that depicts how a In this stage, Natural Language Processing (NLP) techniques relation could be expressed in a particular language. A linear- such as named entity recognition (NER), POS and Phrase- chain model such as CRF can then be applied to a sequence chunking are used. The input sequence of words are taken as which is labeled with POS tags, word segments, semantic input and each word in the sequence is labeled with its part of roles, named entities, and traditional forms of relation ex- speech, e.g., noun, verb, adjective by a POS tagger. A set of traction from the first stage. The system will train a learning nonoverlapping phrases in the sentence is divided based on model given a set of input observations to maximize the POS tags by a phrase chunked. Named entities in the sen- conditional probability of a finite set of labels. TextRunner tence are located and categorized by NER. Some systems and WOEpos use CRFs to learn whether sequences of tokens such as TextRunner, WOE used KNnext8 work directly with are part of a relation. When identifying entities, the system the output of the syntactic and dependency parsers as shown determines a maximum number of words and their sur- in Fig. 2. They define a method to identify useful proposition rounding pair of entities which could be considered as pos- components of the parse trees. As a result, a parser will return sible evidence of a relation. Figure 3 shows entity pairs a parsing tree including the POS of each word, the presence \Albert Einstein" and \the Nobel Prize" with the relationship of phrases, grammatical structures and semantic roles for the \was awarded" serving to anchor the entities. On the other input sentence. The structure and annotation will be essential hand, WOEparse learns relations generated from corePath, a form of shortest path where a relation could exist, by com- Part-of-Speech puting the normalized logarithmic frequency as the proba- bility that a relation could be found. For instance, the shortest
Named Entity Recognition
Dependency Parsing
Fig. 2. POS, NER and DP analysis in the sentence \Albert Einstein Fig. 3. A CRF is used to identify the relationship \was awarded" was awarded the Nobel Prize for Physics in 1921". between \Albert Einstein" and \the Nobel Prize". July 20, 2017 7:08:34pm WSPC/B2975 ch-01 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Open information extraction 5
career as a scientist". The second problem, uninformative path \Albert Einstein" nsubjpass \was awarded" dobj \the extractions, occurs when Open IE systems miss critical in- Nobel Prize" presents the relationship between \Albert formation of a relation. Uninformative extraction is a type of Einstein" and \the Nobel Prize" could be learned from the 14,23 error relating to light verb construction due to multi-word patterns \E1" nsubjpass \V" dobj \E2" in the training data. predicates being composed of a verb and an additional noun. \ Presentation For example, given the sentence Al-Qaeda claimed re- sponsibility for the 9/11 attacks", Open IE systems such as In this stage, Open IE systems provide a presentation of the TextRunner return the uninformative relation (Al-Qaeda, extracted relation triples. The sentences of the input will be claimed, responsibility) instead of (Al-Qaeda, claimed re- presented in the form of instances of a set of relations after sponsibility for, the 9/11 attack). being labeled by the learning models. TextRunner and WOE take sentences in a corpus and quickly extract textual triples that are present in each sentence. The form of relation triples 3. Second Open IE Generation contain three textual components where the first and third denote pairs of entity arguments and the second denotes the In the second generation, Open IE systems focus on relationship between them as (Arg1, Rel, Arg2). Figure 4 addressing the problem of incoherent and uninformative shows the differences of presentations between traditional IE relations. In some cases, TextRunner and WOE do not extract and Open IE. the full relation between two noun phrases, and only extract a Additionally, with large scale and heterogeneous corpora portion of the relation which is ambiguous. For instance, \ " such as the Web, Open IE systems also need to address the where it should extract the relation is author of , it only \ " \ disambiguation of entities, e.g., same entities may be referred to extracts is as the relation in the sentence William Shake- " by a variety of names (Obama or Barack Obama or B. H. speare is author of Romeo and Juliet . Similar to first gen- Obama) or the same string (Michael) may refer to different en- eration systems, Open IE systems in the second generation tities. Open IE systems try to compute the probability that two have also applied NLP techniques in the intermediate level strings denote synonymous pairs of entities based on a highly analysis of the input and the output is processed in a similar scalable and unsupervised analysis of tuples. TextRunner vein to the first generation. They take a sentence as input and applies the Resolver system12 while WOE uses the infoboxes perform POS tagging, syntactic chunking and dependency from Wikipedia for classifying entities in the relation triples. parsing and then return a set of relation triples. However, in the intermediate level analysis process, Open IE systems in the second generation focus deeply on a thorough linguistic analysis of sentences. They expose simple yet principled 2.1. Advantages and disadvantages ways in which verbs express relationships in linguistics. Open IE systems need to be highly scalable and perform Based on these linguistic relations, they obtain a significantly extractions on huge Web corpora such as news, blog, emails, higher performance over previous systems in the first gen- and encyclopedias. TextRunner was tested on a collection of eration. Several Open IE systems have been proposed after over 120 million Web pages and extracted over 500 million TextRunner and WOE, including ReVerb, OLLIE, Chris- triples. This system also had a collaboration with Google on tensen et al.,15 ClausIE, Vo and Bagheri16 with two extraction running over one billion public Web pages with noticeable paradigms, namely verb-based relation extraction and clause- precision and recall on this large-scale corpus. based relation extraction. First generation Open IE systems can suffer from pro- blems such as extracting incoherent and uninformative rela- tions. Incoherent extractions are circumstances when the 3.1. Verb phrase-based relation extraction system extracts relation phrases that present a meaningless ReVerb is one of the first systems that extracts verb phrase- interpretation of the content.6,13 For example, TextRunner based relations. This system builds a set of syntactic and and WOE would extract a triple such as (Peter, thought, his lexical constraints to identify relations based on verb phrases career as a scientist) from the sentence \Peter thought that then finds a pair of arguments for each identified relation John began his career as a scientist", which is clearly in- phrase. ReVerb extracts relations by giving first priority to coherent because \Peter" could not be taken as the first ar- verbs. Then the system extracts all arguments around verb gument for relation \began" with the second argument \his phrases that help the system to avoid common errors such as incoherent or uninformative extractions made by previous systems in the first generation. ReVerb considers three Sentence: “Apple Inc. is headquartered in California” grammatical structures mediated by verbs for identifying Traditional IE: Headquarters(Apple Inc., California) extractable relations. In each sentence, if the phrase matches Open IE: (Apple Inc., is headquartered in, California) one of the three grammatical structures, it will be considered as a relation. Figure 5 depicts three grammatical structures in Fig. 4. Traditional IE and Open IE extractions. ReVerb. Give a sentence \Albert Einstein was awarded the July 20, 2017 7:08:35pm WSPC/B2975 ch-01 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
6 D.-T. Vo & E. Bagheri
V | VP | VW*P relations not only mediated by verbs but also mediated by V = verb particle? adv? nouns, and adjectives. OLLIE follows ReVerb to identify W = (noun | adj | adv | pron | det) P = (prep | particle | inf. marker) potential relations based on verb-mediated relations. The system applies bootstrapping to learn other relation patterns using its similarity relations found by ReVerb. In each pat- Fig. 5. Three grammatical structures in ReVerb.7 tern the system uses dependency path to connect a relation and its corresponding arguments for extracting relations Nobel Prize." for each verb V (awarded) in sentence S, it will mediated by noun, adjective and others. After identifying find the longest sequence of words (VjVPjVW P) such that the general patterns, the system applies them to the corpus (1) it starts with V, (2) it satisfies the syntactic constraint, and to obtain new tuples. Therefore, OLLIE extracts a higher (3) it satisfies the lexical constraint. As result, (VjVPjVW P) number of relations from the same sentence compared to identifies \was awarded" as a relation. For each identified ReVerb. relation phrase R, it will find the nearest noun phrase X to the left of R, which is \Albert Einstein" in this case. Then it will find the nearest noun phrase Y to the right of R, which is \the Nobel Prize" in S. 3.2. Clause-based relation extraction Some limitations in ReVerb prevent the system from A more recent Open IE system named ClausIE presented by extracting all of the available information in a sentence, e.g., Corro and Gemulla8 uses clause structures to extract rela- the system could not extract the relation between \Bill Gates" tions and their arguments from natural language text. Dif- and \Microsoft" in the sentence \Microsoft co-founder Bill ferent from verb-phrase based relation extraction, this work Gates spoke at ... " shown in Fig. 6. This is due to the fact applies clause types in sentences to separate useful pieces. that ReVerb ignores the context of the relation by only con- ClausIE uses dependency parsing and a set of rules for do- sidering verbs, which could lead to false and/or incomplete main-independent lexica to detect clauses without any re- relations. Mausam et al.3 have presented OLLIE, as an ex- quirement for training data. ClausIE exploits grammar clause tended ReVerb system, which stands for Open Language structure of the English language for detecting clauses and Learning for IE. OLLIE performs deep analysis on the all of its constituents in sentence. As a result, ClausIE identified verb-phrase relation then the system extracts all obtains high-precision extraction of relations and also it can relations mediated by verbs, nouns, adjectives, and others. be flexibly customized to adapt to the underlying application For instance, in Fig. 6 ReVerb only detects the verb-phrase to domain. Another Open IE system, presented by Vo and identify the relation. However, OLLIE analyzes not only the Bagheri,16 uses clause-based approach inspired by the work verbs but also the noun and adverb that the system could presented in Corro and Gemulla8 for open IE. This work determine. As in the earlier sentence, the relations (\Bill proposes a reformulation of the parsing trees that will help Gates",\co-founder of", \Microsoft") is extracted by OLLIE the system identify discrete relations that are not found in but will not be extracted using ReVerb. ClausIE, and reduces the number of erroneous relation OLLIE has addressed the problem in ReVerb by adding two extractions, e.g., ClausIE incorrectly identifies `there' as a new elements namely \AttributedTo" and \ClauseModifier" \In today's meeting,
subject of a relation in the sentence: to relation tuples when extracting all relations mediated by there were four CEOs", which is avoided in the work by Vo noun, adjective, and others. \AttributeTo" is used for deter- and Bagheri. ring additional information and \ClauseModifier" is used Particularly, in these systems a clause can consist of for adding conditional information as seen in sentences different components such as subject (S), verb (V), indirect 2 and 3 in Fig. 6. OLLIE produces high yield by extracting object (O), direct object (O), complement (C), and/or one or more adverbials (A). As illustrated in Table 1, a clause can be categorized into different types based on its constituent 1. “Microsoft co-founder Bill Gates spoke at ... ” components. Both of these systems obtain and exploit clauses OLLIE: (“Bill Gates",“be co-founder of”, “Microsoft”) for relation extraction in the following manner: 2. “Early astronomers believed that the earth is the center of the universe. ” Step 1. Determining the set of clauses. This step seeks to ReVerb: (“the earth”,“be the center of”, “the universe”) identify the clauses in the input sentence by obtaining the OLLIE: (“the earth”,“be the center of”, “the universe”) head words of all the constituents of every clause. The AttributeTo believe; Early astronomers mapping of syntactic and dependency parsing are utilized to 3. “If he wins five key states, Romney will be elected President. ” identify various clause constituents. Subsequently, a clause is ReVerb:( “Romney”, “will be elected”, “President”) constructed for every subject dependency, dependent con- OLLIE: (“Romney”, “will be elected”, “President”) stitutes of the subject, and the governor of the verb. ClausalModifier if; he wins five key states Step 2. Identifying clause types. When a clause is obtained, it needs to be associated with one of the main clause types as Fig. 6. ReVerb extraction versus OLLIE extraction.3 shown in Table 1. In lieu of the previous assertions, these July 20, 2017 7:08:36pm WSPC/B2975 ch-01 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Open information extraction 7
Table 1. Clause types.8,17
Clause types Sentences Patterns Derived clauses
SV Albert Einstein died in Princeton in 1955. SV (Albert Einstein, died) SVA (Albert Einstein, died in, Princeton) SVA (Albert Einstein, died in, 1955) SVAA (Albert Einstein, died in, 1955, [in] Princeton) SVA Albert Einstein remained in Princeton until his death. SVA (Albert Einstein, remained in, Princeton) SVAA (Albert Einstein, remained in, Princeton, until his death) SVC Albert Einstein is a scientist of the 20th century. SVC (Albert Einstein, is, a scientist) SVCA (Albert Einstein, is, a scientist, of the 20 the century) SVO Albert Einstein has won the Nobel Prize in 1921. SVO (Albert Einstein, has won, the Nobel Prize) SVOA (Albert Einstein, has won, the Nobel Prize, in 1921) SVOO RSAS gave Albert Einstein the Nobel Prize. SVOO (RSAS, gave, Albert Einstein, the Nobel Prize) SVOA The doorman showed Albert Einstein to his office. SVOA (The doorman, showed, Albert Einstein, to his office) SVOC Albert Einstein declared the meeting open. SVOC (Albert Einstein, declared, the meeting, open)
Notes: S: Subject, V: Verb, A: Adverbial, C: Complement, O: Object.
systems use a decision tree to identify the different clause by identifying relations mediated by nouns and adjectives types. In this process, the system marks all optional adver- around verb phrase. Both ReVerb and OLLIE outperform the bials after the clause types have been identified. previous systems in the first Open IE generation. Another Step 3. Extracting relations. The systems extract relations approach in the second generation, clause-based relation from a clause based on the patterns of the clause type as extraction, uses dependency parsing and a set of rules for illustrated in Table 1. Assuming that a pattern consists of a domain-independent lexica to detect clauses for extracting subject, a relation and one or more arguments, it is reasonable relations without raining data. They exploit grammar clauses to presume that the most reasonable choice is to generate of the English language to detect clauses and all of their n-ary propositions that consist of all the constituents of the constituents in a sentence. As a result, systems such as clause along with some arguments. To generate a proposition ClausIE obtain high-precision extractions and can also be as a triple relation (Arg1, Rel, Arg2), it is essential to deter- flexibly customized to adapt to the underlying application mine which part of each constituent would be considered as domain. the subject, the relation and the remaining arguments. These In the second Open IE generation, binary extractions systems identify the subject of each clause and then use it to have been identified in ReVerb and OLLIE, but not all construct the proposition. To accomplish this, they map the relationships are binary. Events can have time and location subject of the clause to the subject of a proposition relation. and may take several arguments (e.g., \Albert Einstein was This is followed by applying the patterns of the clause types awarded the Nobel Prize for Physics in 1921."). It would be n
in an effort to generate propositions on this basis. For in- essential to extend Open IE to handle -ary and even nested stance, for the clause type SV in Table 1, the subject pre- extractions. sentation \Albert Einstein" of the clause is used to construct the proposition with the following potential patterns: SV, SVA, and SVAA. Dependency parsing is used to forge a 4. Application Areas connection between the different parts of the pattern. As a There are several areas where Open IE systems can be final step, n-ary facts are extracted by placing the subject first applied: followed by the verb or the verb with its constituents. This is First, the ultimate objectives of Open IE systems are to followed by the extraction of all the constituents following enable the extraction of knowledge that can be represented in the verb in the order in which they appear. As a result, these structured form and in human readable format. The extracted systems link all arguments in the propositions in order to knowledge can be then used to answer questions.16,18 For extract triple relations. instance, TextRunner can support user input queries such as \(?, kill, bacteria)" or \(Barack Obama, ?, U.S)" similar to 3.3. Advantages and disadvantages Question Answering systems. By replacing the question mark The key differentiating characteristic of these systems is a in the triple, questions such as \what kills bacteria" and linguistic analysis that guides the design of the constraints in \what are the relationships between Barack Obama and U.S" ReVerb and features analysis in OLLIE. These systems ad- will be developed and can be answered. dress incoherent and uninformative extractions which occur Second, Open IE could be integrated and applied in many in the first generation by identifying a more meaningful re- higher levels of NLP tasks such as text similarity or text lation phrase. OLLIE expands the syntactic scope of Reverb summarization.19–21 Relation tuples from Open IE systems July 20, 2017 7:08:36pm WSPC/B2975 ch-01 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
8 D.-T. Vo & E. Bagheri
could be used to infer or measure the redundancy between 13O. Etzioni, A. Fader, J. Christensen and S. Soderland, Mausam, sentences based on the facts extracted from the input corpora. Open information extraction: The second generation, Proc. IJCAI (2012). Finally, Open IE can enable the automated learning and 14 population of an upper level ontology due to its ability in the S. Stevenson, A. Fazly and R. North, Statistical measures of the scalable extraction of information across domains.22 For in- semi-productivity of light verb constructions, 2nd ACL Workshop on Multiword Expressions (2004), pp. 1–8. stance, Open IE systems can enable the learning of a new 15J. Christensen, Mausam, S. Soderland and O. Etzioni, An analysis biomedical ontology by automatically reading and processing of open information extraction based on semantic role labeling, published papers in the literature on this topic. Proc. KCAP (2011). 16D. T. Vo and E. Bagheri, Clause-based open information extrac- tion with grammatical structure reformation, Proc. CICLing References (2016). 17R. Quirk, S. Greenbaum, G. Leech and J. Svartvik, A Compre- 1 O. Etzioni, M. Banko, S. Soderland and D. S. Weld, Open infor- hensive Grammar of the English Language (Longman 1985). mation extraction from the web, Commun. ACM, 51(12) (2008). 18S. Reddy, M. Lapata and M. Steedman, Large-scale semantic 2 R. Bunescu and R. J. Mooney, Subsequence kernels for relation parsing without question-answer pairs, Trans. Assoc. Comput. extraction, Proc. NIPS (2005). Linguist. 2, 377 (2014). 3 Mausam, M. Schmitz, R. Bart and S. Soderland, Open language 19J. Christensen, Mausam, S. Soderland and O. Etzioni, Towards learning for information extraction, Proc. EMNLP (2012). coherent multi-document summarization, Proc. NAACL (2013). 4 D. T. Vo and E. Bagheri, Relation extraction using clause patterns 20J. Christensen, S. Soderland and G. Bansal, Mausam, Hierarchical and self-training, Proc. CICLing (2016). summarization: Scaling up multi-document summarization, Proc. 5 G. Zhou, L. Qian and J. Fan, Tree kernel based semantic relation ACL (2014). extraction with rich syntactic and semantic information, Inform. 21O. Levy and Y. Goldberg, Dependency-based word embeddings, Sci. 180 (2010). Proc. ACL (2014). 6 M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead and O. 22S. Soderland, J. Gilmer, R. Bart, O. Etzioni and D. S. Weld, Open Etzioni, Open information extraction from the Web, Proc. IJCAI information extraction to KBP relation in 3 hours, Proc. KBP (2007). (2013). 7 A. Fader, S. Soderland and O. Etzioni, Identifying relations for 23D. J. Allerton, Stretched verb constructions in english, Routledge open information extraction, Proc. EMNLP (2011). Studies in Germanic Linguistics, (Routledge Taylor and Francis, 8 L. D. Corro and R. Gemulla, ClausIE: Clause-based open infor- New York, 2002). mation extraction, Proc. WWW (2013). 24H. Bast and E. Haussmann, More informative open information 9 J. Zhu, Z. Nie, X. Liu, B. Zhang and J. R. Wen, StatSnowball: A extraction via simple inference, Proc. ECIR (2014). statistical approach to extracting entity relationships, Proc. WWW 25P. Pantel and M. Pennacchiotti, Espresso: Leveraging generic (2009). patterns for automatically harvesting semantic relations, Proc. 10 F. Wu and D. S Weld, Open information extraction using wiki- COLING (2006). pedia, Proc. ACL (2010). 26G. Stanovsky and I. Dagan, Mausam, Open IE as an intermediate 11 B. V. Durme and L. K. Schubert, Open knowledge extraction structure for semantic tasks, Proc. ACL (2015). using compositional language processing, Proc. STEP (2008). 27X. Yao and B. V. Durme, Information extraction over structured 12A. Yates and O. Etzioni, Unsupervised resolution of objects and data: Question answering with freebase, Proc. ACL (2014). relations on the web, Proc. HLT-NAACL (2007). July 20, 2017 7:09:46pm WSPC/B2975 ch-02 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Methods and resources for computing semantic relatedness † Yue Feng* and Ebrahim Bagheri Ryerson University Toronto, Ontario, Canada *[email protected] † [email protected]
Semantic relatedness (SR) is defined as a measurement that quantitatively identifies some form of lexical or functional association between two words or concepts based on the contextual or semantic similarity of those two words regardless of their syntactical differences. Section 1 of the entry outlines the working definition of SR and its applications and challenges. Section 2 identifies the knowledge resources that are popular among SR methods. Section 3 reviews the primary measurements used to calculate SR. Section 4 reviews the evaluation methodology which includes gold standard dataset and methods. Finally, Sec. 5 introduces further reading. In order to develop appropriate SR methods, there are three key aspects that need to be examined: (1) the knowledge resources that are used as the source for extracting SR; (2) the methods that are used to quantify SR based on the adopted knowledge resource; and (3) the datasets and methods that are used for evaluating SR techniques. The first aspect involves the selection of knowledge bases such as WordNet or Wikipedia. Each knowledge base has its merits and downsides which can directly affect the accuracy and the coverage of the SR method. The second aspect relies on different methods for utilizing the beforehand selected knowledge resources, for example, methods that depend on the path between two words, or a vector representation of the word. As for the third aspect, the evaluation for SR methods consists of two aspects, namely (1) the datasets that are used and (2) the various performance measurement methods. SR measures are increasingly applied in information retrieval to provide semantics between query and documents to reveal relatedness between non-syntactically-related content. Researchers have already applied many different information and knowledge sources in order to compute SR between two words. Empirical research has already shown that results of many of these SR techniques have reasonable correlation with human subjects interpretation of relatedness between two words. Keywords: Semantic relatedness; information retrieval; similarity; natural language processing.
1. Overview of Semantic Relatedness words and documents.4 SR is extremely useful in informa- tion retrieval techniques in terms of the retrieval process It is effortless for humans to determine the relatedness be- where it allows for the identification of semantic-related tween two words based on the past experience that humans but lexically-dissimilar content.1 Other more specialized have in using and encountering related words in similar contexts. For example, as human beings, we know car domains such as biomedical informatics and geoinformatics have also taken advantages of SR techniques to measure the and drive are highly related, while there is little connection relationships between bioentities5 and geographic concepts,6 between car and notebook. While the process of deciding respectively. semantic relatedness (SR) between two words is straightfor- ward for humans, it is often challenging for machines to make a decision without having access to contextual knowledge surrounding each word. Formally, SR is defined as some 1.2. Challenges form of lexical or functional association between two words Developing SR methods is a formidable task which rather than just lexical relations such as synonymy and requires solutions for various challenges. Two primary hyponymy.1 challenges are encountered with the underlying knowledge resources and formalization of the relatedness measures 1.1. Applications respectively. Semantic relatedness is widely used in many practical (1) Knowledge resources challenges: Knowledge resources applications, especially in natural language processing (NLP) provide descriptions for each word and its relations. such as word sense disambiguation,2 information retrieval,3 Knowledge resources can be structured or unstructured, spelling correction1 and document summarization, where it linguistically constructed by human subjects or collabo- is used to quantify the relations between words or between ratively constructed through encyclopedia or web-based.
9 July 20, 2017 7:09:46pm WSPC/B2975 ch-02 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
10 Y. Feng & E. Bagheri
It is challenging to clean and process the large set of Wikipedia is presented as a collection of articles where each knowledge resources and represent each word with article is focused on one specific concept. Besides articles, its extracted descriptions which requires considerable Wikipedia contains hyperlinks between articles, categories computation power. and disambiguation pages. (2) Formalization challenges: Designing algorithms to Some researchers have benefited from the textual content of compute SR between words is also challenging since Wikipedia articles. For example, a widely-used SR technique efficiency and accuracy are two important factors to be called explicit semantic analysis (ESA)10 treats a target word considered. as a concept and uses its corresponding Wikipedia article as the knowledge resource to describe the target word; therefore, each word is represented as a vector of words from the associated Wikipedia article and the weights are the TF-IDF values of the 2. Knowledge Resources words. Then cosine similarity method is applied on two vec- In the world of SR techniques, the term knowledge resources tors for two words respectively to calculate SR. Besides ex- refers to the source of information where the descriptions and ploring the article contents, hyperlinks between Wikipedia relations of words are generated from. Five knowledge articles can also be used to establish relationships between two resources that are popular adopted literature are introduced words. Milne and Witten11 and Milne12 represented each word below. as a weighted vector of links obtained through the number of links on the corresponding Wikipedia article and the proba- bility of the links occurrences. In their work, they have proved 2.1. WordNet that processing only links on Wikipedia is more efficient WordNet is an English lexical database which is systemati- and can achieve comparable results with ESA. The Wikipedia cally developed by expert linguists. It is considered the most category system has also been exploited for the task of SR. reliable knowledge resource due to the reason that it has been For instance, WikiRelate13 expressed the idea that SR between curated through a well-reviewed and controlled process. two words is dependent on the relatedness of their categories, WordNet provides descriptions for English words and therefore, they represented each word with their related expresses various meanings for a word which is polysemy category. according to different contexts. Expert linguists defined relations and synsets in WordNet which are two of the main parts where the relations express the relations between two or 2.3. Wiktionary more words such as hypernymy, antonymy and hyponymy, Wiktionary is designed as a lexical companion to Wikipedia and synsets are a set of synonymous words. Moreover, a short which is a multilingual, Web-based dictionary. Similar to piece of text called gloss is attached to describe members of Wordnet, Wiktionary includes words, lexical relations between each synset. words and glosses. Researchers have taken advantages of the WordNet has been widely applied in researches for com- large number of words in Wiktionary to create high dimen- puting the degree of SR. For example, Rada et al.7 con- sional concept vectors. For example, Zesch et al.14 constructed structed a word graph whose nodes are Wordnet synsets and a concept vector for each word where the value of the term is edges are associated relations. Then SR is represented as the the TF-IDF score in the corresponding Wiktionary entry. Then shortest path between two nodes. Glosses defined in Wordnet the SR is calculated based on the cosine similarity of the two have also been explored to compute SR. For instance, Lesk8 concept vectors. Also, given the fact that Wiktionary consists introduced his method in 1986 that is counting the word of lexical-semantic relations embedded in the structure of each overlap between two glosses where the higher count of Wiktionary entry, researchers have also considered Wiktionary overlap indicates higher SR between the two words. as a knowledge resource for computing SR. For instance, A German version of Wordnet has also been constructed Krizhanovsky and Lin15 built a graph from Wiktionary where named GermaNet. GermaNet shares all the features from nodes are the words and the edges are the lexical-semantic Wordnet except it does not include glosses, therefore, relations between pairs of words. Then they applied path-based approaches based on glosses are not directly applicable on method on the graph to find SR between words. Similar to GermaNet. However, Gurevych9 has proposed an approach to Wordnet, the glosses provided by Wiktionary are explored. solve the problem by generating pseudo-glosses for a target Meyer and Gurevych16 performed sense disambiguation pro- word where the pseudo-glosses are the set of words that are in cess based on word overlaps between glosses. close relations to the target word in the relationship hierarchy. 2.4. Web search engines 2.2. Wikipedia Given Web search engines provide access to over 45 billion Wikipedia provides peer-review and content moderation web pages on the World Wide Web, their results have been processes to ensure reliable information. The information in used as a knowledge source for SR. For a given search query, July 20, 2017 7:09:46pm WSPC/B2975 ch-02 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Methods and resources for computing semantic relatedness 11
search engines will return a collection of useful information be. Therefore, the IS-A hierarchy is adopted to find the lowest including rich snippets that are short pieces of text each common subsumer of two words in a taxonomy, then the containing a set of terms describing the result page, Web page information content value is calculated as the SR score. URIs, user-specified metadata and descriptive page titles. Works based on search engines snippets include the method 3.2. WikiRelate! from Spanakis et al.17 in which they extracted lexico-synactic 13 patterns from snippets with the assumption that related words Strube and Ponzetto created a graph based on the infor- should have similar patterns. Duan and Zeng18 computed the mation extracted from Wikipedia where nodes are Wikipedia SR based on the co-occurrences of the two words and articles, and the edges are the links between the articles. Then occurrences of each word from the snippets returned by the the shortest path is selected between two words which are search engine. Also there are some works that rely on the Wikipedia articles to determine the SR score. content of the retrieved pages. For example, Sahami and Heilman19 enhanced the snippets by including the top-k 3.3. Hughes and Ramage words with the highest TF-IDF value from each of the returned page to represent a target word. Hughes and Ramage23 construct a graph from WordNet where the nodes are Synsets, TokenPOS and Tokens, and the edges are the relations defined in WordNet between these nodes. The 2.5. Semantic web conditional probability from one node to another is caluclated Some researchers have exploited the Semantic Web and the beforehand, then the authors apply Random Walk algorithm on Web of Data. The data on the Web of Data is structured so that the graph to create a stationary distribution for each target word it can be interlinked. Also, the collection of Semantic Web by starting the walk on the target word node. Finally, SR is technologies such as RDF, and OWL among others allows for computed by comparing the similarity between the stationary running queries. REWOrD20 is one of the earlier works in this distributions obtained for two words. area. In this work, each target word is represented as a vector where each element is generated from RDF predicates and their informativeness scores. The predicates are obtained from 3.4. ESA DBpedia triples where they correspond to each word and the Gabrilovich adn Markovitch10 have proposed the ESA tech- informativenss scores are computed based on predicate fre- nique in 2007 by considering Wikipedia as its knowledge quency and inverse triple frequency. After that, the cosine resource. In their approach, a semantic mapper is built to similarity method is applied on the vectors to generate the SR represent a target word as a vector of Wikipedia concepts between two words. The semantic relations defined by the where the weights are the TF-IDF values of the words in the Web ontology language (OWL) have also been explored, for underlying articles. Then the SR is computed by calculating example, In Karanastasi and Christodoulakiss model,21 three the similarity between two vectors represented for the two facts that are (1) the number of common properties and the words respectively. inverseOf properties that the two concepts share; (2) the path distance between two concepts common subsumer; and (3) the count of the common nouns and synonyms from the 3.5. Lesk concepts description are combined to compute SR. Lesk8 takes advantage of the glosses defined for each word from WordNet. Specifically, SR is determined by counting the number of words overlap between two glosses obtained 3. Semantic Relatedness Methods for the two words. The higher the count of overlap, the more Many SR methods have been developed by manipulating the related the two words are. information extracted from the selected knowledge resources. Some methods use the relationships between each word from 3.6. Sahami and Heilman the knowledge resource to create a graph and apply these relations to indicate SR, while other methods directly use Sahami and Heilman19 benefit from the results returned by a content provided by the knowledge resource to represent each Web search engine. By querying the target word, they enrich concept as a vector and apply vector similarity methods to the short snippets by including the top words ranked based on compute the SR. Moreover, there have been works on tem- the TF-IDF values from each returned page. Then the vector poral modeling for building SR techniques. is used to compute the degree of SR between two words.
3.1. Resnik 3.7. WLM Resnik22 proposed his model in 1995. The idea is that the Milne11 intends to reduce the computation costs of the ESA more information two words share, the higher their SR will approach, therefore, a more efficient model is built by July 20, 2017 7:09:46pm WSPC/B2975 ch-02 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
12 Y. Feng & E. Bagheri
considering links found within corresponding Wikipedia 4.1.3. Fin-353 articles where the basic assumption is the more links two Finkelstein et al.27 introduced a dataset that contains 353 articles share, the more they are related. So a word is repre- word pairs where 30 word pairs are obtained from the MC-30 sented as a vector of links. Finally, SR is computed by dataset. The dataset is divided into two parts where the first comparing the similarity between the link vectors. part contains 153 word pairs obtained from 13 subjects and the second part contains 200 word pairs that are judged from 16 subjects. In some literature, the first set is used for training 3.8. TSA and the second is used for evaluation. The use of Fin-353 Radinsky et al.24 propose a temporal semantic analysis dataset can be found in Ref. 28 among others. method based on the idea that enormous information can be revealed by studying the similarity of word usage patterns 4.1.4. YP-130 over time. Therefore, in their model, a word is represented as a weighted vector of concept time series obtained from Yang Powers (YP-130) is a dataset designed especially for a historical archive such as NY Times archive. Then evaluating a SR methods ability to assign the relatedness SR is found by comparing the similarity between two time between verbs. The YP-130 contains 130 verb pairs. series. There are also some datasets in German language. For instance, Gurevych dataset (Gur-65)9 is the German transla- tion of the English RG-65 dataset, Gurevych dataset (Gur-30) is a subset of the Gur-65 dataset, which is associated with 4. Evaluation the English MC-30 dataset. Gurevych dataset (Gur-350)29 In order to evaluate a SR method, researchers have adopted consists of 350 word pairs which includes nouns, verbs various goldstandard datasets and strategies for comparative and adjectives judged by eight human subjects. The Zesch– analysis. In this section, we introduce the common datasets Gurvych (ZG-222) dataset29 contains 222 domain specific and metrics researchers have used. word pairs that were evaluated by 21 subjects which includes nouns, verbs and adjectives. 4.1. Datasets 4.2. Methods The gold standard datasets are often constructed by collecting subjective opinion of humans in terms of the SR between There are two typical ways to evaluate a SR method that are words. The main purpose of creating a SR dataset is to assign (1) calculating the degree of correlation with human judg- a degree of SR between a set of word pairs so they can be ments and (2) measuring performance in application-specific used as a gold standard benchmark for evaluating different tasks. SR methods. The datasets that have been used and cited in literatures are mainly in English and German languages. 4.2.1. Correlation with human judgments Below are four popular English datasets. Calculating the correlation between the output of a SR method and the score obtained from a gold standard dataset is 4.1.1. RG-65 one of the main techniques for evaluating a semantic method. Either the absolute values from a semantic method and the – 25 The Rubenstein Goodenough (RG-65) is created by col- relatedness values from the gold standard are used, or the lecting human judgments from 51 subjects, the similarity rankings produced by the relatedness method with the rank- between each word pair is equal to the average of the scores ings in the gold standard are compared. Comparing the cor- given by the subjects. The RG-65 dataset includes 65 noun relation between rankings is more popularly adopted in pairs, and the similarity of each word pair is scored on a scale literature due to the reason it is less sensitive to the actual between 0 to 4 where higher score indicates higher similarity. relatedness values. Pearson product-moment correlation co- The RG-65 dataset has been used as gold standard in many 31 30 13 efficient and Spearmans rank correlation coefficient are researches such as Strube and Ponzetto. two most popular coefficient to calculate the correlation between a SR method and the human judgments. 4.1.2. MC-30 Miller–Charles (MC-30)26 is a subset of the original RG- 4.2.2. Application-specific tasks 65 dataset that contains 30 noun pairs. The MC-30 dataset Instead of directly comparing the output from a SR method is additionally verified and evaluated by another 38 subjects with the gold standard dataset, a SR method can be embedded and it is widely adopted in many works such as in Refs. 11 into an application-specific task, and the performance of the and 17. application can be the indicator of the performance of the SR July 20, 2017 7:09:47pm WSPC/B2975 ch-02 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Methods and resources for computing semantic relatedness 13
method. The underlying hypothesis of this evaluation is that 14T. Zesch, C. Mller and I. Gurevych, Using wiktionary for com- the more accurate a SR method is, the better the performance puting semantic relatedness, AAAI, Vol. 8. 15 of the application task. A. A. Krizhanovsky and L. Feiyu, Related terms search based on Various application-specific tasks have been used to wordnet/wiktionary and its application in ontology matching, evaluate the SR method. For instance, Sahami and Heilman19 arXiv:0907.2209. 16C. M. Meyer and I. Gurevych, To Exhibit is not to Loiter: A evaluated their work through the task of search query sug- 2 Multilingual, Sense-Disambiguated Wiktionary for Measuring gestion; Patwardhan and Pedersen used their SR method in Verb Similarity, COLING 2012 (2008). the word sense disambiguation application as the target 17G. Spanakis, G. Siolas and A. Stafylopatis, A hybrid web-based 32 evaluation application; while Gracia and Mena deployed measure for computing semantic relatedness between words, 21st their method in the ontology matching task. Int. Conf. Tools with Artificial Intelligence, 2009. ICTAI'09, IEEE (2009). 18J. Duan and J. Zeng, Computing semantic relatedness based on search result analysis, Proc. The 2012 IEEE/WIC/ACM Int. Joint References Conf. Web Intelligence and Intelligent Agent Technology Vol. 3, 1I. A. Budan and H. Graeme, Evaluating wordnet-based measures IEEE Computer Society (2012). of semantic distance, Comput. Linguist. 32(1), 13 (2006). 19M. Sahami and T. D. Heilman, A web-based kernel function for 2S. Patwardhan, S. Banerjee and T. Pedersen, SenseRelate: Tar- measuring the similarity of short text snippets, Proc. 15th Int. getWord: A generalized framework for word sense disambigua- Conf. World Wide Web, ACM (2006). tion, Proc. ACL 2005 on Interactive Poster and Demonstration 20G. Pirr, REWOrD: Semantic relatedness in the web of data, AAAI Sessions, Association for Computational Linguistics (2005). (2012). 3L. Finkelstein et al., Placing search in context: The concept 21A. Karanastasi and S. Christodoulakis, The OntoNL semantic revisited, Proc. 10th Int. Conf. World Wide Web, ACM (2001). relatedness measure for OWL ontologies, 2nd Int. Conf. Digital 4C. W. Leong and R. Mihalcea, Measuring the semantic relatedness Information Management, 2007. ICDIM'07, Vol. 1. IEEE (2007). between words and images, Proc. Ninth Int. Conf. Computational 22P. Resnik, Using information content to evaluate semantic simi- Semantics, Association for Computational Linguistics (2011). larity in a taxonomy, arXiv:cmp-lg/9511007. 5M. E. Renda et al., Information Technology in Bio- and Medical 23T. Hughes and D. Ramage, Lexical Semantic Relatedness with Informatics. Random Graph Walks. EMNLP-CoNLL (2007). 6B. Hecht et al., Explanatory semantic relatedness and explicit 24K. Radinsky et al., A word at a time: Computing word relatedness spatialization for exploratory search, Proc. 35th Int. ACM SIGIR using temporal semantic analysis, Proc. 20th Int. Conf. World Conf. Research and Development in Information Retrieval,ACM Wide Web, ACM (2011). (2012). 25H. Rubenstein and J. B. Goodenough, Contextual correlates of 7R. Rada et al., Development and application of a metric on synonymy, Commun. ACM 8(10), 627 (1965). semantic nets, IEEE Trans. Syst. Man Cybern. 19(1), 17 (1989). 26G. A. Miller and W. G. Charles, Contextual correlates of semantic 8M. Lesk, Automatic sense disambiguation using machine readable similarity, Lang. Cogn. Process. 6(1), 1 (1991). dictionaries: How to tell a pine cone from an ice cream cone, Proc. 27E. Agirre et al., A study on similarity and relatedness using 5th Ann. Int. Conf. Systems Documentation, ACM (1986). distributional and wordnet-based approaches, Proc. Human Lan- 9I. Gurevych, Using the structure of a conceptual network in com- guage Technologies: The 2009 Annual Conf. North American puting semantic relatedness, Natural Language Processing Chapter of the Association for Computational Linguistics, Asso- IJCNLP 2005 (Springer Berlin Heidelberg, 2005), pp. 767–778. ciation for Computational Linguistics (2009). 10E. Gabrilovich and S. Markovitch, Computing semantic related- 28C. Fellbaum, WordNet (Blackwell Publishing Ltd, 1998). ness using wikipedia-based explicit semantic analysis, IJCAI, 29T. Zesch and I. Gurevych, Automatically creating datasets for Vol. 7 (2007). measures of semantic relatedness, Proc. Workshop on Linguistic 11I. Witten and D. Milne, An effective, low-cost measure of semantic Distances, Association for Computational Linguistics (2006). relatedness obtained from Wikipedia links, Proc. AAAI Workshop 30J. H. Zar, Spearman rank correlation, Encyclopedia of Biostatistics on Wikipedia and Artificial Intelligence: An Evolving Synergy, (1998). AAAI Press, Chicago, USA (2008). 31J. Benesty et al., Pearson correlation coefficient, Noise Reduction 12D. Milne, Computing semantic relatedness using wikipedia link in Speech Processing (Springer Berlin Heidelberg, 2009), pp. 1–4. structure, Proc. New Zealand Computer Science Research Student 32J. Gracia and E. Mena, Web-based measure of semantic related- Conference (2007). ness, Web Information Systems Engineering-WISE 2008 (Springer 13M. Strube and S. P. Ponzetto, WikiRelate! Computing semantic Berlin Heidelberg, 2008), pp. 136–150. relatedness using Wikipedia, AAAI, Vol. 6 (2006). July 20, 2017 7:14:18pm WSPC/B2975 ch-03 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Semantic summarization of web news † ‡ § Flora Amato*, Vincenzo Moscato , Antonio Picariello and Giancarlo Sperlí Dipartimento di Ingegneria Elettrica e Tecnologie dell'Informazione University of Naples, Naples Italy *[email protected] † [email protected] ‡ [email protected] § [email protected]
Antonio D'Acierno ISA - CNR, Avellino, Italy [email protected]
Antonio Penta United Technology Research Center Ireland (UTRC-1) [email protected]
In this paper, we present a general framework for retrieving relevant information from news papers that exploits a novel sum- marization algorithm based on a deep semantic analysis of texts. In particular, we extract from each Web document a set of triples (subject, predicate, object) that are then used to build a summary through an unsupervised clustering algorithm exploiting the notion of semantic similarity. Finally, we leverage the centroids of clusters to determine the most significant summary sentences using some heuristics. Several experiments are carried out using the standard DUC methodology and ROUGE software and show how the proposed method outperforms several summarizer systems in terms of recall and readability. Keywords: Web summarization; information extraction; text mining.
1. Introduction concise and grammatically meaningful version of information The exponential growth of the Web has made the search and spread out in pages and pages of texts. We can consider as an example a typical workflow related track of information apparently easier and faster, but the huge to a news reporter that has just been informed of a plane crash information overload requires algorithms and tools allowing in Milan and that would quickly like to gather more details for a fast and easy access to the specific desired information. about this event. We suppose that the following textual in- In other words, from one hand information is available to formation on the accident, extracted from some web sites everybody, from the other hand, the vast quantity of infor- already reporting the news, are publicly available: mation is itself the reason of the difficulty to retrieve data, thus creating the well known problem of discriminating be- \A Rockwell Commander 112 airplane crashed into the tween \useful" and \useless" information. upper floors of the Pirelli Tower in Milan, Italy. Police This new situation requires to investigate new ways to and ambulances are at the scene. The president, Mar- handle and process information, that has to be delivered in a cello Pera, just moments ago was informed about the rather small space, retrieved in a short time, and represented incident in Milan, he said at his afternoon press as accurately as possible. briefing". \It was the second time since the Sept 11 Due to the 80% of data in information generation process terror attacks on New York and Washington that a is contained in natural-language texts, the searching of plane has struck a high-rise building. Many people suitable and efficient summarization techniques has become a were on the streets as they left work for the evening at hot research topic. Moreover, summarization, especially text the time of the crash. Ambulances streamed into the summarization from web sources, may be considered as the area and pedestrians peered upward at the sky. The process of \distilling" the most important information from a clock fell to the floor. The interior minister had in- variety of logically related sources, such as the ones returned formed the senate president, Marcello Pera, that the from classic search engines, in order to produce a short, crash didn't appear to be a terror attack."
15 July 20, 2017 7:14:18pm WSPC/B2975 ch-03 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
16 F. Amato et al.
The paper is organized as follows. Section 2 provide an addition, in order to smooth the phenomenon of redundancy, overview of literature on multiple document summarizer a cross-sentence word overlap value is computed to discard systems. In Sec. 3, we discuss a general model, that is useful sentences having the same semantic content. to solve the problem of automatic summarization and then we In Ref. 10, key-phrases are extracted from the narrative illustrate the instance of the model used for our aims. Section paragraphs using classifiers that learn the significance of each 4 presents the summarization framework developed to test the kind of sentences. Finally, a summary is generated by a theoretical model. Finally, in Sec. 5 we provide the experi- sentence extraction algorithm. Another interesting approach mental evaluation performed to test the effectiveness of the is proposed by Ref. 11, where the goal is to analyze text and system and its performances. retrieve relevant information in the form of a semantic graph based on subject-verb-object triples extracted from sentences. A general framework able to combine sentence-level 2. Related Works structure information and semantic similarity is presented in Summarization has been studied for a long time in the sci- Ref. 6. Finally, in Ref. 12 the authors propose an extractive entific community: as a matter of fact, the automatic docu- summarization of ontology based on a notion of RDF sen- ment summarization has been actively researched since the tence as the basic unit of summarization. The summarization original work by Luhn1 from both Computational Linguistics is obtained by extracting a set of salient RDF sentences and Information Retrieval points of view.2 according to a re-ranking strategy. Ontologies are also used Automatic summarizer systems are classified by a number to support the summarization: for instance,13 describe how of different criteria.3–5 they use Yago ontology to support entity recognition and The first important categorization distinguishes between disambiguation. abstraction or extraction-based summaries. An extraction- based summary involves the selection of text fragments from the source document while an abstraction-based summary 3. A Model for Automatic Multi-Document ' involves the sentences compression and reformulation. Al- Summarization though an abstractive summary could be more concise, it requires deep natural language processing techniques while 3.1. Basic elements of the model extractive summaries are more feasible and became the de Our idea is inspired by the text summarization models based facto standard in document summarization.6 Another key on Maximum Coverage Problem,14,15 but differently from distinction is between summarization of single or multiple them we design a methodology that combines both the syn- documents. The latter case brings about more challenges, as tactic and the semantic structure of a text. In particular, in the summarization systems have to take into account different proposed model, the documents are segmented into several aspects such as the diversity and the similarity of the different linguistic units (named as summarizable sentences). Our sources and the order of the extracted information. main goal is to cover as many conceptual units as possible Our research contributions lie in the area of multi-docu- using only a small number of sentences. ment summarization: thus, in this context, it is important to Definition 1 (Summarizable sentence and semantic compare our approach with the existing related work on the atoms). A Summarizable Sentence defined over a base of: (i) how to define the information value of a text document D is a couple: fragment, (ii) how to extract a meaningful part of it and (iii) how to combine different sources. ¼hs; ft ; t ; ...; t gi; ð1Þ To extract the most significant part of a text, the majority 1 2 m of the proposed techniques follows the bag of words ap- f ; ; ...; g s being a sentence belonging to D and t1 t2 tm being a proach, which is based on the observation that words oc- set of atomic or structured information that expresses in some curring frequently or rarely in a set of documents have to be way the semantic content related to s. considered with a higher probability for human summaries.8 In such systems, another important issue is \to determine We provide the example sentences in Table 1 to better not only what information can be included in the summary, explain the introduced definitions. but also how important this information is";7 thus, the rele- vance of the information changes depending on what infor- Document Sentence Text of Sentence mation has already been included in the summary. To reach this goal, several systems and techniques have been proposed 1 1(a) People and ambulances were at the scene. in the most recent literature. 1(b) Many cars were on the street. 1(c) A person died in the building. Wang et al.9 uses a probabilistic model, computing the ' 2 2(a) Marcello Pera denied the terror attack words relevance through a regression model based on several 2(b) A man was killed in the crash. word features, such as frequency and dispersion. The words 2(c) The president declared an emergency. with the highest scores are selected for the final summary. In July 20, 2017 7:14:20pm WSPC/B2975 ch-03 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Semantic summarization of web news 17
Definition 2 (Summarization algorithm). Let D be a set of not only relations but also subjects and objects and we add to documents, a Summarization Algorithm is formed by a the pattern expressions features related to the sentence lin- sequence of two functions and . The semantic partitioning guistic structure (parse tree). D P ; ...; P function ( ) partitions in K sets 1 K of summarizable sentences having similar semantics in terms 3.3. Semantic similarity function of semantic atoms and returns for each set the related information score by opportunely combining the score of In our model, we decided to compare two semantic atoms each semantic atom: based on the similarity measure obtained by the comparison of the elements hosted by a summarization triple, namely : D!S ¼fhP ; ^ i; ...; hP ; ^ ig ð Þ 1 w1 K wK 2 subject, predicate, and object. In particular, let us consider
s. t. Pi \Pj ¼;; 8i 6¼ j. two sentences and assume to extract from them two triples t1 The Sequential Sentence Selection function ( ): and t2; we define as similarity between two t1 and t2 the function: : S ! S ð3Þ simðt ; t Þ¼F ðF simðsub ; sub Þ; F simðpred ; pred Þ; selects a set of the sentences S from original documents 1 2 agr 1 2 1 2 simð ; ÞÞ ð Þ containing the semantics of most important clustered infor- F obj1 obj2 4 mation sets in such a way that: The function F sim is used to obtain the similarity among (1) jSj L, values of the semantic atoms, while Fagr is an aggregation = function. In particular, we use the Wu and Palmer similarity17 (2) 8Pk; w^k ; 9tj; tj2 S : simðti; tjÞ ;ti 2 S. for computing the similarity among elements of our triples. and being two opposite thresholds. With abuse of nota- This similarity is based on the Wordnet Knowledge Base, that 2 tion, we use the symbol to indicate that a semantic atom lets us compare triples based on their semantic content. comes from a sentence belonging to the set of summarizable sentences S. 3.4. Semantic atom score Now, we are going to explain the following points of our model are: (i) how to represent and extract semantic atoms of Several metrics (frequency-based, based on frequency and a document, (ii) how to evaluate the similarity between two position, query-based) were studied and applied proving the semantic atoms, (iii) how to calculate a score for each se- capability to get meaningful results on certain types of mantic atom, and finally, (iv) how to define suitable semantic documents as described in Ref. 4. We developed the fol- partitioning and sentence selection functions. lowing method that combines the values of frequency and position with a value of source preferenceb: (i) A value of reputation (r) is assigned to each semantic atom obtained 3.2. Extracting semantic atoms from a text from the value of reputation of the related source to which the We adopted the principles behind the RDF framework used in information element belongs, (ii) The frequency (Fr) is cal- the Semantic Web community to semantically describe web culated through the occurrences of the semantic atom ¼ð resources. The idea is based on representing data in terms of a t subject, predicate and object) across the document set ¼ countðtÞ triple hsubject, verb, objecti, attaching to each element of (Fr N ), (iii) A score ( ) is assigned depending on the triples the tokens extracted by processing the document. position of the sentence, to which a semantic atom is asso- The triples are extracted from each sentence in the ciated in the original document. We assume this relationship ¼ 1 documents as by applying a set of rules on the parse tree to be pos, pos being the related position. structure computed on each sentence. In our rules, subjects The final score for our semantic atom is thus obtained !ð Þ¼ and objects are nouns or chunks while the verbs are reported combining frequency, positionP and reputation as: t a þ þ 3 ¼ 2 R in the infinitive form. It is worth to be noted that in the case 1 r 2 Fr 3 , being i¼1 i 1 and i . of long sentences more triples may be associated to a sen- In turn, the information score for a set of summarizable P tence. The rules are obtained by defining a set of patterns for sentences ( ) is defined as the average value of the infor- subject, verb and object which includes not only part of mation scoresP of all the semantic atoms belonging to 1 P:!^ðPÞ ¼ 2Pð!ðt ÞÞ. speech features but also parse tree structures. In particular, we jPj ti i start from the patterns described in Ref. 16 in order to include 3.5. The semantic partitioning function aWe do not consider any other grammatical units for a sentence such as In our context, the clusters have different shapes and they adjective, preposition and so on. This is because we are not interested in are characterized from density-connected points; thus, we detecting the sentiment or opinion in the text or to provide just triples themselves as final summarization results to the user, but we exploit them like \pointers" to original sentences that are the real components of our bA value given by the user through a qualitative feedback that is assumed to summaries. be the value of reputation of each sources. July 20, 2017 7:14:21pm WSPC/B2975 ch-03 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
18 F. Amato et al.
decided to use in our experiments the OPTICS algorithm,18 documents are extracted parsing the HTML pages, NLP and that is based on a density-based cluster approach. In our triplets extraction: several NLP algorithms split the extracted context, the clustering procedure is applied on semantic text into sentences and identifies the functional elements atoms, thus we can use as distance dðti; tjÞ the value of (subject, verb and object) of each sentence and the related 1 simðti; tjÞ. form thus extracting the related triple, Clustering: performs the clustering of the triples using OPTICS algorithm, Sen- tence Selection: On the base of the generated clusters, the Cluster !!^ DI PI SI Sentence most representative sentences are selected using a maximum coverage sentence selection greedy algorithm and Summary 1 0.25 0.3 1 12 a The ambulance arrived on the scene. building: generates the summary related to the search key- 1 0.25 0.2 2 14 b The cars were on the words and user preferences. street. 2 0.15 0.2 1 3 c A man was killed in the crash 5. Experimental Process 2 0.15 0.1 3 2 d He ended in the crash 3 0.15 0.2 2 15 e He declared an emergency In this section, we provide a comparison between iWIN and 3 0.15 0.1 4 10 f The chief asked help other existing automatic summarizer systems using the ROUGE evaluation software.c The provided evaluation is based on ground truth summaries produced by human In this way, we will have clusters where the sentences that experts, using recall-based metrics as the summary needs to are semantically more similar are grouped together. Now, we contain the most important concepts coming from the con- have to describe how we choose the sentences from the sidered documents. Two fundamental problems have been clusters by minimizing the information score and the text addressed: (i) the choice of the documents data set; (ii) the redundancy. selection of existing tools to be compared with iWIN. Concerning the data set, our experiments have been per- formed on four different data corpora used in the text-sum- 3.6. Sentence selection function and summary ordering marization field.d With regard to summarizer systems, we In this section, we discuss how we choose the sentences from have selected the following applications: (i) Copernic 22 21 the clusters by maximizing the information score and mini- Summarizer ; (ii) Intellexer Summarizer Pro ; (iii) 20 21 mizing text redundancy, following the general approach de- Great Summary ;(iv)Tools4Noobs Summarizer ; (v) 19 scribed in the basic model. First, we order the clusters Text Compactor. For each of the four chosen topics, three summaries of according to their scores, then we use the threshold 1 to select the most important ones with respect to the length different lengths (15%, 25% and 35%) have been generated restrictions. After this stage, we select the sentence that has both by the current system and by the other tools. Figures 2–4 the most representative semantic atoms, minimizing, at the show the comparison between our system and the others with same time, the redundancy. As several sentences may have respect to the ground truth using the average recall values of multiple semantic atoms, we need to eliminate the possibility ROUGE-2 and ROUGE-SU4. that the same sentence can be reported in the summary sev- eral times, because its semantic atoms are spread all over the 5.1. Considerations on the efficiency of our approach clusters. We penalize the average score of each cluster if we have already considered it during our summary building We have measured the running times (on a machine equipped process; we also check if the clusters contain any other useful with Intel core i5 cpu M 430 at 2.27 GHz and with 4 GB of information by comparing their comulative average score cRecall-Oriented Understudy for Gisting Evaluation, http://haydn.isi.edu/ with the threshold 2. Once the optimal summary has been determined, the ROUGE/. dThe first contains nine newspaper articles about the air accident which sentences are ordered maximizing the partial ordering of happened in Milan the 18th April 2002, when a little airplane crashed to the sentences within the single documents. Pirelli skyscraper; there are also three ideal summaries generated by humans (15%, 25% and 35% of the original documents' length). The second data corpus is composed of four papers about a bomb explosion in a Moscow 4. Proposal Framework building, which caused many dead and injured; three human generated summaries (15%, 25% and 35% of the original documents' length) are also We propose a framework, called iWin, composed by the available. A document set containing three papers about Al-Qaida with ' following steps: (i) Web Search: in this phase several web summaries whose length are 15%, 25% and 35% of the original documents length is the third data set we used in our experiments. While the fourth data pages are retrieved and stored into a local repository, using corpus (Turkey) is a document about the problems related to Turkey's entry the available API of the most common search engines, Text into the European Community with available summaries with a number of Extractor: allows to extract related textual sentences of the sentences equal to the 15%, 25% and 35% of the original document length. July 20, 2017 7:14:21pm WSPC/B2975 ch-03 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Semantic summarization of web news 19
Fig. 1. The summarization process.
Average Recall (Rouge), Summary Length 15% Average Recall (Rouge), Summary Length 35% 0.5 0.7 0.69 Our Framework Average Recall (Rouge-SU4) 0.45 Our Framework Average Recall (Rouge-SU4) Average Recall (Rouge-2) Average Recall (Rouge-2) 0.4 0.6 0.57 0.35 0.51 0.32 0.5 0.47 0.47 0.31 0.43 0.3 0.28 0.41 0.27 0.27 0.4 0.38 0.38 0.36 0.25 0.25 0.33 0.22 0.22 0.32 0.3 0.2 0.19 0.2
0.1 0.1
0 0 iWIN Copernic Intellexer Great Tools4Noobs Compactor iWIN Copernic Intellexer Great Tools4Noobs Compactor Fig. 4. ROUGE values, length ¼ 35%. Fig. 2. ROUGE values, length ¼ 15%.
DDR3 ram) for each summary building phase in relation to quality of produced summary is better than commercial the number of input sentences; the most critical step, that can systems, but the efficiency of our system at the moment constitute a bottleneck for the system, is the semantic dis- represents a bottleneck of its application in the case of great tances' computation. In order to have an idea of iWIN effi- number of input sentences. We observed that, to reach per- ciency performances with respect to some of the analyzed formances of commercial systems, iWIN elaboration can be commercial systems, we also computed running times for the significantly improved by caching similarity values between Copernic and Intellexer summarizers. In particular, for a terms that have to be computed more times and using con- dataset of 10,000 input sentences, we observe an execution current computation threads for subjects, verbs and objects. time of few seconds for Copernic, 30 s for Intellexer while our Moreover, distances matrix can be opportunely stored and framework takes 500 s. We can point out that the average accessed by a proper indexing strategy.
Average Recall (Rouge), Summary Length 25% 0.7 References Our Framework Average Recall (Rouge-SU4) Average Recall (Rouge-2) 0.6 0.58 0.56 1H. P. Luhn, The automatic creation of literature abstracts, IBM J. 0.5 Res. Dev., 159 (1958). 2 0.42 0.41 R. McDonald, A study of global inference algorithms in multi- 0.4 0.39 0.35 0.37 0.31 0.31 document summarization, Proc. 29th Eur. Conf. IR Res. (2007), 0.3 0.28 0.29 0.25 pp. 557–564. 3 0.2 U. Hahn and I. Mani, The challenges of automatic summarization, Computer 29 (2000). 0.1 4R. McDonald and V. Hristidis, A survey of text summarization 0 techniques, Mining Text Data, 43 (2012). iWIN Copernic Intellexer Great Tools4NoobsCompactor 5V. Gupta and S. Gurpreet, A survey of text summarization ex- Fig. 3. ROUGE values, length ¼ 25%. tractive techniques, J. Emerg. Technol. Web Intell. 258 (2010). July 20, 2017 7:14:23pm WSPC/B2975 ch-03 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
20 F. Amato et al.
6D. Wang and T. Li, Weighted consensus multi-document sum- 14H. Takamura and M. Okumura, Text summarization model based marization, Inform. Process. Manag. 513 (2012). on maximum coverage problem and its variant, Proc. 12th Conf. 7H. Van Halteren and S. Teufel, Examining the consensus between European Chapter of the AC (2009), pp. 781–789. human summaries: Initial experiments with factoid analysis, Proc. 15D. Gillick and B. Favre, A scalable global model for summari- HLT-NAACL 03 on Text summarization workshop, Vol. 5 (2003), zation, Proc. Workshop on Integer Linear Programming for Nat- pp. 57–64. ural Language Processing (2009), pp. 10–18. 8L. Vanderwende, H. Suzuki, C. Brockett and A. Nenkova, Beyond 16O. Etzioni, A. Fader, J. Christensen, S. Soderland and M. Mau- SumBasic: Task-focused summarization with sentence simplifi- sam, Open information extraction: The second generation, IJCAI. cation and lexical expansion, Inf. Process. Manage. 1606 (2007). (2011), pp. 3–10. 9M. Wang, X. Wang, C. Li and Z. Zhang, Multi-document sum- 17Z. Wu and M. Palmer, Verb semantics and lexical selection, 32nd marization based on word feature mining, Proc. 2008 Int. Conf. Annual Meeting of the Association for Computational Linguistics Computer Science and Software Engineering Vol. 01, (2008), pp. (1994), pp. 133–138. 743–746. 18M. Ankerst, M. M. Breunig, H. Kriegel and J. Sander, OPTICS: 10M. Wang, X. Wang, C. Li and Z. Zhang, World wide web site Ordering points to identify the clustering structure, Proc 1999 summarization, Web Intell. Agent Syst. 39 (2004). ACM SIGMOD Int. Conf. Management of data (1999), pp. 49–60. 11D. Rusu, B. Fortuna, M. Grobelnik and D. Mladenic, Semantic 19Copernic, http://www.copernic.com/data/pdf/summarization-white- Graphs derived from triplets with application in document sum- paper-eng.pdf. marization, Informatica (Slovenia), 357 (2009). 20GreatSummary, http://www.greatsummary.com/. 12X. Zhang, G. Cheng and Y. Qu, Ontology summarization based on 21Tool4Noobs, http://summarizer.intellexer.com/. rdf sentence graph, Proc. WWW Conf. (2007), pp. 322–334. 22Copernic, https://www.copernic.com/en/products/summarizer/. 13E. Baralis, L. Cagliero, S. Jabeen, A. Fiori and S. Shah, Multi- document summarization based on the Yago ontology, Expert Syst. Appl. 6976 (2013). July 20, 2017 7:25:31pm WSPC/B2975 ch-04 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Event identification in social networks † ‡ Fattane Zarrinkalam*, , and Ebrahim Bagheri* *Laboratory for Systems, Software and Semantics (LS3) Ryerson University, Toronto, Canada † Department of Computer Engineering, Ferdowsi University of Mashhad Mashhad, Iran ‡ [email protected]
Social networks enable users to freely communicate with each other and share their recent news, ongoing activities or views about different topics. As a result, they can be seen as a potentially viable source of information to understand the current emerging topics/ events. The ability to model emerging topics is a substantial step to monitor and summarize the information originating from social sources. Applying traditional methods for event detection which are often proposed for processing large, formal and structured documents, are less effective, due to the short length, noisiness and informality of the social posts. Recent event detection techniques address these challenges by exploiting the opportunities behind abundant information available in social networks. This article provides an overview of the state of the art in event detection from social networks. Keywords: Event detection; social network analysis; topic detection and tracking.
1. Overview information, because the information that the users publish on Twitter are more publicly accessible compared to other With the emergence and the growing popularity of social social networks. networks such as Twitter and Facebook, many users exten- To address the task of detecting topics from social media sively use these platforms to express their feelings and views streams, a stream is considered to be made of posts which are about a wide variety of social events/topics as they happen, in generated by users in social network (e.g., tweets in the case real time, even before they are released in traditional news outlets.1 This large amount of data produced by various social of Twitter). Each post, in addition to its text formed by a sequence of words/terms, includes a user id and a timestamp. media services has recently attracted many researchers to Additionally, a time interval of interest and a desired update analyze social posts to understand the current emerging rate is provided. The expected output of topic detection topics/events. The ability to identify emerging topics is a algorithms is the detection of emerging topics/events. In TDT, substantial step towards monitoring and summarizing the a topic is \a seminal event or activity, along with all directly information on social media and provides the potential for related to events and activities."4 where event refers to a understanding and describing real-world events and improv- specific thing that happens at a certain time and place.9–11 A ing the quality of higher level applications in the fields of traditional news detection, computational journalism2 and topic can be represented either through the clustering of documents in the collection or by a set of most important urban monitoring,3 among others. terms or keywords that are selected.
1.1. Problem definition 1.2. Challenges The task of topic detection and tracking (TDT) to provide The works proposed within the scope of the TDT have proven the means for news monitoring from multiple sources in to be relatively well-established for topic detection in tradi- order to keep users updated about news and developments. tional textual corpora such as news articles.11 However, ap- One of the first forums was the TDT Forum, held within plying traditional methods for event detection in the social TREC.4 There has been a significant interest in TDT in the media context poses unique challenges due to the distinctive – past for static documents in traditional media.5 8 However, features of textual data in social media12 such as Time Sen- recently the focus has moved to Social Network data sour- sitivity, Short Length, Unstructured Phrases, and Abundant ces. Most of these works use Twitter as their source of Information.
‡ Corresponding author.
21 July 20, 2017 7:25:31pm WSPC/B2975 ch-04 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
22 F. Zarrinkalam & E. Bagheri
. Time sensitivity. Different from traditional textual data, 2.1. Specified event detection the text in social media has real-time nature. Besides Specified event detection aims at identifying known social communicating and sharing ideas with each other, users in events which are partially or fully specified with its content or social networks may publish their feelings and views about 7,13,14 metadata information such as location, time, and venue. For a wide variety of recent events several times daily. 14 Users may want to communicate instantly with friends example, Sakaki et al. have focused on monitoring tweets about \What they are doing (Twitter)" or \What is on their posted recently by users to detect earthquake or rainbow. mind" (Facebook). They have used three types of features: the number of words . Short length. Most of social media platforms restrict the (statistical), the keywords in a tweet message, and the words length of posts. For example, Twitter allows users to post surrounding users queries (contextual), to train a classifier tweets that are no longer than 140 characters. Similarly, and classify tweets into positive or negative cases. To identify Picasa comments are limited to 512 characters, and per- the location of the event a probabilistic spatiotemporal model sonal status messages on Windows Live Messenger are is also built. They have evaluated their proposed approach in restricted to 128 characters. Unlike standard text with lots an earthquake-reporting system in Japan. The authors have of words and their resulting statistics, short messages found that the statistical features provided the best results, consist of few phrases or sentences. They cannot provide while a small improvement in performance has been achieved sufficient context information for effective similarity mea- by the combination of the three features. sure, the basis of many text processing methods.12,29,57 Popescu and Pennacchiotti15 have proposed a framework . Unstructured phrases. In contrast with well-written, to identify controversial events. This framework is based on structured, and edited news releases, social posts might the notion of a Twitter snapshot which consists of a target include large amounts of meaningless messages, polluted entity, a given period, and a set of tweets about the entity and informal content, irregular, and abbreviated words, from the target period. Given a set of Twitter snapshots, the large number of spelling and grammatical errors, and im- authors first assign a controversy score to each snapshot and proper sentence structures and mixed languages. In addi- then rank the snapshots according to the controversy score by tion, in social networks, the distribution of content quality considering a large number of features, such as linguistic, has high variance: from very high-quality items to low- structural, sentiment, controversy and external features in quality, sometimes abusive content, which negatively affect their model. The authors have concluded that Hashtags are the performance of the detection algorithms.61,12 important semantic features to identify the topic of a tweet. . Abundant information. In addition to the content itself, Further, they have found that linguistic, structural, and social media in general exhibit a rich variety of information sentiment features provide considerable effects for contro- sharing tools. For example, Twitter allows users to utilize versy detection. the \#" symbol, called hashtag, to mark keywords or topics Benson et al.16 have proposed a model to identify a in a Tweet; an image is usually associated with multiple comprehensive list of musical events from Twitter based on labels which are characterized by different regions in the artist–venue pairs. Their model is based on a conditional image; users are able to build connection with others (link random field (CRF) to extract the artist name and location of information). Previous text analytics sources most often the event. The input features to CRF model include word appear as
Event identification in social networks 23
returned by each strategy, then they have employed term- short texts. For example, Zhao et al.28 have proposed the frequency analysis and co-location techniques to improve Twitter-LDA model. It assumes that a single tweet contains recall to identify descriptive event terms and phrases, which only one topic, which differs from the standard LDA model. are then used recursively to define new queries. Similarly, Diao et al.29 have proposed biterm topic model (BTM), a Becker et al.19 have proposed centrality-based approaches to novel topic model for short texts, by learning the topics by extract high-quality, relevant, and useful related tweets to an directly modeling the generation of word co-occurrence event. Their approach is based on the idea that the most patterns (i.e., biterms) in the whole corpus. BTM is extended topically central messages in a cluster are more likely to re- by Yan et al.57 by incorporating the burstiness of biterms as flect key aspects of the event than other less central cluster prior knowledge for bursty topic modeling and proposed a messages. new probabilistic model named bursty biterm topic model (BBTM) to discover bursty topics in microblogs. It should be noted that applying such restrictions and the fi 2.2. Unspeci ed event detection fact that the number of topics in LDA is assumed to be fixed The real-time nature of social posts reflect events as they can be considered strong assumptions for social network happen about emerging events, breaking news, and general content because of the dynamic nature of social networks. topics that attract the attention of a large number of users. Therefore, these posts are useful for unknown event detec- tion. Three main approaches have been studied in the litera- 2.2.2. Document clustering methods ture for this purpose: topic-modeling, document-clustering Document-clustering methods extract topics by clustering and feature-clustering approaches1: related documents and consider each resulting cluster as a topic. They mostly represent textual content of each docu- ment as a bag of words or n-grams using TF/IDF weighting 2.2.1. Topic modeling methods schema and utilize cosine similarity measures to compute the Topic modeling methods such as LDA assume that a docu- co-occurrence of their words/n-grams.31,32 Document-clus- ment is a mixture of topics and implicitly use co-occurrence tering methods suffer from cluster fragmentation problems patterns of terms to extract sets of correlated terms as topics and since the similarity of two documents may be sensitive to of a text corpus.20 More recent approaches have extended noise, they perform much better on long and formal docu- LDA to provide support for temporality including the recent ments than social posts which are short, noisy and informal.33 topics over time (ToT) model,21 which simultaneously cap- To address this problem, some works, in addition to textual tures term co-occurrences and locality of those patterns over information, take into account other rich attributes of social time and is hence able to discover more event-specific topics. posts such as timestamps, publisher, location and hash- The majority of existing topic models including LDA and tags.33,35,36 These works typically differ in that they use TOT, focus on regular documents, such as research papers, different information and different measures to compute the consisting of a relatively small number of long and high semantic distance between documents. quality documents. However, social posts are shorter and For example, Dong et al.34 have proposed a wavelet-based noisier than traditional documents. Users in social networks scheme to compute the pairwise similarity of tweets based on are not professional writers and use very diverse vocabulary, temporal, spatial, and textual features of tweets. Fang et al.35 and there are many abbreviations and typos. Moreover, the have clustered tweets by taking into consideration multi- online social media websites have a social network full of relations between tweets measured using different features context information, such as user features and user-generated such as textual data, hashtags and timestamp. Petrovic et al.31 labels, which have been normally ignored by the existing have proposed a method to detect new events from a stream topic models. As a result, they may not perform so well on of Twitter posts. To make event detection feasible on web- social posts and might suffer from the sparsity problem.22–25 scale corpora, the authors have proposed a constant time and To address this problem, some works aggregate multiple short space approach based on an adapted variant of locality sen- texts to create a single document and discover the topics by sitive hashing methods. The authors have found that ranking running LDA over this document.25–27 For instance, Hong according to the number of users is better than ranking and Davison,30 have combined all the tweets from each user according to the number of tweets and considering entropy of as one document and apply LDA to extract the document the message reduces the amount of spam messages in the topic mixture, which represents the user interest. However, in output. Becker et al.39 have first proposed a method to social networks a small number of users usually account for a identify real-world events using an a classical incremental significant portion of the content. This makes the aggregation clustering algorithm. Then, they have classified the clusters process less effective. content into real-world events or nonevents. These nonevents There are some recent works that deal with the sparsity includes Twitter-centric topics, which are trending activities problem by applying some restrictions to simplify the con- in Twitter that do not reflect any real-world occurrences. ventional topic models or develop novel topic models for They have trained the classifier on the variety of features July 20, 2017 7:25:32pm WSPC/B2975 ch-04 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
24 F. Zarrinkalam & E. Bagheri
including temporal, social, topical, and Twitter-centric fea- bursty words and edges are cross-correlation between each tures to decide whether the cluster (and its associated mes- pair of bursty words and used graph-partitioning techniques sages) contains real-world event. to discover events. Similarly, Cordeiro44 has used wavelet In the context of breaking news detection from Twitter, analysis for event detection from Twitter. This author has Sankaranarayanan et al.37 have proposed TwitterStand which constructed a wavelet signal for each hashtag, instead of is a news processing system for Twitter to capture tweets words, over time by counting the hashtag mentions in each related to late breaking news that takes into account both interval. Then, he has applied the continuous wavelet trans- textual similarity and temporal proximity. They have used a formation to get a time-frequency representation of each naive Bayes classifier to separate news from irrelevant in- signal and used peal analysis and local maxima detection formation and an online clustering algorithm based on techniques to detect an event within a given time interval. He weighted term vector to cluster news. Further, they have used et al.6 have used Discrete Fourier Transform to classify the hashtags to reduce clustering errors. Similarly, Phuvipadawat signal for each term based on its power and periodicity. and Murata38 have presented a method for breaking news Depending on the identified class, the distribution of ap- detection in Twitter. They first sample tweets using pre- pearance of a term in time is modeled using one or more defined search queries, and then group them together to form Gaussians, and the KL-divergence between the distributions a news story. Similarity between posts is based on tf-idf with is then used to determine clusters. an increased weight for proper noun terms, hashtags, and In general, most of these works are based on terms and usernames. They use a weighted combination of number of compute similarity between pairs of terms based on their co- followers (reliability) and the number of retweeted messages occurrence patterns. Petkos et al.45 have argued that the (popularity) with a time adjustment for the freshness of the algorithms that are only based on pairwise co-occurrence message to rank each cluster. New messages are included in a patterns cannot distinguish between topics which are specific cluster if they are similar to the first post and to the top-k to a given corpus. Therefore, they have proposed a soft fre- terms in that cluster. quent pattern mining approach to detect finer grained topics. Zarrinkalam et al.46 have inferred fine grained users' topics of interest by viewing each topic as a conjunction of several 2.2.3. Feature clustering methods concepts, instead of terms, and benefit from a graph clus- Feature clustering methods try to extract features of topics tering algorithms to extract temporally related concepts in a from documents. Opics are then detected by clustering fea- given time period. Further, they compute inter-concept sim- tures based on their semantic relatedness. As one of the ilarity by customizing the concepts co-occurrences within a earlier work that focused on Twitter data, Cataldi et al.40 have single tweet to an increased, yet semantic preserving context. constructed a co-occurrence graph of emerging terms selected based on both the frequency of their occurrence and the 3. Application Areas importance of the users. The authors have applied a graph- based method in order to extract emerging topics. Similarly, There are a set of interesting applications of event/topic de- Long et al.41 have constructed a co-occurrence graph by tection systems and methods. Health monitoring and man- extracting topical words from daily posts. To extract events agement is an application in which the detection of events during a time period, they have applied a top-down hierar- plays an important role. For example, Culotta49 have ex- chical clustering algorithm over the co-occurrence graph. plored the possibility of tracking influenza by analyzing After detecting events in different time periods, they track Twitter data. They have proposed an approach to predict in- changes of events in consecutive time periods and summa- fluenza-like illnesses rates in a population to identify influ- rize an event by finding the most relevant posts to that event. enza-related messages and compare a number of regression The algorithm by Sayyadi et al.42 builds a term co-occurrence models to correlate these messages with U.S. Centers for graph, whose nodes are clustered using a community detec- disease control and prevention (CDC) statistics. Similarly, tion algorithm based on betweenness centrality. Additionally, Aramaki et al.52 have identified flu outbreaks by analyzing topic description is enriched with the documents that are most tweets about Influenza. Their results are similar to Google- relevant to the identified terms. Graphs of short phrases, trends based flu outbreak detection especially in the early rather than of single terms, connected by edges representing stages of the outbreak lexical inclusion or similarity have also been used. Paul and Dredze50 have proposed a new topic model for There are also some works that utilize signal processing Twitter, named ailment topic aspect model (ATAM), that techniques for event detection from social networks. For in- associates symptoms, treatments and general words with stance, Weng and Lee43 have used wavelet analysis to dis- diseases. It produces more detailed ailment symptoms and cover events in Twitter streams. First, they have selected tracks disease rates consistent with published government bursty words by representing each word as a frequency-based statistics (influenza surveillance) despite the lack of super- signal and measuring the bursty energy of each word using vised influenza training data. In Ref. 51, the authors have autocorrelation. Then, they build a graph whose nodes are used Twitter to identify posts which are about health issues July 20, 2017 7:25:32pm WSPC/B2975 ch-04 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Event identification in social networks 25
and they have investigated what types of links the users topics of interest from Twitter, they have utilized an LDA- consult for publishing health related information. based topic model that jointly captures word co-occurrences Natural events detection (Disasters) is another application and locality of those patterns over time. for the automatic detection of events from social network. For example, Sakaki et al.14 have proposed an algorithm to monitor the real-time interaction of events, such as earth- 4. Conclusion and Future Directions quakes in Twitter. Their approach can detect an earthquake with high probability by monitoring tweets and detects Due to the fast-growing and availability of social network earthquakes promptly and sends e-mails to registered users. data, many researchers has recently become attracted to event The response time of the system is shown to be quite fast, detection from social networks. Event detection aims at similar to the Japan Meteorological Agency. Cheong and finding real-world occurrences that unfold over space and Cheong53 have analyzed the tweets during Australian floods time. The problem of event detection from social networks of 2011 to identify active players and their effectiveness in has faced different challenges due to the short length, noisi- disseminating critical information. As their secondary goal, ness and informality of the social posts. In this paper, we they have identified the most important users among Aus- presented an overview of the recent techniques to address this tralian floods to be: local authorities (Queensland Police problem. These techniques are classified according to the Services), political personalities (Premier, Prime Minister, type of target event into specified or unspecified event de- Opposition Leader and Member of Parliament), social media tection. Further, we provided some potential applications in volunteers, traditional media reporters, and people from not which event detection techniques are utilized. for profit, humanitarian, and community associations. In While there are many works related to event detection Ref. 54, the authors have applied visual analytics approach to from social networks, one challenge that has to be addressed a set of georeferenced Tweets to detect flood events in Ger- in this research area is the lack of public datasets. Privacy many providing visual information on the map. Their results issues along with Social Network companies' terms of use confirmed the potential of Twitter as a distributed \social hinder the availability of shared data. This obstacle is of great sensor". To overcome some caveats in interpreting immediate significance since it relates to the repeatability of experiments results, they have explored incorporating evidence from other and comparison between approaches. As a result, most of the data sources. current approaches have focused on a single data source, Some applications with marketing purpose have also uti- specially the Twitter platform because of the usability and lized event detection methods. For example, Medvent et al.55 accessibility of the Twitter API. However, being dependent have focused on detecting events related to three major on a single data source entails many risks. Therefore, one brands including Google, Microsoft and Apple. Examples of future direction can be monitoring and analyzing the events such events are the release of a new product like the new iPad and activities from different social network services simul- or Microsoft Security Essential software. In order to achieve taneously. As an example, Kaleel58 have followed this idea the desired outcome, the authors study the sentiment of the and utilized Twitter posts and Facebook messages for event tweets. Si et al.56 have proposed a continuous Dirichlet detection. They have used LSH to classify messages. The Process Mixture model for Twitter sentiment, to help predict proposed algorithm first independently identifies new events the stock market. They extract the sentiment of each tweet (first stories) from both sources (Twitter, Facebook) and then based on its opinion words distribution to build a sentiment hashes them into clusters. time series. Then, they regress the stock index and the As another future direction, there is no method in the field Twitter sentiment time series to predict the market. of event detection from social networks which is able to There are also some works that model user's interests over automatically answer the following questions for each detected events from social networks. For example, Zarrin- detected event: what, when, where, and by whom. Therefore, kalam et al.47 have proposed a graph-based link prediction improving current methods to address these questions can be schema to model a user's interest profile over a set of topics/ a new future direction. As a social post is often associated events present in Twitter in a specified time interval. They with spatial and temporal information, it is possible to detect have considered both explicit and implicit interests of the when and where an event happens. user. Their approach is independent of the underlying topic Several further directions can be explored to achieve ef- detection method, therefore, they have adopted two types of ficient and reliable event detection systems such as: investi- topic extraction methods: feature clustering and LDA gating how to model the social streams together with other approaches. Fani et al.48 have proposed a graph-based data sources, like news streams to better detect and represent framework that utilizes multivariate time series analysis to events,60 designing better feature extraction and query gen- tackle the problem of detecting time-sensitive topic-based eration techniques, designing more accurate filtering and communities of user who have similar temporal tendency detection algorithms as well as techniques to support multiple with regards to topics of interests in Twitter. To discover languages.59 July 20, 2017 7:25:32pm WSPC/B2975 ch-04 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
26 F. Zarrinkalam & E. Bagheri
References 19H. Becker, M. Naaman and L. Gravano, Selecting quality Twitter content for events, Int. AAAI Conf. Weblogs and Social Media 1 L. M. Aiello, G. Petkos, C. Martin and D. Corney, Sensing (Barcelona, Spain, 2011). trending topics in twitter, IEEE Trans. Multimedia 15(6) 1268. 20D. Blei, Probabilistic topic models, Commun. ACM 55(4), 77 2 S. Cohen, J. T. Hamilton and F. Turner, Computational journal- (2012). ism, Comm. ACM 54, 66 (2011). 21X. Wang and A. McCallum, Topics over time: A non-Markov 3 D. Quercia, J. Ellis, L. Capra and J. Crowcroft, Tracking \Gross continuous-time model of topical trends, Proc. 12th ACM Community Happiness" from Tweets, in CSCW: ACM Conf. SIGKDD Int. Conf. Knowledge Discovery and Data Mining Computer Supported Cooperative Work, New York, NY, USA, (2006), pp. 424–433. (2012), pp. 965–968. 22M. Michelson and S. A. Macskassy, Discovering users' topics of 4 J. Fiscus and G. Duddington, Topic detection and tracking over- interest on twitter: A first look, 4th Workshop on Analytics for view, Topic Detection and Tracking: Event-Based Information Noisy Unstructured Text Data (AND'10) (2010), pp. 73–80. Organization (2002), pp. 17–31. 23B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu and 5 J. Kleinberg, Bursty and hierarchical structure in streams, Proc. M. Demirbas, Short text classification in twitter to improve in- Eighth ACM SIGKDD International Conf. Knowledge Discovery formation filtering, 33rd Int. ACM SIGIR Conf. Research and and Data Mining, KDD '02, ACM, New York, NY (2002), Development in Information Retrieval (2010), pp. 841–842. pp. 91–101. 24C. Xueqi, Y. Xiaohui, L. Yanyan and G. Jiafeng, BTM: Topic 6 Q. He, K. Chang and E.-P. Lim. Analyzing feature trajectories for modeling over short texts, IEEE Trans. Knowl. Data Eng. 26(12), event detection, Proc. 30th Annual Int. ACM SIGIR Conf. Re- 2928 (2014). search and Development in Information Retrieval, SIGIR '07, 25R. Mehrotra, S. Sanner, W. Buntine and L. Xie, Improving lda ACM, New York, NY (2007), pp. 207–214. topic models for microblogs via tweet pooling and automatic la- 7 X. Wang, C. Zhai, X. Hu and R. Sproat, Mining correAn event beling, 36th Int. ACM SIGIR Conf. Research and Development in extraction model based on timelinelated bursty topic patterns from Information Retrieval (2013). coordinated text streams, Proc. 13th ACM SIGKDD International 26J. Weng, E. P. Lim, J, Jiang and Q. He, TwitterRank: Finding Conf. Knowledge Discovery and Data Mining, KDD '07, ACM, topic-sensitive influential twitterers, 3rd ACM Int. Conf. Web New York, NY (2007), pp. 784–793. Search and Data Mining (WSDM '10) (2010), pp. 261–270. 8S. Goorha and L. Ungar, Discovery of significant emerging trends, 27N. F. N. Rajani, K. McArdle and J. Baldridge, Extracting topics Proc. 16th ACM SIGKDD Int. Conf. Knowledge Discovery and based on authors, recipients and content in microblogs, 37th Int. Data Mining, KDD '10, ACM, New York, NY (2010), pp. 57–64. ACM SIGIR Conf. Research & Development in Information 9R. Troncy, B. Malocha and A. T. S. Fialho, Linking events with Retrieval (2014), pp. 1171–1174. media, Proc. 6th Int. Conf. Semantic Systems, I-SEMANTICS '10, 28W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan and X. Li, ACM, New York, NY (2010), pp. 42:1–42:4. Comparing twitter and traditional media using topic models, 33rd 10L. Xie, H. Sundaram and M. Campbell, Event mining in multi- European Conf. Advances in Information Retrieval (2011), media stream, Proc. IEEE 96(4), 623 (2008). pp. 338–349. 11J. Allan, J. Carbonell, G. Doddington, J. Yamron and Y. Yang, 29Q. Diao, J. Jiang, F. Zhu and E. Lim, Finding bursty topics from Topic detection and tracking pilot study final report, Proc. DARPA microblogs, 50th Annual Meeting of the Association for Compu- Broadcast News Transcription and Understanding Workshop, tational Linguistics: Human Language Technologies (ACL HLT, Lansdowne, VA (1998), pp. 194–218. 2012), pp. 536–544. 12X. Hu and H. Liu, Text analytics in social media, Mining Text 30L. Hong and B. Davison, Empirical study of topic modeling in Data (Springer US, 2012), pp. 385–414. Twitter, 1st ACM Workshop on Social Media Analytics (2010). 13F. Atefeh and W. Khreich, A survey of techniques for event de- 31S. Petrović, M. Osborne and V. Lavrenko, Streaming first story tection in Twitter, Comput. Intell. 31(1), 132 (2015). detection with application to Twitter, Proc. HLT: Ann. Conf. 14T. Sakaki, M. Okazaki and Y. Matsuo, Earthquake shakes Twitter North American Chapter of the Association for Computational users: Real-time event detection by social sensors, Proc. 19th Linguistics (2010), pp. 181–189. WWW (2010). 32G. Ifrim, B. Shi and I. Brigadir, Event detection in twitter using 15A.-M. Popescu and M. Pennacchiotti, Detecting controversial aggressive filtering and hierarchical tweet clustering, SNOW- events from twitter, Proc. CIKM '10 Proc. 19th ACM Int. Conf. DC@WWW (2014), pp. 33–40. Information and Knowledge Management (2010), pp. 1873–1876. 33G. P. C. Fung, J. X. Yu, P. S. Yu and H. Lu, Parameter free bursty 16E. Benson, A. Haghighi and R. Barzilay, Event discovery in social events detection in text streams, Proc. 31st Int. Conf. Very Large media feeds, Proc. HLT '11 Proc. 49th Annual Meeting of the Data Bases, VLDB '05 (2005), pp. 181–192. Association for Computational Linguistics: Human Language 34X. Dong, D. Mavroeidis, F. Calabrese and P. Frossard, Multiscale Technologies, Vol. 1 (2011), pp. 389–398. event detection in social media, CoRR abs/1406.7842 (2014). 17R. Lee and K. Sumiya, Measuring geographical regularities of 35Y. Fang, H. Zhang, Y. Ye and X. Li, Detecting hot topics from crowd behaviors for Twitter-based geosocial event detection, Proc. Twitter: A multiview approach, J. Inform. Sci. 40(5), 578 (2014). 2nd ACM SIGSPATIAL Int. Workshop on Location Based Social 36S-H. Yang, A. Kolcz, A. Schlaikjer and P. Gupta, Large-scale Networks, LBSN '10, ACM, New York, NY (2010), pp. 1–10. high-precision topic modeling on twitter, Proc. 20th ACM 18H. Becker, F. Chen, D. Iter, M. Naaman and L. Gravano, Auto- SIGKDD Int. Conf. Knowledge Discovery and Data Mining matic identification and presentation of Twitter content for plan- (2014), pp. 1907–1916. ned events, Int. AAAI Conf. Weblogs and Social Media (Barcelona, 37J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D. Lieberman Spain, 2011). and J. Sperling, Twitterstand: News in tweets, Proc. 17th ACM July 20, 2017 7:25:32pm WSPC/B2975 ch-04 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Event identification in social networks 27
SIGSPATIAL Int. Conf. Advances in Geographic Information Conf. World Wide Web Companion, International World Wide Web Systems (2009), pp. 42–51. Conf. Steering Committee (2013), pp. 703–705. 38S. Phuvipadawat and T. Murata, Breaking news detection and 52E. Aramaki, S. Maskawa and M. Morita, Twitter catches the flu: tracking in Twitter, IEEE/WIC/ACM Int. Conf. Web Intelligence Detecting influenza epidemics using Twitter, Proc. Conf. Empir- and Intelligent Agent Technology (WI-IAT), Vol. 3, Toronto, ON ical Methods in Natural Language Processing (2011), pp. 1568– (2014), pp. 120–123. 1576. 39H. Becker, M. Naaman and L. Gravano, Beyond trending topics: 53F. Cheong and C. Cheong, Social media data mining: A social Real-world event identification on Twitter, ICWSM (Barcelona, network analysis of tweets during the 2010–2011 australian Spain, 2011). floods, PACIS, July (2011), pp. 1–16. 40M. Cataldi, L. D. Caro and C. Schifanella, Emerging topic de- 54G. Fuchs, N. Andrienko, G. Andrienko, S. Bothe and H. Stange, tection on twitter based on temporal and social terms evaluation, Tracing the German centennial flood in the stream of tweets: First 10th Int. Workshop on Multimedia Data Mining, MDMKDD '10 lessons learned, Proc. Second ACM SIGSPATIAL Int. Workshop (USA, 2010), pp. 4:1–4:10. on Crowdsourced and Volunteered Geographic Information 41R. Long, H. Wang, Y. Chen, O. Jin and Y. Yu, Towards effective (ACM, New York, NY, USA, 2013), pp. 31–38. event detection, tracking and summarization on microblog data, 55E. Medvet and A. Bartoli, Brand-related events detection, classi- Web-Age Information Management, Vol. 6897 (2011), pp. 652– fication and summarization on twitter, IEEE/WIC/ACM Int. Conf. 663. Web Intelligence and Intelligent Agent Technology (2012), 42H. Sayyadi, M. Hurst and A. Maykov, Event detection and pp. 297–302. tracking in social streams, Proc. ICWSM 2009 (USA, 2009). 56J. Si, A. Mukherjee, B. Liu, Q. Li, H. Li and X. Deng, Exploiting 43J. Weng and F. Lee, Event detection in Twitter, 5th Int. AAAI Conf. topic based twitter sentiment for stock prediction, ACL (2) (2013), Weblogs and Social Media (2011), pp. 401–408. pp. 24–29. 44M. Cordeiro, Twitter event detection: Combining wavelet analysis 57X. Yan, J. Guo, Y. Lan, J. Xu and X. Cheng, A probabilistic model and topic inference summarization, Doctoral Symp. Informatics for bursty topic discovery in microblogs, AAAI Conf. Artificial Engineering (2012). Intelligence (2015), pp. 353–359. 45G. Petkos, S. Papadopoulos, L. M. Aiello, R. Skraba and 58S. B. Kaleel, Event detection and trending in multiple social Y. Kompatsiaris, A soft frequent pattern mining approach for networking sites, Proc. 16th Communications & Networking textual topic detection, 4th Int. Conf. Web Intelligence, Mining and Symp. Society for Computer Simulation International (2013). Semantics (WIMS14) (2014), pp. 25:1–25:10. 59G. Lejeune, R. Brixtel, A. Doucet and N. Lucas, Multilingual 46F. Zarrinkalam, H. Fani, E. Bagheri, M. Kahani and W. Du, Se- event extraction for epidemic detection, Artif. Intell. Med. 65(2), mantics-enabled user interest detection from twitter, IEEE/WIC/ 131 (2015). ACM Web Intell. Conf. (2015). 60W. Gao, P. Li and K. Darwish, Joint topic modeling for event 47F. Zarrinkalam, H. Fani, E. Bagheri and M. Kahani, Inferring summarization across news and social media streams, Proc. 21st implicit topical interests on twitter, 38th European Conf. IR ACM Int. Conf. Information and Knowledge Management, CIKM Research, ECIR 2016, Padua, Italy, March 20–23 (2016), '12 (New York, NY, USA, ACM, 2012), pp. 1173–1182. pp. 479–491. 61P. Ferragina and U. Scaiella, Fast and accurate annotation of short 48H. Fani, F. Zarrinkalam, E. Bagheri and W. Du, Time-sensitive texts with wikipedia pages, J. IEEE Softw. 29(1), 70 (2012). topic-based communities on twitter, 29th Canadian Conf. Artifi- 62S. Yang, A. Kolcz, A. Schlaikjer and P. Gupta, Large-scale high- cial Intelligence, Canadian AI 2016, Victoria, BC, Canada, May precision topic modeling on twitter, Proc. 20th ACM SIGKDD 31–June 3 (2016), pp. 192–204. Int. Conf. Knowledge Discovery and Data Mining (2014), 49A. Culotta, Towards detecting influenza epidemics by analyzing pp. 1907–1916. twitter messages, KDD Workshop on Social Media Analytics 63D. Duan, Y. Li, R. Li, R. Zhang, X. Gu and K. Wen, LIMTopic: A (2010), pp. 115–122. framework of incorporating link based importance into topic 50J. M. Paul and M. Dredze, A model for mining public health topics modeling, IEEE Trans. Knowl. Data Eng. (2013), 2493–2506. from twitter, Technical Report, Johns Hopkins University (2011). 64M. JafariAsbagh, E. Ferrara, O. Varol, F. Menczer and A. Flam- 51E. V. D. Goot, H. Tanev and J. Linge, Combining twitter and mini, Clustering memes in social media streams, Soc. Netw. Anal. media reports on public health events in medisys, Proc. 22nd Int. Min. (2014). July 20, 2017 7:27:28pm WSPC/B2975 ch-05 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Community detection in social networks † ‡ Hossein Fani*, , and Ebrahim Bagheri* *Laboratory for Systems, Software, and Semantics (LS3) Ryerson University, Toronto, ON, Canada † University of New Brunswick, Fredericton, NB, Canada ‡ [email protected]
Online social networks have become a fundamental part of the global online experience. They facilitate different modes of communication and social interactions, enabling individuals to play social roles that they regularly undertake in real social settings. In spite of the heterogeneity of the users and interactions, these networks exhibit common properties. For instance, individuals tend to associate with others who share similar interests, a tendency often known as homophily, leading to the formation of communities. This entry aims to provide an overview of the definitions for an online community and review different community detection methods in social networks. Finding communities are beneficial since they provide summarization of network structure, highlighting the main properties of the network. Moreover, it has applications in sociology, biology, marketing and computer science which help scientists identify and extract actionable insight. Keywords: Social network; community detection; topic modeling; link analysis.
1. Introduction among others. Communities identify proteins that have the same function within the cell in protein networks,8 web pages A social network is a net structure made up of social actors, about similar topics in the World Wide Web (WWW),9 mainly human individuals, and ties between them. Online functional modules such as cycles and pathways in metabolic social networks (OSN) are online platforms that provide social networks,10 and compartments in food webs.11 actors, i.e., users, in spatially disperse locations to build social The purpose of this entry is to provide an overview of relations. Online social networks facilitate different modes of communication and present diverse types of social interac- the definition of community and review the different com- munity detection methods in social networks. It is not an tions. They not only allow individual users to be connected exhaustive survey of community detection algorithms. and share content, but also provide the means for active en- Rather, it aims at providing a systematic view of the gagement, which enables users to play social roles that they fundamental principles. regularly undertake in real social settings. Such features have made OSNs a fundamental part of the global online experi- 1 ence, having pulled ahead of email. Given individuals mimic 2. Definition their real world ties and acquaintances in their online social preferences,2 the tremendous amount of information offered The word community refers to a social context. People natu- by OSNs can be mined through social network analysis (SNA) rally tend to form groups, within their work environment, to help sociometrists, sociologists, and decision makers from family, or friends. A community is a group of users who share many application areas with the identification of actionable similar interests, consume similar content or interact with each insight.3,4 For instance, despite the heterogeneity of user other more than other users in the network. Communities are bases, and the variety of interactions, most of these networks either explicit or latent. Explicit communities are known in exhibit common properties, including the small-world and advance and users deliberately participate in managing ex- scale-free properties.5,6 In addition, some users in the net- plicit communities, i.e., users create, destroy, subscribe to, and ' works are better connected to each other than to the rest. In unsubscribe from them. For instance, Google s social network þa other words, individuals tend to associate with others who platform, Google , has Circles that allows users to put dif- share similar interests in order to communicate news, opinions ferent people in specific groups. In contrast, in this entry, or other information of interest, as opposed to establishing communities are meant to be latent. Members of latent com- sporadic connections; a tendency termed homophily as a result munities do not tend to show explicit membership and their of which communities emerge on social networks.7 similarity of interest lies within their social interactions. Communities also occur in many other networked systems from biology to computer science to economics, and politics, aplus.google.com
29 July 20, 2017 7:27:28pm WSPC/B2975 ch-05 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
30 H. Fani & E. Bagheri
No universally accepted quantitative definition of the requests through online social networks such as Twitterc or community has been formulated yet in the literature. The Facebook.d Customer service representatives would quickly notion of similarity based on which users are grouped into identify the mindset of the customer that has called into the call communities has been addressed differently in social network center by a series of short questions. For such cases, appro- analysis. In fact, similarity often depends on the specific priate techniques are required that would look at publicly system at hand or application one has in mind, no matter available social and local customer data to understand their whether they are explicit connections. The similarity between background so as to efficiently address their needs and work pairs of users may be with respect to some reference property, towards their satisfaction. Important data such as the list of based on part of the social network or the whole. Nonethe- influential users within the community, the position of a given less, a required property of a community is cohesiveness. The user in relation to influential users, the impact of users' opi- more users gather into groups such that they are intra-group nions on the community, customer's social behavioral patterns, close (internal cohesion) and inter-group loose (external in- and emergence of social movement patterns are of interest in coherence), the more the group would be considered as a order to customize the customer care experience for individual community. customers.14 Moreover, in partitioned communities, each user is a member of one and only one community. However, in real networks users may belong to more than one community. In 4. Detection this case, one speaks of overlapping communities where each Given a social network, at least two different questions may user, being associated with a mixture, contributes partially to be raised about communities: (i) how to identify all com- several or all communities in the network. munities, and (ii) given a user in the social network, what is the best community for the given user if such a community exists. This entry addresses proposed approaches solving the 3. Application former problem, known as community detection; also called Communities provide summarization of network structure, community discovery or mining. The latter problem, known highlighting the main properties of the network at a macro as community identification, is relevant but not aimed here. level; hence, they give insights into the dynamics and the The problem of community detection is not well-defined overall status of the network. Community detection finds since its main element of the problem, the concept of com- application in areas as diverse as sociology, biology, mar- munity, is not meticulously formulated. Some ambiguities are keting and computer science. In sociology, it helps with hidden and there are often many true answers to them. understanding the formation of action groups in the real Therefore, there are plenty of methods in the literature and world such as clubs and committees.12 Computer scientists researchers do not try to ground the problem on a shared study how information is disseminated in the network definition. through communities. For instance, community drives to connect like-minded people and encourages them to share History more content. Further, grouping like-minded users who are 4.1. also spatially near to each other may improve the perfor- Probably the earliest account of research on community de- mance of internet service providers in that each community tection dates back to 1927. At the time, Stuart Rice studied of users could be served by a dedicated mirror server.13 In the voting themes of people in small legislative bodies (less marketing, companies can use communities to design tar- than 30 individuals). He looked for blocs based on the degree geted marketing as the 2010 Edelman Trust Barometer of agreement in casting votes within members of a group, Reportb found, 44% of users react to online advertisements if called Index of Cohesion, and between any two distinct other users in their peer group have already done so. Also, groups, named Index of Likeness.15 Later, in 1941, Davis communities are employed to discover previously unknown et al.16 did a social anthropological study on the social ac- interests of users, alias implicit interest detection, which can tivities of a small city and surrounding county of Mississippi potentially useful in recommender systems to set up efficient over 18 months. They introduced the concept of caste to the recommendations.12 earlier studies of community stratification by social class. In a very recent concrete application, Customer Relation- They showed that there is a system of colored caste which ship Management (CRM) systems are empowered to tap into parsed a community through rigid social ranks. The general the power of social intelligence by looking at the collective approach was to partition the nodes of a network into discrete behavior of users within communities in order to enhance subgroup positions (communities) according to some equiv- client satisfaction and experience. As an example, customers alence definition. Meantime, George Homans showed that often post their opinions, suggestions, criticisms or support
ctwitter.com bwww.edelman.co.uk/trustbarometer/files/edelmantrust-barometer-2010.pdf dwww.facebook.com July 20, 2017 7:27:28pm WSPC/B2975 ch-05 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Community detection in social networks 31
social groups could be detected by reordering the rows and actors and edges representing relationships or interactions. the columns of the matrix describing social ties until they Required cohesiveness property of communities, here, is re- form a block-diagonal shape.17 This procedure is now stan- duced to connectedness which means that connections within dard and mainly addressed as blockmodel analysis in social each community are dense and connections among different network analysis. Next analysis of community structure was communities are relatively sparse. Respectively, primitive carried out by Weiss and Jacobson in 1955,18 who searched graph structures such as components and cliques are con- for work groups within bureaucratic organizations based on sidered as promising communities.21 However, more mean- attitude and patterns of interactions. The authors collected the ingful communities can be detected based on graph matrix of working relationships between members of an partitioning (clustering) approaches, which try to minimize agency by means of private interviews. Each worker had been the number of edges between communities so that the nodes asked to list her workers along with frequency, reason, sub- inside one community have more intra-connections than ject, and the importance of her contacts with them. In addition inter-connections with other communities. Most approaches to matrix's rows and columns reordering, work groups were are based on iterative bisection: continuously dividing one separated by removing the persons working with people of group into two groups, while the number of communities different groups, i.e. liaison person. This concept of liaison which should be in a network is unknown. With this respect, has been received the name betweenness and is at the root of Girvan–Newman approach has been used the most in several modern algorithms of community detection. link-based community detection.22 It partitions the graph by removing edges with high betweenness. The edge between- ness is the number of the shortest paths that include an Contemporaries 4.2. edge in a graph. In the proposed approach, the connectedness Existing community detection approaches can be broadly of the communities to be extracted is measured using mod- classified into two categories: link-based and content-based ularity (Sec. 5). Other graph partitioning approaches include approaches. Link-based approaches, also known as topology- max-flow min-cut theory,23 the spectral bisection method,24 based, see a social network as a graph, whose nodes are users Kernighan–Lin partition,25 and minimizing conductance and edges indicate explicit user relationships. On the other cut.26 hand, content-based approaches mainly focus on the infor- Link-based community detection can be viewed as a data mation content of the users in the social network to detect mining/machine learning clustering, an unsupervised classi- communities. Also called topic-based, the goal of these fication of users in a social network in which the proximity of approaches is to detect communities formed toward the topics data points is based on the topology of links. Then, unsu- extracted from users' information contents. Hybrid approa- pervised learning which encompasses many other techniques ches incorporate both topological and topical information to such as k-means, mixture models, and hierarchical clustering find more meaningful communities with higher quality. Re- can be applied to detect communities. cently, researchers have performed a longitudinal study on the community detection task in which the social network is monitored at regular time intervals over a period of time.19,20 4.2.2. Content analysis Time dimension opens up a new temporal version of com- On the one hand, in spite of the fact that link-based techniques munity detections. The following sections include the details are intuitive and grounded on sociological homophily prin- of some of the seminal works in each category. ciple, they fall short in identifying communities of users that share similar conceptual interests due to two reasons, among others. Firstly, many of the social connections are not based 4.2.1. Link analysis on users' interest similarity but other factors such as friend- Birds of a feather, flock together. People tend to bond with ship and kinship that do not necessarily reflect inter-user in- similar others. The structures of ties in a network of any type, terest similarity. Secondly, many users who have similar from friendship to work to information exchange, and other interests do not share connections with each other.27 On the types of relationship are grounds on this tendency. Therefore, other hand, with the ever growing of online social networks, a links between users can be considered as important clues for lot of user-generated content, known as social content, is inferring their interest similarity and subsequently finding available on the networks, besides the links among users. communities. This observation which became the earliest Users maintain profile pages, write comments, share articles, reference guideline at the basis of most community defini- tag photos and videos, and post their status updates. There- tions was studied thoroughly long after its usage by fore, researchers have explored the possibility of utilizing the McPherson et al.7 as the homophily principle: `Similarity topical similarity of social content to detect communities. breeds connection'. They have proposed content- or topic-based community de- In link-based community detection methods, the social tection methods, irrespective of the social network structure, network is modeled by a graph with nodes representing social to detect like-minded communities of users.28 July 20, 2017 7:27:28pm WSPC/B2975 ch-05 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
32 H. Fani & E. Bagheri
Most of the works in content-based community detection studies such as Ref. 36 and 37. Most of these approaches have focused on probabilistic models of textual content for devise an integrated generative model for both link and detecting communities. For example, Abdelbary et al.29 have content through shared latent variables for community identified users' topics of interest and extracted topical memberships. communities using Gaussian Restricted Boltzmann Erosheva et al.38 introduce Link-LDA, an overlapping Machines. Yin et al.30 have integrated community discovery community detection to group scientific articles based on with topic modeling in a unified generative model to detect their abstract (content) and reference (link) parts. In their communities of users who are coherent in both structural generative model, an article is assumed to be a couple model relationships and latent topics. In their framework, a com- for the abstract and the reference parts each of which is munity can be formed around multiple topics and a topic can characterized by LDA. They adopt the same bag-of-words be shared among multiple communities. Sachan et al.12 have assumption used in abstract part for the reference part as well, proposed probabilistic schemes that incorporate users' posts, named bag-of-references. Thus, articles that are similar in the social connections, and interaction types to discover latent abstract and the references tend to share the same topics. As user communities in Twitter. In their work, they have con- opposed to Link-LDA in which the citation links are treated sidered three types of interactions: a conventional tweet, a words, Nallapti et al.39 suggest to explicitly model the topical reply tweet, and a retweet. Other authors have also proposed relationship between the text of the citing and cited docu- variations of Latent Dirichlet Allocation (LDA), for example, ment. They propose Pairwise-Link-LDA to model the link Author-Topic model31 and Community-User-Topic model,32 existence between pairs of documents and have obtained to identify communities. better quality of topics by employing this additional infor- Another stream of work models the content-based com- mation. Other approaches that utilize LDA to join link and munity detection problem into a graph clustering problem. content are Refs. 40 and 41. In addition to probabilistic These works are based on a similarity metric which is able to generative models, there are other approaches such as matrix compute the similarity of users based on their common topics factorization and kernel fusion for spectral clustering that of interest and a clustering algorithm to extract groups of combine link and content information for community users (latent communities) who have similar interests. For detection.42,43 example, Liu et al.33 have proposed a clustering algorithm based on topic-distance between users to detect content-based communities in a social tagging network. In this work, LDA 4.2.4. Overlapping communities is used to extract hidden topics in tags. Peng et al.34 have The common approach to the problem of community detec- proposed a hierarchical clustering algorithm to detect latent tion is to partition the network into disjoint communities of communities from tweets. They have used predefined cate- members. Such approaches ignore the possibility that an in- gories in SINA Weibo and have calculated the pairwise dividual may belong to two or more communities. However, similarity of users based on their degree of interest in each many real social networks have communities with overlaps.44 category. Like link-based methods, content-based community de- For example, a person can belong to more than one social tection methods can be turned into data clustering in which group such as family groups and friend groups. Increasingly, researchers have begun to explore new methods which allow communities are sets of points. The points, representing communities to overlap, namely overlapping communities. users, are close to each other inside versus outside the com- Overlapping communities introduces a further variable, the munity with respect to a measure of distance or similarity membership of users in different communities, called covers. defined for each pair of users. In this sense, closeness is the Since there is an enormous number of possible covers in required cohesiveness property of the communities. overlapping communities comparing to standard partitions, detecting such communities is expensive. Some overlapping community detection algorithms utilize Link jointly with content 4.2.3. the structural information of users in the network to divide Content-based methods are designed for regular documents users of the network into different communities. The domi- and might suffer from short, noisy, and informal social con- nant algorithm in this trend is based on clique percolation tents of some social networks such as Twitter or the like theory.45 However, LFM and OCG are based on local opti- microblogging services. In such cases, the social content mization of a fitness function over user's out/in links.46,47 alone is not the reliable information to extract true commu- Furthermore, some fuzzy community detection algorithms nities.35 Presumably, enriching social contents with social calculate the possibility of each node belonging to each structure, i.e. links, does help with finding more meaningful community, such as SSDE and IBFO.48,49 Almost all algo- communities. Several approaches have been proposed to rithms need prior information to detect overlapping com- combine link and content information for community detec- munities. For example, LFM needs a parameter to control the tion. They have achieved better performance, as revealed in size of communities. There are, also, some probabilistic July 20, 2017 7:27:28pm WSPC/B2975 ch-05 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Community detection in social networks 33
approaches in which communities are latent variables with introduced. The normal value of the index is subtracted from distributions on the entire user space such as Ref. 50. the expectation value of the index in the null model, and the Recent studies, however, have focused on links. Initially result is normalized to [0, 1], yielding 0 for independent suggested by Ahn et al.,51 link clustering finds communities partitions and 1 for identical partitions. The second type of of links rather than communities of users. The underlying similarity measures models the problem of comparing com- assumption is that while users can have many different rela- munities as a problem of message decoding in information tionships, the relationships within groups are structurally theory. The idea is that, if the output communities are similar similar. By partitioning the links into non-overlapping to the ground truth, one needs very little information to infer groups, each user can participate in multiple communities by the result given the ground truth. The extra (less) information inheriting the community assignment of its links. Link clus- can be used as a measure of (dis)similarity. The normalized tering approach significantly speeds up the discovering of mutual information is currently very often used in this type of overlapping communities. evaluation.54 The normalized mutual information reaches 1 if the result and the ground truth are identical, whereas it has an expected value of 0 if they are independent. These measures 4.2.5. Temporal analysis have been recently extended to the case of overlapping communities such as the work by Lancichinetti et al.55 The above methods do not incorporate temporal aspects of Ground truth is not available in most cases of the real- users' interests and undermine the fact that users of com- world applications and there is no well-defined criterion for munities would ideally show similar contribution or interest evaluating the resulting communities. In such cases, quality patterns for similar topics throughout the time. The work by functions are defined as a quantitative measure to assess the Hu et al.19 is one of the few that considers the notion of communities. The most popular quality function is the temporality. The authors propose a unified probabilistic modularity introduced by Newman and Girvan.22 It is based generative model, namely GrosToT, to extract temporal topics on the idea that a random network is not expected to have a and analyze topics' temporal dynamics in different commu- modular structure. The communities are going to emerge as nities. Fani et al.20 follow the same underlying hypothesis the network deviate from a random network. Therefore, the related to topics and temporality to find time-sensitive com- ' more density of links exists in the actual community with munities. They use time series analysis to model user s compare to the expected density when the users were con- temporal dynamics. While GrosToT is primarily dependent nected randomly, the more modular a community is. Simply, on a variant of LDA for topic detection, the unique way of modularity of a community is the number of links within user representation in Ref. 20 provides the flexibility of being communities minus expected number of such links. Evi- agnostic to any underlying topic detection method. dently, the expected link density depends on the chosen null model. One simple null model would be a network with the same number of links as the actual network and links are 5. Quality Measure placed between any pair of users with the uniform proba- The standard procedure for evaluating results of a community bility. However, this null model yields a Poissonian degree detection algorithm is assessing the similarity between the distribution which is not a true descriptor of real networks. results and the ground truth that is known for benchmark With respect to the modularity, high values imply good par- datasets. These benchmarks are typically small real-world titions. So, a community with maximum modularity in a social networks or synthetic ones. Similarity measures can be network should be the near best one. This ignites a class of divided into two categories: measures based on pair counting community detection which is based on modularity maxi- and measures based on information theory. A thorough in- mization. While the application of modularity has been 4 troduction of similarity measures for communities has been questioned, it continues to be the most popular and widely given in Ref. 52. The first type of measures based on pair accepted measure of the fitness of communities. counting depends on the number of pairs of vertices which As another quality function, the conductance of the are classified in the same (different) communities in the community was chosen by Leskovec et al.26 The conductance ground truth and the result produced by the community de- of a community is the ratio between the cut size of the tection method. The Rand index is the ratio of the number of community and the minimum between the total degree of user pairs correctly classified in both the ground truth and the the community and that of the rest of the network. So, if the result, either in the same or in different communities, over community is much smaller than the whole network, the the total number of pairs.53 The Jaccard index is the ratio of conductance equals the ratio between the cut size and the total the number of user pairs classified in the same community in degree of the community. A good community is characterized both ground truth and the result, over the number of user pairs by a low cut size and a large internal density of links which which are classified in the same community of result or result in low values of the conductance. For each real net- ground truth. Both the Rand and the Jaccard index are work, Leskovec et al. have carried out a systematic analysis adjusted for random grouping, in that a null model is on the quality of communities that have various sizes. They July 20, 2017 7:27:28pm WSPC/B2975 ch-05 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
34 H. Fani & E. Bagheri
derived the network community profile plot (NCPP), show- particularly with the ever-increasing size of current online ing the minimum conductance score among subgraphs of a social networks. As an example, Facebook has close to one given size as a function of the size. They found that com- billion active users who collectively spend twenty thousand munities are well defined only when they are fairly small in years online in one day sharing information. Meanwhile, size. Such small communities are weakly connected to the there are also 340 million tweets sent out by Twitter users. A rest of the network, often by a single edge (in this case, they community detection algorithm needs to be efficient and are called whiskers), and form the periphery of the network. scalable, taking practical amount of time to finish when ap- The fact that the best communities appear to have a charac- plied on such large-scale networks. Many existing methods teristic size of about 100 users is consistent with Dunbar are only applicable to small networks. Providing fast and conjecture that 150 is the upper size limit for a working scalable versions of community detection methods is one human community.56 proposed direction worthy of significant future efforts. One solution to deal with large-scale networks is the sampling. The goal is to reduce the number of users and/or links while 6. Future Direction keeping the underlying network structure. Network sampling No doubt community detection has matured and social is done in the preprocessing step and is independent of the network analysts have achieved an in-depth knowledge of subsequent steps in community detection algorithms. Hence, the communities, their emergence and evolution, in real it provides performance improvement to all community de- social networks, but interesting challenges still yet to be tection algorithms. Although sampling seems to be straight- addressed. As hinted in the introduction (Sec. 1), the quest forward and easy to implement, it has a direct impact on the for a single correct definition of network communities and a results of community detection, in terms of both accuracy and single accurate community detection method seem to be efficiency. It has been shown that naively sampled users or futile which is not necessarily a problem. Years of endeavors links by uniform distribution will bring bias into the output have resulted in many methods for finding communities sampled network, which will affect the results negatively.59 based on a variety of principles. The picture that is emerging One of the challenges going forward in social network is that the choice of community detection algorithm depends analysis will be to provide sampling methods and, particu- on the properties of the network under study. In many ways, larly, to take sampling into account in the overall performance the problem of community detection has a parallel in the analysis of community detection methods. more mature topic of clustering in computer science, where a variety of methods exist, each one with standard applica- tions and known issues. As a consequence, one challenge for References the task of community detection is about distinguishing 1F. Benevenuto et al., Characterizing user behavior in online social between existing ones. One way is to compare algorithms on networks, Proc. 9th ACM SIGCOMM Conf. Internet Measurement real networks where network's metadata is available. In the Conf. (ACM, 2009). case of social networks, for example, we can use demo- 2P. Mateos, Demographic, ethnic, and socioeconomic community graphic and geographic information of users. One example structure in social networks, Encyclopedia of Social Network Analysis and Mining (Springer New York, 2014), pp. 342–346. of such a metadata-based evaluation has been done in 3 Ref. 57, but a standard framework for evaluation has yet to R. Claxton, J. Reades and B. Anderson, On the value of digital be emerged. traces for commercial strategy and public policy: Tele- communications data as a case study, World Economic Forum Moreover, networks are dynamic with a time stamp as- Global Information Technology Report, 2012-Living in a sociated with links, users, and the social contents. As seen, Hyperconnected World, World Economic Forum, (pp. 105–112) almost all measures and methods ignore temporal informa- 4A. Stevenson and J. Hamill, Social media monitoring: A practical tion. It is also inefficient to apply such community detection case example of city destinations, Social Media in Travel, Tourism algorithms and measures to static snapshots of the social and Hospitality (Ashgate, Farnham, 2012), pp. 293–312. network in each time interval. In order to truly capture the 5D. J. Watts and S. H. Strogatz, Collective dynamics of `small- properties of a social network, the methods have to analyze world' networks, Nature 393, 440 (1998). 6 data in their full complexity fueled with time dimension. This A. L. Barabasi and R. Albert, Emergence of scaling in random 19,20 networks, Science 286, 509 (1999). trend has slowly started. What makes the problem more 7 complex is the fact that online social network data arrive in a M. McPherson, L. Smith-Lovin and J. M. Cook, Birds of a streaming fashion, esp. the social contents. The states of the feather: Homophily in social networks, Ann. Rev. Sociol. 415 (2001). network need to be updated in an efficient way on the fly,in 8 J. Chen and B. Yuan, Detecting functional modules in the yeast order to avoid a bottleneck in the processing pipeline. To protein–protein interaction network, Bioinformatics 22, 2283 date, only a small number of work has approached this (2006). 58 problem directly. 9Y. Dourisboure, F. Geraci and M. Pellegrini, Extraction and The last area is the computational complexity in which classification of dense communities in the web, Proc. 16th Int. the current methods will need dramatic enhancement, Conf. World Wide Web (ACM, 2007). July 20, 2017 7:27:29pm WSPC/B2975 ch-05 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Community detection in social networks 35
10R. Guimera and L. A. N. Amaral, Functional cartography of 32D. Zhou, E. Manavoglu, J. Li, C. L. Giles and H. Zha, Probabi- complex metabolic networks, Nature 433, 895 (2005). listic models for discovering e-communities, 15th Int. Conf. World 11A. Krause et al., Compartments revealed in food-web structure, Wide Web (2006), pp. 173–182. Nature 426, 282 (2003). 33H. Liu, H. Chen, M. Lin and Y. Wu, Community detection based 12M. Sachan, D. Contractor, T. A. Faruquie and L. V. Subramaniam, on topic distance in social tagging networks, TELKOMNIKA Using content and interactions for discovering communities in Indonesian J. Electr. Eng. 12(5), 4038. social networks, 21st Int. Conf. World Wide Web (WWW'12), 34D. Peng, X. Lei and T. Huang, DICH: A framework for discov- (2012), pp. 331–340. ering implicit communities hidden in tweets, J. World Wide Web 13B. Krishnamurthy and J. Wang, On network-aware clustering of (2014). web clients, ACM SIGCOMM Computer Commun. Rev. 30,97 35T. Yang et al., Combining link and content for community de- (2000). tection: A discriminative approach, Proc. 15th ACM SIGKDD Int. 14Y. Richter, E. Yom-Tov and N. Slonim, Predicting customer churn Conf. Knowledge Discovery and Data Mining (ACM, 2009). in mobile networks through analysis of social groups, SDM, SIAM 36D. Cohn and T. Hofmann, The missing link — a probabilistic (2010), pp. 732–741. model of document content and hypertext connectivity, NIPS 15S. A. Rice, The identification of blocs in small political bodies, (2001). Am. Political Sci. Rev. 21, 619 (1927). 37L. Getoor, N. Friedman, D. Koller and B. Taskar, Learning 16A. Davis et al., Deep South: A Sociological Anthropological probabilistic models of link structure, J. MLR 3 (2002). Study of Caste and Class (University of Chicago Press, 1941). 38E. Erosheva, S. Fienberg and J. Lafferty, Mixed membership 17G. C. Homans, The Human Croup (1950). models of scientific publications, Proc. Natl. Acad. Sci. 101 (2004). 18R. S. Weiss and E. Jacobson, A method for the analysis of the 39R. M. Nallapati, A. Ahmed, E. P. Xing and W. W. Cohen, Joint structure of complex organizations, Am. Sociol. Rev. 20, 661 latent topic models for text and citations, KDD (2008). (1955). 40L. Dietz, S. Bickel and T. Scheffer, Unsupervised prediction of 19Z. Hu, Y. Junjie and B. Cui, User Group Oriented Temporal citation influences, In ICML (2007). Dynamics Exploration, AAAI (2014). 41A. Gruber, M. Rosen-Zvi and Y. Weiss, Latent topic models for 20H. Fani et al., Time-sensitive topic-based communities on twitter, hypertext, UAI (2008). Canadian Conf. Artificial Intelligence. Springer International 42S. Zhu, K. Yu, Y. Chi and Y. Gong, Combining content and link Publishing (2016). for classification using matrix factorization, SIGIR (2007). 21S. Fortunato, Community detection in graphs, Phys. Rep. 486,75 43S. Yu, B. D. Moor and Y. Moreau, Clustering by heterogeneous (2010). data fusion: Framework and applications, NIPS workshop (2009). 22M. Girvan and M. E. J. Newman, Community structure in social 44J. Xie, S. Kelley and B. K. Szymanski, Overlapping community and biological networks, Proc. Nat. Acad. Sci. 99, 7821 (2002). detection in networks: The state of the art and comparative study, 23L. R. Ford and D. R. Fulkerson, Maximal flow through a network, ACM Comput. Surv. 45 (2013), doi: 10.1145/2501654.2501657. Can. J. Math. 8, 399 (1956). 45G. Palla, I. Derenyi, I. Farkas and T. Vicsek, Uncovering the 24A. Pothen, H. D. Simon and K.-P. Liou, Partitioning sparse ma- overlapping community structure of complex networks in nature trices with eigenvectors of graphs, SIAM J. Matrix Anal. Appl. 11, and society, Nature 435, 814 (2005). 430 (1990). 46A. Lancichinetti, S. Fortunato and J. Kertesz, Detecting the 25B. W. Kernighan and L. Shen, An efficient heuristic procedure for overlapping and hierarchical community structure in complex partitioning graphs, Bell System Tech. J. 49, 291 (1970). networks, New J. Phys. 11, 033015 (2009), doi: 10.1088/1367- 26J. Leskovec, J. Kleinberg and C. Faloutsos, Graphs over time: 2630/11/3/033015. Densification laws, shrinking diameters and possible explanations, 47E. Becker, B. Robisson, C. E. Chapple, A. Guenoche and C. Brun, Proc. Eleventh ACM SIGKDD Int. Conf. Knowledge Discovery in Multifunctional proteins revealed by overlapping clustering in Data Mining (ACM, 2005). protein interaction network, Bioinformatics 28, 84 (2012). 27Q. Deng, Z. Li, X. Zhang and J. Xia, Interaction-based social 48M. Magdon-Ismail and J. Purnell, SSDE-Cluster: Fast overlapping relationship type identification in microblog, Int. Workshop on clustering of networks using sampled spectral distance embedding Behavior and Social Informatics and Computing (2013), pp. 151– and gmms, Proc. 3rd Int. Conf. Social Computing (SocialCom/ 164. PASSAT), Boston, MA, USA. NJ, USA: IEEE Press (2011), 28N. Natarajan, P. Sen and V. Chaoji, Community detection in pp. 756–759, 10.1109/PASSAT/SocialCom.2011.237. content-sharing social networks, IEEE/ACM Int. Conf. Advances 49X. Lei, S. Wu, L. Ge and A. Zhang, Clustering and overlapping in Social Networks Analysis and Mining (2013), pp. 82–89. modules detection in PPI network based on IBFO, Proteomics 13, 29H. A. Abdelbary, A. M. ElKorany and R. Bahgat, Utilizing deep 278 (2013). learning for content-based community detection, Science and In- 50W. Ren et al., Simple probabilistic algorithm for detecting com- formation Conf. (2014), pp. 777–784. munity structure, Phys. Rev. E 79, 036111 (2009). 30Z. Yin, L. Cao, Q. Gu and J. Han, Latent community topic anal- 51Y.-Y. Ahn, J. P. Bagrow and S. Lehmann, Link communities reveal ysis: Integration of community discovery with topic modeling, J. multiscale complexity in networks, Nature 466, 761 (2010). ACM Trans. Intell. Syst. Technol. (TIST) 3(4) (2012). 52M. Meilă, Comparing clusterings — an information based dis- 31M. Rosen-Zvi, T. Griffiths, M. Steyvers and P. Smyth, The author- tance, J. Multivariate Anal. 98, 873 (2007). topic model for authors and documents, 20th Conf. Uncertainty in 53W. M. Rand, Objective criteria for the evaluation of clustering Artificial Intelligence (2004), pp. 487–494. methods, J. Am. Stat. Assoc. 66, 846 (1971). July 20, 2017 7:27:29pm WSPC/B2975 ch-05 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
36 H. Fani & E. Bagheri
54L. Danon et al., Comparing community structure identification, J. 57Y. Y. Ahn, J. P. Bagrow and S. Lehmann, Link communities reveal Stat. Mech., Theory Exp. 09, P09008 (2005). multi-scale complexity in networks, Nature (2010). 55A. Lancichinetti, S. Fortunato and J. Kertesz, Detecting the 58C. C. Aggarwal, Y. Zhao and S. Y. Philip, On clustering graph overlapping and hierarchical community structure in complex streams, SDM (SIAM, 2010), pp. 478–489. networks, New J. Phys. 11, 033015 (2009). 59Y. Ruan et al., Community discovery: Simple and scalable 56R. Dunbar, Grooming, Gossip, and the Evolution of Language approaches, User Community Discovery (Springer International (Harvard University Press, Cambridge, USA, 1998). Publishing, 2015), pp. 23–54. July 20, 2017 7:28:49pm WSPC/B2975 ch-06 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
High-level surveillance event detection ‡ † § Fabio Persia*, and Daniela D'Auria , *Faculty of Computer Science, Free University of Bozen-Bolzano, Piazza Domenicani 3, Bozen-Bolzano, 39100, Italy † Department of Electrical Engineering and Information Technology, University of Naples Federico II, Via Claudio 21, Naples, 80125, Italy ‡ [email protected] § [email protected]
Security has been raised at major public buildings in the most famous and crowded cities all over the world following the terrorist attacks of the last years, the latest one at the Promenade des Anglais in Nice. For that reason, video surveillance systems have become more and more essential for detecting and hopefully even prevent dangerous events in public areas. In this work, we present an overview of the evolution of high-level surveillance event detection systems along with a prototype for anomaly detection in video surveillance context. The whole process is described, starting from the video frames captured by sensors/cameras till at the end some well-known reasoning algorithms for finding potentially dangerous activities are applied. Keywords: Video surveillance; anomaly detection; event detection.
1. Problem Description such as brightness, temperature and so on, and how these parameters can modify the surveys (the identification of a In latest years, modern world's needs of safety caused a speed small object in the scene is more complex in the night); as spreading of video surveillance systems; these systems are well, this system cannot identify semantic higher level events collocated especially in the most crowded places. The main (such as a package left near a track) with the same precision purpose of a video surveillance system is to create some auto- and reliability. Similarly, a traditional video surveillance matic tools, which can extend the faculties of human percep- system can discover, in a bank, the presence of objects near tion, allowing collection and real-time analysis of data coming from lots of electronic \viewers" (sensors, cameras, etc...). the safe, but cannot automatically notify an event interesting for the context, such as a \bank robbery" (every time that an One of the main limits of modern security systems is that object is near the safe, an alarm should be generated: in this most of them have been designed for specific functionalities way, nevertheless, false alarms could be generated also when and contexts: they generally use an only kind of sensors (such the bank clerk goes into the safe room to take some money). as cameras, motes, scanners) which cannot notice all the In the end, we want a modern video surveillance system to possible important phenomena connected to the observation context. A second and not negligible limit is that the attain the following points: it is to integrate heterogeneous \semantics" of the phenomena (events) that such systems can information coming from different kinds of sensors, to be notice is quite limited and, as well, these systems are not very flexible in capability to discover all possible events that can flexible when we want to introduce new events to be identi- happen in the monitored environment and to be adaptable to fied. For example, a typical video surveillance system at the the context features of the observed scene. From a techno- entrance of a tunnel uses a set of cameras monitoring train logical point of view, the main requirements of this kind of transit and the possible presence of objects in the scene. systems are: heterogeneity of the adopted surveying systems, When a person transits on the tracks, we want the system heterogeneity of noticed data and of those to be processed, automatically to identify the anomalous event and to signal it wiring of devices and communication with servers dedicated to a keeper. The commonest \Image Processing" algorithms to processing. However, the design and development of a (that can be directly implemented on a camera processor or complete framework addressing all the issues listed above is can be stored on a dedicated server that can process infor- unfortunately a very challenging task. mation sent by a camera) can quite precisely identify the In this work, we present a prototype of framework for changes between a frame and the next one and, in this way, anomaly detection in video surveillance context. The whole discover the potential presence of anomalies (train transit, process is described: thus, we start from the video frames presence of a person,...) in the scene. In the scene analysis, a captured by sensors/cameras and then, after several steps, we system does not consider all the environmental parameters, apply some well-known reasoning algorithms1,2 for finding
37 July 20, 2017 7:28:50pm WSPC/B2975 ch-06 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
38 F. Persia & D. D'Auria
high-level unexplained activities in time-stamped observation detection. These techniques have made monitoring of very data. large areas easier, because they act as pre-filters of defined The remainder of the paper is organized as follows. events. Nevertheless, the 2GSS systems are characterized by Section 2 deals with the evolution of video surveillance a good level of digitalization about signal transmission systems up to the proposed general method (third generation and processing, in the sense that systems include digital surveillance systems (3GSD)), while Sec. 3 describes in components in some parts of their architecture. detail a possible implementation of a prototype attaining to The main goal of 3GSS is to obtain, to manage and to the general method. Eventually, Sec. 4 discusses some efficiently transmit real time noticed video events, from a conclusions and possible future improvements. large set of sensors, through a full digital approach; this approach uses digital components in all layers of the system architecture, from sensor layer till visual and codified 2. Method presentation of information to operators.3 In a 3GSS system, The video surveillance systems proposed in literature can be cameras communicate with some processing and transmis- classified into three categories (or generations3), from a sion devices: in this way intelligent cameras are built. A technological point of view. The three proposed generations, network layer, whose principal component is an intelligent in fact, have followed the evolution of the communication hub, has the purpose to assemble data coming from different techniques of the image processing and of the data storing cameras. So, we can say that the automatic video surveillance and they have been evolving with the same rapidity as these system goal is to act as a pre-filter for human validation of techniques. suspicious events. Such pre-filtering is generally based on First generation surveillance systems (1GSS) extend video processing and gives some important parameters for human perception capability from a spatial point of view: a object localization and for tracking of their trajectories, while set of cameras (sensor layer) are used to capture visual sig- they are in the monitored environment. Figure 1 shows a nals from different positions set in a monitored environment. schematization of a 3GSS video surveillance system logical Such signals are then sent and visualized by operators, after architecture, which can be assumed as general method to be an analogical transmission, in an only location (control applied for building effective and efficient video surveillance room). 1GSS most considerable disadvantages are due to the systems. operators' short duration of attention, which is responsible for The sensor layer is composed of one or more fixed or a high rate of missed recording of important events.4 mobile cameras: their purpose is to collect images, to be sent From the early 80's, because of the increasing interest of to the image processing system (IPS). research in video processing and in order to improve the basic Images captured by cameras are stored in a specific video technologies in this area, scientists obtained a sensible database (VideoRepository),13,14 then, they are sent in input to improvement in camera resolution and, at the same time, a IPS system, that processes them. Such module extracts low- reduction of hardware (computers, memories, ...) costs. level information (such as the presence of new objects in the Most of the researches made during second generation scene, their position, and so on...) through image processing surveillance system (2GSS) period have improved the algorithms; it also converts this information into a format development of automatic techniques called automated event conformable to a syntax used by higher layers.
Fig. 1. Video surveillance system architecture. July 20, 2017 7:28:59pm WSPC/B2975 ch-06 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
High-level surveillance event detection 39
Then, an event description language (EDL) has been the image processing algorithm suite of the IPS frame- defined on the basis of image processing algorithms; through work. this language, it is possible to formalize a complex event in a (2) Client services: The client can specify the working sys- strict way. The data organized in this way, is stored in a tem parameters, based for example on the alert mode. He specific area of Database (Data Collection). can visualize on-line and off-line Video sequences cor- A post-processing framework, called high semantic rea- responding to alarms detected by the system; he can also soning center (HSRC) is the part of the system responsible visualize whole stored videos and make some statistics for complex events' occurrences surveying, through the on detected event occurrences. processing of low-level data made available by IPS. We can In the last years, many framework have been developed to classify the following components: identify anomalies or suspicious events in video sequences. In . Event Repository: It is the part of Database in which Ref. 10, the authors present a framework for detecting com- predicates and complex events' definitions are stored; as plex events through inferencing process based on Markov well, information about event occurrence surveying in the logic networks (MLNs) and rule-based event models. An- video are stored in it, too. other approach has been employed by Zin et al.11 which . Data Collection: It is the part of Database that collects the propose an integrated framework for detecting suspicious output of the IPS framework, organized according to EDL behaviors in video surveillance systems exploiting multiple language syntax. background modeling techniques, high-level motion feature . Agent Based Processor (ABP): Its main aims are to capture extraction methods and embedded Markov chain models. the interesting event definition, composed of the Event Moreover, Helmer and Persia12 propose a framework for Repository, to catch the observation, that is, the video high-level surveillance event detection exploiting a language description in terms of predicates, from Data Collection, based on relational algebra extended by intervals. and to verify the event occurrence during the observation. . Subvideo Extractor: When an event is detected, this framework extracts from the video the interesting frame 3. The Proposed Implementation sequence and saves it in the Video Repository as a new file; In this section, we describe the prototype designed and de- in this way, the sequence is made available for on-line and veloped for finding anomalous activities in video surveillance off-line visualizations. context attaining to the general method shown in Sec. 2. The . Query Builder: The framework assigned to client service architecture of the proposed prototype (Fig. 2) consists of the creation organizes parameters which the attitude of ABP following layers: an Image Processing Library,aVideo processor bases on. Management services are built on lan- Labeler,anActivity Detection Engine and the unexplained guage capability and on algorithms ability available in IPS. activities problem (UAP) Engine, implementing the algo- rithms for video anomaly detection. In particular, the Image The system presents a set of services to final clients through Processing Library analyzes the video captured by sensors/ user-friendly interfaces. Such services can be classified into cameras and returns the low level annotations for each video two different categories: frame as output; the Video Labeler fills the semantic gap (1) Management services: The system manager can define between the low level annotations captured for each frame new kinds of primitive and complex events and can extend and the high level annotations, representing high level events
Fig. 2. The prototype architecture. July 20, 2017 7:29:13pm WSPC/B2975 ch-06 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
40 F. Persia & D. D'Auria
Fig. 3. A video frame from ITEA-CANDELA dataset.
that can be associated to the video frames; then, we used an 3.2. The video labeler Activity Detection Engine to find activity occurrences match- As we mentioned above, the Video Labeler fills the semantic ing the well-known models, that can be classified into good and gap between the low level annotations captured for each bad ones: thus, such a module takes as inputs the high level frame and the high level annotations. So, through the Video annotations previously caught by the Video Labeler and the Labeler, some high level events, called action symbols, with stochastic activity models; eventually, the Unexplained Ac- the related timestamps are detected; thus, the output of the ð Þ tivity Problem UAP Engine described in Refs. 1 and 2 takes Video Labeler is the list of action symbols related to the as input the activity occurrences previously found with the considered video source. The Video Labeler has been associated probabilities and the high level annotations and implemented in Java programming language: it uses the discovers the Unexplained Video Activities. DOM libraries to parse the XML file containing the output of the Image Processing Library. The Video Library defines the 3.1. The image processing library rules that have to be checked to verify the presence of each interested high level atomic event in the video. So, a Java The image processing library used in our prototype imple- method for each action symbol we want to detect, containing 5 mentation is the reading people tracker (RPT), that achieves a the related rules, has been defined. There are listed below good accuracy in object detection and tracking. RPT takes the some examples of rules defined to detect some atomic events frame sequence of the video as input and returns an XML file (action symbols) in a video belonging to the ITEA-CANDELA describing the low level annotations caught in each frame, dataset. according to a standard schema defined in an XML Schema.We Action Symbol A: A person P goes into the central zone ' have only made some few updates to the RPT s source code, in with the package order to be able to get more easily the type of each object detected in a frame (person, package, car). For instance, Fig. 4 . There are at least two objects in the current frame; shows the low level annotations associated to the frame number . at least one of the objects is a person; 18 (Fig. 3) of a video belonging to the ITEA-CANDELA . at least one of the objects is a package; dataset,a which has been used to carry out some preliminary . the person identified appears on the scene for the first time; experiments. As we can see in Fig. 4, the RPT correctly . the distance between the person's barycenter and the identifies two objects (represented by the XML elements called package one is smaller than a specific distance threshold. track) into the frame shown in Fig. 3: the former, identified by ID ¼ 5, is a person (type ¼ 5), while the latter, identified by Action Symbol B: A person P drops off the package ID ¼ 100, is a package (type ¼ 6). The XML attribute type of . There are at least two objects in the current frame; the element track denotes the type of the detected object. . at least one of the objects is a person; . at least one of the objects is a package; ahttp://www.multitel.be/˜va/candela/abandon.html. . the person was previously holding a package; July 20, 2017 7:29:18pm WSPC/B2975 ch-06 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
High-level surveillance event detection 41
Fig. 4. The related low level annotations.
. the distance between the person's barycenter and Action Symbol F: A person P goes out of the central zone the package one is smaller than a specific distance with the package threshold. . This symbol is detected when a person holding a package Action Symbol C: A person P goes into the central zone does not appear anymore on the scene for a specified time to live (TTL). . There is at least one object in the current frame; . at least one of the objects is a person; . the person identified appears on the scene for the first 3.3. The activity detection engine time; An Activity Detection Engine is able to find activity occur- . if there are also some packages on the scene, their distances rences matching the well-known models: thus, such a module are greater than a specific distance threshold. takes as inputs the list of action symbols previously caught by Action Symbol D: A person P picks up the package the Video Labeler and the stochastic activity models, and finally returns the list of the discovered activity occurrences . There are at least two objects in the current frame; with the related probabilities. To reach this goal, a specific . at least one of the objects is a person; software called temporal multi-activity graph index creation . at least one of the objects is a package; (tMAGIC), which is the implementation of a theoretical . the distance between the person's barycenter and the model presented in Ref. 6 has been used. package one is smaller than a specific distance threshold; As a matter of fact, the6 approach addresses the problem . the person was not previously holding a package. of efficiently detecting occurrences of high-level activities Action Symbol E: A person P1 gives the package to from such interleaved data streams. In this approach, there another person P2 has been proposed a temporal probabilistic graph so that the elapsed time between observations also plays a role in . There are at least three objects in the current frame; defining whether a sequence of observations constitutes an . at least two of the objects are persons; activity. First, a data structure called temporal multiactivity . at least one of the objects is a package; graph to store multiple activities that need to be concurrently . P1 was previously holding a package; monitored has been proposed. Then, an index called tMAGIC . in the current frame, both the distances of P1 and P2's has been defined. It basically exploits the data structure just barycenters from the package are smaller than a specific described to examine and link observations as they occur. distance threshold; There are also some defined algorithms for insertion and bulk . in the next frames, P1's distance from the package is insertion into the tMAGIC index showing that this can be greater than the threshold, while P2's one is smaller efficiently accomplished. In this approach, the algorithms are (it means that P2 has got the package and P1 is not holding basically defined to solve two problems: the evidence prob- it anymore) lem that tries to find all occurrences of an activity (with July 20, 2017 7:29:22pm WSPC/B2975 ch-06 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
42 F. Persia & D. D'Auria
probability over a threshold) within a given sequence of in video surveillance context, which would be no longer observations, and the identification problem that tries to find based on the concept of possible worlds, but on game theory. the activity that best matches a sequence of observations. Some methods of reducing complexity and pruning strategies have been introduced to make the problem, which is intrin- References sically exponential, linear to the number of observations. It is 1M. Albanese, C. Molinaro, F. Persia, A. Picariello and V. S. demonstrated that tMAGIC has time and space complexity Subrahmanian, Discovering the top-k unexplained sequences in linear to the size of the input, and can efficiently retrieve time-stamped observation data, IEEE Trans. Knowl. Data Eng. instances of the monitored activities. Moreover, this activity (TKDE) 26(3) 577 (2014). detection engine has been also exploited in other works be- 2M. Albanese, C. Molinaro, F. Persia, A. Picariello and V. S. longing to different contexts, such as in Refs. 7–9. Subrahmanian, Finding unexplained activities in video, Int. Joint Conf. Artificial Intelligence (IJCAI) (2011), pp. 1628–1634. 3J. K. Petersen, Understanding Surveillance Technologies (CRC 3.4. The UAP engine Press, Boca Raton, FL, 2001). 4C. Regazzoni and V. Ramesh, Scanning the Issue/Technology The UAP Engine takes as input the activity occurrences Special Issue on Video Communications, Processing, and previously found by the Activity Detection Engine with the Understanding for Third Generation Surveillance Systems, Proc. associated probabilities and the list of the detected action IEEE, University of Genoa, Siemens Corporate Research, symbols and finally discovers the Unexplained Video Activi- University of Udine (2001). ties, that are subsequences of the video source which are 5N. T. Siebel and S. Maybank, Fusion of multiple tracking algo- not sufficiently explained with a certain confidence by the rithms for robust people tracking, Proc. ECCV02 (2002), pp. 373– activity models and that could thus be potentially dangerous. 387. 6 Such module is based on the concept of possible worlds, has M. Albanese, A. Pugliese and V. S. Subrahmanian, Fast activity been developed in Java programming language and provides detection: Indexing for temporal stochastic automaton based activity the implementations of the theoretical algorithms FindTUA, models, IEEE Trans. Knowl. Data Eng. (TKDE) 25(2) 360 (2013). 7F. Persia and D. D'Auria, An application for finding expected FindPUA.1,2 activities in medial context scientific databases, SEBD (2014), pp.77–88. 8D. D'Auria and F. Persia, Automatic evaluation of medical 4. Conclusion and Future Work doctors' performances while using a cricothyrotomy simulator, – This work presented on overview of the evolution of high-level IRI (2014), pp. 514 519. 9D. D'Auria and F. Persia, Discovering Expected Activities in surveillance event detection systems along with a possible Medical Context Scientific Databases, DATA (2014), pp. 446–453. implementation of a prototype attaining to the general method. 10I. Onal, K. Kardas, Y. Rezaeitabar, U. Bayram, M. Bal, I. Ulusoy More specifically, we started from describing how the video and N. K. Cicekli, A framework for detecting complex events in frames are captured by sensors/cameras and thus analyzed, surveillance videos, 2013 IEEE Int. Conf. Multimedia and Expo then we showed the different steps applied in order to finally Workshops (ICMEW), pp. 1–6. discover some high-level activities which are not sufficiently 11T. T. Zin, P. Tin, H. Hama and T. Toriu, An integrated framework explained by the well-known activity models and that could be for detecting suspicious behaviors in video surveillance, Proc. SPIE potentially dangerous in the video surveillance context. 9026, Video Surveillance and Transportation Imaging Applications 2014, 902614 (March 5, 2014); doi:10.1117/12.2041232. Future work will be devoted to compare this framework 12 with other ones which can be built for instance by replacing S. Helmer and F. Persia, High-level surveillance event detection the components used at each layer with others either already using an interval-based query language, IEEE 10th Int. Conf. Semantic Computing (ICSC'16) (2016), pp. 39–46. well-known in literature or specifically designed and devel- 13 V. Moscato, A. Picariello, F. Persia and A. Penta, A system for oped following innovative approaches. For instance, we automatic image categorization, IEEE 3rd Int. Conf. Semantic planned to also try to use another Image Processing Library Computing (ICSC'09) (2009), pp. 624–629. which would hopefully improve the overall effectiveness of 14V. Moscato, F. Persia, A. Picariello and A. Penta, iwin: A sum- the framework and allow the whole process to work as much marizer system based on a semantic analysis of web documents, as possible automatically. Moreover, we can try to exploit a IEEE 6th Int. Conf. Semantic Computing (ICSC'12) (2012), different UAP Engine for discovering unexplained activities pp.162–169. July 20, 2017 8:34:07pm WSPC/B2975 ch-07 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Selected topics in statistical computing ‡ † Suneel Babu Chatla*, , Chun-Houh Chen and Galit Shmueli* *Institute of Service Science, National Tsing Hua University, Hsinchu 30013, Taiwan R.O.C. † Institute of Statistical Science, Academia Sinica, Taipei 11529, Taiwan R.O.C. ‡ [email protected]
The field of computational statistics refers to statistical methods or tools that are computationally intensive. Due to the recent advances in computing power, some of these methods have become prominent and central to modern data analysis. In this paper, we focus on several of the main methods including density estimation, kernel smoothing, smoothing splines, and additive models. While the field of computational statistics includes many more methods, this paper serves as a brief introduction to selected popular topics. Keywords: Histogram; kernel density; local regression; additive models; splines; MCMC; Bootstrap.
1. Introduction In this paper, we provide summarized expositions for some of these important methods. The choice of methods - \Let the data speak for themselves" highlights major computational methods for estimation and for inference. We do not aim to provide a comprehensive In 1962, John Tukey76 published a paper on \the future of review of each of these methods, but rather a brief intro- data analysis", which turns out to be extraordinarily clair- duction. However, we compiled a list of references for readers voyant. Specifically, he accorded algorithmic models the interested in further information on any of these methods. For same foundation status as algebraic models that statisticians each of the methods, we provide the statistical definition and had favored at that time. More than three decades later, in properties, as well as a brief illustration using an example 1998 Jerome Friedman delivered a keynote speech23 in which dataset. he stressed the role of data driven or algorithmic models in In addition to the aforementioned topics, the 21st century the next revolution of statistical computing. In response, has witnessed tremendous growth in statistical computational the field of statistics has seen tremendous growth in research methods such as functional data analysis, lasso, and machine areas related to computational statistics. learning methods such as random forests, neural networks, According to the current Wikipedia entry on \Com- deep learning and support vector machines. Although most of putational Statistics"a: \Computational statistics or statistical these methods have roots in the machine learning field, they computing refers to the interface between statistics and have become popular in the field of statistics as well. The computer science. It is the area of computational science recent book by Ref. 15 describes many of these topics. (or scientific computing) specific to the mathematical science The paper is organized as follows. In Sec. 1, we open with of statistics." Two well known examples of statistical com- nonparametric density estimation. Sections 2 and 3 discuss puting methods are the bootstrap and Markov Chain Monte smoothing methods and their extensions. Specifically, Sec. 2 Carlo (MCMC). These methods are prohibitive with insuffi- focuses on kernel smoothing while Sec. 3 introduces spline cient computing power. While the bootstrap has gained sig- smoothing. Section 4 covers additive models, and Sec. 5 nificant popularity both in academic research and in practical introduces MCMC methods. The final Sec. 6 is dedicated to applications its feasibility still relies on efficient computing. the two most popular resampling methods: the bootstrap and Similarly, MCMC, which is at the core of Bayesian analysis, jackknife. is computationally very demanding. A third method which has also become prominent in both academia and practice is nonparametric estimation. Today, nonparametric models are 2. Density Estimation popular data analytic tools due to their flexibility despite being very computationally intensive, and even prohibitively A basic characteristic describing the behavior of any random intensive with large datasets. variable X is its probability density function. Knowledge of
ahttps://en.wikipedia.org/wiki/Computational statistics, accessed August 24, 2016.
45 July 20, 2017 8:34:09pm WSPC/B2975 ch-07 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
46 S. B. Chatla, C.-H. Chen & G. Shmueli
the density function is useful in many aspects. By looking at Since the above likelihood (or its logarithm) cannot be the density function chart, we can get a clear picture of maximized directly, penalized maximum likelihood estima- whether the distribution is skewed, multi-modal, etc. In the tion can be used to obtain the histogram estimate. simple case of a continuous random variable X over an in- Next, we proceed to calculate the bias, variance and MSE terval X 2ða; bÞ, the density is defined as of the histogram estimator. These properties give us an idea of Z b the accuracy and precision of the estimator. If we define Pða < X < bÞ¼ f ðxÞdx: ¼½ þð Þ ; þ Þ; 2 Z; Bj x0 j 1 h x0 jh j a
In most practical studies, the density of X is not directly with x0 being the origin of the histogram, then the histogram ; estimator can be formally written as available. Instead, we are given a set of n observations x1 ...; xn that we assume are iid realizations of the random Xn X ^ 1 variable. We then aim to estimate the density on the basis of f hðxÞ¼ðnhÞ IðXi 2 BjÞIðx 2 BjÞ: ð4Þ these observations. There are two basic estimation approa- i¼1 j ches: the parametric approach, which consists of representing We now define the bias of the histogram estimator. As- the density with a finite set of parameters, and the nonpara- sume that the origin of the histogram x is zero and x 2 B . metric approach, which does not restrict the possible form of 0 j Since X are identically distributed the density function by assuming it belongs to a pre-specified i family of density functions. Xn ð^ ð ÞÞ ¼ ð Þ 1 ½ ð 2 Þ In parametric estimation, only the parameters are un- E f h x nh E I Xi Bj known. Hence, the density estimation problem is equivalent i¼1 to estimating the parameters. However, in the nonparametric ¼ðnhÞ 1nE½IðX 2 B Þ Z j approach, one must estimate the entire distribution. This is jh because we make no assumptions about the density function. ¼ h 1 f ðuÞdu: ðj 1Þh ð Þ ð Þ 2.1. Histogram This last term is not equal to f x unless f x is constant in Bj. For simplicity, assume f ðxÞ¼a þ cx; x 2 Bj and a; c 2 R. The oldest and most widely used density estimator is the Therefore, histogram. Detailed discussions are found in Refs. 65 and 34. Using the definition of derivatives, we can write the density in Biasðf^ ðxÞÞ ¼ E½f^ ðxÞ f ðxÞ h hZ the following form: ¼ h 1 ðf ðuÞ f ðxÞÞdu d Fðx þ hÞ FðxÞ f ðxÞ FðxÞ lim ; ð1Þ ZBj dx h!0 h ¼ 1 ð þ Þ where FðxÞ is the cumulative distribution function of the h a cu a cx du B random variable X. A natural finite sample analog of Eq. (1) j 1 is to divide the real line into K equi-sized bins with small bin ¼ h 1hc j h x width h and replace FðxÞ with the empirical cumulative dis- 2 tribution function 1 ¼ cj h x : #fx xg 2 F^ðxÞ¼ i : n Instead of slope c we may write the first derivative of the This leads to the empirical density function estimator 1 density at the midpoint ðj Þh of the bin Bj 2 ð#fx b þ g #fx b gÞ=n ^ð Þ¼ i j 1 i j ; 2ð ; ; ^ 0 1 1 f x x bj bjþ1 Biasðf ðxÞÞ ¼ f j h j h x h h 2 2 ð ; ¼ where bj bjþ1 defines the boundaries of the jth bin and h ¼ Oð1ÞOðhÞ b þ b .Ifwedefinen ¼ #fx b þ g #fx b g then j 1 j j i j 1 i j ¼ OðhÞ; h ! 0: n f^ðxÞ¼ j : ð2Þ nh When f is not linear, a Taylor expansion of f to the first-order The same histogram estimate can also be obtained using reduces the problem to the linear case. Hence, the bias of the maximum likelihood estimation methods. Here, we try to find histogram is given by a density f^ maximizing the likelihood in the observations ^ 1 0 1 Yn Biasðf ðxÞÞ ¼ j h x f j h h 2 2 f^ðx Þ: ð3Þ i þ ð Þ; ! : ð Þ i¼1 o h h 0 5 July 20, 2017 8:34:10pm WSPC/B2975 ch-07 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Selected topics in statistical computing 47
Similarly, the variance for the histogram estimator can be between bias and variance to come up with a good histogram calculated as estimator. ! Xn We observe that the variance of the histogram is propor- ^ 1 tional to f ðxÞ and decreases as nh increases. This contradicts Varðf hðxÞÞ ¼ Var ðnhÞ IðXi 2 BjÞ i¼1 with the fact that the bias of the histogram decreases as h Xn decreases. To find a compromise, we consider the mean 2 ¼ðnhÞ Var½IðXi 2 BjÞ squared error (MSE): i¼1 ^ ^ ^ 2 1 2 MSEðf hðxÞÞ ¼ Varðf hðxÞÞ þ ðBiasðf hðxÞÞÞ ¼ n h Var½IðX 2 BjÞ Z ! Z ! 1 ¼ f ðxÞþððj 1=2Þh xÞ2f 0ððj 1=2ÞhÞ2 ¼ n 1h 2 f ðuÞdu 1 f ðuÞdu nh B B 1 j !j þ oðhÞþo : Z nh ¼ðnhÞ 1 h 1 f ðuÞdu ð1 OðhÞÞ Bj In order for the histogram estimator to be consistent, the MSE ¼ðnhÞ 1ðf ðxÞþoð1ÞÞ; h ! 0; nh !1: should converge to zero asymptotically. This means that the bin width should get smaller with the number of observations Bin width choice is crucial in constructing a histogram. per bin nj getting larger as n !1. Thus, under nh !1; ^ p As illustrated in Fig. 1, bin width choice affects the bias- h ! 0, the histogram estimator is consistent; f hðxÞ!f ðxÞ. variance trade-off. The top three represent histograms for a Implementing the MSE using the formula is difficult in normal random sample but with three different bin sizes. practice because of the unknown density involved. In addi- Similarly, the bottom three histograms are from another tion, it should be calculated for each and every point. Instead normal sample. From the plot, it can be seen that the histo- of looking at the estimate at one particular point, it might be grams with larger bin width have smaller variability but worth calculating a measure of goodness of fit for the entire larger bias and vice versa. Hence, we need to strike a balance histogram. For this reason, the mean integrated squared error Sample 1 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
−4 −2024 −20 2 4 −3 −2 −10 1 2 3 4
5 bins 15 bins 26 bins Sample 2 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
−4 −2024 −3 −2 −10 1 2 3 −3 −2 −10 1 2 3
5 bins 15 bins 26 bins
Fig. 1. Histograms for two randomly simulated normal samples with five bins (left), 15 bins (middle), and 26 bins (right). July 20, 2017 8:34:13pm WSPC/B2975 ch-07 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
48 S. B. Chatla, C.-H. Chen & G. Shmueli
(MISE) is used. It is defined as: However, the naive estimator is not wholly satisfactory because ^ Z 1 f ðxÞ is of a \stepwise" nature and not differentiable every- ^ ^ 2 MISEðf hðxÞÞ ¼ E ðf f Þ ðxÞdx where. We therefore generalize the naive estimator to overcome 1 Z 1 some of these difficulties by replacing the weight function w ^ with a kernel function K which satisfies the conditions ¼ MSEðf hðxÞÞdxÞ 1 Z Z Z ¼ð Þ 1 þ 2= jj 0jj 2 þ ð 2Þþ ðð Þ 1Þ: KðtÞdt ¼ 1; tKðtÞ¼0; and t 2KðtÞdt ¼ k 6¼ 0: nh h 12 f 2 o h o nh 2 jj 0jj 2 Note that f 2 (dx is omitted in shorthand notation) is the 0 Usually, but not always, K will be a symmetric probability square of the L2 norm of f which describes how smooth the density function f is. The common approach for minimizing density function. Now the kernel density estimator becomes MISE is to minimize it as a function of h without higher order Xn ^ 1 x Xi terms (Asymptotic MISE, or AMISE). The minimizer (h ), f ðxÞ¼ K : 0 nh h called an optimal bandwidth, can be obtained by differenti- i¼1 ating AMISE with respect to h. From the kernel density definition, it can be observed that 6 . h ¼ : ð6Þ Kernel functions are symmetric around 0 and can be 0 jj 0jj 2 n f 2 integrated to 1 . Since the kernel isR a density function, theR kernel estimator ^ Hence, we see that for minimizing AMISE we should theo- is a density too: KðxÞdx ¼ 1 implies f hðxÞdx ¼ 1. 1=3 retically choose h0 n , which if we substitute in . The property of smoothness of kernels is inherited ^ the MISE formula, would give the best convergence rate by f hðxÞ.IfK is n times continuously differentiable, then ð 2=3Þ ^ O n for a sufficiently large n. Again, the solution of f hðxÞ is also n times continuously differentiable. Eq. (6) does not help much as it involves f 0 which is still . Unlike histograms, kernel estimates do not depend on the unknown. However, this problem can be overcome by using choice of origin. ^ any reference distribution (e.g., Gaussian). This method is . Usually kernels are positive to assure that f hðxÞ is a den- often called the \plug-in" method. sity. There are reasons to consider negative kernels but then ^ f hðxÞ may be sometimes negative. 2.2. Kernel density estimation We next consider the bias of the kernel estimator: ^ ^ The idea of the kernel estimator was introduced by Ref. 57. Bias½f hðxÞ ¼ E½f hðxÞ f ðxÞ Using the definition of the probability density, suppose X has Z density f. Then ¼ KðsÞf ðx þ shÞds f ðxÞ Z 1 h2s2 f ðxÞ¼lim Pðx h < X < x þ hÞ: ¼ KðsÞ f ðxÞþshf 0ðxÞþ f 00ðxÞ h!0 2h i 2 ð < For any given h, we can estimate the probability P x h þ oðh2Þ ds f ðxÞ X < x þ hÞ by the proportion of the observations falling in ð ; þ Þ ^ h2 the interval x h x h . Thus, a naive estimator f of the ¼ f 00ðxÞk þ oðh2Þ; h ! 0: density is given by 2 2 Xn For the proof see Ref. 53. We see that the bias is quadratic ^ 1 f ðxÞ¼ Ið ; þ ÞðX Þ: in h. Hence, we must choose small h to reduce the bias. Sim- 2hn x h x h i i¼1 ilarly, the variance for the kernel estimator can be written as ! To express the estimator more formally, we define the weight Xn ^ 2 function w by varðf hðxÞÞ ¼ n var Khðx XiÞ 8 i¼1 > 1 < if jxj < 1 ¼ n 1var½K ðx XÞ ð Þ¼ 2 h Z w x > : ¼ð Þ 1 ð Þ 2 þ ðð Þ 1Þ; !1: 0 otherwise: nh f x K o nh nh
Then, it is easy to see that the above naive estimator can be Similar to the histogram case, we observe a bias-variance written as trade-off. The variance is nearly proportional to ðnhÞ 1, Xn which requires choosing h large for minimizing variance. 1 1 x X f^ðxÞ¼ w i : ð7Þ However, this contradicts with the aim of decreasing bias by n h h i¼1 choosing small h. From a smoothing perspective, smaller July 20, 2017 8:34:15pm WSPC/B2975 ch-07 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Selected topics in statistical computing 49 Density Density Density 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
0246810 0246810 0246810 bw=0.1 bw=0.5 bw=1
Fig. 2. Kernel densities with three bandwidth choices (0.1, 0.5, and 1) for a sample from an exponential distribution.
bandwidth results in under-smoothing and larger bandwidth value of CðKÞ, when all other things are equal. The problemR of results in over-smoothing. From the illustration in Fig. 2, we minimizing CðKÞ can be reduced to that of minimizing K 2 by see that when the bandwidth is too small (left) the kernel es- allowing suitable rescaled version of kernels. In a different timator under-smoothes the true density and when the band- context, Ref. 38 showed that this problem can be solved by width is large (right) the kernel estimator over-smoothes the setting K to be a Epanechnikov kernel17 (see Table 1.2). underlying density. Therefore, we consider MISE or MSE of h We define the efficiency of any symmetric kernel K as a compromise. by comparing it to the Epanechnikov kernel: Z 1 h4 = ½^ ð Þ ¼ ð Þ 2 þ ð 00ð Þ Þ2 þ ðð Þ 1Þ effðKÞ¼CðK Þ=CðKÞ5 4 MSE f h x f x K f x k2 o nh e Z nh 4 3 1=2 2 1 þ oðh4Þ; h ! 0; nh !1: ¼ pffiffiffi k K : 5 5 2 Note that MSE½f^ ðxÞ converges to zero, if h ! 0 and h = nh !1. Thus, the kernel density estimator is consistent, that The reason for the power 5 4 in the above equation is that is f^ ðxÞ!p f ðxÞ. On the whole, the variance term in MSE for large n, the MISE will be the same, whether we use n h ð Þ penalizes under smoothing and the bias term penalizes over observations with kernel K or n eff K observations and the 72 smoothing. kernel Ke. Some kernels and their efficiencies are given in Further, the asymptotic optimal bandwidth can be obtained Table 1.2. by differentiating MSE with respect to h and equating it to zero: The top four kernels are particular cases of the following R != family: 2 1 5 K 2pþ1 1 2 p h ¼ : Kðx; pÞ¼f2 Bðp þ 1; p þ 1g ð1 x Þ Ifj j< g; 0 ð 00ð ÞÞ2 2 x 1 f x k2n where Bð ; Þ is the beta function. These kernels are sym- It can be further verified that if we substitute this bandwidth in metric beta densities on the interval ½ 1; 1 .Forp ¼ 0, the the MISE formula then ¼ expression gives rise to a rectangular density, p 1 to Epa- Z 4=5 Z 1=5 5 = nechnikov, and p ¼ 2 and p ¼ 3 are bivariate and trivariate MISEð f^ Þ¼ K 2 k2 5 f 00ðxÞ2 n 4=5 h0 4 2 kernels, respectively. The standard normal density is obtained Z = as the limiting case p !1. 5 1 5 ¼ CðKÞ f 00ðxÞ2 n 4=5; Similar to the histogram scenario, the problem of choos- 4 ing the bandwidth (smoothing parameter) is of crucial im- R 4=5 portance to density estimation. A natural method for choosing ð Þ¼ 2 2=5: where C K K k2 From the above formula, it can the bandwidth is to plot several curves and choose the be observed that we should choose a kernel K with a small estimate that is most desirable. For many applications, July 20, 2017 8:34:18pm WSPC/B2975 ch-07 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
50 S. B. Chatla, C.-H. Chen & G. Shmueli
Table 1. Definitions of some kernels and their efficiencies. The score M0 depends only on data and the idea of least squares CV is to minimize the score over h. There also exists Kernel KðtÞ Efficiency a computationally simple approach to estimate M — Ref. 69 8 0 Rectangular > 1 0:9295 provided the large sample properties for this estimator. Thus, < for jtj 1; 2 asymptotically, least squares CV achieves the best possible :> : choice of smoothing parameter, in the sense of minimizing 8 0 otherwise .pffiffiffi Epanechnikov <> 3 1 1 the integrated squared error. For further details, see Ref. 63. 1 t 2 5 for jtj 5; 4 5 Another possible approach related to \plug-in" estimators :> : 8 0 otherwise is to use aR standard family of distributions as a reference for 00 2 Biweight > 15 2 0:9939 the value f ðxÞ dx which is the only unknown in the op- < ð1 t Þ2 for jtj 1; 16 timal bandwidth formula h . For example, if we consider the :> opt : 2 8 0 otherwise normal distribution with variance , then Triweight > 35 0:987 Z Z < ð1 t 2Þ3 for jtj 1; 3 32 f 00ðxÞ2dx ¼ 5 00ðxÞ2dx ¼ 1=2 5 0:212 5: :> : 8 0 otherwise Triangular 1 jtj for jtj 1; 0:9859 0 otherwise: If a Gaussian kernel is used, then the optimal bandwidth is ð ∕ Þ 2 : Gaussian p1ffiffiffiffi e 1 2 t 0 9512 3 2 ¼ð Þ 1=10 1=2 1=5 hopt 4 n 8 4 1=5 ¼ n 1=5 3 this approach is satisfactory. However, there is a need for = data-driven and automatic procedures that are practical and ¼ 1:06 n 1 5: have fast convergence rates. The problem of bandwidth se- A quick way of choosing the bandwidth is therefore esti- lection has stimulated much research in kernel density esti- mating from the data and substituting it in the above for- mation. The main approaches include cross-validation (CV) mula. While this works well if the normal distribution is a and \plug-in" methods (see the review paper by Ref. 39). reasonable approximation, it may oversmooth if the popula- As an example, consider least squares CV, which was tion is multimodal. Better results can be obtained using a suggested by Refs. 58 and 3 — see also Refs. 4, 32 and 69. robust measure of spread such as the interquartile ðRÞ range, Given any estimator f^ of a density f, the integrated squared which in this example yields h ¼ 0:79Rn 1=5. Similarly, error can be written as opt one can improve this further by taking the minimum of the Z Z Z Z standard deviation and the interquartile range divided by ð^ Þ2 ¼ ^2 ^ þ 2: f f f 2 ff f 1.34. For most applications, these bandwidths are easy to R evaluate and serve as a good starting value. Since the last term ( f 2) does not involve f 0, the updated Although the underlying idea is very simple and the first quantity Rðf^Þ of the above equation, is paper was published long ago, the kernel approach did not Z Z make much progress until recently, with advances in com- Rðf^Þ¼ f^2 2 ff^ : puting power. At present, without an efficient algorithm, the calculation of a kernel density for moderately large datasets can become prohibitive. The direct use of the above formulas The basic principle of least squares CV is to construct an for computations is very inefficient. Researchers developed estimate of Rðf^Þ from the data themselves and then to mini- fast and efficient Fourier transformation methods to calculate mize this estimate over h to give the choice of window width. R the estimate using the fact that the kernel estimate can be The term f^2 can be found from the estimate f^. Further, if ^ written as a convolution of data and the kernel function.43 we define f i as the densityR estimate constructed from all the ^ data points except Xi, then ff can be computed using X ^ 1 1 1 3. Kernel Smoothing f iðxÞ¼ðn 1Þ h Kh ðx XjÞ: j6¼i The problem of smoothing sequences of observations is im- portant in many branches of science and it is demonstrated by Now we define the required quantity without any unknown the number of different fields in which smoothing methods terms f ,as have been applied. Early contributions were made in fields as Z X diverse as astronomy, actuarial science, and economics. De- ð Þ¼ ^2 1 ^ ð Þ: M0 h f 2n f i Xi spite their long history, local regression methods have re- i ceived little attention in the statistics literature until the late July 20, 2017 8:34:21pm WSPC/B2975 ch-07 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Selected topics in statistical computing 51
1970s. Initial work includes the mathematical development 3.3. Local linear estimator of Refs. 67, 40 and 68, and the LOWESS procedure of Ref. 9. This estimator assumes that locally the regression function Recent work on local regression includes Refs. 19, 20 and 36. m can be approximated by The local linear regression method was developed largely Xp Xp as an extension of parametric regression methods and ac- mðjÞðxÞ mðzÞ ðz xÞj ðz xÞj; ð11Þ companied by an elegant finite sample theory of linear esti- j! j mation methods that build on theoretical results for j¼0 j¼0 parametric regression. It is a method for curve estimation by for z in a neighborhood of x, by using a Taylor's expansion. fitting locally weighted least squares regression. One exten- Using a local least squares formulation, the model coeffi- sion of local linear regression, called local polynomial re- cients can be estimated by minimizing the following function: gression, is discussed in Ref. 59 and in the monograph by 8 9 Ref. 21. Xn < Xp = ð ; Þ; ...; ð ; Þ Y ðX xÞjg2K ðX xÞ ; ð12Þ Assume that X1 Y1 Xn Yn are iid observations : i j i h i ; with conditional mean and conditional variance denoted i¼1 j¼0 respectively by ^ where Kð Þ is a kernel function with bandwidth h. If we let j mðxÞ¼EðYjX ¼ xÞ and 2ðxÞ¼VarðYjX ¼ xÞ: ð8Þ (j ¼ 0; ...; p) be equal to the estimates obtained from mini- mizing Eq. (12), then the estimator for the regression func- Many important applications involve estimation of the re- tions are obtained as gression function mðxÞ or its th derivative mð ÞðxÞ. The ð Þ ^ performance of an estimator m^ ðxÞ of m ðxÞ is assessed via m^vðxÞ¼v! v: ð13Þ its MSE or MISE defined in previous sections. While the When p ¼ 1, the estimator m^ ðxÞ is termed a local linear MSE criterion is used when the main objective is to estimate 0 smoother or local linear estimator with the following explicit the function at the point x, the MISE criterion is used when expression: the main goal is to recover the whole curve. P nw Y m^ ðxÞ¼ P1 i i ; 0 nw 3.1. Nadaraya–Watson estimator 1 i ¼ ð Þf ð Þ g; ð Þ wi Kh Xi x Sn;2 Xi x Sn;1 14 If we do not assume a specific form for the regression P ð Þ ¼ n ð Þð Þj ¼ function m x , then a data point remote from x carries little where Sn;j 1 Kh Xi x Xi x . When p 0, the local information about the value of mðxÞ. In such a case, an in- linear estimator equals to the Nadaraya–Watson estimator. tuitive estimator of the conditional mean function is the Also, both Nadaraya–Watson and Gasser–Müller estimators running locally weighted average. If we consider a kernel K are of the type of local leastR squares estimator with weights with bandwidth h as the weight function, the Nadaraya– ¼ ð Þ ¼ si ð Þ wi Kh Xi x and wi si 1 Kh u x du, respectively. Watson kernel regression estimator is given by P The asymptotic results are provided in Table 2.3, which is n taken from Ref. 19. ¼ K ðX xÞY m^ ðxÞ¼ Pi 1 h i i ; ð9Þ To illustrate both Nadaraya–Watson and local linear fit on h n ð Þ i¼1Kh Xi x data, we considered an example of a dataset on trees.2,60 This dataset includes measurements of the girth, height, and vol- where Khð Þ ¼ Kð =hÞ=h. For more details see Refs. 50, 80 and 33. ume of timber in 31 felled back cherry trees. The smooth fit results are described in Fig. 3. The fit results are produced using the KernSmooth79 package in R-Software.73 In the right 3.2. Gasser–Müller estimator panel of Fig. 3, we see that for a larger bandwidth, as – Assume that the data have already been sorted according to expected, Nadaraya Watson fits a global constant model the X variable. Reference 24 proposed the following esti- while the local linear fits a linear model. Further, from the left mator: Xn Z si Table 2. Comparison of asymptotic properties of local estimators. m^hðxÞ¼ Khðu xÞduYi; ð10Þ Method Bias Variance i¼1 si 1 – 0ð Þ 0ð Þ ¼ð þ Þ= ; ¼ 1 ¼þ1 Nadaraya Watson 00ð Þþ2m x f x Vn with si Xi Xiþ1 2 X0 and Xnþ1 . The m x f ðxÞ bn – 00ð Þ : weights in Eq. (10) add up to 1, so there is no need for a Gasser Müller m x bn 1 5Vn 00 denominator as in the Nadaraya–Watson estimator. Although it Local linear m ðxÞbn Vn – was originally developed for equispaced designs, the Gasser R 1 ¼ 1 2 ð Þ 2 Note:R Here, bn 1 u K u duh and Müller estimator can also be used for non-equispaced designs. 2ð Þ 1 2 V ¼ x K 2ðuÞdu. For the asymptotic properties please refer to Refs. 44 and 8. n f ðxÞnh 1 July 20, 2017 8:34:22pm WSPC/B2975 ch-07 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
52 S. B. Chatla, C.-H. Chen & G. Shmueli
● ● Nadaraya−Watson Nadaraya−Watson Local linear Local linear ●●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ●● ● log(Volume) ● log(Volume) ● ● ● ● ● 2.5 3.0 3.5 4.0 ●●● 2.5 3.0●●● 3.5 4.0
2.2 2.4 2.6 2.8 3.0 2.2 2.4 2.6 2.8 3.0 log(Girth) log(Girth) (a)(b)
Fig. 3. Comparison of local linear fit versus Nadaraya–Watson fit for a model of (log) volume as a function of (log) girth, with choice of two bandwidths: 0.1 (a) and 0.5 (b).
panel, where we used a reasonable bandwidth, the Nadara– methods. These methods involve fitting piecewise Watson estimator has more bias than the local linear fit. polynomials or splines to allow the regression function to In comparison with the local linear fit, the Nadaraya–Wat- have discontinuities at certain locations which are called son estimator locally uses one parameter less without reducing \knots".18,78,30 the asymptotic variance. It suffers from large bias, particularly in the region where the derivative of the regression function or design density is large. Also it does not adapt to nonuniform 4.1. Polynomial spline designs. In addition, it was shown that the Nadaraya–Watson Suppose that we want to approximate the unknown regres- estimator has zero minimax efficiency — for details see sion function m by a cubic spline function, that is, a piece- Ref. 19. Based on the definition of minimax efficiency, a 90% wise polynomial with continuous first two derivatives. Let efficient estimator uses only 90% of the data. Which means that t ; ...; t be a fixed knot sequence such that 1 < t < the Nadaraya–Watson does not use all the available data. In 1 J 1 < t < þ1. Then the cubic spline functions are twice contrast, the Gasser–Müller estimator corrects the bias of the J continuously differentiable functions s such that restrictions Nadaraya–Watson estimator but at the expense of increasing of s to each of the intervals ð 1; t ; ½t ; t ; ...; ½t ; t ; variability for random designs. Further, both the Nadaraya– 1 1 2 J 1 J ½t ; þ1Þ is a cubic polynomial. The collection of all these Watson and Gasser–Müller estimator have a large order of bias J cubic spline functions forms a ðJ þ 4Þ-dimensional linear when estimating a curve at the boundary region. Comparisons space. There exist two popular cubic spline bases for this between local linear and local constant (Nadaraya–Watson) fit linear space: were discussed in detail by Refs. 8, 19 and 36. 2 3 3 Power basis: 1; x; x ; x ; ðx tjÞ þ; ðj ¼ 1; ...; JÞ; B-spline basis: The ith B-spline of degree p ¼ 3, written as 3.4. Computational considerations Ni;pðuÞ, is defined recursively as: Recent proposals for fast implementations of nonparametric curve estimators include the binning methods and the 1ifui u ui¼1 N ; ðuÞ¼ updating methods. Reference 22 gave careful speed com- i 0 0 otherwise: u þ þ u parisons of these two fast implementations and direct naive ð Þ¼ u ui ; ð Þþ i p 1 ð Þ: Ni;p u Ni p 1 u þ Niþ1;p 1 u implementations under a variety of settings and using various uiþp ui uiþpþ1 ui 1 machines and software. Both fast methods turned out to be much faster with negligible differences in accuracy. The above is usually referred to as the Cox-de Boor recursion While the key idea of the binning method is to bin the data formula. and compute the required quantities based on the binned data, The B-spline basis is typically numerically more stable the key idea of the updating method involves updating the because the multiple correlation among the basis functions is quantities previously computed. It has been reported that for smaller, but the power spline basis has the advantage that it practical purposes neither method dominates the other. provides easier interpretation of the knots so that deleting a particular basis function is the same as deleting that particular knot. The direct estimation of the regression function m 4. Smoothing Using Splines depends on the choice of knot locations and the number of Similar to local linear estimators, another family of knots. There exist some methods based on the knot-deletion methods that provide flexible data modeling is spline idea. For full details please see Refs. 41 and 42. July 20, 2017 8:34:24pm WSPC/B2975 ch-07 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Selected topics in statistical computing 53
4.2. Smoothing spline is represented separately, the model retains the interpretative Consider the following objective function ease of a linear model. The most general method for estimating additive models Xn allows us to estimate each function by an arbitrary smoother. f ð Þg2: ð Þ Yi m Xi 15 Some possible candidates are smoothing splines and kernel i¼1 smoothers. The backfitting algorithm is a general purpose Minimizing this function gives the best possible estimate for algorithm that enables one to fit additive models with any the unknown regression function. The major problem with kind of smoothing functions although for specific smoothing the above objective function is that any function m that functions such as smoothing splines or penalized splines interpolates the data satisfies it, thereby leading to overfitting. there exist separate estimation methods based on least To avoid this, a penalty for the overparametrization is im- squares. The backfitting algorithm is an iterative algorithm posed on the function. A convenient way for introducing such and consists of the following steps: a penalty is via the roughness penalty approach. The fol- (i) Initialize: ¼ aveðy Þ; f ¼ f 0; j ¼ 1; ...; p lowing function is minimized: i j j P (ii) Cycle: for j ¼ 1; ...; p repeat f ¼ S ðy 6¼ f jx Þ: Z j j k j k j Xn (iii) Continue (ii) until the individual functions do not change. fY mðX Þg2 þ fm00ðxÞg2dx; ð16Þ i i ð j Þ i¼1 The term Sj y xj denotes a smooth of the response y against the predictor x . The motivation for the backfitting algorithm > j where 0 is a smoothing parameter. The first term pena- can be understood using conditional expectation. If the lizes the lack of fit, which is in some sense modeling bias. Padditive model is correct then for any k, EðY The second term denotes the roughness penalty which is re- 6¼ f ðX ÞjX Þ¼f ðX Þ. This suggests the appropriateness of ¼ j k j j k k k lated to overparametrization. It is evident that 0 yields the backfitting algorithm for computing all the f . ¼1 j interpolation (oversmoothing) and yields linear re- Reference 70 showed in the context of regression gression (undersmoothing). Hence, the estimator obtained splines — OLS estimation of spline models — that the ad- ^ from the objective function m , which also depends on the ditive model has the desirable property of reducing a full smoothing parameter, is called the smoothing spline estima- p-dimensional nonparametric regression problem to one tor. For local properties of this estimator, please refer to that can be fitted with the same asymptotic efficiency as a Ref. 51. univariate problem. Due to lack of explicit expressions, It is well known that the solution to the minimization of the earlier research by Ref. 5 studied only the bivariate ad- ½ ; (16) is a cubic spline on the interval Xð1Þ XðnÞ and it is ditive model in detail and showed that both the convergence unique in this data range. Moreover, the estimator is a linear of the algorithm and uniqueness of its solution depend on smoother with weights that do not depend on the response the behavior of the product of the two smoothers matrices. f g 33 Yi . The connections between kernel regression, which we Later, Refs. 52 and 45 extended the convergence theory to discussed in the previous section, and smoothing splines have p-dimensions. For more details, please see Refs. 5, 45 and 52. been critically studied by Refs. 62 and 64. We can write all the estimating equations in a compact The smoothing parameter can be chosen by minimizing 71,1 77,11 form: the CV or generalized cross validation (GCV) criteria. 0 10 1 0 1 Both quantities are consistent estimates of the MISE of m^ . ISS S f S y B 1 1 1 CB 1 C B 1 C For other methods and details please see Ref. 78. Further, for B CB C B C B S2 IS2 S2 CB f2 C B S2y C computational issues please refer to Refs. 78 and 18. B CB C ¼ B C B ...... CB . C B . C @ . . . . . A@ . A @ . A S S S I f S y 5. Additive Models p p p p p Pf^ ¼ Qy^ : While the smoothing methods discussed in the previous section are mostly univariate, the additive model is a widely Backfitting is a Gauss–Seidel procedure for solving the above used multivariate smoothing technique. An additive model is system. While one could directly use QR decomposition defined as without any iterations to solve the entire system, the Xp computational complexity prohibits doing so. The difficulty fð Þ3g Y ¼ þ f ðX Þþ ; ð17Þ is that QR require O np operations while the backfitting j j ð Þ j¼1 involves only O np operations, which is much cheaper. For more details, see Ref. 37. where the errors are independent of the Xjs and have mean We return to the trees dataset that we considered in the 2 Eð Þ¼0 and variance varð Þ¼ . The fj are arbitrary uni- last section, and fit an additive model with both height variate functions, one for each predictor. Since each variable and girth as predictors to model volume. The fitted model is July 20, 2017 8:34:24pm WSPC/B2975 ch-07 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
54 S. B. Chatla, C.-H. Chen & G. Shmueli 0.1 0.1 − s(log(Girth)) s(log(Height)) 0.5 0.0 0.5 1.0 − 0.3 − 1.0 − 8 101214161820 65 70 75 80 85 Girth Height (a)(b)
Fig. 4. Estimated functions for girth (a) and height (b) using smoothing splines.
log(Volume) log(Height) þ log(Girth). We used the gam35 method for score minimization. However, it also has some package in R-software. The fitted functions are shown in Fig. 4. issues related to convergence. In contrast, the outer method The backfitting method is very generic in the sense that it suffers from none of the disadvantages that performance it- can handle any type of smoothing function. However, there eration has but it is more computationally costly. For more exists another method specific to penalized splines, which has details, please see Ref. 81. become quite popular. This method uses penalized splines by The recent work by Ref. 83 showcases the successful estimating the model using penalized regression methods. In application of additive models on large datasets and uses Eq. (16), because m is linear in parameters , the penalty can performance iteration with block QR updating. This indicates always be written as a quadratic form in : the feasibility of applying these computationally intensive Z and useful models for big data. The routines are available in 00 fm ðxÞg2dx ¼ T S ; the R-package mgcv.82
where S is the matrix of known coefficients. Therefore, the penalized regression spline fitting problem is to minimize 6. Markov Chain and Monte Carlo jj jj2 þ T ; ð Þ y X S 18 The MCMC methodology provides enormous scope for re- with respect to . It is straightforward to see that the solution alistic and complex statistical modeling. The idea is to per- is a least squares type of estimator and depends on the form Monte Carlo integration using Markov chains. Bayesian smoothing parameter : statisticians, and sometimes also frequentists, need to inte- grate over possibly high-dimensional probability distributions ^ T ¼ðX X þ SÞ 1X T y: ð19Þ to draw inference about model parameters or to generate predictions. For a brief history and overview please refer to Penalized likelihood maximization can only estimate model Refs. 28 and 29. coefficients given the smoothing parameter . There exist two basic useful estimation approaches: when the scale pa- ' rameter in the model is known, one can use Mallow s Cp 6.1. Markov chains criterion; when the scale parameter is unknown, one can use Consider a sequence of random variables, fX ; X ; X ; ...g, GCV. Furthermore, for models such as generalized linear 0 1 2 such that at each time t 0, the next state X þ is sampled models which are estimated iteratively, numerically there t 1 from a conditional distribution PðX þ jX Þ, which depends exist two different ways of estimating the smoothing t 1 t only on the current state of the chain X . That is, given the parameter: t current state Xt, the next state Xtþ1 does not depend on the Outer iteration: The score can be minimized directly. This past states — this is called the memory-less property. This means that the penalized regression must be evaluated for sequence is called a Markov chain. each trial set of smoothing parameters. The joint distribution of a Markov chain is determined by Performance iteration: The score can be minimized and the two components: smoothing parameter selected for each working penalized The marginal distribution of X , called the initial distri- linear model. This method is computationally efficient. 0 bution. Performance iteration was originally proposed by Ref. 31. It The conditional density pð j Þ, called the transitional kernel usually converges, and requires only a reliable and efficient of the chain. July 20, 2017 8:34:27pm WSPC/B2975 ch-07 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Selected topics in statistical computing 55
It is assumed that the chain is time-homogeneous, Theorem 2. If X is positive recurrent and aperiodic then its which means that the probability Pð j Þ does not depend on stationary distribution ð Þ is the unique probability time t. The set in which Xt takes values is called the state distribution satisfying Eq. (20). We then say that X is ergodic space of the Markov chain and it can be countably finite or and the following consequences hold: infinite. (i) P ðtÞ! ðjÞ as t !1for all i; j. Under some regularity conditions, the chain will gradually ij (ii) (Ergodic theorem) If E½jf ðXÞj < 1, then forget its initial state and converge to a unique stationary (invariant) distribution, say ð Þ, which does not depend on t Pðf ! E½f ðXÞ Þ ¼ 1; and X . To converge to a stationary distribution, the chain P 0 where E½f ðXÞ ¼ f ðiÞ ðiÞ, the expectation of f ðXÞ with needs to satisfy three important properties. First, it has to be i respect to ð Þ. irreducible, which means that from all starting points the Markov chain can reach any nonempty set with positive Most of the Markov chains procedures in MCMC are re- probability in some iterations. Second, the chain needs to be versible which means that they are positive recurrent with aperiodic, which means that it should not oscillate between stationary distribution ð Þ, and ðiÞPij ¼ ðjÞPji. any two points in a periodic manner. Third, the chain must be Further, we say that X is geometrically ergodic, if it is positive recurrent as defined next. ergodic (positive recurrent and aperiodic) and there exists 0 <1 and a function Vð Þ > 1 such that Definition 1 (Ref. 49). (i) Markov chain X is called ir- X ; > reducible if for all i j, there exists t 0 such that jP ðtÞ ðjÞj VðiÞ t; ð21Þ ð Þ¼ ½ ¼ j ¼ > ij Pi;j t P Xt j X0 i 0. j (ii) Let ii be the time of the first return to state i, ð ¼ f > : ¼ j ¼ gÞ for all i. The smallest for which there exists a function ii min t 0 Xt i X0 i . An irreducible chain satisfying the above equation is called the rate of conver- X is recurrent if P½ ii < 1 ¼ 1forsomei.Otherwise,X is transient. Another equivalent condition for recurrence is gence. As a consequence to the geometric convergence, the X central limit theorem can be used for ergodic averages, that is P ðtÞ¼1; ij N 1=2ðf E½f ðXÞ Þ ! Nð ; 2Þ; t 0 for all i; j. for some positive constant ,asN !1, with the conver- (iii) An irreducible recurrent chain X is called positive re- gence in distribution. For an extensive treatment of geometric convergence and central limit theorems for Markov chains, current if E½ ii < 1 for some i. Otherwise, it is called null-recurrent. Another equivalent condition for positive please refer to Ref. 48. recurrence is the existence of a stationary probability distribution for X, that is there exists ð Þ such that X 6.2. The Gibbs sampler and Metropolis–Hastings algorithm ðiÞPijðtÞ¼ ðjÞð20Þ i Many MCMC algorithms are hybrids or generalizations for all j and t 0. of the two simplest methods: the Gibbs sampler and the (iv) An irreducible chain X is called aperiodic if for some i, Metropolis-Hastings algorithm. We therefore describe each of these two methods next. the greatest common divider is ft > 0 : PiiðtÞ > 0g¼1. In MCMC, since we already have a target distribution ð Þ, then X will be positive recurrent if we can demonstrate 6.2.1. Gibbs sampler irreducibility. After a sufficiently long burn-in of, say, m iterations, The Gibbs sampler enjoyed an initial surge of popularity starting with the paper of Ref. 27 (in a study of image pro- points fXt; t ¼ m þ 1; ...; ng will be the dependent sample approximately from ð Þ. We can now use the output from cessing models), while the roots of this method can be traced the Markov chain to estimate the required quantities. For ex- back to Ref. 47. The Gibbs sampler is a technique for indi- ample, we estimate E½f ðXÞ , where X has distribution ð Þ as rectly generating random variables from a (marginal) distri- follows: bution, without calculating the joint density. With the help of techniques like these we are able to avoid difficult calcula- Xn ¼ 1 ð Þ: tions, replacing them with a sequence of easier calculations. f f Xt ð Þ¼ ð ; ...; Þ; 2 Rn n m Let x x1 xk x denote a joint density, t¼mþ1 and let ðxijx iÞ denote the induced full conditional densities This quantity is called an ergodic average and its convergence for each of the components xi, given values of other com- to the required expectation is ensured by the ergodic ponents x i ¼ðxj; j 6¼ iÞ; i ¼ 1; ...; k; 1 < k n. Now the theorem.56,49 Gibbs sampler proceeds as follows. First, choose arbitrary July 20, 2017 8:34:30pm WSPC/B2975 ch-07 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
56 S. B. Chatla, C.-H. Chen & G. Shmueli
0 ¼ð 0; ...; 0Þ ð ; 0Þ¼ f ð 0Þ= ð Þ; g starting values x x 1 x k . Then, successively we have x x min x x 1 , which is the well- make random drawings from the full conditional distributions known Metropolis algorithm.47 For qðx; x0Þ¼qðx0 xÞ,the 66 ðxijx iÞ; i ¼ 1; ...; k as follows : chain is driven by a random walk process. For more choices and their consequences please refer to Ref. 74. Similarly for appli- x 1 from ðx jx 0 Þ; 1 1 1 cations of the MH algorithm and for more details see Ref. 7. 1 ð j 1; 0; ...; 0Þ; x 2 from x2 x 1 x 3 x k 1 ð j 1; 1; 0; ...; 0Þ; x 3 from x3 x 1 x 2 x 4 x k . 6.3. MCMC issues . There is a great deal of theory about the convergence prop- x 1 from ðx jx 1 Þ k k k erties of MCMC. However, it has not been found to be very useful in practice for determining the convergence informa- 0 ¼ð 0; ...; 0Þ This completes a transition from x x 1 x k to tion. A critical issue for the users of MCMC is how to 1 ¼ð 1; ...; 1Þ x x 1 x k . Each complete cycle through the condi- determine when to stop the algorithm. Sometimes a Markov tional distributions produces a sequence x0; x1; ...; xt; ... chain can appear to have converged to the equilibrium dis- which is a realization of the Markov chain, with transition tribution when it has not. This can happen due to the pro- probability from xt to xtþ1 given by longed transition times between state space or due to the multimodality nature of the equilibrium distribution. This Yk þ þ phenomenon is often called pseudo convergence or multi- TPðxt; xtþ1Þ¼ ðx t 1jx t; j > l; x t 1; j < lÞ: ð22Þ l j j modality. l¼1 The phenomenon of pseudo convergence has led many Thus, the key feature of this algorithm is to only sample from the MCMC users to embrace the idea of comparing multiple runs full conditional distributions which are often easier to evaluate of the sampler started at multiple points instead of the usual rather than the joint density. For more details, see Ref. 6. single run. It is believed that if the multiple runs converge to the same equilibrium distribution then everything is fine with the chain. However, this approach does not alleviate all the 6.2.2. Metropolis–Hastings algorithm problems. Many times running multiple chains leads to The Metropolis–Hastings (MH) algorithm was developed by avoiding running the sampler long enough to detect if there Ref. 47. This algorithm is extremely versatile and produces are any problems, such as bugs in the code, etc. Those who the Gibbs sampler as a special case.25 have used MCMC in complicated problems are probably ; ; ...; ; ... familiar with stories about last minute problems after running To construct a Markov chain X1 X2 Xt with state space and equilibrium distribution ðxÞ, the M–H algo- the chain for several weeks. In the following, we describe the two popular MCMC diagnostic methods. rithm constructs the transition probability from Xt ¼ x to the ð ; 0Þ Reference 26 proposed a convergence diagnostic method next realized state Xtþ1 as follows. Let q x x denote a 0 which is commonly known as \Gelman–Rubin" diagnostic candidate generating density such that Xt ¼ x; x drawn from ð ; 0Þ method. It consists of the following two steps. First, obtain an q x x is considered as a proposed possible value for Xtþ1. ð ; 0Þ ¼ 0 an overdispersed estimate of the target distribution and from With some probability x x , we accept Xtþ1 x ; other- wise, we reject the value generated from qðx; x0Þ and set it generate the starting points for the desired number of in- ¼ dependent chains (say 10). Second, run the Gibbs sampler Xtþ1 x. This construction defines a Markov chain with transition probabilities given as and re-estimate the target distribution of the required scalar 8 quantity as a conservative Student distribution, the scale pa- < qðx;Xx0Þ ðx; x0Þ if x0 6¼ x rameter of which involves both the between-chain variance 0 ð ; Þ¼ 00 00 0 p x x : 1 qðx; x Þ ðx; x Þ if x ¼ x: and within-chain variance. Now the convergence is monitored x 00 by estimating the factor by which the scale parameter might shrink if sampling were continued indefinitely, namely Next, we choose sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 8 pffiffiffiffi > ð 0Þ ð 0; Þ ^ n 1 m þ 1 B df < x q x x ; ð Þ ð ; 0Þ > ; R ¼ þ ; 0 min 0 1 if x q x x 0 n mn W df 2 ðx; x Þ¼ ðxÞqðx; x Þ :> 1if ðxÞqðx; x0Þ¼0: where B is the variance between the means from the m par- allel chains, W is the within-chain variances, df is the degrees The choice of the arbitrary qðx; x0Þ to be irreducible and of freedom of the approximating density, and n is number of aperiodic is a sufficient condition for ðxÞ to be the equilib- observations that are used to re-estimate the target density. rium distribution of the constructed chain. The authors recommend an iterative process of running ad- It can be observed that different choices of qðx; x0Þ will ditional iterations of the parallel chains and redoing step 2 lead to different specific algorithms. For qðx; x0Þ¼qðx0; xÞ, until the shrink factors for all the quantities of interest are July 20, 2017 8:34:30pm WSPC/B2975 ch-07 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Selected topics in statistical computing 57
near 1. Though created for the Gibbs sampler, the method by is univariate rather than giving information about the full joint Ref. 26 may be applied to the output of any MCMC algo- posterior distribution. rithm. It emphasizes the reduction of bias in estimation. There There are more methods available to provide the conver- also exist a number of criticisms for Ref. 26 method. It relies gence diagnostics for MCMC although not as popular as heavily on the user's ability to find a starting distribution these. For a discussion about other methods refer to Ref. 10. which is highly overdispersed with the target distribution. Continuing our illustrations using the trees data, we fit the This means that the user should have some prior knowledge same model that we used in Sec. 4 using MCMC methods. on the target distribution. Although the approach is essen- We used the MCMCpack46 package in R. For the sake of tially univariate the authors suggested using 2 times log illustration, we considered three chains and 600 observations posterior density as a way of summarizing the convergence of per chain. Among 600 only 100 observations are considered a joint density. for the burnin. The summary results are described as follows. Similarly, Ref. 55 proposed a diagnostic method which is intended both to detect convergence to the stationary distri- Iterations = 101:600 bution and to provide a way of bounding the variance esti- Thinning interval = 1 mates of the quantiles of functions of parameters. The user Number of chains = 3 must first run a single chain Gibbs sampler with the minimum Sample size per chain = 500 number of iterations that would be needed to obtain the de- sired precision of estimation if the samples were independent. 1. Empirical mean and standard deviation for each variable, The approach is based on the two-state Markov chain theory, plus standard error of the mean: as well as the standard sample size formulas that involves Mean SD Naive SE Time-series SE formulas of binomial variance. For more details, please refer (Intercept) -6.640053 0.837601 2.163e-02 2.147e-02 to Ref. 55. Critics point out that the method can produce log(Girth) 1.983349 0.078237 2.020e-03 2.017e-03 variable estimates of the required number of iterations needed log(Height) 1.118669 0.213692 5.518e-03 5.512e-03 given different initial chains for the same problem and that it sigma2 0.007139 0.002037 5.259e-05 6.265e-05
Trace of (Intercept) Density of (Intercept) 6 − 9 − 0.0 0.3 100 200 300 400 500 600 −10 −9 −8 −7 −6 −5 −4 −3
Iterations N = 500 Bandwidth = 0.2016
Trace of log(Girth) Density of log(Girth) 1.8 2.1 024
100 200 300 400 500 600 1.7 1.8 1.9 2.0 2.1 2.2 2.3
Iterations N = 500 Bandwidth = 0.01921
Trace of log(Height) Density of log(Height) 0.4 1.2 0.0 1.0 2.0 100 200 300 400 500 600 0.5 1.0 1.5
Iterations N = 500 Bandwidth = 0.05028
Trace of sigma2 Density of sigma2 0100 0.005 0.020 100 200 300 400 500 600 0.005 0.010 0.015 0.020
Iterations N = 500 Bandwidth = 0.0004963
Fig. 5. MCMC trace plots with three chains for each parameter in the estimated model. July 20, 2017 8:34:33pm WSPC/B2975 ch-07 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
58 S. B. Chatla, C.-H. Chen & G. Shmueli
(Intercept) log(Girth)
median median 97.5% 97.5% 1.00 1.00 1.20 shrink factor shrink factor 200 300 400 500 600 200 300 400 500 600
last iteration in chain last iteration in chain
log(Height) sigma2
median median 97.5% 97.5% 1.00 1.10 1.00 1.30 shrink factor shrink factor 200 300 400 500 600 200 300 400 500 600 last iteration in chain last iteration in chain
Fig. 6. Gelman–Rubin diagnostics for each model parameter, using the results from three chains.
2. Quantiles for each variable: 7.1. The jackknife 2.5% 25% 50% 75% 97.5% Quenouille54 introduced a method, later named the jackknife, (Intercept) -8.307100 -7.185067 -6.643978 -6.084797 -4.96591 to estimate the bias of an estimator by deleting one data point log(Girth) 1.832276 1.929704 1.986649 2.036207 2.13400 each time from the original dataset and recalculating the es- log(Height) 0.681964 0.981868 1.116321 1.256287 1.53511 ¼ ð ; ...; Þ timator based on the rest of the data. Let Tn Tn X1 Xn sigma2 0.004133 0.005643 0.006824 0.008352 0.01161 be an estimator of an unknown parameter . The bias of Tn is defined as Further, the trace plots for the each parameter are dis-
played in Fig. 5. From the plots, it can be observed that the biasðTnÞ¼EðTnÞ : chains are very mixed which gives an indication of conver- ¼ ð ; ...; ; ; ...; Þ gence of MCMC. Let Tn 1;i Tn 1 X1 Xi 1 Xiþ1 Xn be the given – ; ...; ; Because we used three chains, we can use the Gelman statistic but based on n 1 observations X1 Xi 1 ; ...; ; ¼ ; ...; ' Rubin convergence diagnostic method and check whether the Xiþ1 Xn i 1 n. Quenouille s jackknife bias esti- shrinkage factor is close to 1 or not, which indicates the mator is convergence of three chains to the same equilibrium distri- ¼ð Þð~ Þ; ð Þ bution. The results are shown in Fig. 6. From the plots, we bJACK n 1 T n Tn 23 see that for all the parameters, the shrink factor and its 97.5% P ~ ¼ 1 n value are very close to 1, which confirms that the three chains where T n n i¼1 Tn 1;i. This leads to a bias reduced converged to the same equilibrium distribution. jackknife estimator of ,
T ¼ T b ¼ nT ðn 1ÞT~ : ð24Þ 7. Resampling Methods JACK n JACK n n
Resampling methods are statistical procedures that involve The jackknife estimators bJACK and TJACK can be heuristically repeated sampling of the data. They replace theoretical deri- justified as follows. Suppose that vations required for applying traditional methods in statistical a b 1 analysis by repeatedly resampling the original data and biasðT Þ¼ þ þ O ; ð25Þ making inference from the resamples. Due to the advances in n n n2 n3 computing power these methods have become prominent and particularly well appreciated by applied statisticians. The where a and b are unknown but do not depend on n. Since ; ¼ ; ...; jacknife and bootstrap are the most popular data-resampling Tn 1;i i 1 n, are identically distributed, methods used in statistical analysis. For a comprehensive treatment of these methods, see Refs. 14 and 61. In the fol- a b 1 biasðT ; Þ¼ þ þ O ; ð26Þ lowing, we describe the Jackknife and bootstrap methods. n 1 i n 1 ðn 1Þ2 ðn 1Þ3 July 20, 2017 8:34:37pm WSPC/B2975 ch-07 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Selected topics in statistical computing 59
~ ^ and biasðT nÞ has the same expression. Therefore, numeric approximation to varðTnÞ. Since F is a known ð Þ¼ð Þ½ ð~ Þ ð Þ distribution, this idea can be further extended. That is, we can E bJACK n 1 bias T n bias Tn f ; ...; g; ¼ ...; ^ draw X 1b X nb b 1 B, independently from F, 1 1 ; ...; ¼ ð ; ...; Þ ¼ðn 1Þ a conditioned on X1 Xn. Let T n;b Tn X1b X nb n 1 n then we approximate BOOT using the following 1 1 1 approximation: þ b þ O ð Þ2 2 3 ! n 1 n n XB XB 2 ðBÞ 1 1 a ð2n 1Þb 1 ¼ T ; T ; : ð28Þ ¼ þ þ O ; BOOT B n b B n l n n2ðn 1Þ n2 b¼1 l¼1 ¼ ðBÞ From the law of large numbers, BOOT limB!1 BOOT al- which means that as an estimator of the bias of Tn, bJACK is most surely. Both BOOT and its Monte Carlo approximations correct up to the order of n 2. It follows that ðBÞ ðBÞ BOOT are called bootstrap estimators. While BOOT is more ð Þ¼ ð Þ ð Þ useful for practical applications, is convenient for bias TJACK bias Tn E bJACK BOOT theoretical derivations. The distribution F^ used to generate b 1 ¼ þ O ; the bootstrap datasets can be any estimator (parametric or nðn 1Þ n2 ; ...; nonparametric) of F based on X1 Xn. A simple non- parametric estimator of F is the empirical distribution. 2 that is, the bias of TJACK is of order n . The jackknife pro- While we have considered the bootstrap variance estimator duces a bias reduced estimator by removing the first-order here, the bootstrap method can be used for more general ð Þ term in bias Tn . problems such as inference for regression parameters, hy- The jackknife has become a more valuable tool since pothesis testing etc. For further discussion of the bootstrap, Ref. 75 found that the jackknife can also be used to construct see Ref. 16. variance estimators. It is less dependent on model assump- Next, we consider the bias and variance of the tions and does not need any theoretical formula as required bootstrap estimator. Efron13 applied the delta method to ap- by the traditional approach. Although it was prohibitive in the proximate the bootstrap bias and variance. Let fX ; ...; X g old days due to its computational costs, today it is certainly a 1 n be a bootstrap sample from the empirical distribution Fn. popular tool in data analysis. Define P i ¼ðthe number of X j ¼ Xi; j ¼ 1; ...; nÞ=n 7.2. The bootstrap and 13 The bootstrap is conceptually the simplest of all resampling P ¼ðP ; ...; P Þ0: ; ...; 1 n methods. Let X1 Xn denote the dataset of n independent ; ...; and identically distributed (iid) observations from an un- Given X1 Xn, the variable nP is distributed as a ^ known distribution F which is estimated by F, and let Tn ¼ multinomial variable with parameters n and ð ; ...; Þ ¼ð ; ...; Þ0= Tn X1 Xn be a given statistic. Then, the variance of Tn is P0 1 1 n. Then Z Z 2 n n 0 2 1 0 varðT Þ¼ T ðxÞ T ðyÞd Fðy Þ d ¼ Fðx Þ; E P ¼ P and var ðP Þ¼n I 11 n n n i¼1 i i 1 i n ð27Þ where I is the identity matrix, 1 is a column vector of 1's, and ¼ð ; ...; Þ ¼ð ; ...; Þ ^ E and var are the bootstrap expectation and variance, re- where x x1 xn and y y1 yn . Substituting F for F, we obtain the bootstrap variance estimator spectively. Z Z Now, define a bootstrap estimator of the moment of a 2 ð ; ...; ; Þ n ^ n ^ random variable Rn X1 Xn F . The properties of boot- ¼ T ðxÞ T ðyÞd Fðy Þ d ¼ Fðx Þ BOOT n n i¼1 i i 1 i strap estimators enable us to substitute the population quan- tities with the empirical quantities R ðX ; ...; X ; F Þ¼ ¼ var ½TnðX ; ...; X n ÞjX ; ...; Xn ; n 1 n n 1 1 0 RnðP Þ. If we expand this around P using a multivariate f ; ...; g ^ where X 1 X n is an iid sample from F and is called a Taylor expansion, we get the desired approximations for the ½ ; ...; bootstrap sample. var X1 Xn denotes the conditional bootstrap bias and variance: variance for the given X ; ...; X . The variance cannot be 1 n 1 ¼ ð Þ ð 0Þþ ð Þ; used directly for practical applications when BOOT is not an bBOOT E Rn P Rn P tr V ; ...; 2n2 explicit function of X1 Xn. Monte Carlo methods can be 1 0 used to evaluate this expression when F is known. That is, we ¼ var R ðP Þ U U; BOOT n 2 repeatedly draw new datasets from F and then use the sample n ¼ ð 0Þ ¼ 2 ð 0Þ variance of the values Tn computed from new datasets as a where U Rn P and V Rn P . July 20, 2017 8:34:38pm WSPC/B2975 ch-07 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
60 S. B. Chatla, C.-H. Chen & G. Shmueli
7.3. Comparing the jackknife and the bootstrap off due to the small sample size. However, the bootstrap In general, the jackknife will be easier to compute if n is less results are much closer to the values from OLS. than, say, the 100 or 200 replicates used by the bootstrap for standard error estimation. However, by looking only at the n jackknife samples, the jackknife uses only limited informa- 8. Conclusion tion about the statistic, which means it might be less efficient The area and methods of computational statistics have been than the bootstrap. In fact, it turns out that the jackknife can evolving rapidly. Existing statistical software such as R al- 14 be viewed as a linear approximation to the bootstrap. ready have efficient routines to implement and evaluate these Hence, if the statistics, are linear then both estimators agree. methods. In addition, there exists literature on parallelizing However, for nonlinear statistics, there is a loss of informa- these methods to make them even more efficient, for details, tion. Practically speaking, the accuracy of the jackknife es- please see Ref. 83. timate of standard error depends on how close the estimate is While some of the existing methods are still prohibitive to linearity. Also, while it is not obvious how to estimate even with moderately large data — such as the local linear the entire sampling distribution of Tn by jackknifing, the estimator — implementations using more resourceful envir- bootstrap can be readily used to obtain a distribution onments such as servers or clouds make such methods fea- estimator for Tn. sible even with big data. For an example, see Ref. 84 where In considering the merits or demerits of the bootstrap, it is they used a server (32GB RAM) to estimate their proposed to be noted that the general formulas for estimating standard model on the real data which did not take more than 34 s. errors that involve the observed Fisher information matrix are Otherwise, it would have taken more time. This illustrates the essentially bootstrap estimates carried out in a parametric helpfulness of the computing power while estimating these framework. While the use of the Fisher information matrix computationally intensive methods. involves parametric assumptions, the bootstrap is free of To the best of our knowledge, there exist multiple algo- those. The data analyst is free to obtain standard errors for rithms or R-packages to implement all the methods discussed enormously complicated estimators subject only to the con- here. However, it should be noted that not every method is straints of computer time. In addition, if needed, one could computationally efficient. For example, Ref. 12 reported that obtain more smoothed bootstrap estimates by convoluting the within R software there are 20 packages that implement nonparametric bootstrap with the parametric bootstrap, a density estimation. Further, they found that two packages parametric bootstrap involves generating samples based on (KernSmooth, ASH) are very fast, accurate and also well- the estimated parameters while nonparametric bootstrap maintained. Hence, the user should be wise enough to choose involves generating samples based on available data alone. efficient implementations when dealing with larger datasets. To provide a simple illustration, we again consider the Lastly, as we mentioned before, we are able to cover only trees data and fit an ordinary regression model with the for- a few of the modern statistical computing methods. For an mula mentioned in Sec. 4. To conduct a bootstrap analysis on expanded exposition of computational methods especially for the regression parameters, we resampled the data with re- inference, see Ref. 15. placement 100 times (bootstrap replications) and fit the same model to each sample. We calculated the mean and standard deviation for each regression coefficient, which are analogous Acknowledgment to the OLS coefficient and standard error. We performed the same for the jackknife estimators. The results are produced in The first and third authors were supported in part by grant Table 6.3. From the results, it can be seen that the jackknife is 105-2410-H-007-034-MY3 from the Ministry of Science and Technology in Taiwan.
Table 3. Comparison of the bootstrap, jackknife, and parametric method (OLS) in a regression setting. References 1D. M. Allen, The relationship between variable selection and data OLS Bootstrap Jackknife agumentation and a method for prediction, Technometrics 16(1), (Intercept) 6:63 6:65 6:63 125 (1974). 2 ð0:80Þð0:73Þð0:15Þ A. C. Atkinson, Plots, Transformations, and Regression: An log(Height) 1:12 1:12 1:11 Introduction to Graphical Methods of Diagnostic Regression ð0:20Þð0:19Þð0:04Þ Analysis (Clarendon Press, 1985). 3 log(Girth) 1:98 1:99 1:98 A. W. Bowman, An alternative method of cross-validation for the ð0:08Þð0:07Þð0:01Þ smoothing of density estimates, Biometrika 71(2), 353 (1984). 4 Observations 31 A. W. Bowman, P. Hall and D. M. Titterington, Cross-validation Samples 100 31 in nonparametric estimation of probabilities and probability den- sities, Biometrika 71(2), 341 (1984). July 20, 2017 8:34:38pm WSPC/B2975 ch-07 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Selected topics in statistical computing 61
5A. Buja, T. Hastie and R. Tibshirani, Linear smoothers and 33W. Hardle, Applied Nonparametric Regression (Cambridge, UK, additive models, Ann. Stat. 453 (1989). 1990). 6G. Casella and E. I. George, Explaining the gibbs sampler, Am. 34W. Härdle, Smoothing Techniques: With Implementation in s Stat. 46(3), 167 (1992). (Springer Science & Business Media, 2012). 7S. Chib and E. Greenberg, Understanding the metropolis-hastings 35T. Hastie, Gam: generalized additive models. r package version algorithm, Am. Stat. 49(4), 327 (1995). 1.06. 2 (2011). 8C.-K. Chu and J. S. Marron, Choosing a kernel regression 36T. Hastie and C. Loader, Local regression: Automatic kernel estimator, Stat. Sci. 404 (1991). carpentry, Stat. Sci. 120 (1993). 9W. S. Cleveland, Robust locally weighted regression and 37T. J. Hastie and R. J. Tibshirani, Generalized Additive Models, smoothing scatterplots, J. Am. Stat. Assoc. 74(368), 829 (1979). Vol. 43 (CRC Press, 1990). 10M. K. Cowles and B. P. Carlin, Markov chain monte carlo con- 38J. L. Hodges Jr and E. L. Lehmann, The efficiency of some non- vergence diagnostics: A comparative review, J. Am. Stat. Assoc. 91 parametric competitors of the t-test, Ann. Math. Stat. 324 (1956). (434), 883 (1996). 39M. C. Jones, J. S. Marron and S. J. Sheather, A brief survey of 11P. Craven and G. Wahba, Smoothing noisy data with spline bandwidth selection for density estimation, J. Am. Stat. Assoc. 91 functions, Num. Math. 31(4), 377 (1978). (433), 401 (1996). 12H. Deng and H. Wickham, Density estimation in r, (2011). 40V. Ya Katkovnik, Linear and nonlinear methods of nonparametric 13B. Efron, Bootstrap Methods: Another Look at the Jackknife, regression analysis, Sov. Autom. Control 5, 25 (1979). Breakthroughs in Statistics, (Springer, 1992), pp. 569–593. 41C. Kooperberg and C. J. Stone, A study of logspline density es- 14B. Efron and B. Efron, The Jackknife, the Bootstrap and Other timation, Comput. Stat. Data Anal. 12(3), 327 (1991). Resampling Plans, Vol. 38, SIAM (1982). 42C. Kooperberg, C. J. Stone and Y. K. Truong, Hazard regression, 15B. Efron and T. Hastie, Computer Age Statistical Inference, Vol. 5 J. Am. Stat. Assoc. 90(429), 78 (1995). (Cambridge University Press, 2016). 43C. Loader, Local Regression and Likelihood (Springer Science & 16B. Efron and R. J. Tibshirani, An Introduction to The Bootstrap Business Media, 2006). (CRC press, 1994). 44Y. P. Mack and H.-G. Müller, Derivative estimation in nonpara- 17V. A. Epanechnikov, Nonparametric estimation of a multidimen- metric regression with random predictor variable, Sankhya: Indian sional probability density, Teor. Veroyatn. Primen. 14(1), 156 J. Stat. A 59 (1989). (1969). 45E. Mammen, O. Linton, J. Nielsen et al., The existence and 18R. L. Eubank, Spline smoothing and nonparametric regression. asymptotic properties of a backfitting projection algorithm under no. 04; QA278. 2, E8. (1988). weak conditions, Ann. Stat. 27(5), 1443 (1999). 19J. Fan, Design-adaptive nonparametric regression, J. Am. Stat. 46A. D. Martin, K. M. Quinn, J. H. Park et al., Mcmcpack: Markov Assoc. 87(420), 998 (1992). chain monte carlo in r, J. Stat. Softw. 42(9), 1 (2011). 20J. Fan, Local linear regression smoothers and their minimax effi- 47N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller ciencies, Ann. Stat. 196 (1993). and E. Teller, Equation of state calculations by fast computing 21J. Fan and I. Gijbels, Local Polynomial Modelling and its Appli- machines, J. Chem. Phys. 21(6), 1087 (1953). cations: Monographs on Statistics and Applied Probability 66, 48S. P. Meyn and R. L. Tweedie, Stability of markovian processes ii: Vol. 66 (CRC Press, 1996). Continuous-time processes and sampled chains, Adv. Appl. Pro- 22J. Fan and J. S. Marron, Fast implementations of nonparametric bab. 487 (1993). curve estimators, J. Comput. Graph. Stat. 3(1), 35 (1994). 49P. Mykland, L. Tierney and B. Yu, Regeneration in markov chain 23J. H. Friedman, Data mining and statistics: What's the connection? samplers, J. Am. Stat. Assoc. 90(429), 233 (1995). Comput. Sci. Stat. 29(1), 3 (1998). 50E. A. Nadaraya, On estimating regression, Theory Probab. Appl. 9 24T. Gasser and H.-G. Müller, Kernel estimation of regression (1), 141 (1964). functions, Smoothing Techniques for Curve Estimation (Springer, 51D. Nychka, Splines as local smoothers, Ann. Stat. 1175 (1995). 1979), pp. 23–68. 52J. D. Opsomer, Asymptotic properties of backfitting estimators, 25A. Gelman, Iterative and non-iterative simulation algorithms, J. Multivariate Anal. 73(2), 166 (2000). Comput. Sci. Stat. 433 (1993). 53E. Purzen, On estimation of a probability density and mode, Ann. 26A. Gelman and D. B. Rubin, Inference from iterative simulation Math. Stat. 39, 1065 (1962). using multiple sequences, Stat. Sci. 457 (1992). 54M. H. Quenouille, Approximate tests of correlation in time-series 27S. Geman and D. Geman, Stochastic relaxation, gibbs distribu- 3, Math. Proc. Cambr. Philos. Soc. 45, 483 (1949). tions, and the bayesian restoration of images, IEEE Trans. Pattern 55A. E. Raftery, S. Lewis et al., How many iterations in the gibbs Anal. Mach. Intell. (6), 721 (1984). sampler, Bayesian Stat. 4(2), 763 (1992). 28C. Geyer, Introduction to markov chain monte carlo, Handbook of 56G. O. Roberts, Markov chain concepts related to sampling algo- Markov Chain Monte Carlo (2011), pp. 3–48. rithms, Markov Chain Monte Carlo in Practice, Vol. 57 (1996). 29W. R. Gilks, Markov Chain Monte Carlo (Wiley Online Library). 57M. Rosenblatt et al., Remarks on some nonparametric estimates 30P. J. Green and B. W. Silverman, Nonparametric Regression and of a density function, Ann. Math. Stat. 27(3), 832 (1956). Generalized Linear Models, Vol. 58, Monographs on Statistics 58M. Rudemo, Empirical choice of histograms and kernel density and Applied Probability (1994). estimators, Scand. J. Stat. 65 (1982). 31C. Gu, Cross-validating non-gaussian data, J. Comput. Graph. 59D. Ruppert and M. P. Wand, Multivariate locally weighted least Stat. 1(2), 169 (1992). squares regression, Ann. Stat. 1346 (1994). 32P. Hall, Large sample optimality of least squares cross-validation 60T. A. T. A. Ryan, B. L. Joiner and B. F. Ryan, Minitab Student in density estimation, Ann. Stat. 1156 (1983). Handbook (1976). July 20, 2017 8:34:38pm WSPC/B2975 ch-07 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
62 S. B. Chatla, C.-H. Chen & G. Shmueli
61J. Shao and D. Tu, The Jackknife and Bootstrap (Springer Science 74L. Tierney, Markov chains for exploring posterior distributions, & Business Media, 2012). Ann. Stat. 1701 (1994). 62B. W. Silverman, Spline smoothing: the equivalent variable kernel 75J. W. Tukey, Bias and confidence in not-quite large samples, Ann. method, Ann. Stat. 898 (1984). Math. Stat., Vol. 29, Inst Mathematical Statistics IMS Business 63B. W. Silverman, Density Estimation for Statistics and Data OFFICE-SUITE 7, 3401 INVESTMENT BLVD, Hayward, CA Analysis, Vol. 26 (CRC press, 1986). 94545 (1958), pp. 614–614. 64B. W. Silverman, Some aspects of the spline smoothing approach 76J. W. Tukey, The future of data analysis, Ann. Math. Stat. 33(1), 1 to non-parametric regression curve fitting, J. R. Stat. Soc. B (1962). (Methodol.) 1 (1985). 77G. Wahba, Practical approximate solutions to linear operator 65J. S. Simonoff, Smoothing Methods in Statistics (Springer Science equations when the data are noisy, SIAM J. Numer. Anal. 14(4), & Business Media, 2012). 651 (1977). 66A. F. M. Smith and G. O. Roberts, Bayesian computation via the 78G. Wahba and Y. Wang, When is the optimal regularization pa- gibbs sampler and related markov chain monte carlo methods, rameter insensitive to the choice of the loss function? Commun. J. R. Stat. Soc. B (Methodol.) 3 (1993). Stat.-Theory Methods 19(5), 1685 (1990). 67C. J. Stone, Consistent nonparametric regression, Ann. Stat. 595 79M. P. Wand and B. D. Ripley, Kernsmooth: Functions for kernel (1977). smoothing for wand and jones (1995). r package version 2.23-15 68C. J. Stone, Optimal rates of convergence for nonparametric (2015). estimators, Ann. Stat. 1348 (1980). 80G. S. Watson, Smooth regression analysis, Sankhya: The Indian J. 69C. J. Stone, An asymptotically optimal window selection rule for Stat. A 359 (1964). kernel density estimates, Ann. Stat. 1285 (1984). 81S. Wood, Generalized Additive Models: An Introduction with r 70C. J. Stone, Additive regression and other nonparametric models, (CRC press, 2006). Ann. Stat. 689 (1985). 82S. N. Wood, mgcv: Gams and Generalized Ridge Regression for r, 71M. Stone, Cross-validatory choice and assessment of statistical Rnews1(2), 20 (2001). predictions, J. R. Stat. Soc. B (Methodol.) 111 (1974). 83S. N. Wood, Y.Goude and S. Shaw, Generalized additive models for 72A. Stuart, M. G. Kendall et al., The Advanced Theory of Statistics, large data sets, J. R. Stat. Soc. C (Appl. Stat.) 64(1), 139 (2015). Vol. 2 (Charles Griffin, 1973). 84X. Zhang, B. U. Park and J.-L. Wang, Time-varying additive 73R. C. Team et al., R: A language and environment for statistical models for longitudinal data, J. Am. Stat. Assoc. 108(503), 983 computing (2013). (2013). July 20, 2017 8:31:43pm WSPC/B2975 ch-08 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
Bayesian networks: Theory, applications and sensitivity issues Ron S. Kenett KPA Ltd., Raanana, Israel University of Turin, Turin, Italy [email protected]
This chapter is about an important tool in the data science workbench, Bayesian networks (BNs). Data science is about generating information from a given data set using applications of statistical methods. The quality of the information derived from data analysis is dependent on various dimensions, including the communication of results, the ability to translate results into actionable tasks and the capability to integrate various data sources [R. S. Kenett and G. Shmueli, On information quality, J. R. Stat. Soc. A 177(1), 3 (2014).] This paper demonstrates, with three examples, how the application of BNs provides a high level of information quality. It expands the treatment of BNs as a statistical tool and provides a wider scope of statistical analysis that matches current trends in data science. For more examples on deriving high information quality with BNs see [R. S. Kenett and G. Shmueli, Information Quality: The Potential of Data and Analytics to Generate Knowledge (John Wiley and Sons, 2016), www.wiley.com/go/infor- mation quality.] The three examples used in the chapter are complementary in scope. The first example is based on expert opinion assessments of risks in the operation of health care monitoring systems in a hospital environment. The second example is from the monitoring of an open source community and is a data rich application that combines expert opinion, social network analysis and continuous operational variables. The third example is totally data driven and is based on an extensive customer satisfaction survey of airline customers. The first section is an introduction to BNs, Sec. 2 provides a theoretical background on BN. Examples are provided in Sec. 3. Section 4 discusses sensitivity analysis of BNs, Sec. 5 lists a range of software applications implementing BNs. Section 6 concludes the chapter. Keywords: Graphical models; Bayesian networks; expert opinion; big data; causality models.
1. Bayesian Networks Overview combination of values of its parents. The joint distribution of a collection of variables is determined uniquely by these local Bayesian networks (BNs) implement a graphical model conditional probability tables (CPT). In learning the network structure known as a directed acyclic graph (DAG) that is structure, one can include white lists of forced causality links popular in statistics, machine learning and artificial intelli- black lists gence. BN enable an effective representation and computation imposed by expert opinion and of links that are not to be included in the network. For examples of BN application of the joint probability distribution (JPD) over a set of random to study management efficiency, web site usability, operational variables.40 The structure of a DAG is defined by two sets: the risks, biotechnology, customer satisfaction surveys, healthcare set of nodes and the set of directed arcs. The nodes represent systems and testing of web services see, respective- random variables and are drawn as circles labeled by the ly.2,24–27,31,42 For examples of applications of BN to education, variables names. The arcs represent links among the variables banking, forensic and official statistics see Refs. 10 and 34, 43, and are represented by arrows between nodes. In particular, an 46, 47. The next section provides theoretical details on how arc from node Xi to node Xj represents a relation between the corresponding variables. Thus, an arrow indicates that a value BNs are learned and what are their properties. taken by variable Xj depends on the value taken by variable Xi. This property is used to reduce the number of parameters that 2. Theoretical Aspects of Bayesian Networks are required to characterize the JPD of the variables. This re- duction provides an efficient way to compute the posterior 2.1. Parameter learning probabilities given the evidence present in the data.4,22,32,41,44 To fully specify a BN, and thus represent the JPDs, it is In addition to the DAG structure, which is often considered necessary to specify for each node X the probability distri- as the \qualitative" part of the model, a BN includes bution for X conditional upon X's parents. The distribution of \quantitative" parameters. These parameters are described by X, conditional upon its parents, may have any form with or applying the Markov property, where the conditional proba- without constraints. bility distribution (CPD) at each node depends only on its These conditional distributions include parameters which parents. For discrete random variables, this conditional prob- are often unknown and must be estimated from data, for ability is represented by a table, listing the local probability that example using maximum likelihood. Direct maximization of a child node takes on each of the feasible values — for each the likelihood (or of the posterior probability) is usually based
63 July 20, 2017 8:31:44pm WSPC/B2975 ch-08 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
64 R. S. Kenett
on the expectation–maximization (EM) algorithm which the human body cannot be decided only in the laboratory. alternates computing expected values of the unobserved A mechanistic understanding based on pharmacometrics variables conditional on observed data, with maximizing the models is a preliminary condition for determining if a certain complete likelihood assuming that previously computed medicinal treatment should be studied in order to elucidate expected values are correct. Under mild regularity conditions, biological mechanisms used to intervene and either prevent or this process converges to maximum likelihood (or maximum cure a disease. The concept of potential outcomes is present posterior) values of parameters.19 in the work of randomized experiments by Fisher and Ney- A Bayesian approach treats parameters as additional un- man in the 1920s and was extended by Rubin in the 1970s to observed variables and computes a full posterior distribution non-randomized studies and different modes of inference.35 over all nodes conditional upon observed data, and then In their work, causal effects are viewed as comparisons of integrates out the parameters. This, however, can be expen- potential outcomes, each corresponding to a level of the sive and leads to large dimension models, and in practice treatment and each observable, had the treatment taken on classical parameter-setting approaches are more common.37 the corresponding level with at most one outcome actually observed, the one corresponding to the treatment level real- ized. In addition, the assignment mechanism needs to be 2.2. Structure learning explicitly defined as a probability model for how units receive BNs can be specified by expert knowledge (using white lists the different treatment levels. With this perspective, a causal and black lists) or learned from data, or in combinations of inference problem is viewed as a problem of missing data, both. The parameters of the local distributions are learned from where the assignment mechanism is explicitly modeled as a data, priors elicited from experts, or both. Learning the graph process for revealing the observed data. The assumptions on structure of a BN requires a scoring function and a search the assignment mechanism are crucial for identifying and strategy. Common scoring functions include the posterior deriving methods to estimate causal effects.16 probability of the structure given the training data, the Imai et al.21 studied how to design randomized experi- Bayesian information criteria (BIC) or Akaike information ments to identify causal mechanisms. They study designs that criteria (AIC). When fitting models, adding parameters are useful in situations where researchers can directly ma- increases the likelihood, which may result in over-fitting. Both nipulate the intermediate variable that lies on the causal path BIC and AIC resolve this problem by introducing a penalty from the treatment to the outcome. Such a variable is often term for the number of parameters in the model with the referred to as a `mediator'. Under the parallel design, each penalty term being larger in BIC than in AIC. The time re- subject is randomly assigned to one of the two experiments. quirement of an exhaustive search, returning back a structure In one experiment, only the treatment variable is randomized that maximizes the score, is super-exponential in the number of whereas in the other, both the treatment and the mediator are variables. A local search strategy makes incremental changes randomized. Under the crossover design, each experimental aimed at improving the score of the structure. A global search unit is sequentially assigned to two experiments where the algorithm like Markov Chain Monte Carlo (MCMC) can avoid first assignment is conducted randomly and the subsequent getting trapped in local minima. A partial list of structure assignment is determined without randomization on the basis learning algorithms includes Hill-Climbing with score func- of the treatment and mediator values in the previous experi- tions BIC and AIC Grow-Shrink, Incremental Association, ment. They propose designs that permit the use of indirect Fast Incremental Association, Interleaved Incremental associ- and subtle manipulation. Under the parallel encouragement ation, hybrid algorithms and Phase Restricted Maximization. design, experimental subjects who are assigned to the second For more on BN structure learning, see Ref. 36. experiment are randomly encouraged to take (rather than assigned to) certain values of the mediator after the treatment has been randomized. Similarly, the crossover encouragement 2.3. Causality and Bayesian networks design employs randomized encouragement rather than the Causality analysis has been studied from two main direct manipulation in the second experiment. These two different points of view, the \probabilistic" view and the designs generalize the classical parallel and crossover designs \mechanistic" view. Under the probabilistic view, the causal in clinical trials, allowing for imperfect manipulation, thus effect of an intervention is judged by comparing the evolution providing informative inferences about causal mechanisms of the system when the intervention is present and when it is by focusing on a subset of the population. not present. The mechanistic point of view focuses on un- Causal Bayesian networks are BNs where the effect of any derstanding the mechanisms determining how specific effects intervention can be defined by a `do' operator that separates come about. The interventionist and mechanistic viewpoints intervention from conditioning. The basic idea is that inter- are not mutually exclusive. For examples, when studying vention breaks the influence of a confounder so that one can biological systems, scientists carry out experiments where make a true causal assessment. The established counter- they intervene on the system by adding a substance or by factual definitions of direct and indirect effects depend on knocking out genes. However, the effect of a drug product on an ability to manipulate mediators. A BN graphical July 20, 2017 8:31:44pm WSPC/B2975 ch-08 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
BNs: Theory, applications and sensitivity issues 65
representations, based on local independence graphs and assessment of risks in monitoring of patients in a hospital, the dynamic path analysis, can be used to provide an overview of second example is based on an analysis of an open source dynamic relations.1 As an alternative approach, the econo- community and involves social network analysis (SNA) and metric approach develops explicit models of outcomes, where data related to the development of software code. The third the causes of effects are investigated and the mechanisms used case is derived from a large survey of customers of an governing the choice of treatment are analyzed. In such airline company. investigations, counterfactuals are studied (Counterfactuals are possible outcomes in different hypothetical states of the world). The study of causality in studies of economic policies 3.1. Patient monitoring in a hospital involves: (a) defining counterfactuals, (b) identifying causal Modern medical devices incorporate alarms that trigger vi- models from idealized data of population distributions and (c) sual and audio alerts. Alarm-related adverse incidents can identifying causal models from actual data, where sampling lead to patient harm and represent a key patient safety issue. 20 variability is an issue. Pearl developed BNs as the method Clinical alarms handling procedures in the monitoring of of choice for reasoning in artificial intelligence and expert hospitalized patients is a complex and critical task. This systems, replacing earlier ad hoc rule-based systems. His involves both addressing the response to alarms and the extensive work covers topics such as: causal calculus, coun- proper setting of control limits. Critical alarm hazards include terfactuals, Do calculus, transportability, missingness graphs, (1) Failure of staff to be informed of a valid alarm on time and 38 causal mediation, graph mutilation and external validity. take appropriate action and (2) Alarm Fatigue, when staff is In a heated head to head debate between probabilistic and overwhelmed, distracted and desensitized. Prevention to mechanistic view, Pearl has taken strong standings against the avoid harm to patients requires scrutinizing how alarms are 3 probabilistic view, see for example the paper by Baker and initiated, communicated and responded to. 39 1 discussion by Pearl. The work of Aalen et al. and Imai To assess the conditions of alarm monitoring in a specific 21 et al. show how these approaches can be used in comple- hospital, once can conduct a mapping of current alarms and mentary ways. For more examples of BN applications, their impact on patient safety using a methodology developed 12–14 see Fenton and Neil. The next section provides three BN by ECRI.11 This is performed using a spreadsheet doc- application examples. umenting any or all alarm signals that a team identifies as potentially important (e.g., high priority) for management. Figure 1 is an example of such a worksheet. It is designed to 3. Bayesian Network Case Studies assess the decision processes related to alarms by comparing This section presents the applications of BNs to three diverse potentially important signals in such a way that the most case studies. The first case study is based on an expert important become more obvious.
Risk to patient from Obstacles to effective Medical opinion of alarm malfunction or Contribution to alarm Care area Alarm load alarm communication Alarm signal Device/system alarm importance delayed caregiver load/fatigue or response response ICU moderate low low alarm volume physiologic monitor high high low ICU moderate low bradycardia physiologic monitor high high low ICU moderate low tachicardia physiologic monitor moderate moderate moderate ICU high low leads off physiologic monitor high moderate moderate ICU moderate low low oxygen satura on physiologic monitor high high moderate ICU moderate low low BP physiologic monitor high high moderate ICU moderate low high BP physiologic monitor moderate moderate moderate RECOVERY high low low alarm volume physiologic monitor high high low RECOVERY high low bradycardia physiologic monitor high moderate high RECOVERY high low tachicardia physiologic monitor moderate moderate high RECOVERY high low leads off physiologic monitor high moderate high RECOVERY moderate low low oxygen satura on physiologic monitor high high moderate RECOVERY moderate low low BP physiologic monitor high high moderate RECOVERY moderate low high BP physiologic monitor moderate moderate moderate Med-surg dep high high low alarm volume physiologic monitor high high low Med-surg dep high high bradycardia physiologic monitor high moderate high Med-surg dep high high tachicardia physiologic monitor moderate moderate high Med-surg dep high high leads off physiologic monitor high moderate high Med-surg dep moderate high low oxygen satura on physiologic monitor high high high Med-surg dep moderate high low BP physiologic monitor high high high Med-surg dep moderate high high BP physiologic monitor moderate moderate high
Fig. 1. Spreadsheet for alarm monitoring risks. July 20, 2017 8:31:45pm WSPC/B2975 ch-08 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
66 R. S. Kenett
Filling in the spreadsheet requires entering ratings derived department increased from 33% to 42% and high load from experts, other tools (e.g., Nursing Staff Survey) as well as from 36% to 43%. from other sources (e.g., input from medical staff). The first Figure 5 is an example of predictive analysis where the column in Fig. 1, specifies the care area. In the figure, one can BN is conditioned on ICU and bradycardia. We see that the see information related to the intensive care unit (ICU), the contribution of an alarm indicates a slower than normal heart recovery room and the surgery department. The second column rate (bradycardia) in ICU, if not treated properly, increases reflects the alarm lad intensity. The third column shows us an the potential high level harm to patients from 39% to 52%, assessment of obstacles to effective alarm communication or a very dramatic increase. In parallel, we see that the impact response. The next column lists the different types of alarm of this condition on light load to staff increased from 30% to signals from a specific device. In Fig. 1, we see only the part 49% and the option of low level of obstacles to communicate related to a physiology monitor. The last three columns cor- the alarm increased from 55% to 78%. respond to a medical opinion of alarm importance, the risk to In other words, the bradycardia alarm in ICU does not patient from an alarm malfunction or delay, and the contribu- increase fatigue, is very effectively communicated with a low tion of the alarm to load and fatigue of the medical staff. level of obstruction and is considered higher than average We show next how data from such a clinical alarm hazard alarm in terms if impact on patient safety. This analysis relies spreadsheet can be analyzed with a BN that links care area, on expert opinion assessments and does not consider unob- device, alarm signal, risks to patients and load on staff. servable latent variables. The software we will use in the analysis is GeNie version In this case study, the data was derived from expert 2.0 from the university of Pittsburgh (http://genie.sis.pitt. opinion assessment of seven types of alarms from a patient edu). Figure 2 shows the GeNie data entry screen corre- monitoring device placed in three care delivery areas. The sponding to the data shown in Fig. 1. The low, medium and alarms are evaluated on various criteria such as contribution high levels in Fig. 1 are represented here as s1, s2 and s3. to staff fatigue, risk to patient from malfunction, obstacles Figure 3 shows the BNs derived from the data in Fig. 2. to communicate alarms etc. This assessment is feeding a Figure 4 shows a diagnostic analysis where we condition spreadsheet which is then analyzed using a BN. The analysis the BN on high level of fatigue. We can now see what con- provides both diagnostic and predictive capabilities that ditions are the main contributor to this hazardous condition. supports various improvement initiatives aimed at improving Specifically, one can see that the percentage of surgery monitoring effectiveness.
Fig. 2. Data entry of GeNie software Version 2.0. July 20, 2017 8:31:48pm WSPC/B2975 ch-08 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
BNs: Theory, applications and sensitivity issues 67
Fig. 3. BN of data collected in spreadsheet presented in Fig. 1.
Fig. 4. BN conditioned on high contributors to fatigue (diagnostic analysis).
3.2. Risk management of open source software behavior of OSS communities. This use case is derived The second example is based on data collected from a from the RISCOSS project (www.riscoss.eu), a platform and related assessment methodology for managing risks in OSS community developing open source software (OSS). Risk 15 management is a necessary and challenging task for organi- adoption. zations that adopt OSS in their products and in their software As a specific example, we aim at understanding the roles development process.33 Risk management in OSS adoption of various members in the OSS community and the rela- can benefit from data that is available publicly about the tionships between them by the analysis of the mailing lists or adopted OSS components, as well as data that describes the forums of the community. The analysis should provide us July 20, 2017 8:31:51pm WSPC/B2975 ch-08 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
68 R. S. Kenett
Fig. 5. BN conditioned on ICU and bradycardia (predictive analysis).
information on dimensions such as timeliness of an OSS of the IRC chat archives so extracting the dynamics of the community that can be measured by its capacity of following XWiki community over time. Some of the challenges in- a roadmap or to release fixes and evolutions of the software in cluded a chat format change towards the end of 2010 and time. The methods illustrated here are based on data coming ambiguous names of a unique user (e.g., Vincent, VincentM, from the XWiki OSS community (http://www.xwiki.org), an Vinny, Vinz). Eventually names were fixed manually. Open Source platform for developing collaborative applica- Figure 6 represents the visual rendering of the dynamics, tions and managing knowledge using the wiki metaphor. over time, of the community in terms of intensity and kind of XWiki was originally written in 2003 and released at the relationships between the different groups of actors (mainly beginning of 2004; since then, a growing community of users contributors and manager of the community), that can be and contributors started to gather around it. The data consists captured by community metrics such as degree of centrality. of: user and developer mailing lists archives, IRC chat The analysis has been performed using NodeXL (http:// archives, code commits and code review comments, and in- nodexl.codeplex.com) a tool and a set of operations for SNA. formation about bugs and releases. The community is around The NodeXL-Network Overview, Discovery and Exploration 650,000 lines of code, around 95 contributors responsible for add-in for Excel adds network analysis and visualization around 29,000 commits and with more than 200.000 mes- features to the spreadsheet. The core of NodeXL is a special sages, and 10.000 issues reported since 2004. Excel workbook template that structures data for net- Specifically, we apply a SNA of the interactions between work analysis and visualization. Six main worksheets members of the OSS community. Highlighting actors of im- currently form the template. There are worksheets for portance to the network is a common task of SNA. Centrality \Edges", \Vertices", and \Images" in addition to worksheets measures are ways of representing this importance in a for \Clusters," mappings of nodes to clusters (\Cluster quantifiable way. A node (or actor)'s importance is consid- Vertices"), and a global overview of the network's metrics. ered to be the extent of the involvement in a network, and this NodeXL workflow typically moves from data import through can be measured in several ways. Centrality measures are steps such as import data, clean the data, calculate graph usually applied to undirected networks, with indices for di- metrics, create clusters, create sub-graph images, prepare rected graphs termed prestige measures. The degree centrality edge lists, expand worksheet with graphing attributes, and is the simplest way to quantify a node's importance which is show graph such as the graphs in Fig. 6. used to consider the number of nodes it is connected to, with Each of the social networks presented in Fig. 6 is char- high numbers interpreted to be of higher importance. acterized by a range of measures such as the degree of cen- Therefore, the degree of a node provides local information of trality of the various community groups. The dynamics of a its importance. For a review of SNA see Ref. 45. social network is reflected by a changing value of such As an example of an SNA, we analyze the data from measures, over time. We call these, and other measures XWiki community over five years using data preprocessing derived from the OSS community, risk drivers. These risk July 20, 2017 8:31:52pm WSPC/B2975 ch-08 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
BNs: Theory, applications and sensitivity issues 69
Fig. 6. SNA of the XWiki OSS community chats.
drivers form the raw data used in the risk assessment, in this of risk drivers values by OSS experts. To link risk drivers to case of adopting the XWiki OSS. This data can be aggregated risk indicators, the RISCOSS project developed a method- continuously using specialized data collectors. ology based on workshops where experts are asked to rate The data sources used in this analysis consisted of: alternative scenarios. An example of such scenarios is pre- sented in Fig. 7. (1) Mailing lists archives: The risk drivers are set at various levels and the expert is . XWiki users mailing list: http://lists.xwiki.org/piper- asked to evaluate the scenario in terms of the risk indicator. mail/users As an example, in scenario 30 of Fig. 7, the scenario consists . XWiki devs mailing list: http://lists.xwiki.org/piper- of two forums posts per day, 11 messages per thread, a low mail/devs amount of daily mails and a small community with a high (2) IRC chat archives: http://dev.xwiki.org/xwiki/bin/view/ proportion of developers, a med size number of testers and IRC/WebHome companies using the OSS. These conditions were determined (3) Commits (via git): https://github.com/xwiki as signifying low Activeness by the expert who did the rating. (4) Code review comments available on GitHub After running about 50 such scenarios, one obtains data that (5) Everything about bugs and releases: http://jira.xwiki.org can be analyzed with BN, like in the alarm monitoring use From this data and SNA measures, one derives risk drivers case. Such an analysis is shown in Fig. 8. that are determining risk indicators. Examples of risk indi- The BN linking of risk drivers to risk indicators is pro- cators include Timeliness (is the community responsive to viding a framework for ongoing risk management of the OSS open issues), Activeness (what is the profile and number of community. It provides a tool for exploring the impact of active members) and Forking (is the community likely to specific behavior of the OSS community and evaluate risk split). These risk indicators are representing an interpretation mitigation strategies. For more on such models, see Ref. 15.
Fig. 7. Alternative scenarios for linking risk drivers to the Activeness risk indicator. July 20, 2017 8:31:54pm WSPC/B2975 ch-08 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
70 R. S. Kenett
Fig. 8. BN linking risk drivers to risk indicators.
3.3. Satisfaction survey from an airline company In Fig. 10, one sees the conditioning of the BN on ex- The third case study is about a customer satisfaction survey tremely dissatisfied customer (left) and extremely satisfied presented in Ref. 8. The example consists of a typical cus- customers (right). tomer satisfaction questionnaire directed at passengers of an The examples in Fig. 10 show how a BN provides an airline company to evaluate their experience. The question- efficient profiling of very satisfied and very unsatisfied cus- naire contains questions (items) on passengers' satisfaction tomers. This diagnostic information is a crucial input to from their overall experience and from six service elements initiatives aimed at improving customer satisfaction. Con- (departure, booking, check-in, cabin environment, cabin trasting very satisfied with very unsatisfied customers is crew, meal). The evaluation of each item is based on a four- typically an effective way to generate insights for achieving point scale (from 1¼ extremely dissatisfied to 4 ¼ extremely this goal. Like in the previous two examples, the BN con- satisfied). Additional information on passengers was also sidered here is focused on observable data. This can be ex- collected such as gender, age, nationality and the purpose of panded by including modeling and latent variable effects. the trip. The data consists of responses in n ¼ 9720 valid For an example of latent variables in the analysis of a cus- questionnaires. The goal of the analysis is to evaluate these tomer satisfaction survey, see Ref. 17. For more considera- six dimensions and the level of satisfaction from the overall tions on how to determine causality, see Sec. 2. experience, taking into account the interdependencies be- tween the degree of satisfaction from different aspects of the service. Clearly, these cannot be assumed to be independent 4. Sensitivity Analysis of Bayesians Networks of each other, and therefore a BN analysis presents a par- The three examples presented above shop how a BN can be ticularly well-suited tool for this kind of analysis. For more used as a decision support tool for determining which pre- examples of BN applications to customer survey analysis see dictor variables are important on the basis of their effect on Refs. 5 and 27. Figure 9 is a BN of this data constructed target variables. In such an analysis, choosing an adequate using the Hill-Climbing algorithm with score functions AIC. BN structure is a critical task. In practice, there are several The proportion of customers expressing very high overall algorithms available for determining the BN structure, satisfaction (a rating of \4") is 33%. each on with its specific characteristics. For example, the July 20, 2017 8:31:55pm WSPC/B2975 ch-08 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
BNs: Theory, applications and sensitivity issues 71
Fig. 9. BN of airline passenger customer satisfaction survey.
Fig. 10. BN conditioned on extremely dissatisfied customer (left) and extremely satisfied customers (right). July 20, 2017 8:31:59pm WSPC/B2975 ch-08 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
72 R. S. Kenett
Table 1. BN with proportion of occurrence of each are in the bootstrap replicates.
hc-bic hc-aic tabu-bic tabu-aic gs iamb fiamb intamb mmhc-bic mmhc-aic rsmax tot
Booking Checkin 1.0 1.0 1.0 1.0 1.0 0.5 0.5 0.5 1.0 1.0 1.0 9.5 Cabin crew 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 6.0 Cabin departure 1.0 1.0 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 0.0 9.0 Cabin experience 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 0.0 9.0 Cabin meal 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 11.0 Crew booking 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 6.0 Crew departure 1.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 7.0 Crew experience 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 0.0 9.0 Departure booking 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 6.0 Departure checkin 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0 7.0 Departure experience 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0 0.0 9.0 Meal crew 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 7.0 Cabin booking 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 3.0 Crew checkin 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 4.0 Meal departure 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 3.0 Meal experience 0.0 1.0 0.0 1.0 0.0 1.0 1.0 1.0 0.0 1.0 0.0 6.0 Booking departure 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 5.0 Crew meal 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 4.0 Departure meal 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 2.0 Checkin booking 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.0 0.0 0.0 1.5 Checkin departure 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 1.0 4.0 Booking experience 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 2.0 Checkin experience 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 2.0 Checkin crew 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 Departure cabin 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0
R package bnlearn includes eleven algorithms: two-scored experiments on a BN performed by conditioning on specific based learning algorithms (Hill-Climbing with score func- variable combinations and predicting the target variables tions BIC and AIC and TABU with score functions BIC and using empirically estimated network. We can then analyze the AIC), five constraint-based learning algorithms (Grow- effect of variable combinations on target distributions using Shrink, Incremental Association, Fast Incremental Associa- the type of conditioning demonstrated in Sec. 4. The next tion, Interleaved Incremental association, Max-min Parents section provides an annotated listing of various software and Children), and two hybrid algorithms (MMHC with score products implementing BNs. functions BIC and AIC, Phase Restricted Maximization). In this section, we present an approach for performing a sensi- tivity analysis of BN, across various structure learning algo- 5. Software for Bayesian Network Applications rithms, in order to assess the robustness of the specific BN, (i) Graphical Network Interface (GeNIe) is the graphical one plans to use. Following the application of different interface to Structural Modeling, Inference, and learning algorithms to set up a BN structure, some arcs in the Learning Engine (SMILE), a fully portable Bayesian network are recurrently present and some are not. As a basis inference engine developed by the Decision Systems for designing a robust BN, one can compute how often an arc Laboratory of the University of Pittsburgh and thor- is present, across various algorithms, with respect to the total oughly field tested since 1998. Up to version 2.0 number of networks examined. GeNIe could be freely downloaded from http://genie. Table 1 shows the impact of the 11 learning algorithms sis.pitt.edu with no restrictions on applications or implemented in the bnlearn R application. The last column otherwise. Version 2.1 is now available from http:// represents the total number of arcs across the 11 algorithms. www.bayesfusion.com/ with commercial and academic The robust structure is defined by arcs that appear in a ma- versions, user guides and related documentation. jority of learned networks. For these variables, the link connection does not depend on the learning algorithm and the (ii) Hugin (http://www.hugin.com) is a commercial soft- derived prediction and is therefore considered robust. For more ware which provides a variety of products for both on this topic and related sensitivity analysis issues, see Ref. 8. research and nonacademic use. The HUGIN Decision After selection of a robust network, one can perform what- Engine (HDE) implements state-of-the-art algorithms if sensitivity scenario analysis. These scenarios are computer for BNs and influence diagrams such as object-oriented July 20, 2017 8:31:59pm WSPC/B2975 ch-08 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
BNs: Theory, applications and sensitivity issues 73
modeling, learning from data with both continuous and tradeoffs; sensitivity analysis; and explanation-genera- discrete variables, value of information analysis, sen- tion based on MAP and MPE. sitivity analysis and data conflict analysis. (viii) BNT (https://code.google.com/p/bnt) supports many (iii) IBM SPSS Modeller (http://www-01.ibm.com/soft- types of CPDs (nodes), decision and utility nodes, ware/analytics/spss) is a general application for analytics static and dynamic BNs and many different inference that has incorporated the Hugin tool for running BNs algorithms and methods for parameter learning. The (http://www.ibm.com/developerworks/library/ source code is extensively documented, object-oriented, wa-bayes1). IBM SPSS is not free software. and free, making it an excellent tool for teaching, research (iv) The R bnlearn package is powerful and free. Compared and rapid prototyping. with other available BN software programs, it is able (ix) Agenarisk (http://www.agenarisk.com/) is able to to perform both constrained-based and score-based handle continuous nodes without the need for methods. It implements five constraint-based learning static discretization. It enables decision-makers to algorithms (Grow-Shrink, Incremental Association, measure and compare different risks in a way that is Fast Incremental Association, Interleaved Incremental repeatable and auditable and is ideal for risk scenario association, Max–min Parents and Children), two planning. scored-based learning algorithms (Hill-Climbing, TABU) and two hybrid algorithms (MMHC, Phase Restricted Maximization). 6. Discussion (v) Bayesia (http://www.bayesia.com) developed proprie- This chapter presents several examples of BNs in order to tary technology for BN analysis. In collaboration with illustrate their wide range of relevance, from expert opinion- research labs and big research projects, the company based data to big data applications. In all these cases, BNs develops innovative technology solutions. Its products has helped enhance the quality of information derived from include (1) BayesiaLab, a BN publishing and auto- an analysis of the available data sets. In Secs. 1 and 2, we matic learning program which represents expert describe various technical aspects of BNs, including esti- knowledge and allows one to find it among a mass of mation of distributions and algorithms for learning the BN data, (2) Bayesia Market Simulator, a market simula- structure. In learning the network structure, one can include tion software package which can be used to compare white lists of forced causality links imposed by expert the influence of a set of competing offers in relation to a opinion and black lists of links that are not to be included in defined population, (3) Bayesia Engines, a library of the network, again using inputs from content experts. This software components through which can integrate essential feature permits an effective dialogue with content modeling and the use of BNs and (4) Bayesia Graph experts who can impact the model used for data analysis. Layout Engine, a library of software components used We also briefly discuss statistical inference of causality to integrate the automatic position of graphs in specific links, a very active area of research. In general, BNs provide application. a very effective descriptive causality analysis, with a natural (vi) Inatas (www.inatas.com) provides the Inatas System graphical display. A comprehensive approach to BNs, with Modeller software package for both research and application sports, medicine and risks is provided by the commercial use. The software permits the generation Bayes knowledge project (http://bayes-knowledge.com/). of networks from data and/or expert knowledge. It also Additional example of applications of BNs in the context of permits the generation of ensemble models and the the quality of the generated information are included in introduction of decision theoretic elements for decision Refs. 23 and 29. support or, through the use of a real time data feed API, system automation. A cloud-based service with GUI is in development. References (vii) SamIam (http://reasoning.cs.ucla.edu/samiam) is a 1O. Aalen, K. Røysland and JM. Gran, Causality, mediation and comprehensive tool for modeling and reasoning with time: A dynamic viewpoint, J. R. Stat. Soc. A, 175(4), 831 (2012). 2 BNs, developed in Java by the Automated Reasoning X. Bai, R. S. Kenett and W. Yu, Risk Assessment and adaptive Group at UCLA. Samiam includes two main compo- group testing of semantic web services, Int. J. Softw. Eng. Knowl. Eng. 22(5), 595 (2012). nents: a graphical user interface and a reasoning en- 3 gine. The graphical interface lets users develop BN S. Baker, Causal inference, probability theory, and graphical insights, Stat. Med. 2(25), 4319 (2013). models and save them in a variety of formats. The 4 I. Ben Gal, Bayesian networks, in Encyclopedia of Statistics in reasoning engine supports many tasks including: clas- Quality and Reliability, eds. F. Ruggeri, R. S. Kenett and F. Faltin sical inference; parameter estimation; time-space (Wiley, UK, 2007). July 20, 2017 8:32:00pm WSPC/B2975 ch-08 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
74 R. S. Kenett
5A. C. Constantinou, N. Fenton, W. Marsh and L. Radlinski, From in Business and Industry — The State of the Art, eds. T. Greenfield, complex questionnaire and interviewing data to intelligent S. Coleman and R. Montgomery (John Wiley & Sons, Chichester, Bayesian network models for medical decision support, Artif. UK, 2008). Intell. Med. 67, 75 (2016). 25R. S. Kenett, Risk analysis in drug Manufacturing and Healthcare, 6G. F. Cooper, The computational complexity of probabilistic in- in Statistical Methods in Healthcare, eds. F. Faltin, R. S. Kenett ference using Bayesian belief networks, Artif. Intell. 42, 393 and F. Ruggeri (John Wiley & Sons, 2012). (1990). 26R. S. Kenett and Y. Raanan, Operational Risk Management: 7C. Cornalba, R. S. Kenett and P. Giudici, Sensitivity Analysis of A Practical Approach to Intelligent Data Analysis (John Wiley & Bayesian Networks with Stochastic Emulators, ENBIS-DEINDE Sons, Chichester, UK, 2010). Proc. University of Torino, Torino, Italy (2007). 27R. S. Kenett and S. Salini, Modern Analysis of Customer Satis- 8F. Cugnata, R. S. Kenett and S. Salini, Bayesian networks in faction Surveys: With Applications Using R (John Wiley & Sons, survey data: Robustness and sensitivity issues, J. Qual. Technol. Chichester, UK, 2011). 48, 253 (2016). 28R. S. Kenett and G. Shmueli, On information quality, J. R. Stat. 9L. DallaValle, Official statistics data integration using vines and Soc. A 177(1), 3 (2014). non parametric. Bayesian belief nets, Qual. Technol. Quant. 29R. S. Kenett and G. Shmueli, Information Quality: The Potential Manage. 11(1), 111 (2014). of Data and Analytics to Generate Knowledge (John Wiley & 10M. Di Zio, G. Sacco, M. Scanu and P. Vicard, Multivariate Sons, 2016). www.wiley.com/go/information quality. techniques for imputation based on Bayesian networks, Neural 30R. S. Kenett and S. Zacks, Modern Industrial Statistics: With Netw. World 4, 303 (2005). Applications Using R, MINITAB and JMP, 2nd edn (John Wiley & 11ECRI, The Alarm Safety Handbook Strategies, Tools, and Guid- Sons, Chichester, UK, 2014). ance (ECRI Institute, Plymouth Meeting, Pennsylvania, USA, 31R. S. Kenett, A. Harel and F. Ruggeri, Controlling the usability of web 2014). services, Int. J. Softw. Eng. Knowl. Eng. 19(5), 627 (2009). 12N. E. Fenton and M. Neil, The use of Bayes and causal modelling 32T. Koski and J. Noble, Bayesian Networks — An Introduction in decision making, uncertainty and risk, UPGRADE, Eur. J. Inf. (John Wiley & Sons, Chichester, UK, 2009). Prof. CEPIS (Council of European Professional Informatics 33J. Li, R. Conradi, O. Slyngstad, M. Torchiano, M. Morisio and Societies), 12(5), 10 (2011). C. Bunse, A state-of-the-practice survey of risk management in 13N. E. Fenton and M. Neil, Risk Assessment and Decision Analysis development with off-the-shelf software components, IEEE Trans. with Bayesian Networks (CRC Press, 2012), http://www.baye- Softw. Eng. 34(2), 271 (2008). sianrisk.com. 34D. Marella and P. Vicard, Object-oriented Bayesian networks 14N. E. Fenton and M. Neil, Decision support software for proba- for modelling the respondent measurement error, Commun. Stat. bilistic risk assessment using Bayesian networks, IEEE Softw. — Theory Methods 42(19), 3463 (2013). 31(2), 21 (2014). 35F. Mealli, B. Pacini and D. B. Rubin, Statistical inference for 15X. Franch, R. S. Kenett, A. Susi, N. Galanis, R. Glott and causal effects, in Modern Analysis of Customer Satisfaction Sur- F. Mancinelli, Community data for OSS adoption risk management, veys: with Applications using R, eds. R. S. Kenett and S. Salini in The Art and Science of Analyzing Software Data, eds. C. Bird, (John Wiley and Sons, Chichester, UK, 2012). T. Menzies and T. Zimmermann (Morgan Kaufmann, 2016). 36F. Musella, A PC algorithm variation for ordinal variables, 16B. Frosini, Causality and causal models: A conceptual perspective, Comput. Stat. 28(6), 2749 (2013). Int. Stat. Rev. 74, 305 (2006). 37E. R. Neapolitan, Learning Bayesian Networks (Prentice Hall, 2003). 17M. Gasparini, F. Pellerey and M. Proietti, Bayesian hierarchical 38J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks models to analyze customer satisfaction data for quality im- of Plausible Inference (Morgan Kaufmann, 1988). provement: A case study, Appl. Stoch. Model Bus. Ind. 28, 571 39J. Pearl, Comment on causal inference, probability theory, and (2012). graphical insights (by Stuart G. Baker). UCLA Cognitive Systems 18A. Harel, R. S. Kenett and F. Ruggeri, Modeling web usability Laboratory, Stat. Med. 32(25), 4331 (2013). diagnostics on the basis of usage statistics, in Statistical Methods 40J. Pearl, Bayesian networks: A model of self-activated memory for in eCommerce Research, eds. W. Jank and G. Shmueli (John evidential reasoning (UCLA Technical Report CSD-850017). Wiley & Sons, New Jersey, USA, 2009). Proc. 7th Conf. Cognitive Science Society (University of California, 19D. Heckerman, A tutorial on learning with Bayesian networks, Irvine, CA), pp. 329–334. Microsoft research technical report MSR-TR-95-06, Revised 41J. Pearl, Causality: Models, Reasoning, and Inference, 2nd edn. November 1996, from http://research.microsoft.com. (Cambridge University Press, UK, 2009). 20J. Heckman, Econometric causality. Int. Stat. Rev. 76, 1 (2008). 42J. Peterson and R. S. Kenett, Modelling opportunities for statis- 21K. Imai, D. Tingley and T. Yamamoto, Experimental designs ticians supporting quality by design efforts for pharmaceutical de- for identifying causal mechanisms, J. R. Stat. Soc. A 176(1), 5 (2013). velopment and manufacturing, Biopharmaceut. Rep. 18(2), 6 (2011). 22F. V. Jensen, Bayesian Networks and Decision Graphs (Springer, 43L. D. Pietro, R. G. Mugion, F. Musella, M. F. Renzi and P. Vicard, 2001). Reconciling internal and external performance in a holistic ap- 23R. S. Kenett, On generating high infoQ with Bayesian networks, proach: A Bayesian network model in higher education, Expert Qual. Technol. Quant. Manag. 13(3), (2016), http://dx.doi.org/ Syst. Appl. 42(5), 2691 (2015). 10.1080/16843703.2016.11891. 44O. Pourret, P. Naïm and B. Marcot, Bayesian Networks: A 24R. Kenett, A. De Frenne, X Tort-Martorell and C. McCollin, The Practical Guide to Applications (John Wiley & Sons, Chichester, statistical efficiency conjecture, in Applying Statistical Methods UK, 2008). July 20, 2017 8:32:00pm WSPC/B2975 ch-08 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
BNs: Theory, applications and sensitivity issues 75
45M. Salter-Townshend, A. White, I. Gollini, T. B. Murphy, Review An analysis of mystery shopping data, Expert Syst. Appl. 39(11), of statistical network analysis: Models, algorithms, and software, 10103 (2012). Stat. Anal. Data Min. 5(4), 243 (2012). 47P. Vicard, A. P. Dawid, J. Mortera and S. L. Lauritzen, Estimation 46C. Tarantola, P. Vicard and I. Ntzoufras, Monitoring and of mutation rates from paternity casework, Forensic Sci. Int. improving Greek banking services Using Bayesian networks: Genet. 2, 9 (2008). July 20, 2017 7:41:53pm WSPC/B2975 ch-09 World Scientific Encyclopedia with Semantic Computing and Robotic Intelligence
GLiM: Generalized linear models ‡ † § Joseph R. Barr*, and Shelemyahu Zacks , *HomeUnion, Irvine, California † State University of New York (Emeritus), Binghamton, New York ‡ [email protected] § [email protected]
Much was written on generalized linear models. The reader is referred to the following books: P. McCullagh and J. A. Nelder, Generalized Linear Models, 2nd edn. (Chapman & Hall, 1989); A. J. Dobson and A. Barnett, An Introduction to Generalized Linear Models, 3rd edn. (Chapman & Hall, 2008); C. E. McCulloch and S. R. Searle, Generalized Linear and Mixed Models (2008); R. H. Myers and D. C. Montgomery, Generalized Linear Models: With Applications in Engineering and the Sciences (2010); A. Agresti, Foundations of Linear Generalized Linear Models, Wiley series in probability and statistics (Wiley, New York, 2015); J. J. Faraway, Linear Models with R, 2nd edn. (Chapman & Hall, 2014). We present here the main ideas. Keywords: GLiM; exponential families; MLE; model specification; goodness-of-fit.