Searching the News 1 Introduction
Total Page:16
File Type:pdf, Size:1020Kb
Searching the news Using a rich ontology with time-bound roles to search through annotated newspaper archives Wouter van Atteveldt1, Nel Ruigrok2, Stefan Schlobach1, and Frank van Harmelen1 1 Department of Arti¯cial Intelligence Free University Amsterdam De Boelelaan 1071, 1071 HV Amsterdam fwva,schlobac,[email protected] 2 The Netherlands News Monitor University of Amsterdam Kloveniersburgwal 48, 1012 CX Amsterdam [email protected] Abstract. A frequent motivation for annotating documents using ontologies is to allow more e±cient search. For collections of newspaper articles, it is often di±cult to ¯nd spe- ci¯c articles based on keywords or topics alone. This paper describes a system that uses a formalisation of the content of newspaper articles to answer complex queries. The data for this system is created using Relational Content Analysis, a method used in Communication Sciences in which documents are annotated using a rich annotation scheme based on an on- tology that includes political roles with temporal validity. Using custom inferencing over the temporal relations and query translation, our system can be used to search for and browse through newspaper articles and to perform systematic analyses by evaluating queries against all articles in the corpus. This makes the system useful both for the (Social) Scientist and for interested laypersons. 1 Introduction A number of services exist that o®er keyword-based search of news content, such as Google news3 and LexisNexis4. Such services have severe limitations, however. The ¯rst di±culty is the semantic gap between keywords and meaning that is always present in keyword based search. Another limitation is that it is impossible to look for a relation, such as positive or negative, between two concepts without specifying all of its possible lexical representations. For example a keyword- based query `Blair support EU' will not return documents containing `Prime Minister praises new Commission.' A third limitation is that search is generally bounded by articles or sentences as the only possible unit of search, meaning that one cannot search for `Two articles within one week in which a politician's stance on a topic has changed.' Queries such as the ones above require metadata not just about the topic and publishing details of an article but also about the content of that article. Generating such metadata automatically is a 3 http://news.google.com 4 http://www.lexisnexis.com 2 Wouter van Atteveldt, Nel Ruigrok, Stefan Schlobach, and Frank van Harmelen formidable challenge, but there are large corpora of articles manually annotated by Communication and Political Scientists that can be used. This paper is based on an analysis of the news coverage of the 2006 Dutch parliamentary election campaign, which was annotated manually using a rich annotation schema, summarised as answering: \Who says what about whom/what according to whom?" The concepts in this annotation are drawn from a detailed ontology of (political) actors and issues, including time-dependent political function and party membership information. The system presented here utilizes these annotations for semantically informed search through the newspaper corpus, inspecting the results quantitatively and visually, and retrieving the original articles the results are derived from. Based on a formalised version of the annotation of the news- paper content, the system allows for very sophisticated queries. Moreover, since these annotations are based on a rich background ontology, it is possible to ask general queries as well as very detailed ones, bridging the gap between abstract concepts and concrete representation. Finally, we perform automatic reasoning over the validity of the temporal political functions, making it possible to use a political function such as a minister in the query, which yields answers from statements about the various ministers during their respective time in o±ce. The primary users of this system are Social Scientists investigating communication processes and e®ects. In the preparatory phase of such an investigation, a researcher can use the system to form an understanding of the corpus and to formalise the concepts he or she is interested in. In the analysis phase, the researcher can use the system to query the corpus using the formalised concepts and export the results for statistical analysis. In the post-analysis phase, the researcher can use the system to check the results, retrieve interesting articles for qualitative sense-making, and obtain quotes and examples for writing about the results. It should be noted, however, that this system can also be very interesting for users outside academia. Often, the annotated material is of high relevance to society, such as election campaigns and high-pro¯le issues such as the war in Iraq or Middle East policy. This material can be very interesting to politicians, civil society groups, NGO's, and citizens. The system described in this paper can help them search through the corpus for general trends or speci¯c patterns. The contribution of this paper is twofold. Firstly, using the annotations and the ontology sup- plied by the Social Scientists, we are able to create and test a search system which showcases the bene¯ts of formalised metadata, such as querying at various levels of abstraction, searching for potentially highly complex patterns, and reasoning about temporal roles. Secondly, the system presented here makes the annotations created by the Social Scientists more accessible, making it Searching the news 3 easier for the scientist to perform both quantitative and qualitative analysis, and providing other users with a richer way of searching for news. 1.1 Knowledge Representation and Content Analysis In this paper we present a system for searching speci¯c patterns in an annotated newspaper archive. In our vision, there is a potential for synergy between Knowledge Representation, especially Se- mantic Web techniques, and Content Analysis. Studying the complex relationship between politics, the media, and the public requires very large data sets. In order to ¯nd general statistical patterns, these data sets often need to span multiple countries and events. Since this data is expensive to obtain, and often needs detailed knowledge of the subject language and society, it is important that researchers are able to combine and share data sets. Additionally, since there are many competing theories and methodologies for analysing news patterns and e®ects, these data sets should lend themselves to multiple analyses rather than being speci¯c to one study. We believe that using Semantic Web techniques can help alleviate these problems. We propose annotating as close to the text as possible, and using a formal ontology to aggregate the detailed objects found in the text to the theoretical concepts needed for the analysis. This minimises the amount of interpretation done by the annotators, and thus the potential for unreliable coding. Additionally, this makes it possbile to combine the analyses of the di®erent countries or time periods in a single scheme, where it is possible to have a concept such as `opposition politician' map to di®erent objects depending on time and place. Moreover, since it is possible to combine di®erent aggregation schemes in one ontology, this allows for the same data to be used for di®erent analyses. Finally, it is possible to express interesting variables, such as politicial frames like `strategic framing' or `internal conflict’, as formal patterns, for example as a SeRQL query or OWL de¯nition. This makes the process of aggregating and analyzing data more transparant, and makes it easier for researchers to duplicate and expand upon studies from other groups. This is not a one way street. Content Analysts have been annotating texts for the last decades, and many methods have been developed to perform systematic annotation and evaluate annotation quality (Holsti, 1969; Krippendor®, 2004). As manual annotation is a necessary part of the Semantic Web vision, either for creating the metadata directly or for bootstrapping and evaluating Machine Learning tools, these techniques can play an important role. Moreover, the rich annotated corpora such as the one used in this paper can be very useful data for building and testing systems to show the usefulness of the Semantic Web techniques. 4 Wouter van Atteveldt, Nel Ruigrok, Stefan Schlobach, and Frank van Harmelen Previously, we have shown how to use hybrid logic to de¯ne concepts within annotated media material (Van Atteveldt and Schlobach, 2005). This can be directly generalised to using OWL to de¯ne such concepts, and also shows the limits of this approach because of the lack of variable binding in OWL. In (Van Atteveldt et al., 2007) we provide an in-depth discussion of the possibilities and challenges in using RDF to formalise media data. Finally, in (Van Atteveldt et al., 2006), we propose a vocabulary and formalisation standard for converting content analysis data to RDF. Structure of this paper The following section will describe the corpus used in this paper and the Relational Content Analysis that was used for the annotation. Section three will describe the way the data were coded formally and give an overview of the ontology used for this encoding. The fourth section will provide more information about the implemented system, and the ¯fth section will discuss its usage and usefulness. 2 Domain In this section we describe the annotated data corpus on which this study is based, a collection of annotated newspapers and TV news items about the 2006 Dutch election campaign. After a brief description of the campaign we will describe the method used for annotating the data and some aspects of the corpus this resulted in. 2.1 Data Collection The use case presented in this study is based on a content analysis of political news from August 14th (¯rst party manifesto) until November 22nd (Election Day) in six daily newspapers and two television news programs.