The Pennsylvania State University

The Graduate School

College of Earth and Mineral Sciences

MAPPING SEMANTIC AND SPATIAL MEDIASCAPES IN THE CATALONIAN

INDEPENDENCE MOVEMENT: GEOPOLITICS, SPORTS, AND BLACK

BOXES

A Dissertation in

Geography

by

Samuel K. Stehle

© 2017 Samuel K. Stehle

Submitted in Partial Fulfillment Of the Requirements for the Degree of

Doctor of Philosophy

December 2017

The dissertation of Samuel K. Stehle was reviewed and approved* by the following:

Donna J. Peuquet Professor Emeritus of Geography Dissertation Advisor Chair of Committee

Clio Andris Assistant Professor of Geography

Deryck Holdsworth Professor Emeritus of Geography

Burt L. Monroe Professor of Political Science, Social Data Analytics, and Information Sciences

Cynthia Brewer Professor of Geography Head of the Department of Geography

*Signatures are on file in the Graduate School

ii Abstract

This dissertation explores the geographic and semantic spaces of local and international news media reporting on the Catalonian independence movement of fall 2015. It shows that the media intersects several geographic themes through its connections between , sports, and other independence-seeking movements, most notably in Scotland. These connections are largely dictated by two factors: the scale and location of the media source, and the themes which these sources discuss in reference to the Catalonian movement. This project uses data-driven thematic analysis through Latent Dirichlet Allocation (LDA) to determine the primary topics present in the media. LDA defines probabilistic topics based on co-occurring terms within clusters of documents, and relies on several key parameters to generate a usable result. The highly variable semantic spaces which result from parameter combinations are scrutinized via this dissertation’s introduction and implementation of ‘interestingness’ measures. Interestingness measures define related but separate methods for evaluating the usability of data-driven results when multiple valid results are generated. Thus, this dissertation offers new methods to GIScience for evaluating data-driven methods. This dissertation then maps the semantic spaces discovered through the LDA process onto geographic space via the placenames present in the media reports. Modern, digital, and global media emphasizes unique places connected through the Catalonian independence context and the media which reports on it. Networks of soccer – via competing local and national teams, the nationalities of athletes, and international sponsorship deals – emerge alongside those of media conglomerates throughout Europe and the world. The linking of semantic and geographic spaces in this analysis generate new ways of understanding the impacts of globalized media.

iii Table of Contents

LIST OF TABLES ...... viii LIST OF FIGURES ...... x LIST OF EQUATIONS ...... xii ACKNOWLEDGEMENTS ...... xiii 1. Chapter 1:...... 1 1.1 Introduction ...... 2 1.2 Problem statement ...... 4 1.2.1 Research Objective ...... 4 1.2.2 Catalan Independence ...... 6 1.2.3 Topic Modeling ...... 7 1.2.4 Contributions ...... 8 1.3 Dissertation outline ...... 10 2. Chapter 2:...... 11 2.1 Introduction ...... 12 2.2 Big Data...... 12 2.2.1 Big data – paradigm-shifting for social science ...... 13 2.2.2 Theory and Data-Driven Research ...... 15 2.2.3 Big Data and Media ...... 17 2.2.4 Big Data and Politics: Event Data ...... 19 2.2.5 Evaluation ...... 21 2.2.5.1 Evaluation issues ...... 21 2.2.5.2 Interestingness...... 22 2.3 Geopolitics in Media and Sport ...... 27 2.3.1 Geography in the Media ...... 28 2.3.2 Geography through the Media ...... 29 2.3.3 Popular Geopolitics ...... 32 2.3.4 Sports and Popular Geopolitics ...... 35 2.3.5 Sports and National Identity ...... 37 2.3.6 Political Relationships Influence the Framing of Sports Rivalry ...... 37 2.4 Geography and Text: Computational Methods ...... 40 2.4.1 Placename Disambiguation ...... 40 2.4.2 Mining Big Text Data: Geographic Information Retrieval ...... 41 2.4.3 Thematic text analysis ...... 42 2.5 Summary ...... 44 3. Chapter 3:...... 45 3.1 Introduction ...... 46 3.2 Methods ...... 46 3.2.1 Big Data ...... 46 3.2.2 Latent Dirichlet Allocation ...... 47 3.2.2.1 Algorithm Procedure ...... 47 3.2.3 Topic Disambiguation ...... 51 3.3 Evaluation ...... 54 3.3.1 Expectation Maximization...... 55 3.3.2 Interestingness for Evaluation ...... 56 3.3.2.1 LDA Outputs Facilitate Interestingness Evaluation ...... 57

iv 3.3.2.2 Conciseness ...... 59 3.3.2.3 Generality/Coverage ...... 60 3.3.2.4 Peculiarity ...... 61 3.3.2.5 Diversity ...... 62 3.3.2.6 Reliability ...... 65 3.3.2.7 Novelty...... 67 3.3.2.8 Unexpectedness/surprisingness ...... 68 3.3.2.9 Utility, Actionability ...... 69 3.4 Catalonian Independence ...... 70 3.4.1 The movement ...... 71 3.5 Data Processing ...... 73 3.5.1 Data Collection ...... 73 3.5.2 Text Preprocessing ...... 74 3.5.3 Data Analysis...... 75 3.5.3.1 Parameterization ...... 76 3.5.3.2 Post-Analysis ...... 77 3.5.4 Spatial Analysis ...... 78 3.5.4.1 News Producers ...... 78 3.5.4.2 News Audiences ...... 79 3.5.4.3 News Content ...... 80 3.5.4.4 Spatial Analysis ...... 81 3.6 Summary ...... 83 4. Chapter 4:...... 84 4.1 Evaluation ...... 85 4.1.1 Issues in validation of exploratory techniques ...... 86 4.1.2 Interestingness Approach ...... 89 4.2 Model Sensitivity Analysis ...... 90 4.2.1 k – number of topics ...... 91 4.2.2 Alpha ...... 91 4.2.3 Term frequency – inverse document frequency ...... 92 4.3 Results ...... 94 4.3.1 Conciseness ...... 95 4.3.1.1 Sensitivity testing ...... 95 4.3.1.2 Recommendation ...... 95 4.3.2 Generality/Coverage ...... 96 4.3.2.1 Sensitivity testing ...... 96 4.3.2.2 Recommendation ...... 99 4.3.2.3 Geographic significance ...... 99 4.3.3 Peculiarity ...... 100 4.3.3.1 Model Sensitivity ...... 100 4.3.3.2 Recommendation ...... 105 4.3.3.3 Geographic significance ...... 106 4.3.4 Diversity ...... 106 4.3.4.1 Sensitivity Testing ...... 107 4.3.4.2 Recommendation ...... 111 4.3.4.3 Geographic significance ...... 112 4.3.5 Reliability ...... 113 4.3.5.1 Sensitivity Testing ...... 114 4.3.5.2 Recommendation ...... 118 4.3.6 Novelty ...... 118 4.3.6.1 Sensitivity Testing ...... 119

v 4.3.6.2 Recommendations ...... 121 4.3.6.3 Geographic significance ...... 123 4.3.7 Unexpectedness/Surprisingness ...... 123 4.3.7.1 Sensitivity Testing ...... 123 4.3.7.2 Recommendations ...... 125 4.3.7.3 Geographic significance ...... 126 4.3.8 Utility/Actionability ...... 126 4.3.8.1 Sensitivity Testing ...... 127 4.3.8.2 Recommendations ...... 130 4.4 Discussion...... 131 4.4.1 Summary...... 131 4.4.2 Contributions ...... 134 4.4.2.1 Geographic Information Science...... 135 5. Chapter 5:...... 137 5.1 Introduction ...... 138 5.2 Mediascapes ...... 139 5.2.1 Producer ...... 140 5.2.2 Audience ...... 141 5.2.3 Content ...... 142 5.2.4 Mapping Process ...... 143 5.3 Mapping Mediascapes ...... 143 5.3.1 Spatial Frequency ...... 144 5.3.2 Spatial Distribution ...... 148 5.3.3 Discussion...... 153 5.4 Mapping Semantic Space ...... 155 5.4.1 Global Distance Decay ...... 156 5.4.2 Comparing Catalonian Independence and Sport ...... 159 5.4.3 Comparing Catalonian Independence to Scottish Independence ...... 162 5.5 Conclusion ...... 165 6. Chapter 6:...... 169 6.1 Summary ...... 170 6.2 Contributions to Literature ...... 171 6.2.1 Popular Geopolitics ...... 171 6.2.2 Geocomputation ...... 172 6.2.3 Event Data ...... 173 6.3 Methodological Pitfalls and Solutions ...... 173 6.3.1 Linguistic Effects...... 174 6.3.2 The Self-Fulfilling Prophecy of Evaluation ...... 177 6.4 Future Work ...... 178 6.4.1 Sport ...... 178 6.4.1.1 The British Commonwealth Games ...... 179 6.4.1.2 The World Classic ...... 180 6.4.2 Other Useful Domains ...... 180 6.4.2.1 Brexit ...... 180 6.4.2.2 Real-Time Analysis ...... 181 6.4.3 Measuring Engagement with Digital Media...... 182 6.4.3 Projections ...... 183 REFERENCES ...... 185

vi Appendix: List of topics given by each of the 48 tested models ...... 197

vii List of Tables

Table 2.1 Descriptions of each of the interestingness measures as explained by three sets of authors, given by the following: 1 (Yao et al. 2006), 2 (Geng and Hamilton 2006), 3 (Silberschatz and Tuzhilin 1996) ...... 24 Table 3.1 Sample topics generated by LDA with clear themes regarding, (a) the Catalonian independence context with that of (b) Scotland’s independence movement and (c) convergence between Catalonian independence and soccer ...... 48 Table 4.1 List of the unique values for k – the number of topics – in tested LDA models, along with a general description of the expected semantic overlap of that value among each of the k topics...... 91 Table 4.2 List of the unique values for alpha used in this dissertation. Each value for alpha was generated using estimations from the ‘topicmodels’ package using the indicated number of topics in the estimation...... 92 Table 4.3 List of the three values for the tested minimum tf-idf value with their accompanying reduction in vocabulary size. Values were obtained using modifications on the median tf-idf value of 0.055...... 94 Table 4.4 List of the effects of specifying each of three minimum tf-idf values on the reduction in vocabulary size (number of term) and generality (number of input documents able to be classified)...... 98 Table 4.5 Selection of novel patterns from each of the four values of k. The themes described in lower-k models also appear in higher-k models, indicating the novelty increase as topics increase. The text processing procedure stems words, removing them of their conjugations and leaving only the term’s root, as in ‘polic’ for ‘police’ and ‘commod’ for ‘commodity.’...... 120 Table 4.6 Selection of novel patterns from each of the three models generated by the three minimum tf-idf values and a k of 25 and alpha of 0.016...... 121 Table 4.7An unexpected pattern, designated as such for the combination of terms which suggest a Scottish anti-Syrian-immigrant policy...... 124 Table 4.8 Observed actionable patterns, the parameters that generated them, and the labeled description of the news explained in the topic...... 128 Table 4.9 Summary of all recommendations for maximizing each interestingness measure. The confidence is my subjective trust in the conclusions given by each evaluation...... 132 Table 5.1 News publications captured at the conglomerate and national scales, their headquarters’ locations, and sampling information for geographic content...... 145 Table 5.2 Frequencies of place mentions and unique places in conglomerate and international news. Places are categorized by country, province (sub-nation administrative areas), cities, and other landmark or landform locations. Extra-state or unknown locations are not included in this data...... 147 Table 5.3 Definitions for the topics used in section 5.5.1 and 5.5.2 to compare (a) the Catalonian independence context with that of (b) Scotland’s independence movement and (c) convergence between Catalonian independence and soccer ..... 156

viii Table 5.4. List of the places which have the largest difference in the number of mentions within articles pertaining to topics on Catalonian independence and Catalonian independence plus soccer...... 161 Table 5.5 List of the places which have the largest difference in the number of mentions within articles pertaining to topics on Catalonian independence and the Scottish independence referendum...... 163

ix List of Figures

Figure 2.1 Compare the sponsorship of Barclay’s teams (c) with viewership (a) and athlete’s nationalities (b)...... 36 Figure 3.1 Graphical representation of the Dirichlet distribution of 3 variables. The height of the surface represents the probability of sampling related to the power given to each variable in the model. The height of the surface represents the probability of sampling in variable space, where a smaller alpha increases the probability of sampling a distribution of one variable at 100%, while a larger alpha increases the likelihood of an even distribution among the variables. Image from (Zhihui 2013) ...... 49 Figure 3.2 Plots of three common text modeling procedures – the Correlated Topics Model, LDA, and probabilistic Latent Semantic Indexing – used on two textual sources – the New York Times and Wikipedia – measured by two typical quantitative clustering mdethods (Chang et al 2009) ...... 53 Figure 3.3 Separatist regions throughout Europe. Many look toward possible success by Catalonia for precedent and strategies. Image from (Kassam 2015)...... 73 Figure 3.4 Density plots of three different topics pertaining to the Catalonian independence movement solely (red), the Scottish independence referendum (blue), and the Catalonian independence movement mixed with soccer themes (green). The x-axis of distance has orgin zero located in ...... 82 Figure 4.1 Two alternative clusterings for eleven objects with the salient properties of shape and color, adapted from Färber et al (2010)...... 87 Figure 4.2 Histogram showing the highly left-skewed term frequency-inverse document frequency of terms across all documents. The three test values for minimum tf-idf are shown by dashed vertical lines, with the median in red...... 97 Figure 4.3. Plots of peculiarity as a function of alpha and number of topics, keeping minimum tf-idf constant. (a) keeps minimum tf-idf constant at 0.03, (b) uses a minimum tf-idf value of 0.05, and (c) uses a value of 0.08, each comparing over the four values of alpha and k...... 102 Figure 4.4Plots of peculiarity as a function of minimum tf-idf and number of topics, keeping alpha constant. (a) keeps alpha constant at 0.006, (b) uses an alpha value of 0.011, (c) uses a value of 0.0165, and (d) uses a value of 0.029...... 105 Figure 4.5Plots of diversity variance score (as a percent of the maximum potential variance given by the value of k), as a function of k and alpha...... 108 Figure 4.6Plots of diversity variance score (as a percent of the maximum potential variance given by the value of k), as a function of k and the minimum tf-idf...... 111 Figure 4.7 One example plot of pattern reliability, consisting of 16 patterns with a minimum tf-idf of 0.05, and varying k and alpha...... 114 Figure 4.8 One example plot of pattern reliability, consisting of 12 patterns with an alpha of 0.011, and varying k and minimum tf-idf. Very little evidence exists to support the claim that greater topics yields greater reliability...... 115 Figure 4.9 Matrix consisting of the reliability between pairs of patterns, computed by the mean Jaccard similarity of 100 randomly sampled documents’ topic definitions. High reliability between two patterns is shown by brighter yellow cells. The X axis

x is sorted by observed clusters of similar models, and the Y axis is sorted first by the number of topics, then the alpha, and finally the minimum tf-idf...... 117 Figure 5.1 Locations of the headquarters of each news source collected. News are primarily situated in Barcelona and Madrid with seven and 5 sources, respectively...... 141 Figure 5.2 Map of all places mentioned in the content of conglomerate and international news sources, separated by countries/continents and other sub-country scale locations. Distributions are summarized with the mean center location ...... 149 Figure 5.3 Map of Europe showing comparisons between conglomerate and international news. Specific important cities or other locations are labeled for reference...... 150 Figure 5.4 Density plot of conglomerate (blue line) versus international news (red line). The X-axis represents the distance Barcelona, and the Y-axis represents the density of places mentioned at the given distance. Distance represents the centroid of the feature...... 152 Figure 5.5 Photo of routinely expressed pro-independence sentiment at FC Barcelona home soccer matches………………………………………………………………154 Figure 5.6 Density plot of place mentions given by Euclidean distance from Barcelona by topic. Catalonian independence is in red, Scottish independence is in blue, and Catalonian independence with soccer is in green. Countries’ distances are given by their centroids...... 157

xi List of Equations

Equation 3.1 Measure of variance from an even distribution used to evaluate diversity of an association rule in (Hilderman and Hamilton 2001). Where p_i is the probability for class i, q ̅ is the average probability for all classes, and m is the number of classes...... 64 Equation 3.2 Adjusted measure of variance for proportional assignment of documents to topics. The equation is essentially unchanged from equation 3.1, but specifies i as an index for each article coded into a topic, m as the number of articles in that topic, p_i as t ...... 65

xii Acknowledgements

To all of those without whose mentorship, assistance, and support this project would not have been possible: Laurie (Mom), Dave (Dad), Carla, Joyce (Grandma), Lee (Papa), and Paula – for giving me positive perspectives on life and duty whenever I needed it. Burt, Clio, and Deryck – for asking the difficult questions, challenging my comfort zone, and always encouraging me to discover. Donna – for being a constant source of anything I ever asked for, from line edits to furniture, on a weekly basis for six full years. And Arielle – for your love of late night grapefruit, without which I never would have turned this in on time, or fallen in love with you.

xiii

1. Chapter 1:

Research Objectives and Contributions

1 1.1 Introduction The advent of the digital age has resulted in a vastly increased volume, speed, and variety of data, and ease of information exchange. Data, and the means to extrapolate meaningful connections from it, has transformed how people interact with information, a process that both challenges and reaffirms our understanding of global phenomena, physically and culturally. Social and natural sciences alike are realizing the potential in a new level of spatio-temporal detail through big data and the contextual representations it allows. Broad observational areas of application within geography and spatially related sciences, including climate change science (Hampton et al. 2013), health geography (Widener and Li 2014), global geo-demography (Stewart et al. 2015) and urban studies (Farber et al. 2012), among others have benefitted from fine-grained spatio-temporal data and new processes for computationally analyzing it. Simultaneously, big data collection and analysis enables both real-time awareness and understanding of complex processes. This dissertation further explores how spatio-temporal big data can be leveraged to better understand the geographic relationships present in a globally connected news media. The merging of news and commentary in both traditional news outlets and social media has been closely scrutinized in politics around the world with recent claims of ‘fake news’ and other highly polarizing attempts to ‘spin’ current events. The geography of those sources of political narratives can reveal the objectives and spatial networks involved in their creation. Appadurai examines the impact of digital media on global networks through the concept of the mediascape (Appadurai 1996). Mediascapes describe the geography of news through the combination of where it is produced, who reads it, and the places which it connects via its spatial content. Rose’s methodology for making sense of visual sources (Rose 2012) provides helpful context for formalizing the spatialities of these properties of mediascapes. The networks of places that result from mediascape interactions are punctuated by new global links between places as digital media incorporates narratives from a range of spatial perspectives. The merging of news and commentary in digital media allows geopolitical messages to emerge from unlikely sources, such as popular film (Dodds 2006) and comic books (Dittmer 2007). Sports are another such source that has been explored for its geopolitical relelvance. International professional sports allow for athletes to represent

2 their nation against another, which can motivate fans and athletes alike in different ways than friendly matches can. Some organizations advancing friendly sport competition attempt to dismiss the presence of historic tensions, such as the British Commonwealth Games Federation, whose constitution explicitly states that the Games are “contests between athletes, and not contests between countries” (Sporting Intelligence 2013). This directive contrasts with the explicitly political origins of the Commonwealth itself. For many in ’s Catalonia region, the intertwining of politics and sport are both expected and encouraged, with a unique cultural identity merging soccer, history, and global inter-connections to produce a unique pro-independence narrative which reverberates throughout the local, regional, and international media. Catalonian sports, and professional soccer in particular, embrace their political identities which are geographically and historically grounded in Barcelona. The political identities of Barcelona’s two major soccer clubs, FC Barcelona and Espanyol, contrast with one another as pro-Catalonia and pro-Spanish, respectively. In turn, Catalonia has used the expectation of politics arising through sport to advance its secessionist philosophy through a combined Catalonian and European identity, as demonstrated in Chapter 5. The importance of sports to the Catalonian movement is evaluated via the differences in thematic narrative between news in several contexts, especially the Scottish independence movement. The two regions are linked through their ongoing efforts to establish independent states. Importantly, the spatial and semantic contexts of these separate but linked movements can be detected and explained in data-driven ways, preventing the need to define comprehensive ontologies of each situation. This dissertation utilizes text analysis techniques based on data science and big data methodologies to compare the spatial and semantic spaces of media narratives. It considers critiques of data-driven methods and addresses the argument that such methods are not based in theory and therefore, unable to confirm or deny established beliefs and real-world connections. By substantively evaluating results and patterns for the multitude of ways that they help understand the topic under study, this dissertation explores the impact that multiple parameter combinations have on resulting output. Although these impacts, termed interestingness in knowledge discovery applications, is defined in objective, subjective, and domain-dependent ways, substantive evaluation of the impacts

3 that atheoretical inputs have on analysis results will contribute to methodological theory for data science processes, and presents a novel technique for the field of GIScience.

1.2 Problem statement 1.2.1 Research Objective This dissertation merges techniques from geographic analysis; most specifically, big data, popular geopolitics, and space-time analysis. Although subsets of these techniques have been employed previously, the combination of these methods and disciplinary perspectives provides unique insights in the area of textual analysis of news content and geopolitical messages. Because news and social media increasingly provide sources of spatio-temporal data, particularly with the burgeoning use of big data techniques, this dissertation examines new ways of using textual sources in spatio- temporal research. It does so through the example I continue to refer to as the geopolitics of sport as represented in news media. Specifically, by connecting global geopolitics, sports, and automatic text analysis, this dissertation explores the geographic and semantic spaces of the news media’s portrayal of Catalonian independence. It aims to understand 1) how text analysis procedures can be used for analyzing the content of digital news reports of real world events, 2) how the patterns of similar narrative topics that emerge from global media, and the historical, spatial, and social influences on those patterns relate to the geography in the example of the Catalonian and Scottish independence movements, and 3) how substantive objective and subjective evaluation of text analysis procedures can establish theories for data-driven spatial analysis. First, this dissertation uses text analysis methods to understand the content of digital news and social media information. Such methods have been demonstrated in various applications for categorizing input text data: abstracts from scholarly journals (Deerwester et al. 1990), music lyrics (Logan, Kositsky, and Moreno 2004), and spam disambiguation (Biro, Szabo, and Benczur 2008). Latent Dirichlet Allocation (LDA) (Blei, Ng, and Jordan 2003), used in this project to categorize news media posts, not only sorts individual posts into a provided number of topics, but returns a list of the most likely terms which differentiate one topic from another. Toward responding to the

4 problem statement defined for this research, this project uses LDA to classify the text of news media posts into descriptive themes derived from key term combinations. This method’s ability to represent spatio-temporal multi-scale processes, especially historical and global connections that arrive through the context of sport provides advantages over qualitative, small sample studies as well as methods that heavily rely on predetermined dictionaries. LDA provides a data-driven summary of input text documents, so it not only overcomes the reliance used in many studies on predetermined actors and themes, it works best when the sample size of text is large, creating more dimensions to compare via combinations of terms and their similarity. Geography has utilized the LDA method for various applications from geographic information retrieval (Li et al. 2008) and abnormal event detection (Chae et al. 2012). With increasing use of social media and other textual data generated via publically accessible and socially generated sources, text analysis in various forms have become necessary for semantic and spatial analysis of news media data (Peuquet et al. 2015, Stehle and Peuquet 2015) and social media (Karimzadeh et al. 2013, Lansley and Longley 2016). Second, using the topical definitions derived from LDA, this dissertation examines the geographic and semantic spaces produced by the media’s in its role as narrator of a complex local and international geopolitical process, using independence movements as case studies. News reports were collected to examine the movement leading up to, as well as the period following, the Catalonian parliamentary vote – a non- official referendum on the public’s willingness to pursue secession from Spain. Scotland’s similar movement in 2014 resulted in a public ‘no’ vote for separating from the United Kingdom. These events were linked through geographic and semantic relationships as portrayed in the news media. Important geographic patterns are explained through the media’s multiple sites of production, audiences, and content (Rose 2012), and the ways that information is semantically organized in the narratives that they share. Thirdly, this project addresses a common critique in data science research that a lack of theory guiding the exploratory analysis necessary for investigating large datasets yields untrustworthy and non-confirmable findings (Miller and Goodchild 2014). Although many processes, such as LDA, require inputs that lack a theoretical foundation

5 based on the study context, that is not an excuse for simply relying on prior research suggestions for optimal input values. Research context – social, spatial, temporal, etc. – and data properties – volume, variety – make input parameters unique for any given situation. This project evaluates the parameterizations of LDA using the Knowledge Discovery in Databases concept of interestingness to compare model outputs and generate appropriate parameterizations for optimally interesting findings. Interestingness measures, of which seven are implemented in this dissertation, provide a new and critical scheme for the field of GIScience to further advance its growing data-driven focus. 1.2.2 Catalan Independence Identity politics are publically scrutinized when secessionist ideals are put to the test. Several cultures’ ambitions to create an independent state in the Western world have gained much attention recently, including the Quebecois in Canada, the Scottish in the UK, and the Catalonians in Spain. Although culturally and geographically these movements have little in common, they are linked through the globalization of culture, media, and finance (Appadurai 1996). Appadurai observes the collapse of geographic boundaries resulting from modern communication, sharing information and identity across great spatial distances and connecting places with otherwise little in common. To varying magnitudes, the connections between spatial and political contexts among these movements and global networks are traceable through the ways that local and international media portrays them. Appadurai’s mediascape concept explores modern media’s power to connect people and places through its production scale, its audience locality, and the geographic content that it produces. This concept is employed in the current research as documented in chapter 5. The mediascapes explored here show that the varying geographies produced by media are highly dependent of the locations of the media’s production headquarters and the scale of the audience that the source intends to reach. These findings are partially supported by work by Gasher and coauthors (Gasher and Klein 2008, Gasher 2009). By combining the sites of production and of the audience, as described by Rose’s visual interpretation of media’s narrative imagery (Rose 2012), this project demonstrates that the geographic content of news reporting is not based solely on its location, but that the scale of its audiences, from local to international, greatly influences the ways that it promotes global networks and political landscapes.

6 1.2.3 Topic Modeling The exploratory nature of many big data methods, and particularly clustering algorithms like that used here for text, remove the need to use theory to generate extensive hypotheses and make validating findings difficult to impossible. LDA uses common terms among input text documents to determine the most indicative terms of clusters, and those terms are not necessarily consistent with the salient terms human readers expect from specific categorizations of news articles (Färber et al. 2010). However, the categorizations represent major themes which themselves can be estimated and predicted. The specific political and geographic content of those themes change with varying parameterizations of the LDA model, and here that content is evaluated as a function of the parameters which generated the model. The particular topics observed here in Chapter 4 provide meaningful summaries of subsets of news articles pertaining to various aspects of the Catalonian, Spanish, and international news media, defined as ‘topics’ by LDA. A topic simply contains a list of terms, extracted from the text of news articles, which together form a label uniquely defining the semantic space of that subset of articles. In this analysis, rarely are the extracted topics uninterpretable; the terms almost always combine into a meaningful theme summarizing the articles classified by a given topic. In Chapter 4, these themes are explored and evaluated. Topic modeling is shown to be useful for spatial big data analysis of news media for flagging non-interesting articles with respect to Catalonian independence (topics such as travel recommendations and food recipes) and the sources which generate them, small variations among similar themes to facilitate fine-scale semantic comparison (the use of ‘Catalonia’ versus the Catalan spelling ‘Catalunya’ indicates local audiences), and articles which combine relevant themes into specific ways of exploring the geographic aspects of Catalonian’s independence movement through the media (Catalonian independence themes combine with soccer and with Scottish independence in topics which contain geographically and politically interesting spatial connections). Although many of the media’s major themes of news articles are well-known to readers: over time the major sections of politics, business, opinion, world, sports, and others have become familiar categories for classifying news stories, LDA provides a

7 more fine-scaled semantic comparison which directly uses terms from the text of articles to present a narrative theme. The scope of data collected here, via keywords and select sections’ RSS feeds, generally restricts such major news-based categories to a small number, therefore, choosing a larger number of topics as input to the LDA algorithm (or the parameter k) than present in news categorization, ensures sufficient detail among variations in output to preform meaningful analysis. Particularly meaningful topics emerging within the overarching themes of politics and sports include references to historical political relationships among athletes of various nationalities, comparisons to other similar global events of national and international significance, and intense ethno- nationalism stemming from culturally-driven secessionist sentiment. These themes are indicative of modern developments in the connection between geopolitics and sports, as opposed to formerly prominent themes such as propaganda in support of a state message, and the sports-as-war metaphor that pervaded athletic matchups between political rivals. 1.2.4 Contributions First, this dissertation is based on data-driven methods and theory. Big data grew as a concept fairly simultaneously across academic disciplines and industry sources, and so while disciplines contribute unique methodological and theoretical insight to the movement, interdisciplinary connections drive much of the cutting-edge research characteristic of big data. The methods used in this study originate and have been refined in information science, and in the geographic context use of such methods for text analysis has been quite limited. The increasingly social nature of publically available big data encourages new applications for the methods. As such, this dissertation contributes to text analysis methodologies, through a unique application of one such method on news articles. Exploratory techniques such as LDA depend largely on the nature of the input text data – the length of input documents, the languages, the range of vocabulary, etc. – so the careful attention paid here to the impact of parameterization of the algorithm will contribute further knowledge of the process in line with previous work (Griffiths and Steyvers 2004). Particular inputs as applied to particular data properties can contribute to default and recommended inputs in implementations of text analysis algorithms. This project seeks to describe the most paradigm-shifting impacts that big data have on social science and geography in particular. The qualitative nature of much work

8 within the discipline concerning media and of geopolitics encourages some degree of critique against big data methods. In fact, geography continues to produce most of the work that is largely critical of big data methodologies. These studies demonstrate mixed feelings about the power of human interpretation versus automatic coding of textual themes, with some suggesting little difference (Blair et al. 2016), while others showing an incongruence between the two (Chang et al. 2009). Objective and subjective means are both used to compare the results of the LDA process, with the ultimate goal of producing theory which explains the impact that variations on parameterization have on so-called ‘black box’ methods typical of data-driven research. Particularly, this research focuses on two areas of geographic analysis where automatic methods have the potential to provide insight into topics that have been historically qualitatively-driven; geographies of media and geopolitics. Media studies heavily emphasize the subjectivity of reports in news (Ó'Tuathail and Agnew 1992) and social media (boyd and Crawford 2012), and critique quantitative studies for considering the text of media to be an account of truth, rather than a writer’s and publisher’s account. This dissertation addresses this critique by considering the expected audience of media reports and the authors of them through spatial and scalar analysis of the variations in topics across many news sources and their cultural-linguistic context. Therefore, this dissertation contributes to understandings of the spatiality of media by mapping the semantic spaces of news content onto social and geographic spaces. Similarly, geopolitical analysis frequently utilizes popular texts and imagery to compare representations of political relationships and histories. Quantitative research of these texts is considered insufficient for lack of contextual detail in understanding the origins and histories that generate particular observed narratives. This research contributes a big-data approach to geopolitical studies within geography, offering an example of the advantages of automatic and quantitative methods for identifying and spatializing narratives. More substantially, this dissertation provides methods, theories, and tools for understanding and mapping the semantic spaces of the news media. It improves upon the LDA method for geographic analysis by formalizing the effects of its many parameters on generated topics, and contributes formal processes for evaluating similar data-driven methods. The interestingness measures that it uses to do so are not unfamiliar to

9 geographic literature, but have not been formally implemented and tested and applied to key spatial theories as evaluation tools for complex, data-driven methods as they are here. The replacement of primarily validation methods with comprehensive evaluation presents an alternative to the typical process used in geographic analysis. Valid models are obviously necessary, but in the case of multiple valid results, evaluation helps identify the most relevant results given a specific geographic context. The use of interestingness measures and their application to spatial problem solving provides a generalizable tool for data-driven GIScience methods. The processes for computing interestingness are implemented in R and will result in a general tool available as a downloadable package.

1.3 Dissertation outline The rest of this dissertation is organized as follows: Chapter 2 reviews four areas of literature that are important for the motivations and methods of this project. Specifically, it examines big data, recent critiques of big data methods and sources, popular geopolitics, the social-political impacts of professional sport, and text analysis methods for categorization and spatial analysis. Chapter 3 details the methodology used, describing the two empirical examples of the geopolitics of sport from which news media data was collected, and the impact of parameterization on the Latent Dirichlet Allocation text analysis algorithm’s ability to summarize input text. Chapter 4 evaluates the results from several LDA models derived from different parameterizations, aimed to find the most interesting subsets of articles collected during the Catalonian independence movement. Chapter 5 uses patterns identified in chapter 4 to show examples of sport as a facilitator of ethno-nationalist sentiment and both pro- and anti- independence narratives. In chapter 5, the mediascape of four months of media covering Catalonia’s independence movement is mapped as a function of media identity and the semantic spaces that it generates. Finally, chapter 6 concludes with response to the research objectives, contributions, and a look toward what this research indicates for future scholarship.

10

2. Chapter 2:

Research and Disciplinary Intersections

11 2.1 Introduction This chapter considers prior research on four topics that lend theoretical and methodological grounding to this project. First, it discusses the big data framework that enables new developments in data-based research. Particularly, the following section discusses big data as a paradigm and various ways that it influences social science research via analysis of the ‘V’s of big data. Big data has been unquestioningly accepted across social science, so it also offers a sample of critiques and responses to them from a geographical perspective. In section 3, it considers previous theoretical work on geopolitical structures, particularly through the source of the news media. Geopolitics are generally studied qualitatively in geography, and that emphasis, though methodologically not applied in this analysis, grounds the intentions of this work to address current critiques of big data in previous efforts to understand political interactions in geography. This section also discusses the significance of sport in the context of politics by looking at political identity, modernity, and major international sporting events. Finally, in section 4, it briefly outlines the methods used in this research and their context within text analysis and big data as a whole.

2.2 Big Data An accepted definition of big data says that anything that surpasses our current standard methods for representing and analyzing it, in terms of size and other data characteristics, qualifies the data as ‘big’ (MIT Technology Review 2013). This definition emphasizes the technological innovation necessary to ‘do’ big data. The impact of which is felt methodologically in the social sciences. ‘Doing’ big data relies on increases in storage capacity and advances in high performance and distributed computing. The consequences of data capacity for storage and analysis are significant for social science research, enabling interdisciplinary activity, individual level analysis, real- time exploration, integration of different data formats, among other advances. This section discusses the reactions that social science has had to recent big data trends, from what the ‘V’s mean for research, to evaluating the results that come from data-driven

12 research, and how they are considered important the research that described in this dissertation. 2.2.1 Big data – paradigm-shifting for social science Despite appearances and names, the amount of data is only one characterizing property of the data revolution that we call big data. Volume is critical, as many agree that modern data analysis became possible when data storage caught up with to the ability to collect data in real time (Miller and Han 2009). Real time data capture, or data velocity, is also a major component of big data, referring to the rapid generation of digital data and its analysis. The variety of data, especially in data types and sources, is a property which requires careful attention. With data coming from various sources, big data analysis spends considerable energy wrangling data when it is different formats and structures. Other big data ‘V’s, have been discussed in the literature, including vinculation and validity which, respectively, refer to the interrelatedness of data and coming together of those relationships, and the difficult question of evaluating true and valid findings and relationships (Monroe 2013). The veracity of big data refers to how much trust can be put in the quality of data and the conclusions that can be made from it (Lukoianova and Rubin 2014). Although veracity is an obvious issue with all data, big or small, the many big data studies carry the assumption that data irregularities and errors will be masked by the true values overwhelming the collection. Technical advances are necessary for analysis of these complex properties of data, but equally important are tools explicitly geared towards handlings the increasingly social and spatial nature of this data. Social media analysis has grown largely as a result of big data capabilities and philosophies, analyzed for political sentiment (Korson 2014), geolocated communication during disasters (Caragea et al. 2014), and social relationship networks (Braun and Gillespie 2011). Although much of this data remains behind pay- walls of mobile app data-generation companies, the social content of citizen-created data is unprecedented. Such rich geographic information in big, social data means that geography is embracing a data-driven fourth paradigm (Hey, Tansley and Tolle 2009), relying on the availability and spatial content of data. In the first paradigm, science emphasized experimentation. The second paradigm based scientific inquiry on theoretical insight. The third paradigm turned toward computational methods. As with previous

13 paradigm shifts, this data-driven fourth paradigm has been met with criticism and productive response. Throughout its history, geography has been influenced by paradigmatic shifts in science, generally. There is truth in the contentions that big data is not a new concept to geographers and spatial methods (Miller and Goodchild 2014). Spatial data is high- dimensional, and data volume has always tested the limit of software capabilities for storage and analysis. Data-based spatial study has seen its share of critiques among spatial theorists, concerned with the merging of ‘data’ and ‘information’ in scientific studies (Harvey 1972), the removal of theory from data-based activity (Gould 1981), and the assumption that general claims can be made from specific data (Hartshorne 1955). Trenched critiques of the positivist emphasis of quantitative geography and geographic information systems and cartography more specifically (Rose 1993) remain important texts to the field. The discipline of geography, and particularly GIScience, has undergone many methodological changes attributable to big data, from cloud computing (Shekhar et al. 2012) to greater emphasis on interdisciplinary collaboration (Kitchin 2014). The use of the particular term big data across many disciplines has brought additional tools to geographic analysis, such as machine learning methods for clustering and classification. These methods rely heavily on input data, rather than the scientific method and theoretical hypothesis verification. Data-driven approaches create problems and questions of alignment with existing theory (Miller and Goodchild 2014) and of evaluating the patterns that are found. Chris Anderson wrote a provocative article for Wired Magazine proclaiming the arrival of big data in which he proclaimed the ‘death of theory’ by applications using solely the insights contained within the data (Anderson 2008). For example, unsupervised clustering methods separate subsections of data based on some salient factor that may or may not be evident to human perception (Färber et al. 2010). Many thus fear the reliance that such processes have on particular types and organizations of data, rather than a reality understood through theory and observation. Anderson’s article celebrates the data-driven nature of big data analysis, generating its own critique about scientific research and how we value theory and

14 experimentation and the roles that data and hypotheses are generated. Boyd and Crawford are particularly concerned about Anderson’s dismissal of theory throughout the scientific process (2012). Data has subjectivities, ethical constraints, and technical constraints to who can access it, and the context in which big datasets are situated are often neglected with the assumption that data contains its own context. Miller and Goodchild fear changing scientific processes based more on specific responses to data’s limited representation of social-spatial-temporal processes, than on forming general rules applicable to many spatio-temporal contexts (2014). The authors state, though, that specificity is applicable and influential at small scales. Data-driven techniques are not influenced by data alone. Unsupervised clustering methods often require a user-specified input of k – the number of clusters to generate, in addition to a variety of others. Latent Dirichlet Allocation (LDA), which will be presented in greater detail in the next chapter, requires k, the alpha prior distribution for documents, a minimum term-frequency – inverse document frequency to reduce the vocabulary to a manageable and instrumental set of terms, a beta prior distribution for terms, and a number of iterations to conduct. These values themselves are not dictated by theory, but literature has suggested optimal input values. In this study, extensive experimentation and evaluation is used to consider the impact of each input to the LDA process, thereby providing guidance regarding the use of atheoretical parameterizations in the particular context of news data. 2.2.2 Theory and Data-Driven Research Miller and Goodchild resume a debate about the nomothetic (law-seeking) vs ideographic (description-seeking) nature of geography which has existed from the origins of modern geography to the present (Miller and Goodchild 2014, Sui and DeLyser 2012, DeLyser and Sui 2013). The debate continues under the more catchy question of whether or not geography and physics envy one another for their law-based and description-based natures (Massey 1999, O'Sullivan and Manson 2015). A better way of talking about theory and its relation to data-driven research is through the use of abductive reasoning. Miller describes abductive reasoning through the process of knowledge discovery in databases (KDD) (Miller 2010), where the availability of data precedes the formation of a hypothesis. Theory is not absent from abductive

15 hypothesis generation, as the methods used to derive knowledge from data rely on the belief that the method can reveal some intended result. Miller summarizes C.S. Pierce, who is credited with describing the abductive process, showing that the development of the hypothesis that best describes the data results is a weaker argument than that generated by an abduction or deduction process. Induction merely states that the chosen hypothesis may be one of several possible explanations, which are refined with domain and contextual knowledge and theory. Nonetheless, abduction is a critical process made more relevant by data influx and data-driven methodologies, and can provide a basis for further data reduction and hypothesis formation. Abductive reasoning is also the theoretical basis for many geovisual analysis approaches, where human identification of patterns, rather than targeted computational methods, facilitate hypothesis generation (Gahegan et al. 2002). Debates about spatial big data have paralleled and drawn from digital humanities literature, where digital representation is reflexively considered for its impact on humanities research while more and more of that research is being conducted with the help of digital resources (Gold 2012). Geographic scholarship has recently embraced this humanities scholarship in response to the big data paradigm and widening “chasm” between quantitative and qualitative methods (DeLyser and Sui 2013). Text analysis particularly benefits from work bridging this chasm, as quantitative analysis and subjective evaluation combine to make sense of large collections of qualitative textual information. Like the digital humanities, but originating from geographic scholarship, the emerging focus area of critical data studies aims to address questions of where data, especially spatial data, comes from, who is generating it and how to use it effectively and responsibly. Critical data studies originated as questions of big data, social science, and geographers’ roles by Dalton and Thatcher (2014). Their questioning evolved into dialogs intended to give the questions historical significance, ground them in existing theory, and situate them in societal structures (Dalton, Taylor and Thatcher 2016). Critical data studies are rooted in capitalist critique of data generation and access, as much of the individual-level data that social science research demands of big spatial data is generated and maintained by companies building mobile and network applications. Corporations

16 and other developers, as data storers and creators of technology for spatial data generation, privatize spatial/social data and the methods used to analyze it, limiting their accessibility for academic research (Thatcher 2014). Not to say that mobile application developers do not themselves contribute to social insight, but many protect and monetize their data in ways that are restrictive for social science research. This dissertation draws from these critical data studies as it considers the implication of using deriving information from digital news sources, whose claim to impartial reporting of world events is muddied by capitalistic motivations to create sellable content. 2.2.3 Big Data and Media News is expected to contain an objective view of important occurrences. Research acknowledges that objectivity may not always be the case when using news media as a data source (Sharp 1993). Social media analysis, on the other hand, often seeks the subjectivity of the media generator to understand the sentiment surrounding a particular event or geographic area (Lansley and Longley 2016) or by simply making value out of the immense quantity of social media users (Pak and Paroubek 2010). Objectivity though, remains an assumption guiding the use of news media in quantitative political research for its expanding coverage digitally and globally. Social research utilizes many ‘big’ aspects of news and social media. The text of media reports represents the most prevalent portion used in scientific research. Many methods now exist for automatic processing of text (e.g. Blei, Ng and Jordan 2003), which has increased the range of available data, in terms of both volume and content. Imagery is also examined as a significant part of what messages are conveyed to readers. Examples include popular comic books (Dittmer 2007), satirical comic strips in the news (Falah, Flint and Mamadouh 2006), infographics (Dick 2014), audio (Sterne 2012), and video (Nelson and Boynton 1997). Big data also considers the connections and relationships implied in the networked component of most social media platforms. Those connections can be useful for establishing roles for individuals within a network (Russell 2014), overlaps of interest and action, and other shared attributes. Networks increase rapidly in complexity with the number of individuals, and on the scale of many commonly used social media platforms, big data research remains in need of methods for analyzing the networks available. News

17 and social media do not only provide context for international and social events, big data and text extraction methods have made digital media a source of primary data itself (Caragea et al. 2014). Using social media as a data source raises concerns about exacerbating a digital divide between users and nonusers (Robinson and Feick 2016) and of privacy violation of users and their locations for such seemingly positive applications as humanitarian mapping (Shanley et al. 2013). Big data makes it easier to collect and analyze data on individuals, which means that the spatial, temporal, and personal expression that social media provides will continue to be a topic of insight and contention. In some fields, especially within social science, big data has become synonymous with using social media as a data source. Without a doubt, the volume of social media data generated with precise temporal and sometimes spatial information, spanning many languages and locations across the world, inspires new forms of understanding expression and reactions to world events. Big data provides methods for collecting, storing, processing, and analyzing these large collections of textual data. Much geographic scholarship emphasizes the subjectivity of social media data, and many rely on it to make claims about sentiment and expression by the individuals behind social media accounts. For the purpose of analyzing global interactions and political developments, news media are an increasingly important source of information. Qualitative research and big data alike utilize news media for the purpose of understanding global patterns and events. This section focuses on big data approaches and specifically that of event data, which derives an event structure from news articles by taking advantage of their typical structure of sharing information – who, what, where, why, and when – to represent politically relevant information. News media’s expansion of global coverage is perhaps directly related to its expansion of digital dissemination and readership. Mobile platforms and an increasing percentage of the public connected to the internet mean that people are receiving more of their news digitally than in print (Mitchell et al. 2016). Increasing demand for tailored reports based on readers’ interests and location also spurs digital development of news with increased categorization and filtering abilities. Spatial filtering in particular allows readers to connect with events occurring around the world, especially for those with

18 interest in specific places. Such a landscape of global media reporting and dissemination impacts the cultural landscape of emigrant and multi-cultural populations, as well as others who seek specific reporting on diverse topics and places. 2.2.4 Big Data and Politics: Event Data In quantitative analysis of news media as a source of global-scale geopolitical action and interaction, event data analysis is the most widely-used approach and its adoption is constantly growing. The use of event data originated from political science literature as a means of capturing global scale international interactions for extensive temporal analysis (Gerner et al. 1994). Event data utilizes digital news media reports for their representation of real world events and detailed descriptions of actors involved. As such, the creation of event data relies on sophisticated processing of text into categorizable and relevant representations. Event data, as defined in political science (Schrodt, Davis and Weddle 1994) represents global political interactions as dyadic interactions between two actors. Although the nature of an event is debated in many fields, Geography included, event data tends to contain several pieces of necessary information: a source actor (the “who” in a “who did what to whom” event structure), an action (the “what”), and a target actor (the “whom”). Extracted entities – from organized groups such Al Qaeda, individual politicians who are often referred to by name, to groups of people such as “citizens of Yemen” – become the actors in the event structure, and verbs represent actions, such as “agree to provide aid.” Actors, actions, and geographic entities are matched to a predefined dictionary to ensure consistency between expected political themes and printed content. This information is generally available within the first few sentences of a news report, and extracted from the text using natural language processing techniques. More recently, spatial information, derived from geographic placenames, have been included in addition to the temporal detail extracted from publication dates. The semantic, spatial, and temporal information contained within news media and the resulting event data yields many opportunities for political analysis. These properties yield tremendous detail for exploratory pattern analysis (Peuquet et al. 2015) and comparative analysis of patterns across spatial and temporal contexts (Stehle and Peuquet

19 2015). Highly detailed geopolitical data attracts many to the promise of event data, but it is not without complexities which complicate the representativeness of the information. The spatial information contained in news articles is inconsistent in scale and specificity. Big data embraces the inconsistent nature of spatial and other information by either excluding data records with lower resolution or upscaling the geographic information to a common resolution. This approach is not inaccurate, and the uniformity makes for efficient analysis and comparison. But when specificity in terms of spatial scale is available, the preference should be to use the more detailed information to understand social variation. The temporal information collected from news articles is critical for the predictive analyses that are typical of event data applications. In contrast to spatial information, temporal information is standardized across news articles via the publication date; all events consist of the same daily-scale temporal resolution. This may be a function of the article’s publication date or the amount of temporal detail contained in its introductory sentences. Further specific temporal markers may be available deeper in the text of an article, but event data extraction methods do not usually explore that far, and in the absence of a specific date and time, the article’s publication date is used to denote event timing. Surprisingly, the precise digital imprint of an online article’s publication time (in the scale of seconds or better), is ignored and all event timings are captured at the daily scale. Although this consistent temporal structure is convenient and necessary for predictive analysis, related events consist of many temporal and spatial scales which interact with one another (Allen 1983). Some event-oriented political studies ignore the multiple possible temporal scales of events, however, as spatial and temporal aggregation are necessary to test their null hypothesis that events are unrelated and the potentially related events share no regular temporal relationships (e.g. see Arva et al. 2013). This homogeneous temporal assumption given to events described in news articles is not consistent with geopolitical process, whether reported in the news or not, and may contain commentary on not just isolated events, but reflections of a longer process. In light of these and other nuances for representing political representation in the news,

20 event data remains a work in progress, as recognized by the leading researchers in the field (Beieler et al. 2016). Regardless of remaining theoretical and practical challenges, event data derived from popular news media presents a unique resource for understanding geographic processes, and represents a theoretical building block and personal motivation for this project. Predefined dictionaries assume a semantic structure on real-time, evolving processes, and geographic thinking on such processes often consider thematic textual descriptions in public sources. This project examines data-driven methods for extracting geopolitical information from digital news sources as an alternative to event-based representation by implementing a topic modeling perspective, introduced in section 4, and described in detail in the next chapter. 2.2.5 Evaluation

2.2.5.1 Evaluation issues Despite the popularity of methods for discovering and analyzing algorithmic output in big data, much of the evaluation of these processes is a neglected but necessary step toward furthering the field methodologically. Data-first methodologies and the resulting use of exploratory procedures for big data necessitates that evaluation is done in an iterative process and in a way that that does not rely on a single optimal solution. Unsupervised classification algorithms, in particular, have seen increased use in geographic applications beyond image processing. The data-driven nature of these procedures, such as the LDA text modeling method used here, reveal the variable and potentially unexpected nature of data analysis when data is unfamiliar and input parameters are atheoretical. Thus, evaluation is critical for understanding variation and its applicability and generalizability. There are two reasons for using classification on a collection of big data. First, as mentioned before, is for exploratory purposes (Gordon 1981). Although data should represent the conceptualization of the real world unique to each project, an exploratory analysis seeks to determine the major themes present in a collection of data. Exploratory analysis is often synonymous with knowledge discovery, as previously-unknown connections among data and themes are made (Miller and Han 2009). The second seeks to compare new insight with expected insight established through previous exploratory

21 analysis or theory. Even with the data-driven emphasis of big data, theory remains a significant driver of expectation and successful analysis. It is this expectation that is often used to evaluate classification processes, creating a gold-standard solution to compare against. Following from Färber et al, (2010) more attention should be paid to the process of evaluating classification algorithms. Färber et al dispel the veneration of the gold standard method against which results are evaluated. As long as “the whole point in performing unsupervised methods in data mining is to find previously unknown knowledge,” (ibid, 3) then comparing against a known result is not helpful. Classification methods themselves have been tested and become common tools in areas from text modeling (Blei et al. 2003) to image analysis (Pal 2005), and such methods proven to reflect real world processes when provided with appropriate data and parameterizations. The evaluation of such methods should be optimized to be critical of the insights they provide, rather than the ability of the method to provide a fully expected result, especially in exploratory knowledge discovery applications where unexpected results often generate more interest than expected ones. Additionally, classification methods may not be designed to produce a single optimal solution (Färber et al. 2010). Multiple parameterizations, random components, and partial class assignments make evaluation a complex and imperfect task. Although some measures, such as maximum likelihood may be used to compare sets of outputs (Vatsavai et al. 2012), quantitative measures of class validity seek to ensure that classes diverge from one another at a maximum potential. Many methods allow for classified data to exist proportionally in more than one class, representing a legitimate convergence of classes, rather than a measurable divergence (Blei and Lafferty 2007). With near infinite possible parameter combinations for many methods, it is clear that some evaluation process is necessary to ensure correct inputs and justify insightful results. 2.2.5.2 Interestingness One class of evaluation methods often suggested in information science literature concerns measuring the interestingness of patterns generated by classification methods and other data mining processes. This dissertation refers to the results from data mining processes as patterns to emphasize that such outcomes are variable organizations and

22 classifications driven by the input data, rather than a new summary metric. A pattern, with respect to data mining and knowledge discovery, may not be a regular, recurring, or reproducible result, as implied by the term in geographic and other literature, but represents an organization or specification of the input data. Some data mining processes do generate summary metrics such as maximum likelihood, which can be used for evaluation purposes, but it is the classifications – the patterns – they create that are interesting from a practical standpoint. In the case of topic modeling, many topics are generated, and the combination of all topics creates a pattern. Interestingness represents not a single means of evaluating patterns, but a set of strategies for determining the value of discovered patterns. Interestingness consists of several methods for assessing the value of a set of patterns, which fall into two major categories; objective and subjective measures. Although objective and subjective measures of interestingness can and should be utilized together, they can present dissimilar conclusions (Chang et al. 2009). Objective evaluations, like conciseness, generality, and reliability, are derived probabilistically, depend only on the data, and do not take into account the user or the application (Geng and Hamilton 2006). Most statistical evaluations are objective measures, such as maximum likelihood estimation, which measures reliability as defined in (ibid). Subjective evaluations, on the other hand, explicitly take advantage of the user’s knowledge and domain understanding of the data and patterns. In many cases, that knowledge can be formally represented as rules and beliefs and integrated with statistical measures. Unexpectedness/surprisingness (Silberschatz and Tuzhilin 1996) and novelty (Geng and Hamilton 2006) are common examples. Geng and Hamilton (2006) also add a third category, semantic evaluation, which includes evaluations that depend on the semantics of the data and domain, such as utility and actionability. Some do not consider semantic measures of interestingness a separate category from subjective measures because the user specifies domain knowledge in similar ways (Yao, Chen and Yang 2006). Table 2.1 provides an overview of interestingness measures, and they are described in detail in the following. More details on each measure and how they apply to this project will be discussed in Chapter 3.

23 INTERESTINGNESS DESCRIPTION CLASSIFICATION NOTE MEASURE CONCISENESS pattern contains relatively few attribute- objective 2 value pairs DIVERSITY patterns differ significantly from one objective 2 another GENERALITY/ pattern characterizes a greater portion of the objective 2 COVERAGE dataset PECULIARITY pattern is far away from other generated objective 2 patterns, as per a distance measure RELIABILITY pattern occurs in a high percentage of objective 2 applicable cases NOVELTY pattern contains information not previously subjective 2 known or inferred from other patterns UNEXPECTEDNESS pattern contradicts existing expectations subjective 2, 3 /SURPRISINGNESS ACTIONABILITY pattern enables decision making and acting subjective/ semantic1 2, 3 to user’s advantage UTILITY pattern contributes to reaching a goal subjective/ semantic1 2

Table 2.1 Descriptions of each of the interestingness measures as explained by three sets of authors, given by the following: 1 (Yao et al. 2006), 2 (Geng and Hamilton 2006), 3 (Silberschatz and Tuzhilin 1996)

Objective measures of interestingness are measured directly from the output of a computational process. Conciseness attempts to capture a minimal set of non-redundant results. In (Padmanabhan and Tuzhilin 2000), algorithms are designed to reduce redundancies. Although some coherence between classifications can be interesting, as shown in this research, redundancies can be checked for by comparing the patterns to one another for crossover. Diversity is measured by both the attributes within a pattern and between patterns, and is determined by comparing the probability distribution of each pattern element to the uniform distribution (Hilderman and Hamilton 2001). Generality/coverage is measured by the fraction of input data characterized by a pattern (Geng and Hamilton 2006). Generality is easy to measure, as many input observations can be dropped in the process of analysis when they do not consist of significant characteristics. For example, the text of a document may not contain enough significant terms to be used in the LDA process. Peculiarity is measured via some distance measure, where a resulting pattern differs significantly from all other patterns (Knorr, Ng and Tuckakov 2000). Peculiar patterns may also occur in conjunction with other measures, such as conciseness, but is also conflicts with generality (Geng and Hamilton 2006).

24 Subjective measures require user specification of the characteristics that make one pattern more interesting than another. As opposed to objective means, the subjective classification of interestingness measures incorporates the knowledge, experiences, and expectations of the user. Big data claims that objectivity is preferred for reproducibility and comparison (Anderson 2008), but geographers have also contended that, while a desire for objective evaluation is reasonable, big data, like all data, has inherent subjectivities (boyd and Crawford 2012). Subjective interestingness measures work with a user to formalize and compare against the knowledge that user has. Novelty, however, is an exception. Novelty is the feature that exploratory data analysis strives to discover, with novel patterns representing new, unknown connections from the data. A user’s knowledge or ignorance cannot be represented with full certainty in the process of analysis, so novelty is evaluated similarly to peculiarity, where a pattern not only must be different from other patterns, but also different from expectation (Geng and Hamilton 2006). Surprisingness or unexpectedness differ from novelty only in that unexpected patterns are not only novel, but contradict a user’s system of beliefs. Silberschatz and Tuzhilin (1996) formalize this system of beliefs for computational comparison. With iterative pattern generation, expectation arises from repetition of patterns, so even in an exploratory methodology, surprising patterns become clear. This dissertation anticipates a similar result, and uses multiple iterations of pattern disambiguation to establish an expectation, against which each pattern is evaluated to measure its interestingness. Yao and coauthors differentiate a third class of interestingness measures from within the subjective category defined above that they call semantic measures (Yao et al. 2006). Semantic measures are not only subjective, in that the specification of interesting patterns is user-driven, they are also data-, rule-, and pattern-driven. For instance, utility refers to a pattern’s ability to aid a user in reaching some goal, and actionability more specifically measures a user’s to take a specific action with the new knowledge resulting from a specific pattern (Geng and Hamilton 2006). Utility and actionability are clearly subjective measures, as the user specifies what information provides a benefit for a defined purpose. But these are also separate from subjective analyses like surprisingness and novelty because the objectives not only must prior beliefs be incorporated into the model, but so do the problems which demand action. A user with no domain knowledge

25 will find novelty in some patterns, but will have difficulty finding utility without understanding the necessary goals. Domain knowledge can be represented with rule criteria (Yao et al. 2006), but domain knowledge is independent of data and analysis, making semantic measures difficult to formalize. Interestingness measures are not independent of one another. While some, like peculiarity and generality confound one another, there may be times when both are desirable toward different ends. Quantitative studies often emphasize statistical significant through objective measures, rather than considering the possibility that other patterns could also generate interesting conclusions (Ward, Greenhill and Bakke 2010). Pattern communication can also depend on appropriate evaluation of significant patterns. Concise patterns may be useful for communicating findings to communities will little domain knowledge or comprehension of statistical reliability measures, but science often fails without statements of reproducibility and domain testing through actionable findings. Strategies should exist to evaluate objective interestingness against subjective and semantic evaluation, such as in visual analytics (Andrienko et al. 2011). Visual, interactive exploration of patterns against one another is one way to incorporate domain knowledge into evaluation. Even when measures of interestingness conflict with one another, having access to them and understanding of the ways that we evaluate analytical outcomes helps data-driven analysis proceed for communication and for taking meaningful action. Although these interestingness measures are defined in KDD applications, they have corollaries in geographic concepts as well. Few have utilized the measures explicitly as defined by the authors above, for example Laube and Purves 2006, Bogorny, Kuijpers and Alvares 2008, Miller and Han 2009, but the evaluative objectives that they represent appear in many spatial applications and theories. Coverage/generality resonates with theories in critical GIScience concerning representation and the difficulty in recognizing what is not represented (Kwan 2002, Kwan 2008). Mennis and Guo discuss data mining’s goal in detecting unique items among ‘unprecedented’ complexity and spatial interactions, which is measured through the peculiarity of the model (2009, 403). Model diversity reflects spatial clustering methods and autocorrelative measures, where within cluster differences are minimized and between cluster differences are maximized.

26 Finding novel spatial relationships and outliers to established expectations is one of the key tenets of spatial data mining as well (Shekhar, Zhang and Huang 2009). Finally, unexpectedness/surprisingness and actionability drive spatio-temporal pattern discovery as the primary goal of such research is to find new information that contradicts current expectations (Laube and Purves 2006). The fundamental purpose of these interestingness measures lies in knowledge discovery applications, but applied to spatial information, the concepts need little repurposing. This dissertation considers the definitions and implementations of interestingness in a knowledge discovery context, while applying the geographic knowledge contained within the Catalan application and the spatiality of each news article.

2.3 Geopolitics in Media and Sport With big data’s ability to simultaneously look at an increasingly spatial and temporal coverage of global events, the study of politics and their spatial and temporal characteristics are areas ripe with opportunity for fresh insight. Big data’s methodological advances parallel the advent of globalizing forces in that they are both furthered by an increasingly digital media. For the purpose of understanding and using media to understand geopolitics, globalization means that digital media’s readership is not limited to a geographic area. People maintain connections to events in places they have been, but also are exposed to new cultures, interact with people from those cultures, and travel with relative ease. This geographic dispersion of culture is traceable through news and social media and big data’s emphasis on real-time analysis, as interest and interaction happens at a rapid pace. In geography, engagement with political representations, especially those in the media, takes a more mixed methods approach. Although the methods used in this dissertation are quantitative, this section considers the approaches that geographers have used to examine the spatial and temporal aspects of politics beyond the use of big data. Texts are central to the study of geopolitics in geography, as representations of international relationships range from personal effect to elite intergovernmental action. In examining the range of ways that geopolitics are represented, especially with respect to

27 media and public exposure, the research that is summarized here defines many necessary theoretical points for understanding news media’s influence on public political narrative. Geographers have previously studied the media landscape in various forms, thematically and spatially. Mapping the readers of news establishes patterns of information diffusion across space (Toole, Cha and Gonzalez 2012), mapping the locations contained within news articles highlights the connections between places and the topics of the articles that they appear in (Gasher and Klein 2008), and the geographies of the news producers themselves impact both the readership and the content that can be expected from published content (Hay and Israel 2001). Rose considers these three facets essential to the production of the media’s image – what the readers see when they consume the article (Rose 2012). The spatiality of news is increasingly difficult to formalize as readership transitions from predominantly print to digital. Online readers are difficult to track geographically – ISP addresses can be unreliable for obtaining spatial information (Gruteser and Grunwald 2003) and companies may only record such information insofar as it may be used to sell advertisement space tailored to individual readers. Some online news publications provide space for readers to comment by signing into an account or by using a linked social media account such as Facebook, which have geolocation features or user-supplied location options. Despite the difficulty in measuring the spatiality of digital publications’ readership, the locations where the news is being produced and the spatial content of the news itself are more easily measured. 2.3.1 Geography in the Media Since most news media provides a geographic reference for the current events, many applications use a multi-step process to extract references to geographic placenames and place them on a map. This placename extraction process, only recently combined with event data coding schemes (Leetaru and Schrodt 2013), has given event data a larger presence within geography (Peuquet et al. 2015, Stehle and Peuquet 2015). The combination of locations, time, and theme regarding the political content of events adds geographic context to what previously could not be placed on a map. Primarily responsible for this development are numerous advances for capturing geographic information from text.

28 The ability to extract placenames, people, and other specific named objects from unstructured text derives from advancements in natural language processing and Named Entity Recognition (NER). Typically, NER can rely on capitalization and other syntactical clues within the context of English news reporting to extract known entities (Karimzadeh et al. 2013). Several open source engines are available to the public, with the most well-known being Stanford CoreNLP (Manning et al. 2014) and ANNIE GATE (Cunningham 2002). Both engines extract named entities and distinguish between the names of people and places using syntax within the corresponding text. With named entities, especially those extracted as placenames, the more uncertain task of deciphering the specific location which is being referred to, and geolocating to that location. The Geonames public database (geonames.com) contains records of geographic entities at many scales continental to local, landmarks and some places of business, as well as named topographic features. Many contain the names of these places in multiple languages. Identifying placenames is considerably simpler than disambiguating the locations that they are referring to because many locations have multiple names and many placenames refer to multiple locations on earth. A standard strategy is to prioritize the candidate location which is the most likely, disregarding context from the text. This option would emphasize locations with high population, making it difficult to automatically extract a location which shares a name with a more populous place. In utilizing placename data from the news, media-based research explores the engagement of media producers with the geography around them. Gasher and coauthors study international, national, and local media by comparing the number of times a media source mentions the country or state in which it is headquartered (Gasher and Klein 2008, Gasher 2009). Although the media’s geography is not limited to the places mentioned in a news article, Gasher’s work shows that, despite international audiences and the means to produce media reports on a global scale, news sources still tend to prioritize the perspectives, events, and impacts of events on the local scale. 2.3.2 Geography through the Media Geography is also reflected in the media through the sense of place that it creates, which contains references to geographic entities, but also creates a narrative image of

29 interconnected places and events. Appadurai intersects geography with the media to reveal a modern network of information sharing and culture via connections between global places (Appadurai 1996). Although cultures and places have their peculiarities and uniquenesses, unexpected developments and other shared interests create avenues for sharing information and experiences through a global media. These avenues create spaces where, despite not sharing spatial locations, ‘imagined communities’ (Anderson 1983) emerge across unexpected spaces. As these imagined communities emerge through the digital and global spaces of media access, Appadurai argues that the nation-state diminishes as the unifying cultural structure for modern populations (Appadurai 1996). Ironically, leverage the internationality of Barcelona in several ways, explored further in chapter 5, in their message promoting an independent state for means of cultural and governmental autonomy from Spain. Appadurai describes this spatiality and the imagery created by the interactions of media events with his concept of the mediascape (Appadurai 1996). The –scape suffix refers to the dynamic and perspective-based arrangement of the media, particularly with respect to the changing global and modern nature of context in which it is situated. Mediascapes constitute the complex network of images generated and shared by diverse forms of media to readers around the world in textual and graphic forms. Inside the network of media producers, their productions (both real and fantastical), and their audiences is a spatial network of connected people and places through the locations of people and events. Mediascapes are functions of both the geography of the media itself and that which it creates. The difference between the imagery created by news with audiences at local, national, and international levels clearly impacts the ways that geography is used to share a political narrative. Similarly, the global patterns of places entwined with one another via modern digital media appear in different ways within the three scales of media observed here. The media’s evolution as a source of geographic reference and association contributes to the sense of globalization which Appadurai emphasizes in his many anecdotes about sports and postcolonial nationalism. Places emerge together in both expected and unexpected ways through activities that bring places together – such as

30 athletes playing together and against one another (Skey 2014) – and via the aftermath of colonialism (Appadurai 1996). Appadurai’s use of the media does not consider some of the peculiarities that exist between different classes of media and how they generate different mediascapes. Scale is demonstrated by Appadurai in the production of subjects who are part of a local (specifically not defined spatially or tied to a specific place) community of ethnic, cultural, and relational contexts (Appadurai 1996). This ‘ethnoscape’ which, like the mediascape, is increasingly global and spatially diverse, emphasizes modern, decentralized communities based not on territory, but on imagined networks of shared interests. The removal of a spatial context in this definition of subjects, combined with its use of a spatial scalar term to describe it, serve to emphasize the lack of spatial distance as a cohesive element in the formation of modern -scapes, Despite the attempt to reduce the effect of space through distance to make sense of causes and consequences of globalization, Appaduarai’s definitions of ‘-scapes’ do not attempt to make sense of why particular places appear connected via specific forms of modern image-generation. Particularly, the news media, for all of its ability to advance a global narrative, does so with evident scalar and place-specific patterns. Thus, it is important to understand the image of the mediascape as the intersection of three sites which define the context of such material (Rose 2012). First, the production of the content, as a function of the producer, the setting, and the materials, is necessary to understand the significance of the creation and place it as part of a larger social and geography system. Second, the intended and actual audiences of the material and the means by which they access it determine the meaning that is made from them. Third and finally is that which is contained within the image source itself; its content and composition. To Rose, these facets are methodological ways of looking at and interpreting visual materials, of which text is a significant category. Discourse and narrative particularly have intentions to generate imagery for their consumers, and news is no exception given its necessary combination of entertaining content and accurate portrayal of world events. Within geographic and media studies literature, the producer of news is an important actor which carries geographic and topic contexts to the media that it generates.

31 Popular geopolitics has, from its beginning, used the perspectives of media producers to show differences among ‘elite’ or ‘intellectual’ texts and more accessible ones created for public consumption (Ó'Tuathail and Agnew 1992, Sharp 1993). Cartographically, the producers of reference maps and imagery consisting of maps contain images of geography reflective of the producer’s perspective on that geography, especially in political contexts (Edsall 2007). Rose also states that technology and the means of accessing visual materials impacts the way that an audience interprets the material (Rose 2012). Obviously, the digital nature of consumption associated with modernity exemplifies the ability of technology to mediate public interactions with media and news in particular. Digital production and audiencing means that the prior need to locate near the intended audience is far less critical than it would be with a predominantly print-based media (Mitchelstein and Boczkowski 2009). Hence, the presence of conglomerate news sources in this study, particularly via the technological prioritization of RSS feeds to distribute digital editions of published articles. Additionally, the digital audiences of news changes the way that audiences interact with content versus printed versions (Webster and Ksiazek 2012). 2.3.3 Popular Geopolitics Geopolitics establishes lines of inquiry separate from the longstanding tradition within geographic research of political geography. Much of popular political media and related political geography emphasize intergovernmental relations and actions by politicians, but politics also occur at scales separate from the federal: from the regional to the personal. Globalizing, digital media helps us recognize the pockets of culture and opinion that exist within state boundaries, meaning new lines of inquiry are established to specifically examine those other scales where politics take place through varying means. Popular geopolitics begins with the premise of understanding the ways that political messages are conveyed through mass media and other means with specific cultural significance (Ó'Tuathail and Agnew 1992). O’Tuathail and Agnew introduce the topic as a new line of inquiry separate from but related to other specific geopolitical research, such as feminist geopolitics (Dowler and Sharp 2001). The driving difference which sets popular geopolitics apart is its focus on media and other popularly-consumed representations of political interaction as separate from ‘elite’ texts that are created and

32 consumed by politicians and international actors (Sharp 1993). Such popular texts include news publications which are widely disseminated, and entertainment media which have political themes which may or may not be fictional or based on real political actors and relationships. Prominent examples considered for their popular representations of politics include the Readers Digest (Sharp 1993), espionage films (Dodds 2006), comic books (Dittmer 2007) television (Dittmer and Dodds 2008), and local, regional, and international news (Woon 2014). As O’Tuathail and Agnew state, these examples are important for geopolitical conveyance because of the particular cultural context through which they present real relationships internationally and domestically to the public. Thus, the source, audience, and spatio-temporal context in which media is published – the subjectivity that is ignored by more quantitative political applications of digital media – drives popular geopolitical inquiry. Popular geopolitics has often focused on imagery in its empirical exploration. Although the pictures painted by popular media sources through descriptive and often entertaining language inspires new understanding of political relationships, visual imagery media motivate much of popular geopolitics’ inquiry. This is primarily due to the specific popular appeal that certain types of visual imagery can have to a popular readership that text reports do not. Political cartoons, comics, and fictional film are great examples where political narratives are reinforced despite the obvious entertainment appeal (Dittmer 2007), which likely provide a greater motive to consume their political message. Thorogood explains the power of visual representations to geopolitics by examining the satirical content of the South Park television show through the grotesque ways that it portrays current events (Thorogood 2016). Regardless of whether the representations of political relationships in popular media are accurate, they have the power to influence popular opinion. Humor and exaggeration enhances the experience and bring in audiences that are more interested in being entertained than in learning about current events. However, popular geopolitical content also occurs in texts and imagery within the more routine nonfiction news media which is generally consumed for truthful accounts more than for entertainment. News media remains critical for popular portrayals of political relationships and localized perspectives on global events (McFarlane and Hay

33 2003), particularly when the scale of the media is considered. Woon compares local news narratives regarding regional geopolitics in the Philippines (Woon 2014), not through the imagination of comics or other humorous interpretations, but through the ways that local producers and consumers ‘see’ events and produce regional differences. Local reporting on religiously-motivated violence on the island of Mindanao frames the events as ubiquitous across the region, creating an image wrongly suggesting that isolated events are more widespread than they are. Such an analysis shows that the scale of news, with respect to the audience it intends to reach, matters for illustrating the geography of the places that it represents and addresses. Popular geopolitics make use of solely qualitative methods to explore the messages conveyed in media sources. Driven by specific political relationships and media sources, the number of available reports to draw from remains small enough that a targeted examination of specific features is preferred. Qualitative approaches facilitate a much more thorough analysis of the intricacies of humorous images and other political representations, such as satire. Although computational processes are getting better at detecting these linguistic and visual effects, popular geopolitics have always used qualitative methods to interpret the political and popular context of news media reporting. Thus, they require small collections of media reports for targeting specific geopolitical representations and publishing sources. Computational process aimed at deriving thematic summaries from text in order to overcome this small sample requirement, and reduce the potential bias of using specific publication sources to illustrate larger themes transcending individual news outlets. This dissertation considers the advantages of popular geopolitics as a context to explore the content of digital news reporting. First, it considers the context through which consumers of sports reporting view the content of those reports. Although the institutions of international domestic professional sports are inherently political, the consumers of sports in the media are primarily interested in the entertainment value of the sporting event. Popular geopolitics show that the primarily entertainment basis of some mediums disseminating political narratives provide a unique context for the audiences of those media. So too do the social and political identities of teams influence the ways that fans interact with their teams and rivalries. The geographies of the media’s audiences are

34 examined in Chapter 6. Second, this dissertation understands that qualitative inquiry restricts popular geopolitics to a smaller sample size than methodological capabilities allow, and uses big data methods to analyze large collections of news articles with broad spatio-temporal range. Finally, a focus on the differences between the production scale of news provides a unique lens to analyzing the spatial content of news. Local news has been examined for its coverage of local events (Howe 2009), but this project’s comparison of local and international coverage of a regional event is a unique perspective on media coverage in the context of popular geopolitics. 2.3.4 Sports and Popular Geopolitics Professional sport is important in many societies as a pastime and means of entertainment. Some sports organizations recognize that competition can be polarizing, and attempt to remove politics from the course of competition (Sporting Intelligence 2013) to maintain the appearance of liesure. Although sport frequently attempts to remove itself from politics, media coverage of sports makes their dissociation impossible. Frequently, at the intersection of politics and sport, news media keeps the camera and the reporting on the action, as was the case when a protest occurred at the 2014 football World Cup (White 2014). Nevertheless, it is the intense visual media coverage of sports that enables public exposure to the complexities of globalization and other forces. Media’s impact in promoting globalization is not just seen in sports participation and reporting (Appadurai 1996), but in the proliferation of international content, especially specialized by region, language, and culture, making it easier for readers to access news and commentary relating to the context about diverse places beyond that from which the publication comes. Advertisement has become a prominent example of media’s relationship with globalization. Sports factor into this vision of a changing global and modern landscape in ways that will continue to be seen in England’s professional soccer system, and the Premier League in particular. For example, sponsors from throughout Asia and the Middle East own majority stakes in teams that never play within the borders of their sponsor’s country. Television viewership within the home country of advertising companies drives this foreign interest, however that foreign viewership is not at all representative of the number of foreign athletes playing in the Premier League from those

35 same countries. Figure 2.1 compares the geographies of the Premier League’s sponsorship of teams (a), its athletes (b), and television viewership (c). Although sponsorship, like athletes’ nationalities and viewership, is spread through many nations across the world, sponsorship and viewership are disproportionally high in Asia and Africa compared to athlete representation. However, foreign advertising does reflect an international composition of professional sports – athletes in the Premier League represent over 56 different countries and territories – even while global representation of players is not proportional to sponsorship or viewership. Disproportional athlete representation with respect to sponsorship, viewership, and population is especially true in Asia, as shown in Figure 2.11.

Figure 2.1 Comparing the regional geographies of sponsors, athletes, and viewers of the Premier League. Sponsorship data aggregated by the home continent/region during the 2015-2016 season (Sporting Intelligence 2015), Athlete representation at the start of the 2013-2014 season aggregated by continent/region (Sporting Intilligence 2013) and Premier League viewership by continent/region during the 2014-2015 season (Gibson 2015)2.

1 For consistency’s sake, this data is aggregated by continent with the addition of the Middle East and the UK. The UK is added as a separate entitity because the Premier League is based there, and the patterns evident with respect to athlete representation versus sponsorship and viewership are so stark. 2 Viewership in the UK is aggregated with Europe. Still, the amount of viewership is not proportional to athlete representation.

36 2.3.5 Sports and National Identity Nation-building and representation appears often in geopolitical literature with respect to major sporting events such as the Olympic Games and the Football World Cup (Tomlinson and Young 2006). Geopolitical research also acknowledges that connections of power relations, social class, and economic transferal are equally as important as state- centric institutions, and examine geopolitics in other spheres where politics intersect forms of identity-building, such as sports fandom (Jackson and Haigh 2008). Local and regional teams in domestic leagues promote and sometimes adopt particular cultural and political identities of their own (ESPN FC 2014), which have been targeted by political elites for leverage over regional politics (Harding 2014). Like other forms of contextualizing geopolitical representations through popular texts and culture (Sharp 1993) both the audiences and the subjects – the athletes – of sports reporting and their political identities impact the messages conveyed through the geopolitics of sport. Media more or less drives what politics are visible to the public through the venue of sports, but politics play into the games themselves as well. Many athletes play together but do not speak a common language and ritualistically celebrate culture and talent upon success on the playing field. These interactions come into conflict between competing teams and individuals, as well as between athletes playing for a common side. Racism (Jarvie and Reid 1997), sexism (Scranton and Flintoff 2002), and international rivalry (Rudd and Levermore 2004) pervade the clubhouse and the playing field, with global participation creating more opportunities for conflict and reconciliation. Athletes’ diversity and their interactions with one another inside and outside of the public eye have been studied via racial exclusion and integration (Pelak 2005) and income inequality (Frick, Prinz and Winkelmann 2003). Media narratives that result from this diversity and their spatial variation motivate this dissertation, with potential contributions to understanding public sentiment surrounding some highly diverse sporting communities. 2.3.6 Political Relationships Influence the Framing of Sports Rivalry Sport facilitates the exchange of social and cultural experiences - such is written in the mission statement of the International Olympic Committee (International Olympic Commitee 2013) - but it can also expose and play upon deeply held bias and colonial histories. National rivalries play themselves out in sports competition, with athletes the

37 vehicles for national-scale posturing, therefore becoming political actors themselves, willing or not. In other times, athletes lose the ability to separate athletic competition from national identity and erupt in conflict during gameplay (Peck 2014), an act many ruling sports organizations deplore and punish (International Olympic Committee 2013). Asking athletes to ignore their political identities relies on an assumption that politics detracts from the entertainment of consuming sports, which is increasingly difficult in internationally-organized events, where athletes part with their clubs and unite to compete for their countries. Displays of political identity during the course of sport competition are not limited to athletes. The spectacle that sports provide, especially when broadcast to a large audience, frequently become the medium for fans and other participants to make public expressions in favor of cultural and political beliefs. In one example, athletes on the soccer field were incited to violence during an otherwise congenial match between Serbian and Albanian national teams when a fan – who, like all Albanian supporters, was banned from the match for fear of inciting racial violence – flew a drone onto the field with an Albanian flag (Peck 2014). The event gained international attention, not because of the intrigue of violence – players and referees are known to be killed by one another during the course of matches in some countries (Fox News 2013) – but because the abandoned match had international implications in European qualifying. At the opening ceremony of the 2014 World Cup, a child hired to release one of several doves as a sign of peace and friendly competition used the media’s attention to make a statement of protest defying Brazil’s treatment of indigenous land rights (White 2014). The live broadcast did not display the act, presumably on purpose, as no replay or mention was given. Media has control over the messages displayed, and in this case, may have silenced the attempt by somebody trying to use the spectacle of sport to gain recognition. Global events, such as the Winter and the Summer Olympic Games, taking place between countries willing and able to supply athletes for the many events contested (with a few historical restrictions on who can send athletes), create an environment where politics between countries can and do impact the intended friendly nature of competition. This happens in two ways with respect to political relationships: active nationalistic sentiment with a low latency, and historical factors that lead to deepening rivalry. In one

38 example, examined here, the ethno-nationalism expressed by Catalan people recently has proven to have a significant impact on sporting events and the behavior of fans attending them. In another, the deeply-rooted legacy of British imperialism casts a (curiously ignored by organizers and some nations) range of fan sentiment and nationalist discourse between particular nations competing in the British Commonwealth Games. Geographic proximity and shared orders help define many international and domestic athletic rivalries. Since borders are often the result of historical division, political history remains an important factor in defining sports rivalry, athletes’ conduct during matches, and public reaction to news reporting of rivalry matches. This dissertation explores the impact that colonialism has on modern sports. The explicitly colonial organization of the British Commonwealth Games attempts to remove politics from the course of sport competition, but some competing countries consider the games an opportunity to send a message of independence to a one-time occupying force (Perkin 1989). Colonial history provides an exception to the distance decay of sport rivalry, where nations from around the world attempt to send the same message to a single country. Reporting similarly varies between source locations, as news and public opinion differ with respect to the former political positions of competing sports nations. Active political disagreement tends to illicit the most intense response via sports than the historical kind. Many countries have been excluded, either by mandated organizational committee (DeSantis 2016) or by boycott (Sarantakes 2014) as a result of strained relations or unsupported policies. For the United States, Cold War politics shrouded Olympic Games both through self-exclusion and higher stakes competition (Hill 1999). Yet, most literature examines the Olympic Games and soccer’s World Cup as sources of propaganda and brand promotion for host countries (Tomlinson and Young 2006). Also, a temporary impact on sports, propagandizing international competition is meant to promote a particular vision of a political utilizing a global media fixation on a specifically-chosen image of a place. Motivations abound for utilizing sports in this manner, particularly through the globally visible sporting events of the Olympic Games and the World Cup (Kennett and Moragas 2006). Particular examples include soured opinions of a host country (Aron 2014), creating a model for international cooperation (Whang 2006), and self-promotion of an ambitious dictator (Gordon and London 2006).

39 These issues of national identity, at play between individual athletes and often adopted by their teams, play out not only during major internationally organized events, but during intervening time between events as well, showing that the geopolitics of sport continue to effect media and public opinion following, and in preparation for, global meetings of athletes and nations.

2.4 Geography and Text: Computational Methods Although some geographers argue that the only way to synthesize geographic knowledge from textual information is to closely read it, many methods have been developed for automatically extracting relevant geographic information from text. In information science, text analysis presents a number of challenges. Many strategies exist for extracting semantic information from text, some of which are discussed here. First, this section discusses placenames in text and how current studies disambiguate one placename from another. The next method –geographic information retrieval – responds to geographically-enhanced textual queries using placenames and spatial relationships. Finally, this section discusses non-spatial clustering methods for summarizing the semantic content of text, including the LDA method used in this research. Non-spatial text analysis processes are also used in geographic applications, as the link between geographic information retrieval and LDA is demonstrated. 2.4.1 Placename Disambiguation The first step of Geographic Information Retrieval (GIR) relies on having an accurate list of geographic entities extracted from the document. Identifying placenames in text, as the first step in geographic information retrieval, is in itself a difficult task (Janowicz, Raubal and Kuhn 2011). In unstructured text, such as in published news articles, geographic information is often presented in a clear way via natural language conventions and the prevalence of geographic entities makes news a frequent testbed for GIR innovations (Lieberman and Samet 2012). Capitalization and proper grammatical structure help disambiguation of placenames because several means exist for part-of- speech tagging and other grammatical processing tools can be adapted for GIR (Karimzadeh et al. 2013).

40 Placenames are not unique, as many names may refer to multiple separate locations. Much work on geographic information retrieval from news articles focuses on refining the larger geographic focus of articles (D'Ignazio et al. 2014), rather than the technical procedure of identifying placenames. Identified placenames then go through the process of disambiguating the appropriate geographic place associated with the name. Context is necessary, usually identified by comparing other separate placenames in the same document, because the same name often applies to many places (Karimzadeh et al. 2013). Many tools exist to accomplish the task of placename identification, disambiguation, and geocoding, including GeoTxt (ibid) and GATE (Cunningham 2002). Determining and geolocating geographic entities is a difficult process, but only represents the initial step in geographic information retrieval from text. 2.4.2 Mining Big Text Data: Geographic Information Retrieval Most spatial applications of big data methodologies on text address the goal of retrieving information from spatial queries. Geographic information retrieval (GIR) is a spatial adaptation of the information retrieval concept, used to extract useful information from a large database. Information retrieval general consists of a database and a query, where the query is similar to some subset of the data with varying adherence to the exact search terms (Janowicz et al. 2011). Since the query need not fit an exact result under the more sophisticated IR systems, results should be quantified for their relevance to the initial query. Search engines are the most common type of IR, with specific adaptations to query syntax to take advantage of semantic similarities in query context and data retrieved (Ding et al. 2004). Geographic applications extend information retrieval concepts into geographic information retrieval (GIR). In GIR applications, a spatial reference is specified in the query – for example, ‘coffee shops on College Ave’ – which means not only does the thematic relevance of the query result (returning coffee shops instead of pizza shops) have to be presented and ranked, but the geographic relevance as well (finding College Ave and ranking coffee shops by proximity). Jones and Purves describe the unique spatial issues with geographic information retrieval: using spatial qualifiers alongside placenames, interpreting vague placenames and spatial language, comparing spatial with non-spatial attributes and themes, displaying matches and partial matches, and evaluating

41 the success of GIR (Jones and Purves 2008). Although this dissertation is not concerned with retrieving relevant information from a structured database and optimizing queries to do so, geographic information retrieval relies on several associated text processing procedures which are central to this dissertation research. 2.4.3 Thematic text analysis Semantic analysis is paired with geographic information retrieval to not only help determine vague placenames that can refer to multiple locations, but also to determine the thematic nature of the full text. Several methods have been proposed to computationally determine thematic information from the organization of terms within each document. - Term frequency-inverse document frequency, or tf-idf, weights frequent terms highly, then reduces weights when they are evenly spread among documents, creating a metric that highly weights terms used frequently in a small number of documents (Aizawa 2003). Tf-idf is useful for determining interesting individual terms, but contain little nuance about how one document differs from another. Latent semantic indexing (LSI) uses a term-frequency approach considering the terms in each document in a matrix, which is compressed into the most informative terms with a singular value decomposition (Deerwester et al. 1990). LSI was not built for classifying individual document, though it can be used as such. LDA was later introduced to classify individual documents, compare proportional thematic assignment between terms and documents, and improve performance of existing text analysis procedures. LDA is a text analysis procedure which derives topics from a collection of input text documents. As a generative model, it relies on sampling from a distribution of unknown topics over known documents based on shared terms (Blei et al. 2003). LDA generates k topics which are derived from the top t number of terms which in combination, uniquely differentiate one topic from another. Both k and t are user inputs. LDA utilizes aspects of previous processes for deriving thematic information from text. A minimum tf-idf is often used to remove terms from the vocabulary, leaving only the most informative terms across the entire corpus of documents. It also utilizes the term- document matrix introduced by LSI to compare the frequent use of terms per document. Preliminary investigation of varying combinations of these parameters indicates that high values of k and t and median values of tf-idf provide adequate difference between topics

42 and interpretative detail. This corresponds with suggestions given in other published work (Griffiths and Steyvers 2004). But these inputs are extremely data-driven, so this research attempts to establish geographic and information science theory linking text analysis results to the variable inputs to LDA. In addition to the advantages over tf-idf and LSI, LDA also enables intra-topic analysis at the levels of topics, documents, and terms. Because topics are mixtures of terms, some terms partially indicate more than one topic, and documents, which are mixtures of terms, also proportionally correspond to more than topic. This strength is a particular advantage for finding the intersection of primary themes in this dissertation, such as geopolitics and sport. However, the LDA process is also considered something of a black box. The inputs to the algorithm have little real-world significance – they are only relelvant in the context of the algorithm and the properties of the other inputs. Thus, there is no means to suggest optimal values for these inputs without knowledge of their impacts. Similarly, the probabilistic nature of the generative process can be non- reproducible and highly dependent on inputs for prior distributions. That is why this dissertation devotes significant attention to understanding the impact that inputs have on the process, for both communication and modification of results. LDA draws from statistical and information science concepts, but its focus on textual analysis means that it can be an important method to other social science domains, including geography. The author of several text analysis techniques, including LDA and other extensions of it, wrote a short article introducing text analysis methods to the digital humanities (Blei 2012). Since then, many have embraced LDA and similar methods for digital representation of text for social science research (Schnober and Gurevych 2015). Geographers have found use in LDA procedures to combine thematic analysis with spatial analysis. Much of these studies focus on mapping the semantics of social media. Lai and Rand control for spatial and temporal variation on social media posts to examine topics (2013). This study pre-selects several salient themes from social media for its data acquisition phase, which reduces the utility of LDA for finding interesting common themes, nor does it map the spatial distribution of pattern in geographic space. Lansley and Longley map a year of topics in georeferenced Twitter posts in the London area (2016). This study considers topics and sub-groups of topics and compares their locations

43 with that of several types of land use at the local scale, finding that areas of social media activity are not even across space. This example, however, does not consider the impact that LDA’s parameterization has on the topic distribution. This dissertation maps spatial variation in topics at regional and international scales and experimentally considers optimal values for parameters of LDA.

2.5 Summary I have reviewed several concepts which contribute theoretically and methodologically to this research. Big data drives much of the process. Advances in data science have enabled new ways to represent and analyze political content. One result is event data – a representation of political interactions and their spatial and temporal components, the actors involved, and the nature of the event itself. In geography, analysis of political content, especially that which appears in news and entertainment media, often is of a qualitative nature. Popular geopolitics explore the ways that political narratives are represented through media, even when consumed for the primary purpose of entertainment. This dissertation does not use the qualitative methods that are preferred in popular geopolitical research, but considers the separation of the inherently political nature of sports competition with the expectations of the audiences when consuming sports reporting. National and cultural identity, international corporate influence, and public protest are central to the institution of sports, especially when the media is involved in representing and disseminating them. These political messages are both visual and textual. Geography has adapted several methods for text analysis. Those methods and their unique purposes are considered before describing the Latent Dirichlet Allocation method that used here to derive a thematic structure from input documents. Together, these previous studies help understand the objectives and methods utilized here. In the coming chapters, this dissertation describes the detailed procedure that used to analyze and evaluate thematic and spatial patterns in news media.

44

3. Chapter 3:

Methodologies

45 3.1 Introduction This chapter more specifically addresses the data and procedures used to respond to this project’s objectives. First, it considers the analysis of news and social media text data using automatic text analysis generally, and Latent Dirichlet Allocation (LDA) in particular. The first section discusses LDA and its relevance to the larger theme of big data. The process relies on precise parameterization, with optimal suggestions included in existing literature and tested here for comparison within the specific context of news article texts. Section 2 explains the process of evaluating the results from LDA using interestingness measures. The third section describes the empirical context of this project, namely the Catalonian independence movement in the Fall of 2015. Finally, section 4 describes the data collection and analysis process used in the following chapters to derive thematic and spatial meaning from news articles. With a desire for reproducibility, despite the stochastic elements of the LDA method and the historical context of the news articles themselves, this chapter intends to guide future research utilizing LDA and specifically as a technique to examine news media sources.

3.2 Methods 3.2.1 Big Data Despite the use of big data approaches in many areas of geographic research, including those utilizing news media as a data source, geopolitics remains a qualitatively- driven subject (Korson 2014). Text and imagery have long necessitated a qualitative approach, but big data’s methodological and interdisciplinary shifts encourage new strategies for analyzing large collections of text, such that the sample size limitations of qualitative methods can be overcome. Although more is not always better, expanding the sample geographically, temporally, and news publication sources casts a wider contextual net of topical coverage. Big data, although not a method in itself, provides distinct ways of thinking about geographic problems. First, and particularly of interest to fields in the social sciences, it considers data collection and the nature of social data generation. Big data scholarship contends that information that directly proves or disproves a hypothesis may be difficult

46 to collect. Instead, readily available and plentiful data is often used as a proxy to the phenomenon a researcher intends to study. Thus, more data increases the chances of finding connections and patterns, even while increasing the spurious data points and patterns in the sample. The need to separate the meaningful patterns from the spurious ones gives light to the second impact that big data has on geographic research. Methodologically, big data requires that significant attention be paid to the methods for deciphering ‘the signal from the noise’ (Anderson 2008) within the data. Big data is expanding the breadth of quantitative methods used throughout social science, borrowing from information science, statistics, and others, and this dissertation utilizes one in particular for analyzing and classifying textual information. 3.2.2 Latent Dirichlet Allocation Latent Dirichlet Allocation, or LDA, is a generative process, which determines clusters within an input corpus of individual documents consisting of unstructured text. Clusters are primarily determined by the use of similar terms at a greater proportion than throughout the rest of the corpus. Two outcomes are important for this analysis. First, LDA sorts articles into similar clusters. That similarity is used to compare what separates one cluster from another. Second, clusters are defined by the top key terms which semantically separate them. These clusters, or ‘topics’ in LDA, provide a summary of the content of the corpus, reducing the dimensionality of the entire corpus to a user-defined number of dimensions. 3.2.2.1 Algorithm Procedure LDA is a generative text model, which assumes that there is an underlying, latent structure governing the content of each input document. Documents are mixtures of one or more topics based on the terms that comprise them, while terms themselves are also mixtures of one or more topics. The generative nature of the process involves repeated examination of documents by sampling topics and terms from those topics. Over many iterations, the terms within each topic and the co-occurrence of documents into a specific topic is refined and stabilized. Many implementations of LDA accomplish this process using variants on a Gibbs Sampler to estimate the iterative procedure in a Monte Carlo chain where previous estimations for model parameters guide the sampling of the next step until the estimations reach a minimum variation between iterations (Griffiths and

47 Steyvers 2004, Porteous et al. 2008, Grün and Hornik 2011). The result is a set of k topics, each of which is defined by a unique set of terms, as shown in the example in Table 3.1.

(a) catalunya independ vote catalan junt resolute commiss elect (b) snp, vote, labour, scotland, parti, independ, referendum, sturgeon (c) leagu catalan football club catalonia independ play game

Table 3.1 Sample topics generated by LDA with clear themes regarding, (a) the Catalonian independence context with that of (b) Scotland’s independence movement and (c) convergence between Catalonian independence and soccer

The topic generation process begins with the assumption that documents are random mixtures of latent topics. The variable k represents the total number of possible topics to which each document can be assigned. Although this mixture of topics over documents is random, each topic is not identically distributed over each document such that the random topic options are equally likely to be assigned to each document, except in the unrealistic scenario where the documents contain the same text. In other words, it is not assumed that each topic will appear in the same proportion in each document, and thus topics emerge semantically via the differences in text across documents. The selection of a topic for a document is made from the multivariate Dirichlet distribution, which specifies a higher likelihood for one or more topics than others. Given a vector of alpha coefficients with size equal to the number of topics, the Dirichlet distribution represents the sampling probability of each topic given by its corresponding power with respect to all other topics. A simplified three-dimensional representation of the Dirichlet Distribution is illustrated in Figure 3.1. The three dimensions represent topics, while the Z-axis represents the likelihood that that topic will be sampled. A random distribution would contain a flat surface, while Figure 3.1 shows that the Dirichlet distribution is characterized by variable distributions over the spaces of topics, given by the definition of an alpha parameter.

48

Figure 3.1 Graphical representation of the Dirichlet distribution of 3 variables. The height of the surface represents the probability of sampling related to the power given to each variable in the model. The height of the surface represents the probability of sampling in variable space, where a smaller alpha increases the probability of sampling a distribution of one variable at 100%, while a larger alpha increases the likelihood of an even distribution among the variables. Image from (Zhihui 2013)

Alpha specifies the power to each topic in the probability distribution. Typically, alphas are positive, but they may not sum to one across all variables, and each topic’s respective influence is considered independent on one another. Thus, as the top row of Figure 3.1 demonstrates, increasing alpha’s magnitude increases the slope of the probability surface between the most and least likely sampling spaces, while the proportion between individual alphas determines the location on the simplex of the most likely sampled attribute combination. In the context of LDA, the vector of alphas corresponds to the expectation of any given document being assigned to the corresponding topic. Since the topics are not known a priori, the alphas cannot be known either, at least not as a vector of variable values. Therefore, the topic alphas are given as a single value in LDA, where the magnitude of this alpha determines the slope of the probability surface. Thus, a smaller alpha is more likely to generate documents located on the edges of the topic simplex, whereas a larger alpha, is more likely to generate documents located at the center of the topic simplex, representing equal topic distributions over all topics. The Dirichlet distribution is sampled once per document to estimate a mixture of topics. According to that mixture of topics, each term in the document is assumed to be generated by one or more of the topics. The term is then assigned to one of the topics

49 from the multinomial topic distribution given the document’s expected proportional topic breakdown. Then, each term in the document is reassigned to one or more topics based on the probability that each topic generated the term, taken from the assignment of terms to topics across the entire vocabulary, and the probability that the topic also generated the current document. This joint distribution is estimated by sampling from the known distribution to integrate out the Dirichlet prior distributions and calculate the relevant probabilities with the known data (Griffiths and Steyvers 2004). After many iterations, the process stabilizes as reassignment of terms to new topics are less likely. LDA also specifies a beta Dirichlet prior distribution parameter for terms per topic. A small beta parameter ensures that a representative number of terms may help define multiple topics where semantically they match some general aspect of those topics, while more rare or specific terms may help define a much smaller set of topics. Using a small beta is standard (Griffiths and Steyvers 2004) and a value of 0.1 is maintained in this dissertation over all parameterizations of the process. Using the alpha prior and the k number of topics, the LDA process iteratively samples random terms from each document, refining the topic definitions and the assignment of each document to a topic with iteration. The process converges toward zero refinement of the topic and document definitions with infinite iterations. One thousand iterations are performed before assuming convergence. This process varies from other topic models, such as latent semantic indexing (Deerwester et al. 1990) because it samples each document repeatedly and with randomized selections. This allows each document to be placed proportionally in more than one topic. The LDA process is very dependent on initial parameter choices. The number of topics impacts the number of unique themes and the detail in those themes that can be detected from the corpus. The initial choice for k is an important one, and one typically left to the discretion of the operator. A smaller number of topics are easier to evaluate with respect to one another, though in contrast, unrealistically high values for k may also generate usable results in that the topics may diverge to a much greater degree, allowing the operator to find the most unique subsets of input documents (Blei 2012). Some suggestions are given (Griffiths and Steyvers 2004), but among the many applications of LDA in the literature, a large range exists for k. Similarly, alpha is dependent on k and

50 does not have a theoretical basis so it must be tested. Some LDA implementations calculate an optimal alpha given an initial starting value of 50/k (Griffiths and Steyvers 2004, Grün and Hornik 2011). In this dissertation, the method of calculating an optimal alpha is performed first on multiple models with a variable k to obtain a range of acceptable alpha values for testing, which are then used in additional implementation which requires manual input. LDA produces several useable outputs. First, for understanding the content of the entire corpus, the topics are defined by the eight most important terms for identifying a given topic over other topics (eight terms was a choice made for this analysis, and is explained below). This result can be interpreted as a semi-classification of the collection of documents, as the eight terms generalize the high-dimensional data and can be interpreted semantically into themes. Each topic typically consists of many documents, and each document can be assigned to potentially many topics, proportionally. Most functions will require just the primary topic, or topic with the largest proportion, however, evaluations of a topic’s uniqueness, one of the motivations examined in Chapter 4, the secondary topics are also considered. 3.2.3 Topic Disambiguation LDA generates topic labels using terms that appear in the text itself, creating a form of in vivo code. The process of in vivo coding involves taking descriptive text and creating analytical codes, which consist of specific and revealing passages directly from the text (Strauss and Corbin 1990). As LDA assumes that latent topics generate the unique combinations of terms within documents, the topics are groups of terms frequently used together, and more often than not those terms generate semantic similarities which are interpreted as semantic topics. The terms used to delineate and describe topics generally exhibit a semantic structure which is interpretable, but it is ultimately the job of an evaluator to interpret the topic semantics and name each topic given the combination of terms which describe it (Blei 2012). This interpretation and labeling process addresses the second research objective of the project; to examine semantic spaces produced by the media in its role as narrator of a complex local and international geopolitical process by identifying and comparing topic codes and their interpreted empirical labels in news articles and social media posts. As will be seen in

51 this analysis, the similarities and differences in narratives from different news sources, and spatial variation among them, are important for contextualizing the expression of political identity at multiple spatial and social scales. Interpreting the results is dependent upon the amount of detail used to represent each topic. Although the choice of the number of terms to represent each topic is important for the interpretability of the result, it is not considered an input parameter. Rather, it is a choice of convenience versus comprehensiveness. Because LDA assumes that topics are a mixture of terms, each topic is represented as a proportion of each term in the vocabulary. Many terms do not contribute to the establishment of most topics, but the choice of how many terms can range from one to the size of the vocabulary. Semantic overload and interpretability demands a number of terms which is small enough to reduce cognitive burden and large enough to derive a contextual meaning from the combination of terms. For the combination of readability and topic disambiguation, eight representative terms have been chosen to compare topics in this project. Significant effort has been made to understand how humans interpret the output of latent variables models such as LDA with respect to objective measures of the algorithm’s success. In tests measuring the interpretability of topic models versus commonly used quantitative methods, such as log likelihood, for determining the representativeness of models to their input data, the opposing strategies exhibited a negative relationship among several similar topic models (See Figure 3.2 for a representation of this relationship) (Chang et al. 2009). Even in similar work where human coding was compared to automated topic methods, the general conclusion was that more similarities exist than disparities, the presence of individual terms in the definition of a topic disrupted the interpretability of model results (Blair et al. 2016). Additional research is needed to fully evaluate the utility of topic models for subjective interpretability. This dissertation attempts to contribute to this effort emphasizing subjective evaluation measures for pattern interestingness described in this chapter.

52

Figure 3.2 Plots of three common text modeling procedures – the Correlated Topics Model, LDA, and probabilistic Latent Semantic Indexing – used on two textual sources – the New York Times and Wikipedia – measured by two typical quantitative clustering mdethods (Chang et al 2009)

One potential critique of human interpretation of LDA’s output refers to the subjectivity of determining the context in which multiple terms generate a summary of underlying semantic structure. It is inevitable that individuals will have unique understandings of the thematic nature of the terms’ co-occurrence in any given derived topic. Inter-coder reliability is one method utilized in many situations where subjectivity is expected or inevitable, and requires agreement between multiple people and their evaluations (Lombard, Snyder-Duch, and Bracken 2002). In this project, however, context is critical for understanding the nature of topics that emerge from the corpus of news articles collected. Thus, to fully evaluate the content of resulting topics without requiring each evaluator to pursue extensive background research on Catalonian geopolitics, this project has employed the assistance of a single coder who has experience with Spanish and Catalonian culture during recent years. Evaluating the data together, we examined the topics with respect to the spatial, temporal, and cultural perspectives of our research and experience. We discussed and came to mutually agreed upon interpretations.3 This joint evaluation enabled the interpretation of the results of this

3 As part of the Undergraduate Research Opportunities Connection program organized by Penn State’s Department of Geography, I have supervised Latrese Morris, a Penn State junior majoring in Geography and English, for internship credit.

53 project and enhanced the data analysis by providing a measure to correct for instances where language served as a barrier to our ability to interpret results.

3.3 Evaluation I provide several ways of evaluating the effectiveness of classifications achieved by the LDA procedure. First is a series of evaluation measures described in Chapter 2 meant to consider the ‘interestingness’ of patterns derived from data analysis. Ranging from the quantitative to the subjective, measures such as unexpectedness, actionability, (Silberschatz & Tuzhilin, 1996), conciseness, generality, novelty, and diversity (Geng and Hamilton 2006) are offered. Conciseness and generality can be implemented as rules in an iterative analysis, while actionability and novelty require subjective input. Interesting/insightful results need not satisfy all of these measures, but explicit consideration of some in both their objective and subjective interpretations benefits the classification process by providing substantial interrogation of classes and assignments. Interestingness measures provide specific advantages over existing validation methods for understanding and quantifying the success of a data-driven process. Common validation or model selection procedures, such as expectation maximization (Dempster, Laird, and Rubin 1977), attempt to compute the validity of a model by maximizing the parameter values such that the results are most likely to have been generated by the input data. This is a fairly non-specific goal, and choosing a model directly by expectation maximization could yield a suboptimal model due to local maxima and depending on the definition of model success. Models derived from unsupervised processes need a nuanced evaluation which considers multiple factors when optimizing, rather than a single broad attempt at validation. In the following sections, this dissertation describes several methods for evaluating the results from data-driven methods to provide an alternative to the problems associated with expectation maximization. The first section discusses the expectation maximization measure as a factor for comparing the output of topic modeling. Then, each of the interestingness measures, as described by Geng and Hamilton (Geng and Hamilton 2006) and expanded upon by others, are introduced along with their specific implementations for this project.

54 3.3.1 Expectation Maximization Expectation maximization (EM) is a subset of likelihood estimation models aimed at comparing model outputs to the input data and parameters used to generate them. In a latent variable model, likelihood estimation measures the fit of the latent variables to the input data. Likelihood estimation is a convenient validation method to test the results of many methods which discover unobserved structure in complex data, and is used to determine an optimal set of input parameters. EM is a common example, although it is not without known flaws (Do and Batzoglou 2008). EM works similarly to the iterative process of LDA by iterating through attempts to test and refine the missing or latent data structures of the data, approaching a stable estimation. Expectation maximization is named for the two steps – an expectation step and a maximization step – which it uses to perform this estimation. In the expectation step, values for the latent information are defined as the model’s expected parameters, and a log-likelihood function is computed based on the fit of those parameters to the observed data. Because of the iterative structure of generative models and the LDA process in particular, this step is computed as part of the normal procedure. The next step, the maximization step, adds negligible processing time. The maximization step maximizes the log likelihood function by estimating optimal parameters (Dempster, Laird, and Rubin 1977). The simple theory behind the EM algorithm works well for quick comparisons between model parameter combinations and general evaluation, but in it comes with inconsistencies which threaten the ability to use the process as an independent validation method. EM is very susceptible to the initial values for parameters. EM’s parameter estimations are not independent between subsequent iterations, increasing the likelihood estimations are influenced highly by initial selections. In the case of incomplete data estimations, EM breaks down the problem into assumptions which result in a guarantee of only local maxima in the likelihood function (Do and Batzoglou 2008). Further, a latent variable model will perform best with respect to log likelihood when the latent variables match the data precisely, as in the case when the number of clusters approaches the number of observations. This is an unrealistic assumption, but underlines the

55 tendency of EM estimates to overfit to the training data in the expectation step (Karlis and Xakalaki 2002, Tian et al. 2011). These tendencies make automatic identification of optimal parameters and model selection impossible, but as a general guide and with careful selection and parameter testing, EM can be useful (Tian et al. 2011). As with many statistical procedures, evaluative statistics like log likelihood and p-value provide a valuable overview of potentially useful models, but alone they cannot provide actionable validation methods (Ward, Greenhill, and Bakke 2010a). Further evaluation methods are required to overcome the ease of applying the EM process to a latent variable model and use a maximum estimation to choose the correct parameter combinations. I propose such a list of methods to further evaluate the desired aspects of a latent variable classification method, meant to encourage consideration of the important definitions of a model’s success beyond the circumstantial and inconsistent results of the EM algorithm. 3.3.2 Interestingness for Evaluation Interestingness has long been proposed as a method for better understanding processes for discovering knowledge from databases (KDD), but has not been implemented in such a comprehensive form in geographic information science literature as is done here. Interestingness may refer to any of several related methods which attempt to define some subset of more valuable results. In many implementations of interestingness as applied to KDD procedures, the measures are built into the process of knowledge discovery to guide the pattern discovery process, rather than as an evaluative technique over many patterns. Building interestingness measures into the process of knowledge discovery ensures that only the most interesting or meaningful results are returned and optimal inputs used to achieve the most interesting outputs, reducing the burden on the evaluator to manually observe unique results. In this study, however, interestingness is applied after the knowledge discovery process to compare several sets of patterns and isolate specific variations that result from combinations of inputs. Each result generated by the LDA process and its specific parameter combinations is considered to be a valid result, and thus, a pattern worth exploring for its significance through interestingness. The project accomplishes two goals by separating the interestingness measures from the topic modeling processes. However,

56 arguably, evaluation is integral to the knowledge discovery process, and inherently not separable from the processes that yield knowledge from databases, but for evaluative purposes, topic modeling and evaluation are considered as separate processes. First, to address the third research objective regarding the critique that a lack of theory guiding the exploratory analysis necessary for investigating large datasets yields untrustworthy and non-confirmable findings, the project utilizes specific interestingness measures linked to variations in input parameters to establish theories on the use of topic models on large collections of news text. Theories which guide the use of topic modeling methods are lacking, presenting a barrier to their use in many fields, including spatial and social sciences. These barriers are reduced by identifying interesting features of variable input combinations to these methods. Secondly, separating the evaluation phase of the knowledge discovery process from the analytical phase enables more thorough examination of specific interestingness factors. 3.3.2.1 LDA Outputs Facilitate Interestingness Evaluation Subjective interestingness measures are one such example. Some subjective measures can be implemented by establishing rules corresponding to prior knowledge which guide the KDD process. Obviously, however, no rules can fully represent a discoverer’s lack of knowledge, which is critical for establishing the novelty of patterns with respect to individual users. When evaluated by an individual whose judgement reflects knowledge and gaps in that knowledge, subjective measures accomplish their objective in identifying interesting patterns with respect to other, less interesting patterns, rather than relying on formalization of that knowledge. For more objective measures of interestingness, collection of topics that generate each pattern and the ways that documents are assigned to them are compared against one another. An individual pattern is made of a set of k topics defined by the set of the most indicative terms which separate it from other topics, and every document is proportionally assigned to each topic with a value between 0%, referring to no assignment, and 100%, which indicates a complete assignment to only one topic. Each input news document is given a primary topic – the topic which contains the highest percent of the article’s assignment – and up to four additional topics when applicable. Most articles are given a single topic with 100% assignment, though a primary topic may

57 be assigned with as little as 33% assignment. In any scenario where the vocabulary of terms given as input to the topic modeling procedure (which is a necessary step in text modeling, as many terms are too common or too rare to indicate any latent structure), a subset of articles will inevitably not be representable because it consists of solely terms excluded from the vocabulary. The generality/coverage measure explained in Section 2.2.2 discusses unassignable documents and attempts to measure the amount of data lost in this critical processing step as a loss of important information and therefore interestingness. The proportional assignments of each article to one or more topics is also used in the evaluation of a topic’s as well as a pattern’s diversity and peculiarity. Comparing patterns for their interestingness is crucial in evaluating them as the result of unique combinations of inputs. I use two methods to compare interestingness within individual measures of interestingness; subjective rankings and numerical scales. Several factors make it impossible to compare across interestingness measures, theoretically and methodologically. Methodologically, subjective measures are harder to define as a percent of a theoretical maximum interestingness, which would allow it to be compared against other interestingness measures on a similar scale. Instead, I rank subjective measures based on their respective interestingness on individualized scales of value. Objective measures may have theoretical maximum values, and therefore can be evaluated on consistent scales, but the theoretical maximum for one value is likely more reachable than for a different value, so to compare them as more or less interesting as a combination of multiple measures misses the point of evaluating different modes of interestingness for individual patterns. A pattern that is 70% diverse and 40% novel, for example, is not 55% interesting. It may be an interesting pattern where other patterns have consistently lower diversity. Many interestingness measures are also applied in similar ways to both patterns and the attribute-value patterns that comprise them. Topics comprise patterns, so the interestingness of one should reflect the interestingness of the other, but one way that topic models produce knowledge from databases involves identifying subsets of the input data to devote to further scrutiny. Particularly to address the second research objective, the project identifies interesting topics within patterns (which may themselves be interesting, but do not necessarily have to be) which could reveal sets of news articles

58 which refer to the merging of sport and politics in interesting ways. Interestingness literature distinguishes between evaluating association rules and summaries of those rules in knowledge discovery applications (Geng and Hamilton 2006). Similarly, the project applies interestingness measures to full patterns as well as to individual topics within patterns (though analogously, topics or classifications do not have the same function or structure of association rules). Where possible and applicable, the project employs the same measures to evaluate both patterns and the topics that comprise them that emerge from unique combinations of input parameters. Chapter 2 described the differences between objective and subjective measures of interestingness. Objective measures take into account only the quantifiable output of KDD processes. Subjective and semantic measures depend largely on the opinions, knowledge, and decisions of the knowledge discoverer relating to the specific context of the data and inputs (in the case of semantic measures). All types of measures ultimately involve formalization of the rules and beliefs which define interestingness to the evaluator. Thus, even the objective measures described in this section have subjective decisions inherent in the way that they are measured and implemented. Those decisions are necessary to adapt the KDD concepts to the current textual application, but they do not have an exclusive bearing on the evaluation of one result against another like the subjective measurements do. The following sections explain the methods used to evaluate topics and patterns for each of seven different interestingness measures. For objective measures, the KDD definition and adaptations for use on the results from LDA are described. For the remaining subjective measures, the original motivation of the measure in KDD literature is presented, and the subjective interpretation of that measure for the research objective of the current application is given for specification of its interestingness. Although all have subjectivity built into their definitions, every attempt to maintain the original intention of each measure is made, particularly in the case of the objective measures. 3.3.2.2 Conciseness Conciseness of a pattern is achieved by having relatively few attribute-value pairs which comprise it (Geng and Hamilton 2006). The property of conciseness ensures that a pattern more easily fits into an evaluator’s existing knowledge because of its brevity.

59 ‘Attribute-value pairs’ is a suitable description for the article-topic constriction of patterns generated by LDA, so the conciseness of such patterns is easily measured and the input parameters of the algorithm directly affect its variability. Specifically, the number of topics, k, is the only parameter which impacts both the number of values, or topics, and the number of attributes that are assigned to each value, since each article must be assigned to at least one topic proportionally, and each topic must consist of at least one input article. Therefore, conciseness is inversely proportional to k, and no other parameter impacts the conciseness of a pattern measured in this way. A smaller set of topics fits more easily into an evaluator’s existing knowledge. A modeler may choose the number of terms to display in each topic definition, which also increases the size and reduces the readability of the result, but has not impact on the performance of the model. As per this definition and the impact that the number of topics solely has on the size of the result set, evaluating the conciseness of LDA is simple and directly influenced by the modeler’s choice of the k parameter. Inherent in modeling and knowledge discovery is one interpretation of Occam’s Razor, which states that simpler models (given all other parts being equal) should be preferred because simplicity is a goal unto itself (Domingos 1999). Although the theory does not apply to all scientific problems, and especially not to some machine learning techniques where comprehensive parameterization and training increase model complexity to increase representativeness, simplicity does drive the interpretability and preference of some model options over more complex ones (Padmanabhan and Tuzhilin 2000). 3.3.2.3 Generality/Coverage Generality and coverage measure how well a pattern encompasses the full input data. A general pattern covers a larger range of data, leaving few input observations unclassified. This property is measured simply by calculating the percent of the total input documents that were assigned to a topic by the LDA process. Generative text models contain the feature that each input document which consists of at least one term that can be used to help define a topic will be assigned to at least that topic. The only factor that effects this list of terms – the vocabulary – is the term frequency-inverse document frequency. The tf-idf first measures the frequency of each term used throughout the corpus of documents by counting the occurrences of each term. That

60 frequency measure is then normalized by the distribution of terms across all documents, such that evenly-distributed terms have reduced tf-idf values (Aizawa 2003). Thus, a term with a higher tf-idf will be more representative and deterministic of a topic subset of input documents. The minimum tf-idf value is chosen to reduce the vocabulary of terms to only the most representative ones. Literature suggests that choosing a minimum tf-idf that is ‘a little less than the median’ will offer a reasonable tradeoff between deterministic terms and size of vocabulary (Grün and Hornik 2011, 12). A smaller vocabulary yields fewer terms from which to define topics, and reduces the number of usable documents as individual documents are less likely to contain any terms in the vocabulary. Therefore, generality/coverage is inversely proportional to the minimum tf-idf. The only way that documents can be excluded from classification by the LDA is if the document contains no terms that correspond with the vocabulary. This relationship becomes more extreme when documents are short, because each document already has a higher probability that none of its terms correspond with the vocabulary. News article-length documents, however, have sufficient length and term diversity that the effect of reduced vocabulary is lessened. Nonetheless, the shortest and least diverse articles stand higher likelihood of not being included in the classification, and so generality is a necessary measure to account for reduction in the document corpus. Geographically, generality/coverage captures the necessity to use the power of representation to recognize when some voices aren’t considered (Kwan 2002b). Kwan considers this concept in terms of representing the experiences of women and minorities (Kwan 2008) in GIS data, where leaving whole classes of experiences off of the map ignores their perspectives. In big data research, missing information is assumed to have little effect because the pattern remains evident, however, this assumption is incorrect (boyd and Crawford 2012), and generality/coverage attempts to formalize how much information is lost in the analysis process. 3.3.2.4 Peculiarity Peculiarity attempts to measure the divergence between multiple patterns. The higher the separation between patterns, the larger the expected peculiarity. An optimally peculiar pattern will have no semantic overlap with any other pattern. This is an

61 unrealistic expectation, given that many of the same terms are important for defining topics, regardless of the number of topics or other input parameters, ensuring that there is some semantic overlap between patterns, and even between topics within a pattern. Evaluating the peculiarity of an individual pattern aims to determine how much of the pattern is unique to it, rather than knowledge that is replicable in a separate pattern. This is frequently done using a distance measure (Geng and Hamilton 2006) to evaluate how separate a pattern is from another pattern. To measure peculiarity as intended in the original KDD defintions, some subjective reinterpretation into a textual context is necessary. To analogize the definition for this LDA-driven study, peculiarity is measured via the number of terms in the discovered latent topics that are not shared among patterns. This dissertation measures peculiarity of a pattern by comparing the terms identified as important for delineating its topics to the terms which delineate every other topic. Although a subjective decision, this interpretation of pattern divergence adheres as closely to the KDD concept as possible. The lower the proportion of its terms which it shares with the other patterns, the larger its distance to the second pattern. The distances are then averaged between the given pattern and every other pattern and the inverse taken to calculate the pattern’s peculiarity. A simple exception makes this calculation slightly more complicated. As the number of topics fluctuates, the number of terms which are assigned to a single pattern changes exponentially. Comparing a 10-topic pattern, which has 80 (not necessarily unique) terms to a 50-topic model which has 350 terms means that the proportions and therefore the distances are not symmetrical. To ensure that distances are symmetrical, peculiarity is calculated by the proportion of the pattern with fewer topics and that distance is applied to both patterns compared. 3.3.2.5 Diversity Diversity refers to the range of represented values within a pattern. Like peculiarity, diversity expects a large variance among patterns. But unlike peculiarity, which compares across patterns, diversity measures variation within patterns. The diversity of a pattern attempts to quantify the range of separation among the individual topics within the pattern. Increasing diversity indicates a more interesting pattern through a wider range of potential insights. This follows the principles stated by Heubner (2009)

62 quoting Hilderman and Hamilton (2001) that a diverse, and therefore interesting, pattern is one that is optimally far away from the normal distribution. In order for the topics of a pattern to be separate for the purpose of measuring diversity, the topics must not be proportionally assigned to other topics, so a set of non- diverse topics yields the optimally diverse pattern. As stated in section 2.2.1, most of the articles which are categorizable by LDA have 100% assignment to a single topic, though many are proportionally assigned to multiple topics based on the combinations and frequency of terms which comprise them. LDA assigns the topic with the highest proportion to each input document. Utilizing these proportional assignments of documents to multiple topics generates the ability to measure the overall similarity between two topics as a function of their document members. Although there is subjectivity in the decision to represent the semantic overlap between topics in this way, the diversity measure is applied in the same way to each pattern and set of patterns, separating this measure from the subjective ones discussed in this chapter, which are dependent on the goals and intentions of the evaluator. Contrary to intuition, a diverse pattern relies on topics which themselves are not diverse. A more cohesive topic, or one consisting of documents coded 100% for the given topic, has minimal diversity, though a maximally diverse pattern consists entirely of 100% coded articles, ensuring complete separation between topics. Thus, articles’ proportional assignments affect topic diversity and pattern diversity in contradictory ways. This project emphasizes the diversity of patterns and the topics which contribute the most to diverse patterns by themselves being non-diverse. A pattern consisting of a set of non-diverse topics is interesting because of the separation between those topics, quantitatively and semantically, potentially leading to more unique insights than topics which share semantic overlap. Diversity is not always viewed as a positive measure of pattern interestingness when nuance and topic similarity provide their own interesting summaries of the input documents. LDA provides the benefit over other text modeling methods, such as Latent Sematic Analysis (LSA) (Deerwester et al. 1990) and probabilistic Latent Semantic Indexing (pLSI) (Hoffman 2001), that documents may be proportionally assigned to more than one topic, rather than a single topic alone. The single topic assignments of other

63 methods yield a full diversity score, as topics are not allowed to converge via documents’ multiple assignments. LDA incorporates those convergences, introducing imperfect diversity. LDA’s multiple assignments are considered a methodological advantage, and thus pattern diversity is deemphasized. An LDA model which exhibits no diversity is thus equivalent to the LSA or pLSI models, and so LDA assumes that the latent structure of topics necessitates a reduction of diversity. Here, diversity is measured as the variance from an assumed even distribution of each topic over all documents. Such a situation is the opposite of the result generated by the LSA model, where each document is coded with 100 percent certainty into just one topic, by specifying that each document is assigned at the proportion of 1/k to each of the k topics. This extreme lack of diversity, where every topic is as likely as the next to have generated the observed documents provides no analytical advantage, but allows for sufficient calculation of each pattern’s variance from the uniform distribution and complete lack of diversity. In one example of diversity, Blei – one of the authors of the LDA algorithm – demonstrates interesting results by manually choosing semantically non-overlapping topics for display (Blei 2012). Interesting patterns need not consist of entirely diverse results when a subset of topics within a pattern indicate high diversity. So this project’s evaluation of pattern diversity considers the arrangement of articles within topics to determine the level of diversity of topics within patterns. The proportions of each article which are assigned to each topic are central to the calculation of diversity. Here, the variance formula identified by both Hilderman and Hamilton (2001) and Huebner (2009) is used to measure the variance of patterns with respect to an uneven distribution, shown in Equation 3.1:

∑푚 (푝 − 푞̅)2 푖=1 푖 푚 − 1 Equation 3.1 Measure of variance from an even distribution used to evaluate diversity of an association rule in (Hilderman and Hamilton 2001). Where 푝푖 is the probability for class i, 푞̅ is the average probability for all classes, and m is the number of classes.

As Hilderman and Hamilton’s variance measure computes the average difference from an even distribution, Equation 3.2 reflects a modification to the definition to reflect

64 the components of topic models and the proportions of each article assigned to their primary topic:

푚 1 2 ∑푖=1(푝푖 − ) 푘 = 퐷퐼푉 푚 − 1 푡표푝푖푐 Equation 3.2 Adjusted measure of variance for proportional assignment of documents to topics. The equation is essentially unchanged from equation 3.1, but specifies i as an index for each article coded into a topic, m as the number of articles in that topic, 푝푖 as the proportion of the article coded into the given topic, and 1/k is the expected proportion if the document was coded equally among all k topics.

The expected distribution is defined as an article that is evenly distributed among all topics. This highly unlikely scenario represents minimal diversity, as no topic information can be gleaned from a combination of articles and topics indistinguishable from one another. Thus, this measure computes the deviation from the uniform distribution, and measures the contribution of each topic to the pattern’s diversity. Although this measure is applied to each topic in turn, it in essence measures the inverse diversity of each topic, or the topic’s contribution to pattern diversity. From each topic’s diversity score, two calculations comprise the extent of the interestingness concept. First, the variance calculation of a topic’s diversity can help identify the individual topics which are furthest from the other topics for more in-depth exploration. The qualitative interestingness measures explained below are primarily used for this purpose, but a topic’s diversity score will closely coincide with such a measure as novelty which aims to select topics based on their greater chance of containing unique and interesting information. The combined topic diversities within a single pattern is also used to compare parameterizations across sets of topics. The mean diversity of all topics within a pattern is calculated to derive the pattern’s diversity score and use that calculation to compare the diversity between all patterns and parameter combinations. 3.3.2.6 Reliability The reliability of a pattern measures the replicability of important information between subsequent patterns. Rather than consider reliability as the inverse of peculiarity, where diverging patterns are more interesting, reliability refers to the ability of the analysis to maintain thematic information from pattern to pattern, assuming that the information within each document remains constant between subsequent patterns. If that

65 assumption were correct, then parameterization would have no impact on the topic definitions observed and a completely reliable pattern provides no interesting information over a simpler, more efficient parameterization. However, it is also true that different parameterizations do impact the themes which appear in topics and the terms used to define each topic. So reliability is the closest of the interestingness measures to calculate some form of trustworthiness of the algorithm. For this reason, reliability is important and necessary, though in combination with one or more other measures to define both what makes it interesting and what makes the LDA process more than a random collection of terms and documents. As with the other objective measures, some subjective decisions were made to apply the original interestingness concept of reliability to this application of LDA. Since consistency can be measured by the classification of individual documents, reliability is measured by comparing that classification across each computed pattern.Reliability of a pattern is measured by comparing the classification of individual documents across each computed pattern. Reliability is higher where the same documents are classified into the same topic and the same terms appear in the topic. A random sample of 100 documents is selected (which corresponds with the maximum number of k topics used to parameterize LDA to attempt a representation of each topic) and track the changes in topic definitions and co-coded documents across different parameterizations. These changes are measured using the Jaccard coefficient (Chandrasekharan and Rajagopalan 1989), which simply calculates the intersection of two sets, divided by the union of those sets. Aside from being a simple similarity measure to calculate, Jaccard similarity compares sets of strings, rather than other set comparison methods, such as cosine similarity, which rely on matrices and counts. Thus, the Jaccard coefficient is used to compare sets of terms which define each topic, as well as the sets of documents added to the same topic via the textual representation of a unique index for each document. In general, greater similarity equals reliability. Because Jaccard similarity is a binary comparison, reliability is calculated with a combination of all pairwise set comparisons. First, to compare the similarity between two patterns via the topic definitions, the Jaccard similarity between each pair of topics is found by comparing the set of eight terms defining each topic. Two topics which are

66 more similar will contain more of the same terms, increasing the intersection of the sets. The mean of all Jaccard similarity values for each pair of topics within a pattern (Jaccard similarity is symmetrical, so the diagonal and lower left of the similarity matrix are ignored) is compared against the mean for all patterns’ Jaccard similarities, with greater divergence from the mean similarity equaling that pattern’s reliability measure. The expectation, also stated in Geng and Hamilton’s (2006) explanation of reliability, is that reliability appears as the inverse of peculiarity and of novelty, where interesting information cannot be captured by other patterns. Reliability strives to achieve a model which reproduces consistent results. Using recommended values for input parameter combinations (for example, the calculated alpha prior distribution given a specific k number of topics) should increase reliability scores by keeping topic definitions and document co-occurrences. Reliability does less than its inverse interestingness measures for identifying specific pattern for further scrutiny. Reliability will therefore be given less weight to the evaluation of pattern than other measures, but where it identifies interesting patterns in conjunction with other measures, those patterns are given priority. 3.3.2.7 Novelty Novelty represents the first of the subjective measures of interestingness employed in this dissertation. Novelty measures the uniqueness of individual patterns with respect to the knowledge of the individual doing the evaluating. The information in a novel pattern must not be previously known to the evaluator, nor can the information be discovered in a separate pattern or combination of patterns (Geng and Hamilton 2006). Since no knowledge discovery system can represent the universe of an evaluator’s prior knowledge, let alone the inverse of that – what the evaluator dos not know – novelty must be tested subjectively. The previous objective measures are possible because they consider the limited universe of the pattern and the collection of patterns. Novelty is very similar to the concept of peculiarity discussed previously. A novel pattern, which is unique from other patterns, is also a peculiar pattern in that the same information cannot be gleaned from other patterns. The difference is in the objective/subjective nature of these measures and the universe which is considered in the comparison. Peculiarity objectively measures how unique one pattern is from all other patterns by considering the universe of all patterns. Novelty is based on evaluator

67 knowledge and is evaluated manually, so it considers a much wider universe of knowledge of the user. Novelty is maximized by finding individual topics which contain new information. The subjective measurement of novelty ranks patterns by the uniqueness of the topics within them, not with a quantitative comparison. This dissertation calculates novelty by examining the terms designated as important for defining each topic, and assign a topic title to summarize the combination of terms. As cautioned by Blei et al, the key terms which designate each topic represent the primary dimensions of documents as per the latent ‘topic’ variable, but not necessarily should they be considered semantically relevant or interpretable in semantic space (Blei, Ng, and Jordan 2003). While terms individually certainly do not provide any interpretation power for topics, since they often (and probabilistically will with larger numbers of topics) repeat in multiple topics within the same patterns. For the sake of increased confidence, this evaluation stage employs an assistant to help evaluate the novelty of patterns. With a basic understanding and appreciation of the Catalonian context and the relevance of sport to that context, my research assistant was able to help me evaluate patterns for novelty, in addition to being my Spanish-speaking translator. It should be assumed then that some topics will transcend different patterns generated by combinations of input parameters, even if the specific terms delineating those topics do not entirely match. With increasing topics, the distribution of unique key terms per topic decreases, such that patterns with smaller k and larger tfi-df have a smaller variety of topics and terms which designate them, and hence are less likely to be novel. Certainly, the combination of terms creates unique semantic situations, and in larger k patterns, terms are more likely to be repeated in multiple topics, so novel patterns will be generated by unique combinations of terms. The possibility of unique combinations of terms favors the greater potentiality of novelty in patterns with large k and smaller minimum tf-idf. With this in mind, novelty is evaluated with the lowest potentiality first, anticipating that with additional evaluation will come greater recognition of novel information. 3.3.2.8 Unexpectedness/surprisingness Unexpectedness, like novelty, is subjective and measured against the knowledge of the person doing the evaluating. The only difference between novelty and

68 unexpectedness is that in order for a pattern to be unexpected, it is not necessarily represent new or unique knowledge, but must contradict the existing knowledge of the evaluator (Silberschatz and Tuzhilin 1996, Geng and Hamilton 2006). Unexpectedness and surprisingness are equivalent concepts in Silberschatz and Tuzhilin and in Geng and Hamilton. Contradictory findings are clearly interesting in the context of exploratory research for discovering and explaining alternative arguments and proofs. Unexpectedness is simple to overlook when the evaluator’s beliefs are strongest, so contradictory and surprising results should not be overlooked. Like in the case of novelty, individual topics and their terms cross over between patterns, so patterns are more similar than they are different. It is the topics and the combination of terms comprising them which can produce unexpected results. Comparing these terms and combinations provides for rating of the degree of surprisingness, though in this project, surprising patterns prove to be somewhat rare. Thus, surprisingness is best described here in a binary way, where a pattern exhibiting surprisingness is in itself significant. Like with novelty, a research assistant assisted the process of evaluating patterns and topics for their unexpectedness with respect to both the Catalonian context and the context of other patterns by discussing and reconciling our subjective interpretations. 3.3.2.9 Utility, Actionability Finally, each pattern is evaluated for its ability to contribute towards reaching and acting on a goal. For the purposes of this project, the concepts of utility (Geng and Hamilton 2006) and actionability (Silberschatz and Tuzhilin 1996, Geng and Hamilton 2006) are combined because the difference between reaching the goal (utility) of identifying a subset of news articles exemplifying a link between politics and sports, and the ability to act upon the pattern (actionability) by mapping the subset and presenting spatial patterns is negligible. Utility and actionability not only rely on the subjective decisions and knowledge of the evaluator, but the semantics and domain knowledge of the data as well. An unexpected pattern, for instance, may not be actionable if it does not facilitate any further exploration or action. The goal in evaluating the patterns generated by LDA is stated in research objective 2, where this dissertation seeks to understand the patterns of similar narrative

69 topics that emerge from global media, and the historical, spatial, and social influences on those patterns relating to two examples of sports with political contexts. Patterns that exhibit useful or actionable semantics will contain individual topics which cross the concepts of politics and sports. Pairs of topics which together span these two domains also contribute to the actionability of the pattern because they facilitate a comparison of the geographic spaces created by their semantic differences. This is explored in greater detail in Chapter 5. The optimal combinations of input which will generate the most useful and actionable results is difficult to predict. Larger numbers of topics mean that the semantic categories are more likely to be separated, so there should be fewer individual topics which transcend these themes. Similarly, a larger vocabulary of terms used to define topics and patterns will increase the range of terms corresponding to any given theme, increasing the complexity of identifying specific themes, so a smaller vocabulary should yield more useful and actionable results.

3.4 Catalonian Independence Sports have been a key source of propaganda for both fledgling and established governments in many cases worldwide. Mussolini used his image as a “sporting hero” (Gordon and London 2006, 45) to promote his fascist regime in , and more recently the Winter Olympics in Sochi were used by Vladimir Putin as a springboard to legitimize his Russian regime (Petersson 2014). The Olympic Games provide many opportunities for defining and displaying a desired image, especially via the opening and closing ceremonies where the host nation creatively shows off their history and key moments. Catalonia’s movement to establish support for independence has seen sport enter prominently into the conversation as well. When Barcelona hosted the Summer Olympics in 1992, it was used to advertise the Catalan identity as separate from Spanish, a sentiment that is today encouraged at FC Barcelona home soccer matches (ESPN FC 2014). Acknowledging the importance of the local soccer club for Barcelona – Catalonia’s political and social capital – the Spanish government has threatened to remove the team from its league competition should the independence motion succeed (Harding 2014). Barcelona and, by extension, Catalonia, use their established sport

70 identity to create international recognition through tourism and successful play, bolstered by heavy international athletes and support. This project examines one particular event in late 2015 which embodies different sides of the geopolitics of sport. This dissertation studies the injection of sports into the politics of the Catalan Independence movement during the late summer and autumn of 2015 and the political identity of sports in Barcelona and Catalonia during the process of the independence referendum vote is examined. 3.4.1 The movement A collection of news media is used to explore the use of sports to further discussion of a ramped-up Catalan independence movement. Digital news articles collected between September 2015 and February 2016 – during which time parliamentary elections demonstrated increased support within Catalonia for leadership to begin a process of establishing an autonomous state - are analyzed for temporal patterns in public sentiment and textual themes for representing the geopolitics of sport present during this movement. Sources include news publishers throughout Spain and Catalonia for binary comparison, and sources from other nations, especially those represented by athletes on FC Barcelona’s roster, which presumably have higher than normal interest in the team’s success. Sources are primarily in English, with Spanish, and sources included as well. FC Barcelona played domestic and international matches during this time, and the Vuelta a España cycling tour also took place in September, providing an important sport context to compare to. Regionally and temporally the narratives on Catalan independence and how sport is used to promote those narratives are expected to vary, with the greatest demonstration of competing narratives occurring between Barcelona and Madrid local sources, showing the competitive and political rivalries between these two regional – and international – capital cities. In November of 2014, the people of the Catalonian region of northern Spain were asked via referendum whether they supported or opposed the creation of a state independent from Spain. The Spanish government has blocked this and previous referendums, despite the growth in support among the people of Catalonia to form an independent state. On September 27th, 2015, parliamentary elections were held in Catalonia, after which major secessionist party Junts Pel Sì (Together for Yes) promised

71 to begin the process of breaking away from Spain (Kassam 2015). Although Junts Pel Sì captured a majority of available seats in this election, the party came 7 seats short of an outright parliamentary majority; today the movement for independence remains active on the minds of politicians and residents. During the time of the referendum vote held in 2014, Spanish politicians threatened the future of FC Barcelona – one of the world’s most well-known and successful soccer teams – and its place the Spanish professional league (Harding 2014). The international standing of the team rests in its ability to compete with other such teams within the Spanish professional league. Such a threat to the professional affiliation of the team (which also applies to the pro-unification team, Espanyol, also based in Barcelona) intends to highlight the competitive, financial, and social consequences of Catalonian independence in a way that influences not only the residents of the region, but globally as well. Sports’ history in Catalonia goes beyond that of soccer. Spain annually hosts the Vuelta a España. In September 2015, the Vuelta contained a stage that began in and ended in in Catalonia. This international competition encourages national allegiances, while the route through rural Catalonia could encourage the involvement of spectators and a timely opportunity for expression of cultural identity. This separatist movement is one of the strongest fueled by ethno-nationalist sentiment today, particularly in Europe where there are multiple movements (Figure 3.3). Because nationalism driven by ethnic affiliation rather than civic duty has existed prior to and separate from colonial organization, some have suggested that ethno-nationalism will increase with transnational migration and global modernity (Appadurai 1996). In Catalonia, ethno-nationalism is currently manifesting in a movement to establish a civic state around the primary acknowledgement that Catalan is an ethnicity distinct from Spanish. This example examines how ethno-nationalism sentiment is portrayed through reporting on domestic soccer where Spanish and Catalan teams compete routinely in major team rivalries, and on the international stage of cycling.

72

Figure 3.3 Separatist regions throughout Europe. Many look toward possible success by Catalonia for precedent and strategies. Image from (Kassam 2015).

3.5 Data Processing This final section explains the data processing and analysis procedures in the general sense that they apply to the Catalonia case study. Each section explains the data capture process, use of natural language processing tools for formatting the articles’ content, and the software packages which enabled data analysis and visualization. 3.5.1 Data Collection Data pertaining to the Catalonian independence movement was collected via online RSS feeds during the fall and winter seasons of 2015. To capture a period of time including the Catalonian parliamentary elections on September 19th, 2015, data collection began on August 21st and ended the process on the final day of November. The three and a half months of news data covers significant background media activity as well as reporting on immediate reactions to the referendum vote.

73 News was collected from publication sources with public RSS links. Despite the limitation that many sources do not have links for RSS capture, a variety of sources with varying thematic foci and spatial coverage was captured, ultimately generating a list of 64 unique RSS feeds. Many publications have unique feeds for specific topics or regions, which are included in the 64, where available. Some of the geographically specific areas represented by the RSS feeds include national sources for Spain covering national matters, national sports-specific sources, regional sources for Catalonia/Barcelona, the Canary Isles, the Mallorca Isles, Basque, Costa Blanca, Andalusia, and several larger communities in and around Catalonia, sources from countries with connections to FC Barcelona through athletes; Brazil and Argentina, and major international sources who have had coverage of and commentary on such parliamentary events. The automatic RSS collector was adapted from STempo (Peuquet et al. 2015) to scan each RSS feed daily and download the HTML for any new articles found. This program, written in Java, creates a new local folder each day, takes each valid response from the RSS feed, and writes the HTML to a new text file inside the daily folder. The program was packed into a runnable jar file and ran the program on the GeoVISTA server for continuous operation. A couple of unexpected interruptions occurred, but restarting the runnable jar within hours prevented the loss of daily news activity. 3.5.2 Text Preprocessing The HTML output from the news collection process contained more formatting and extraneous information related to the web content of the article than was necessary and useful for analyzing the text content of the article. The project used Python package Beautiful Soup for parsing HTML tagged documents (v.4, https://www.crummy.com/software/BeautifulSoup/). In a single document with uniform tags for representing specific information from the article and other special formatting components, Beautiful Soup enables extracting and searching all desired components of an HTML-encoded document. Built-in functions allow for traversing the tag structure, retrieving lists of all instances of a custom tag, and using regular expressions to identify text stored within strings. Beautiful Soup was used to extract the article content text, the publisher source information, and publication dates associated with each article.

74 The challenge increases when attempting to retrieve the same information from several files with no uniform HTML formatting guidelines. Each separate source uses slightly different HTML conventions for storing text and dates. Within the program built for this project to extract this information from my collection of news articles, I used extensive testing to define cases to capture every unique way that sources represented their articles. Some sources use tags such as ‘article_content’ to store the article’s text, but most could be captured by extracting all of the ‘p’ paragraph tags and excluding paragraphs that contained generic formatting strings, including ‘adSpace,’ ‘window,’ ‘headerHeight,’ and ‘commentsWidget.’ The source information from each article was extracted by searching the HTML document for strings relating to the source names. Publication date information is considerably less uniform than article text and sources. Some sources use similar tags to store dates, stored within name and property attributes called ‘datePublished, ’DC.date.created,’ or ‘_date_creation.’ Regular expressions were used to identify the names of these attributes, but simply finding all attributes containing the substring ‘date’ returns many non-date items. Yet other sources’ dates were discoverable using regular expressions to search for more specific ‘datetime’ tags. Several sources still remained without discoverable dates. Finally, remaining sources were examined manually and developed individual cases to extract and format their dates. After extracting the necessary information from each HTML article (no downloaded articles were lost by missing information during this process), they were formatted and combined into a single database. Source-specific information was joined to the extracted information for each article. The database contains nine columns: article text, file name on disk, folder on disk, source, country/region/locality of the source, topic covered by the source, source’s primary language, a numerical scale representing the geographic scale of the source, and the publication date. This database is used as the primary input to the LDA analysis procedure described next. 3.5.3 Data Analysis Extensive comparisons were undertaken to examine multiple implementations of LDA within R to determine their ease of use and compare the utility of their results. These implementations vary slightly in the format and comprehensiveness of the inputs

75 and outputs, and comparing their results leads to some interesting differences. Five procedures are implemented between two primary packages in R, the ‘topicmodels’ package (Grün and Hornik 2011) and the ‘lda’ package (Chang 2015). Despite some degree of topic overlap, some methods did not perform well, with incorrect assignments of documents to topics, topic definitions which lack semantic coherence, and comparatively low average log likelihood statistics. Consistently, the collapsed Gibbs sampler implemented in the ‘lda’ package avoided these issues4. 3.5.3.1 Parameterization The ‘lda’ package accepts standard inputs of documents, alpha, beta, number of iterations, and number of topics. Other options exist that are not required but have their own utility, such as initial assignments of documents to topics (to assign specificity to the alpha parameter), a burn-in factor to remove the first several iterations from the ultimate result, and a flag to compute the log likelihood of each iteration. One thousand iterations were used in this project. This negates the need for a burn-in and assumes an exploratory research design where no prior topic assignments are known. Thus, initial assignments are not given as an input. Since no burn-in was required, this value may be less significant than simply using the log likelihood of the result from the final iteration. But the average captures the generative nature of the LDA process through its random topic assignments in the initial iterations, and comparing the average log likelihood over all patterns ensures that the values sufficiently evaluate the algorithm and process more than just their outcomes. The ‘lda’ package implementation requires a double data type input for the alpha parameter, with little guidance for recommended values. The recommended value of 50/k (Griffiths and Steyvers 2004) is a reasonable starting point, but a more nuanced calculation is necessary for testing the impact if this atheoretical value (alpha is an atheoretical value in that the mathematical basis is not built into any real process with

4 In previous unpublished work with Oak Ridge National Laboratory, I evaluated five text modeling procedures in two packages in R with LDA-like inputs and outputs for known and seeded custom documents, other manually-classified document corpuses, and unknown unclassified data (keeping in mind that Färber and coauthors (2010) advise against evaluating topic models with previously-classified data because unsupervised methods emphasize finding new insights in unexplored data).

76 respect to the data or the expected result) and its effect on the LDA process. An alternate implementation of LDA was used in the ‘topicmodels’ package of R which tests for an optimal value of k during its iterative process, beginning with an initial value of 50/k. From this estimation, the “optimal” alpha value to use with the given parameter of k was found. Each alpha value calculated for the tested k values of 10, 20, 50, and 100 was included in the analysis. This enabled the impact of both higher and lower than expected alpha values given the test k value to also be tested. Finally, the ‘lda’ package implementation requires the input documents in a specific document-term matrix format. This object, consisting of lists of counts of terms which are stored as indices into the complete vocabulary list of terms over all documents, simplifies each document and removes the storage burden of strings and characters within the LDA process. To generate this vocabulary, one final test parameter to define its size was incorporated. As the collection of all terms from every document is constant between models, the term frequency is weighted by its inverse document frequency to define for each term a level of ‘usefulness’ with respect to finding interesting clusters within the corpus. The vocabulary is reduced by defining a minimum tf-idf, above which only the most useful terms have values. Three vocabularies for each combination of alpha and k were tested by specifying minimum tf-idf values close to the literature suggested value of ‘a little under’ the median tf-idf value (Griffiths and Steyvers 2004). This vocabulary and the instances of those terms within documents are used to create a document-term matrix. Where there are packages for generating document-term matrices in R, the ‘lda’ package specifies a list of individual document-term matrices rather than a single three- dimensional matrix. The term frequencies and format the required list for the process were generated. Chapter 4 describes the precise values and results of testing the LDA procedure using the ‘lda’ package of R with the combinations of the alpha, k, and minimum tf-idf parameters described here. 3.5.3.2 Post-Analysis The ‘lda’ package proportionally assigns each of the k topics to each of the input documents, and following that process, the usability of the resulting assignments is improved via several formatting modifications. First, the number of terms which define

77 each topic must be defined for optimal readability and interpretability. Eight terms accomplish this analytical necessity while maintaining readability on the digital display. These terms are stored and printed in both a summary file and a document-level assignment file and are used to evaluate the interestingness of each pattern as well as disambiguate topics into subjective classifications. The summary file primarily assists in topic disambiguation. It consists of the top eight terms for each discovered topic, and a list of the parameters used to generate them. Following each model, a document assignments file is also created, which contains document-level information including the article text, date published, source information, and their proportional assignments to topics. Most documents are assigned with 100% proportion to a single topic, but others may be assigned to many topics with low proportions. Because of the rarity of documents having assignments to over five different topics, only the top five topics for each document and the proportion at which it is assigned to that topic are added to the document assignments file. The summary file is more easily processed to obtain a list of the topics for topic- or pattern-level evaluation, while the document assignments files is used for document-level evaluation and spatial analysis. 3.5.4 Spatial Analysis Location comes into play in several stages of this analysis which facilitates a geographic analysis of the media and its content. Importantly, the three facets of media reporting each have locations which are extracted, mapped, and compared for spatial patterning: the media producer, the audience, and the news content (Rose 2012). Chapter 5 examines the spatial patterns of topics and media across different scales. 3.5.4.1 News Producers The first spatial facet of news reporting is that which is contained within the producer of the news. The producer includes the publishing company and the author and place of publication for individual articles. However, author names and publishing locations inconsistently appear on article text. Without additional information about author history and location, the name of the author provides little spatial information, so to remain consistent across each article and publication for the purposes of this project,

78 the geography of the media producer consists solely of the location of the producing company’s headquarters. This decision is discussed further in Chapter 5. Finding the location of the media producer involves manual scrutiny of the company’s website. For this, street addresses for each of the 35 companies included in the RSS collection list were recorded. Two of those companies listed no physical address on their website, and several others exist in multiple locations, all within Spain. Where available, maps provided by the company of their location on an embedded map server were also collected to aid in geolocating the address to a specific point. ArcGIS, with a road reference basemap provided by Esri through the ArcGIS Online server, was then used to geolocate each address to a digital point. Second, and the most uncertain facet for mapping purposes, is the spatial properties of the audience of each news article. With digital publications, the readership is impossible to define precisely. In paper production, subscribers and their addresses are records, even if inaccessible to researchers. Any person across the world with internet access can access a Catalan-language publication based in Barcelona which produces local content as easily as they can access content from the Guardian, which covers events around the world. 3.5.4.2 News Audiences Thus, this dissertation takes two approaches to understanding the geography of the audience of digital news articles. First, it assumes that the headquarters of the publication company is the center of the intended readership. A publisher is more likely to cover events in its near proximity than it is to cover events in other places. Intended audience is much simpler to estimate than the actual audience, since access to web traffic and ISP information is not publically available information. The intended audience, as determined by the location of the publisher’s headquarters, is augmented with an estimation of the scale of the publication. The second approximation of audience takes a subjective interpretation of the scale of intended readership of a publication with the headquarters listed on its website. Here, three scales are considered: local, national, and international. Local news is indicated by language; Catalan-language news focus entirely within the Catalonia region of Spain, and by their descriptions as servicing local areas. International news is

79 recognized across the world as reporting on topics that are important to a range of localities and international interests. Every other publication source lies somewhere in between a local and international definition is of a national scale, representing the news and information relevant to a single country. In Chapter 5, spatial analysis to compare the scale of publications is performed. The international scale is represented by The Guardian for its popular reporting on global scale events, though it is located primarily in London, UK. To compare equal numbers of articles between each classification, a sample size of 1000 articles is chosen. The Guardian, with nearly 10,000 total articles, is sampled at a rate of approximately ten percent to have an equal sample of 1000 documents to compare to the local scale news. National scale news has so much variety topically and spatially that only the local and international news are compared spatially. 3.5.4.3 News Content Mapping the content of the news articles involves a multi-step process that must be performed on each article individually. This project maps the content of both the publications’ scale as previously described, to examine the variation of spatial content with audience, as well as the spatial content of interesting and actionable topics with specific semantic themes. Actionable topics are identified in chapter 4 as responding to the research objectives of finding subsets of news which merge sports and geopolitics into a cohesive theme. Cohesive themes should consist of geographic spaces corresponding to specific geopolitical contexts within the Catalonian independence movement. The process of mapping the content of individual news articles begins by identifying mentions of geographic placenames in the text of the article. The process, which also extracts other proper nouns like people and organizations, is known as Named Entity Recognition, or NER (Cunningham 2002). This project utilizes a free-to-use service provided by Penn State’s GeoVISTA lab, GeoTxt (geotxt.org) (Karimzadeh et al. 2013). GeoTxt takes an unstructured piece of text, finds all instances of geographic placenames, and places a point on an embedded map server at the corresponding location. NER options include the Stanford CoreNLP engine (Manning et al. 2014) and GATE ANNIE (Cunningham 2002). Because even with contextual cues within the text, multiple

80 candidate locations may be produced by a single placename, several options exist in GeoTxt to refine a location if the correct candidate was not chosen to be placed on the map. To achieve the most accurate results, this option is utilized when necessary having read each article for contextual and other spatial clues about the geography of the article content. GeoTxt concludes by returning a GeoJSON formatted file containing information about each of the places identified in the text, their geographic scale defined by country, province, city, or other location (which includes topographical features, businesses, and airports among many others), and their locations within the text. After downloading the GeoJSON file, it must then be converted to a shapefile for analysis n ArcGIS. For this, the free online service Mapshaper (mapshaper.org) is used. Whereas GeoTxt takes one text document at a time, Mapshaper accepts any number of GeoJSON inputs and converts each one to a shapefile. All individual shapefiles are then combined using ArcGIS’s Merge tool. The geographic content of the news measures the spatial context of the event on which it is reporting. This context is separate from that of the publisher and the audience. It is important to note that the article content is related to but separate from the intended audience. An article’s content is more likely to resemble the intended audience because a publication will focus on producing content that is most relevant to its customers. However, mapping each mention of a spatial entity within a news article explores the connections between places when they appear together in the context of the article’s theme. 3.5.4.4 Spatial Analysis With lists of placenames related to article groupings based on media’s spatial properties and semantic spaces, two processes are used to analyze global connections and specific geographies. Many methods which describe spatial distributions, such as spatial autocorrelation, work better on local scale data than global (Getis and Ord 1992) because of the tendency of social factors to cluster at small scales as a result of natural and artificial boundaries. Here, global distributions are measured as deviations from a regular distance decay function with its origin at the center of the Catalonian independence movement; Barcelona. The largest proportion of placenames should show little distance decay, since the immediate vicinity around Barcelona is the greatest effected by the

81 movement. Next, Madrid and other localities throughout Spain are directly related and have local impacts on and as a result of the move toward independence. Expanding beyond Spain, many locations are related to the movement for social, media, and economic regions, and those places are expected to be mentioned in news about Catalonia more frequently than places which are not related, regardless of distance from Barcelona. This social distance from the center of the movement is measured in two ways. First, the two-dimensional distance radius with origin at Barcelona is collapsed to one dimension in a smoothed histogram of the density of the number of times a place is mentioned as a function of its distance from the origin. A regular distance decay function will have a peak at the origin and decrease in density with increasing distance. In this project, the rate of decrease will be great. However, the interesting points of this method will be in finding unexpected peaks in the distance decay plot. For example, see Figure 3.4, which compares the distance decay functions of three separate topics (this figure and its content is explored in greater detail in Chapter 5). Interruptions to the regular distance decay function indicate locations with unexpectedly high placename mentions with respect to the location’s distance from the spatial center of the Catalonian independence movement. Since the figure plots densities of place mentions over spatial distance, the places which correspond to interesting observable peaks are labeled.

Figure 3.4 Density plots of three different topics pertaining to the Catalonian independence movement solely (red), the Scottish independence referendum (blue), and the Catalonian independence movement mixed with soccer themes (green). The x-axis of distance has origin zero located in Barcelona

82 The places identified by disruptions to the distance decay function are examined for their relationships with Barcelona and the Catalonian independence movement. The fact that many locations with similar populations and international significance are relatively underrepresented in the distance decay plots generates a network of related places with links in the movement. These places and their networked links are discussed in Chapter 5 in terms of economic links, athletic links, and media links.

3.6 Summary This chapter discussed the implementations of analyses forthcoming in Chapters 4 and 5. The process for performing Latent Dirichlet Allocation was described, particularly as it is affected by the parameters tested with sensitivity testing in Chapter 4. The sensitivity tests are executed for each of the seven interestingness measures discussed here. Interestingness is a rarely used concept in GIScience, which is surprising given efforts to resolve many issues with data-driven research. Each of the interestingness measures’ implementations, discussed here, utilize the terms used to define topics and the proportional assignments of each article to each topic. Although defined specifically to evaluate the results given by LDA, these measures are implemented in ways that can be generalized to contribute to evaluating the results of data-driven methods in text analysis or in other areas. The methods for spatial analysis and comparing the geographies of media and topics was also described. The methods described here and the guiding methodological principles of data-driven research are implemented in the following two empirical chapters.

83

4. Chapter 4:

Evaluation

“The whole point in performing unsupervised methods in data mining is to find previously unknown knowledge”

-Ines Färber, et al 2010, On Using Class-Labels in Evaluation of Clustering

84 4.1 Evaluation ‘[W]ith enough data, the numbers speak for themselves,’ (Anderson 2008). Even as this contention received significant critique since its publication in Wired Magazine in 2008 (boyd and Crawford 2012, Dalton, Taylor, and Thatcher 2016), many processes still wrongfully allow the evaluation process to, indeed, let the data and models ‘speak for themselves.’ This chapter calls for a more nuanced set of evaluative techniques meant to guide the parameterizations of data-driven analyses. Assessing the success of data-driven techniques is an involved process. Success can be defined in many ways which depend on the data, the decisions made by the evaluator, and the function of the model used to produce a result. The process of model validation addresses the latter by confirming that an analysis process addresses its provided task. Big data applications, where exploration takes precedence over confirmation of known findings, are not particularly well-suited to validation due to missing or incongruous assumptions, especially when using unsupervised methods (Färber et al. 2010). An evaluation process, however, addresses all three measures of model success by defining the important, or interesting, components of success, and ensuring that the model contains the correct components to produce the most desirable result. Desired results are defined via interestingness measures; data mining concepts specifying important objective and subjective properties of patterns computed from big data. Throughout this chapter, the results from a single data mining procedure are referred to as a ‘pattern.’ This is consistent with KDD definitions, though slightly different from social science and even further from a mathematical definition. A mathematical definition of a pattern has regularity and recurrence which can be extracted from a larger collection of more random events. In geography, a pattern may or may not have temporal regularity, but is still separable from events occurring ‘in the background’ and can be explained by real-life processes. In KDD, the results from an analysis process are, in themselves, a pattern. Since the LDA analysis process is a dimension reduction method, generating unique cluster organizations of the input text, this definition is not dissimilar to the geographic definition of a pattern. The primary difference being that any

85 result from LDA is a pattern, and the subsequent task is to determing which are most interesting and useful for generating new insights about the data. This chapter tests several ways of evaluating data-driven analysis through the use of interestingness measures from knowledge discovery in databases. It performs a model sensitivity analysis on LDA, tracking the changes to the generated topics resulting from different parameter combinations. Although each combination of parameters generates a valid topic pattern over all documents, significant differences in topic definition and document assignment occur with varying parameters. Specifically, three parameters which significantly impact the LDA procedure are tested here: the alpha prior distribution for determining the tendency of documents to be assigned to more than one topic, the number of topics k, and the minimum term frequency (tf-idf)– inverse document frequency for determining the size of the vocabulary. Through testing 48 different models generated by varying each parameter at specific values, this chapter demonstrates the utility of nuanced evaluation using a combination of objective and subjective measures. This chapter first explores issues with the validation process in exploratory big data analysis, then describes an evaluation technique based on model sensitivity analysis and interestingness measures. It then demonstrates the process on the results generated by LDA on news data collected during the Catalonian independence movement. Finally, it illustrates connections and contributions to geographic data science specifically to data- driven methods which continue to have spatial application. 4.1.1 Issues in validation of exploratory techniques One of the most important distinctions in big data and data-driven analysis is that of the difference between confirmatory and exploratory techniques. Exploratory techniques, such as visual analytics (Cheng et al. 2013), computational pattern detection (Peuquet et al. 2015), unsupervised clustering (Steiger, Resch, and Zipf 2016) among many others make few assumptions about the arrangement and patterning of input data. These techniques become more necessary when data volume makes any prior assumptions about data relationships difficult to make without prior exploration. Unsupervised methods perhaps offer the best example of exploratory methods, where the user provides no direction suggesting pattern structure or relationships among the data, allowing them to emerge without predisposed supervision.

86 Färber et al consider the inherent exploratory nature of unsupervised methods through the process of validation (Färber et al. 2010). More accurately, they discuss the inability to validate methods which generate results that are not expected (unexpected in the sense that the knowledge was previously unknown, rather than the interpretation where results contradict existing knowledge. This distinction is important in knowledge discovery from databases). Since exploratory analysis are intended to generate unknown results, there exists no ‘gold standard’ reality to which a result can be compared for validity. To exploit the exploratory nature of unsupervised methods, the evaluative process following discovery of these patterns much also be free of user assumptions. Figure 4.1, adapted from Färber et al (2010), demonstrates the multiple potential clustering schemes depending on the latent variables of shape and color and of combinations of these variables when the number of clusters increases. Both color and shape are legitimate factors defining clusters in this set of observations, so validation based on either factor could yield insufficient assessments of model success. Latent variables are, by definition, unknown at the outset of analysis, revealed as important factors by the analysis process. Evaluating latent variable models using known information recreates existing knowledge by confirming the presence of expected variables, and may underemphasize a result which reveals information that was previously unknown.

Figure 4.1 Two alternative clusterings for eleven objects with the salient properties of shape and color, adapted from Färber et al (2010).

87 All measures of model validation require the presence of a single ‘gold standard’ result which can be compared against model-generated results for success. This gold standard is generated from more than the salient latent factors assigned to data manually by discerning users. Unsupervised learning methods contain several properties which necessarily reduce their ability to generate consistent results on subsequent attempts, even while these results are valid clusterings on latent factors present in the same data. These factors include the functionally infinite parameterizations that come from processes which require multiple ordinal inputs, generative components guiding Bayesian sampling methods, and the potential assignments of observations into multiple clusterings proportionally. In processes which require multiple input parameters reflecting properties of the observed data and desired format of results, little theory exists guiding appropriate values to use. This is, of course, acknowledging that many variables depend strongly on the nature of the input data; its volume, is dispersion, and its arrangement. In the example of inverse-distance weighting (IDW) for the purpose of spatially interpolating values between known points, multiple numerical inputs are necessary to define the appropriate function of the method (Watson and Philip 1985). The power of the function is necessary to define how rapidly the similarity function decays away from a tested point, defining how much weight is given to surrounding points versus more distance ones. The power value is not based on any known processes which actually represent real phenomena. Inappropriate values, especially ones that are too large, will result in incorrect results. IDW also requires a defined restriction on the number of neighbors that should be considered in the similarity function. Options exist for this parameter, including restrictions based on number of neighbors and fixed radius, both of which can be estimated and tested, but could have variable impacts on the interpolated result. The lack of a priori knowledge about the exact optimal values for parameters means that the only ways to determine their appropriate values for IDW are parameter testing and general rules of thumb. Generative components in models often contribute to the erosion of trust in output since results may not be consistent between subsequent model instances. In processes which require sampling, iteration, and convergence toward a truer result, randomization

88 is critical to the performance of the model. Generative models enable testing of unobserved variables by generating the observed variables as if they were dependent on the observed ones. By chaining together iterations of latent variable estimation and resulting generation of observations, a generative model converges on a single optimal arrangement of observed and unobserved information, such as in a Markov Chain process (Alvarez-Garcia et al. 2010). Here, randomization is necessary to prevent overfitting. Many clustering techniques allow for observations to exist in multiple clusters simultaneously. This multiple assignment feature with the majority proportion being the ‘primary’ cluster when one is needed to delineate clusters depending on the application. In a hypothetical validation environment where optimal clusterings are known and compared against model output, known values commonly have only singular classifications. Some models, like the Correlated Topics Model which is based on exploiting the proportional assignments of Latent Dirichlet Allocation to determine topic similarity (Blei and Lafferty 2007), actively search for proportional assignments, but most applications demand a single cluster assignment to model latent variables. The evaluation method in this dissertation utilizes of these factors when measuring the success of model outputs. The model sensitivity analysis discussed in section 4.2 embraces the variability caused by parameter combinations, considers the semantics of topic definitions to compare similarity in thematic identification rather than through particular common terms, and uses the proportional assignments generated by LDA to evaluate the extent to which the process can consistently separate classes and find useful combinations of distinct topics. 4.1.2 Interestingness Approach In the absence of a known gold standard, data driven clustering methods turn to likelihood statistics to evaluate the representativeness of missing variables and latent clusters. Log likelihood is one such statistic, and a particular method of calculating log likelihood – expectation maximization (Do and Batzoglou 2008) – is popular in generative latent variable models because the iterative procedures of sampling and refining are compatible with one another. But as explained in the previous chapter, expectation maximization has tendencies to produce sub-optimal results in the form of local maxima and overfitting as the number of clusters increases toward the n number of

89 observations. This project uses interestingness measures to overcome these issues with likelihood statistics. Many interestingness measures are common in KDD literature. Individually, interestingness measures are useful for particular purposes, as described in this chapter. Rarely, however, are each of the interestingness measures here utilized in the same research to explore how a pattern may be interesting in one sense, and uninteresting in another. Interestingness is even more of a rare concept in geographic research. Several studies have suggested the use of one or more subjects of interestingness (Laube and Purves 2006, Miller 2010, Miller and Goodchild 2014), but few implementations of interestingness exist in geography. This project contributes to geographic literature by reconsidering evaluation of data-driven research in terms of their interestingness, and by the implementations of each measure to explore their effect on the parameterizations of LDA. As described in Chapter three, each of the interestingness measures described in the KDD literature evaluate different aspects of the cluster generation process. Each measure specifies a destinct aspect for evaluating the success and usefulness of the topic definitions and document assignments toward evaluating the LDA procedure.

4.2 Model Sensitivity Analysis To test the interestingness of LDA’s topic definitions, the ‘evaludation’ approach of Augusiak, Van den Brink, and Grimm (2014) is utilized. Central to this approach of combining validation with evaluation is the process of model sensitivity analysis (Saltelli, Tarantola, and Campolongo 2000), which tests the impact of parameter values on the resulting patterns. Model sensitivity is measured in two ways; by keeping pairs of model parameters constant and allowing the third to vary by a consistent interval, and by comparing combinations of two parameters as they vary with respect to one another. The former method simply demonstrates the variance of a single parameter on any individual measure of interestingness by isolating its effect. The latter demonstrates the co- dependence of some parameters, such as k and alpha, where the optimal alpha prior distribution could depend on the number of topics specified. This chapter illustrates both methods in Section 4.3.

90 4.2.1 k – number of topics There is some disagreement about what the optimal number of topics should be specified in a process utilizing Latent Dirichlet Allocation. Too low of a value for k generates broad topics which have clear semantic themes, but match up well with general expectations from the data. Too large a k is quite unrealistic, where topics structures may artificially be placed on data that contains significantly less variation, even while topic definitions tend to merge semantically. The only recommendation given in the literature seems to suggest that larger k is a more desirable choice than a smaller k. In a pattern with greater topics, select individual topics tend to diverge thematically from one another to a greater degree than when using smaller k. Researchers demonstrate this fact by choosing the seven most clearly separate topics from a model generated with a k of 500 (Blei 2012). However, while this method demonstrates LDA’s ability to separate semantic structures, a pattern’s diversity is only one measure of its interestingness, so several values of k are tested here. A k value of 500 is not tested, which is unrealistically large and creates a burden on subjective evaluation, values for k are selected which span both low and high ends of expected and reasonable numbers of topics. See Table 4.1 listing each of the k values tested.

K TEST DESCRIPTION VALUE 10 small value, broad topics 25 mid value, some topic overlap 50 mid-large value, some topic overlap 100 large value, much topic overlap

Table 4.1 List of the unique values for k – the number of topics – in tested LDA models, along with a general description of the expected semantic overlap of that value among each of the k topics.

4.2.2 Alpha Following from Grün and Hornik’s analysis of appropriate alpha test values provided a constant k described in Chapter 2 (Grün and Hornik 2011), several appropriate alpha values are generated and tested given k. The effect of alpha on model results was tested by varying alpha and keeping all other parameters constant in subsequent runs of

91 the process. The values in Table 4.2 which are used to test the model were generated by the ‘topicmodels’ package in R with the topics indicated, then used as input to the ‘lda’ package implementation that was identified as providing the best results from the LDA process. Each of these values for alpha with each k value, not just the value that was used to generate the appropriate alpha are used. This method generates realistic alpha values to test, rather than random numerical selection.

ALPHA NUMBER OF TEST TOPICS USED TO DESCRIPTION VALUE GENERATE 0.006 10 small value, greatest topics per document 0.011 25 small value, high topics per document 0.016 50 mid value, lower topics per document 0.029 100 high value, least topics per document

Table 4.2 List of the unique values for alpha used in this dissertation. Each value for alpha was generated using estimations from the ‘topicmodels’ package using the indicated number of topics in the estimation.

Alpha has a measurable effect on the distribution of topics per document, and thus directly impacts the measurement of several interestingness measures. The larger the alpha parameter value, the steeper the slope of the convex Dirichlet distribution from the center of the simplex toward each of the edges, where individual topics lie. A small alpha which is less than one creates a concave distribution surface, making individual documents more likely to be generated by distributions of a single topic. Conversely, fewer topics lead to a larger alpha, and greater probability that any given document was generated by each topic at the same proportion. This is important with respect to measuring interestingness as it pertains to topic diversity of documents and patterns. Diversity measures the extent to which a topic is homogeneous with respect to the documents assigned to it, which is precisely what a pattern of high diversity attempts to maximize. The effect of alpha on diversity is measurable and consistent as demonstrated in Section 4.3.4. 4.2.3 Term frequency – inverse document frequency Finally, the impact of variable vocabularies on the definition and interpretability of patterns is measured. The vocabulary consists of all unique terms from all documents

92 in the corpus, from which terms are randomly selected during the LDA process to represent topics and compared to the actual distribution of that particular term in the document. Because the topics and selected terms ultimately converge on the observed distribution of terms within documents, small variations in the size of the vocabulary means very little to the semantic definitions of topics. However, the terms in the vocabulary should be confined to include only words which contribute to the identification of separate themes among subsets of the input documents. For this reason, text analysis procedures often make common vocabulary reduction steps, such as removing stop words. Additional vocabulary reduction steps are taken to further specify the vocabulary to include only the most useful terms for identifying thematic subsets of documents from the given corpus. The term frequency – inverse document frequency (tf-idf) metric directly measures each term’s importance to the separation of clusters of documents from the rest of the document collection. Although many use tf-idf independently of other text clustering methods to model text data, it is only used in a pre-analysis step defining the vocabulary of terms by specifying a minimum tf-idf cutoff, where terms with a greater tf- idf than the minimum are included in the vocabulary. To define appropriate values to test for minimum tf-idf, recommendations from the literature suggested minimum value of ‘a little under’ the median tf-idf value across all terms (Griffiths and Steyvers 2004). Subjectively, the value of 0.001 is used to represent ‘a little’ and modify the calculated median by subtracting. The effect of variable vocabulary sizes is tested by specifying two new minimum tf-idf values at equal intervals above and below the median value by using a larger modifier to the median. Based on ranges of tf-idf values in test scenarios, 0.025 is determined to be a suitable modifier above and below the median tf-idf value to represent realistic vocabulary variations. The three minimum tf-idf values used for articles in English are provided in Table 4.3.

93

MINIMUM VOCABULARY TF-IDF DESCRIPTION REDUCTION VALUE 0.030 18.43% small value, greater proportion of terms above the minimum 0.054 49.3% median value, half of terms excluded from vocabulary 0.080 70.44% high value, large reduction of terms from vocabulary

Table 4.3 List of the three values for the tested minimum tf-idf value with their accompanying reduction in vocabulary size. Values were obtained using modifications on the median tf-idf value of 0.055.

The vocabulary which is used to generate topics and assign documents to them has a marked impact on the aspects of the patterns which interestingness seeks to measure. This is specifically because of the impact that the minimum tf-idf parameter has on the vocabulary. As minimum tf-idf decreases, the size of the vocabulary increases, which also increases the uniqueness of the terms. Uniqueness is particularly important for the subjective measures, which seek to find information which is not shared among patterns.

4.3 Results This section contains results from the model sensitivity analysis conducted with the above parameters on the Latent Dirichlet Model. The combination of parameter values listed in Tables 4.1, 4.2, and 4.3 yield 48 separate models which vary by at least one parameter. All other necessary additions to the LDA implementation used from the ‘lda’ package of R (Chang 2015) – the beta parameter (0.1), the number of iterations (100), and the seed for recreation purposes – are kept stationary. In each subsection that follows, the 48 models are evaluated via each of the interestingness measures described in Geng and Hamilton (2006) and listed with their descriptions in Table 2.1. The interestingness measures are separated into the following subsections to fully evaluate the effect of varying parameter values on the isolated measure.

94 4.3.1 Conciseness As discussed in previous chapters, conciseness refers to the size of the result set, where fewer attribute-value pairs in the result represent a more concise pattern (Geng and Hamilton 2006). A more concise result more easily fits in to the existing knowledge base of a user, making it more interesting for its simplicity given the new information that it contains. In the LDA model, which utilizes each coadable piece of input data, attribute- value pairs correspond simply to the generated topics and their document membership. 4.3.1.1 Sensitivity testing Among the 48 tested models in this sensitivity analysis, the subset that varies with k are the only models of interest to compare for conciseness. Quite simply, lower k values generate more concise patterns. Obviously, this relationship exists only to an extent. The more concise model is one that contains a minimum number of attribute-value pairs. Such a model – a one-topic model – may be concise and therefore interesting, but it does not accomplish the goal of the model, which is to provide a useful summary of clusters within the input documents. In calculations of a minimal set of values, which are used to measure conciseness (Padmanabhan and Tuzhilin 2000), the informativeness of the pattern is also taken into account alongside its simplicity. In order to isolate individual interestingness measures, the minimal set of values for any model is not calcuated. Instead, conciseness serves as a philosophy for comparing multiple models that are otherwise similarly interesting. In particular, the four values of k selected exhibit varying degrees of conciseness with respect to one another. A 25-topic model is concise compared to a 50-topic model, which is concise compared to a 100-topic model. A 50-topic model may be ‘twice’ as concise as a 100-topic model, given that a 1-topic model is optimally concise, but no single value can be assigned to a pattern representing its conciseness isolated from other patterns. Similarly, since no application could make use of an optimally concise model, that value would mean little in isolation anyway. 4.3.1.2 Recommendation Evaluating the particular parameter combinations tested, it is concluded that a more concise model will result from a choice of a smaller k parameter. The number of topics has a large effect on other interestingness measures, so it seems unlikely that

95 conciseness would dictate the choice of k more than other factors. From a purely evaluative standpoint, conciseness should be used to determine a simpler pattern between two patterns that are otherwise equally interesting. Although some modelers use high numbers of topics to define small subsets of input data (Blei 2012), the more concise, simpler model provides benefits in the form of clearer topic separation, ease of finding particular clusters of interest, and overall model clarity, models with fewer topics are more desirable than similar models with more topics. 4.3.2 Generality/Coverage Generality, or coverage, refer to the extent to which a model uses a maximum amount of the input data to generate patterns. A more general model classifies a greater proportion of the input observations into the resulting pattern than a more specific model. Generality usually refers to classification rules (Webb and Brain 2006), so to specify the evaluative nature of the measure, this project uses coverage to represent the number of input records covered by the output pattern. Coverage is obviously useful in machine learning applications; if some data cannot be modeled by the process, then the pattern and its insights will be incomplete. Outlier detection presents a possible outlier to the benefit of full coverage, though it is unclear what outliers would be analytically interesting in this way, except to define which data does not fit a model’s assumptions. 4.3.2.1 Sensitivity testing LDA will classify every document which consists of terms above the minimum tf- idf, so the only parameter which impacts the coverage of the model is the minimum tf-idf value. To understand the effect of minimum tf-idf value on the size of the vocabulary and therefore the model’s coverage, first we should observe the tf-idf values for each term across all documents prior to vocabulary reduction. Figure 4.2 shows the number of terms for each calculated tf-idf value, with the three tested values for minimum tf-idf specified.

96

Figure 4.2 Histogram showing the highly left-skewed term frequency-inverse document frequency of terms across all documents. The three test values for minimum tf-idf are shown by dashed vertical lines, with the median in red.

Clearly, the tf-idf values are not normally distributed among all terms across all of the documents. The distribution is highly skewed to the left, meaning that most terms have very low tf-idf values, indicating that very common, evenly distributed terms or very rare terms dominate the vocabulary. Reducing the vocabulary with a minimum tf-idf removes these terms which are less helpful for establishing useful document subsets, and reduces the computational burden of searching the large vocabulary. The minimum tf-idf values that do not appear to have much impact on the number and nature of the terms in the vocabulary, but this choice does impact ways that individual documents are classified and even whether they are included in the process at all. These variations in vocabulary size translate to similarly variable coverages among models. The coverage of a model through the percent of the total input documents (21,688) which were classified by the given model are measured. A higher percent of classified documents represents a higher coverage with respect to the choice of minimum tf-idf. The reduction in vocabulary as a result of the minimum tf-idf selection along with the reduction in coverage as a result of the reduced vocabulary, are given for each of the three minimum tf-idf values in Table 4.4.

97 Min Total Vocabulary Vocabulary Total Classified Coverage Generality tf-idf terms size reduction documents documents reduction 0.3 24267 19794 18.43 % 21,688 21528 99.26 % 0.74 % 0.54 24267 12304 49.3 % 21,688 21484 99.06 % 0.94 % 0.8 24267 7173 70.44 % 21,688 20070 92.54 % 7.46 %

Table 4.4 List of the effects of specifying each of three minimum tf-idf values on the reduction in vocabulary size (number of term) and generality (number of input documents able to be classified).

These reductions in the number of documents may not appear to be a problem. However, the difference between a nearly one percent reduction and a seven-and-a-half percent reduction in total codable documents represents a significant loss of information. The tradeoff between allowing more terms that are less informative for defining subsets and incorporating more of the input documents is a significant one. With a minimum significantly higher than the median tf-idf value, such as the tested value of 0.8 here, the vocabulary reduction is prohibitively significant to produce a model with very low generality. However, due to the skewness of the distribution of terms toward lower tf-idf values, documents are increasingly likely to be excluded as the minimum tf-idf threshold is increased, resulting in a larger reduction in generality with increased minimum tf-idf beyond the median value. Some reduction in generality is expected when dealing with large numbers of observations. The reduction is not random; it is based on the calculated benefit of each term toward defining separate subsets of the inputs. The same documents that are excluded from a pattern generated by a minimum tf-idf of 0.3 are also excluded from patterns from each of the other values. Those excluded documents do not contribute terms toward defining topics, nor are they likely to contain other interesting information since the terms individually had low tf-idf values. Although some reduction is expected, only those with the lowest interestingness should be excluded, so the large reduction in generality that results from high minimum tf-idf values is not desirable. An additional tradeoff between coverage and computational efficiency is necessary to consider in the specification of a minimum tf-idf. An increasing vocabulary size also increases the necessary computation involved in drawing random terms from the multinomial distribution of the term vocabulary. This random selection occurs frequently during LDA’s iterative selection step, and so even negligible differences in the efficiency

98 of using a particular distribution can add up over many iterations. However, this impact is not considered detrimental given other procedures which more greatly impact computational performance (number of iterations, documents, and topics, for example) and the relative impact that small changes in the minimum tf-idf have on pattern interestingness, especially coverage. 4.3.2.2 Recommendation The literature suggest value of ‘a little less’ than the median tf-idf value over all terms is a reasonable threshold to use for specifying a vocabulary of informative terms (Griffiths and Steyvers 2004). The general pattern of higher minimum tf-idf generating patterns with lower coverage not only holds, but the effect exponentially applies to a detrimental effect on the LDA process. Lower minimum tf-idf values may be used to cover a greater proportion of the input data, as the most commonly-used and frequent terms are more likely to be used when a lower minimum tf-idf is speccified. The reduced tf-idf may also incorporate rare terms – jargon – which more uniquely define specific clusters than a term that has a high tf-idf based on usage but by itself does not reveal any interesting information about the semantic structure of the cluster. For the purpose of evaluating a pattern for its coverage, an approximately median minimum tf-idf reduces the vocabulary sufficiently for computational and semantic purposes, while maintaining a list of terms that is broadly defined so as to distinguish between many separate topics. 4.3.2.3 Geographic significance Generality also has an impact on the geographic properties of the model. Spatially and contextually, greater coverage yields a model which better encompasses the range of factors represented in the original data. News media already chooses which narratives to present and which ones to leave out, so capturing as much of the different representations of a particular event is the best possible strategy for understanding that event. This project’s inevitable removal of media perspectives through a suboptimal generality/coverage is also systematic based on the structure of the text and the terms used in each publication source. This presents an exclusion in line with Kwan’s concerns over mapping underrepresented populations in GIS (Kwan 2002b). All three perspectives presented in news articles – the publishers, the subjects of the articles, and the intended audience – may contribute to specific writing styles and vocabularies, which creates a

99 systematic means of excluding classes of individuals and their experiences. Kwan uses targeted data collection and mixed media to develop a narrative of perspectives that may not be collected automatically (Kwan 2008). Thus, coverage/generality is important to maximize, or risk the exclusion of media perspectives. Though big data theoretically covers a larger, more representative sample of the relevant perspectives, this systematic exclusion is an issue that generality measures and tries to minimize, regardless of the amount of data. 4.3.3 Peculiarity Peculiarity measures the divergence between a given pattern and every other pattern, to establish the uniqueness of the themes which it separates from one another. Peculiarity is in this way similar to the diversity and unexpectedness measures, all of which attempt to measure the maximal separation of topics and patterns from one another. LDA provides little analytical potential if it cannot separate the unique themes within the input corpus, so maximizing the separation of each topic is paramount. Peculiarity is measured via the quantifiable difference among the defintions of each pattern, given by the specific terms unique to each pattern. As a function of the number of terms which are unique, peculiarity is the inverse of the proportion of terms which are shared between a pair of patterns, forming a distance between them. However, since patterns which contain different values of k also have a different number of associated terms, peculiarity is not calculated as a proportion of the total terms which are shared, but as a proportion of the number of terms in the smaller of the patterns (lower value of k) which are shared between them, ensuring that the proportion is limited to a range of zero (no shared terms) to one (all terms in the pattern with lower k also appear in the other pattern). The mean of all proportions associated with each pattern is then taken after calculating the proportion for each pair of patterns. The inverse of this mean represents the distance between each pattern and all other patterns, such that a value of zero represents a complete thematic overlap between the given pattern and all other patterns, and therefore a lack of peculiarity, while a value of one represents the opposite. 4.3.3.1 Model Sensitivity As just mentioned, peculiarity between patterns is a function of the shared or unshared terms between them. The minimum term frequency-inverse document

100 frequency is the only parameter which has any significant impact on which terms can appear in topic definitions. The impacts of minimum tf-idf, alpha, and the number of topics on peculiarity are explored in Figures 4.3 and 4.4.

(a)

(b)

101 (c)

Figure 4.3. Plots of peculiarity as a function of alpha and number of topics, keeping minimum tf-idf constant. (a) keeps minimum tf-idf constant at 0.03, (b) uses a minimum tf-idf value of 0.05, and (c) uses a value of 0.08, each comparing over the four values of alpha and k.

It appears from Figure 4.3 that alpha has very little impact on the peculiarity of a pattern. The effects of alpha are explored in these three figures via the series shown in shades of red, with higher alpha given in a darker shade. Regardless of the alpha value, in models generated by the same minimum tf-idf, shown in sub-figures (a), (b), and (c), the series across the number of topics share very similar shapes and magnitudes, with the one exception in sub-figure (c). This is to be expected, as the terms used in topic definitions vary most notably by the size of the vocabulary, which dictates the number of terms from which to choose to define topics. Although the highest peculiarity values are generated by the highest minimum tf-idf value (Figure 4.3 (c)), the correlation does not hold through the other values of tf-idf. One pattern is obvious with varying minimum tf-idf. The peak peculiarity score appears at varying models generated by increasing number of topics with increases in tf- idf. The highest peculiarity score for a minimum tf-idf of 0.03 belongs to models (regardless of alpha) with a k of 10 (Figure 4.3 (a)). The same peak for models with minimum tf-idf of 0.05 occurs with k of 25 (Figure 4.3 (b)). Finally, the highest minimum tf-idf of 0.08 coincides with peaks at models with a k of 50 (Figure 4.3 (c)).

102 It thus seems that the expected pattern of higher peculiarity as a result of lower minimum tf-idf only applies when k is also considered. Lower minimum tf-idf creates a larger vocabulary, which can only result in a larger peculiarity with a greater number of possible terms used in topic definitions. This effect appears to be amplified with lower number of topics. Although greater topics and a larger vocabulary generate more unique (or novel, see Section 4.3.6) topics and patterns, the same relationship does not translate to peculiarity, as greater topics means that the greater chance of any term being repeated in multiple topics outweighs the additional chance that topics are also more likely to be semantically very different. With minimal topics, such as the 10-topic model whose peak peculiarity is a result of low minimum tf-idf, terms which are unique to a single topic carry much more weight than they do in a larger-k model. Additionally, the ratio of vocabulary size to total terms used in the topic definitions is small even with large minimum tf-idf values. So the multinomial probability of drawing any term from the vocabulary decreases faster than the number of new terms incorporated into the topics of a larger-k model than one with smaller k. Hence peculiarity is relatively high for larger k when the vocabulary size decreases, or minimum tf-idf increases.

(a)

103 (b)

(c)

104 (d)

Figure 4.4Plots of peculiarity as a function of minimum tf-idf and number of topics, keeping alpha constant. (a) keeps alpha constant at 0.006, (b) uses an alpha value of 0.011, (c) uses a value of 0.0165, and (d) uses a value of 0.029.

The relationship between high and low tf-idf with varying number of topics is also visible in Figure 4.4 where peculiarity is observed with varying tf-idf and topics, while keeping alpha constant. In the four images in Figure 4.4, the alpha parameter is constant and the minimum tf-idf is shown in each of three series with increasing shades of blue. In each of Figure 4.4 (a), (b), and (c), the low minimum tf-idf of 0.03 and the high tf-idf of 0.08 have inverse relationships with peculiarity as a function of the k topics. Contrary to other measures of interestingness, especially coverage and diversity, the near-median minimum tf-idf has almost no advantage to the peculiarity of any model. This is surprising, as the pattern between tf-idf and k is not consistent, but nonetheless carries significance for suggesting particular values of both variables in model selection. 4.3.3.2 Recommendation One clear recommendation from the calculation of peculiarity clearly shows that alpha has hardly any impact on the distance between two patterns via the terms in topic definitions. Some alphas may be more important for specific values of k, but peculiarity does not demonstrate any significant relationship with the alpha parameter. From both Figure 4.3 and 4.4, it seems that recommending optimal parameters to increase the peculiarity of patterns generated by LDA require that the number of topics

105 and the minimum tf-idf be jointly related. Lower numbers of topics require a lower minimum tf-idf to increase the vocabulary size, while higher numbers of topics demand higher minimum tf-idf values. This is still counter-intuitive, as larger values of k require more terms to define topics in order to be different from one another, however the increase in topics with shared terms is overshadowed by unique topics that utilize the additional terms from a larger vocabulary, minimizing the model’s peculiarity. 4.3.3.3 Geographic significance Peculiarity is particularly important in the presence of great complexity, which spatial data exemplifies through multi-dimensionality, autocorrelative structure, interactions in space and time, and the qualitative nature of much geographic information (Mennis and Guo 2009). Among these spatially specific complexities, data volume adds another dimension. Peculiarity maximizes the distance between observed patterns such that the high dimensionality of the data is reduced to clearly defined and interpretable spaces. In this application, peculiarity is measured by the uniqueness of the topics given by the terms that generate them, so attempts to capture the most separate themes that otherwise would be lost. The peculiar semantic spaces generated by LDA also present peculiar geographies. As Chapter 5 demonstrates, the topics discoverable in news articles contain different references to geographic features, showing the link between semantic and geographic spaces in a political context. 4.3.4 Diversity Diversity, as definied by Equation 3.2, measures the variety of concepts contained within a single pattern. Another way to think about it is that a non-diverse pattern, one whose topics contain semantic overlap, is less likely to be of use than a pattern whose topics are distinguishable and separable. The goal of LDA is to separate semantic themes through topic generation, so a pattern without diversity fails to accomplish the basic objective of the method. Here, diversity is measured as the divergence from an equally distributed set of articles to topics. In the case where every article contains the same semantic content, each topic would be identical, regardless of the number of topics, and each article would be assigned evenly to every topic.

106 4.3.4.1 Sensitivity Testing To evaluate the effects of the chosen parameter values on each pattern’s diversity, one parameter value is isolated at a time, and diversity variance is observed across the other two parameters. Figure 4.5 a-c shows the changes in diversity on the Y-axis while holding minimum tf-idf constant at the three tested values and varying values for alpha and the number of topics, along the X-axis. Figure 4.6 a-d does the same with four constant values of alpha.

(a)

(b)

107 (c)

Figure 4.5Plots of diversity variance score (as a percent of the maximum potential variance given by the value of k), as a function of k and alpha.

Several patterns are evident in the diversity data in Figure 4.5, as the series given by the four tested alpha values have difference shapes and magnitudes of diversity. The first clear pattern shows that regardless of the minimum tf-idf value or the number of topics, lower alpha values produce higher diversities. This evidence supports the very definition of alpha’s effect on the Dirichlet distribution and its role in LDA, shown in Figure 3.1. LDA requires a single alpha value, applied to all topics, and so only the top row of Figure 3.1 is necessary to understand the impact that alpha has. Although all Dirichlet distributions with uniform alpha have the highest probability of sampling at the center of the k-topic simplex (representing a selection with even representation among all topics), the slope of the surface away from the center and towards the corners (representing selections with 100% representation at a single topic) increases with increasing alpha. Therefore, the model expects more documents to be generated by a single topic, increasing the diversity over documents with multiple topics, with lower alpha. Surprisingly, the highest overall diversity scores do not come from lower values for minimum tf-idf. Logic suggests that a larger vocabulary will create more diversity among topics because the variety of terms increases, separating topics from one another

108 semantically. Instead, the largest single diversity value is generated by a high minimum tf-idf, low topics, and low alpha (Figure 4.5 (c)), while similarly high diversity values also appear when the minimum tf-idf is near a median value. There seems to be a tradeoff between vocabulary size and number of topics. When fewer terms are used to generate topics, unique themes can still be generated when there are fewer themes to generate. However, as topics increase, more specific topics tend to emerge, which requires a greater vocabulary to accommodate a similar diversity. In fact, the highest diversity values for each of the four values of k, except for the aforementioned peak in Figure 4.5 (c), are generated by low alphas combined with a minimum tf-idf that fits suggested given in the literature (Griffiths and Steyvers 2004).

(a)

109 (b)

(c)

110 (d)

Figure 4.6 Plots of diversity variance score (as a percent of the maximum potential variance given by the value of k), as a function of k and the minimum tf-idf.

Isolating each of the four alpha parameters here and varying the minimum tf-idf shows much the same conclusions as when isolating tf-idf. Higher total values for diversity are generated by lower values for alpha (Figure 4.6, (a) and (b)) with peaks of over 97% of potential diversity appearing for median and high minimum tf-idf values. Median values for tf-idf produce the most consistently diverse models, especially when using lower alpha, forcing the Dirichlet distribution closer to the edges of the topic simplex. A high minimum tf-idf can produce high diversity scores when k is a mid-range value, such as 50. But high minimum tf-idf can also produce variable diversities, including the lowest diversity scores among all combinations of parameters, especially at low values for k (Figure 4.6 (b), (c), and (d)). Strangely, however, when minimum tf-idf is high, and topics and alpha are low, the largest diversity score is produced. This is most likely an anomaly, as high minimum tf-idf values should generally be avoided, maximizing the diversity across all other combinations of parameters. 4.3.4.2 Recommendation Ultimately, diversity is impacted predominantly by the single alpha parameter. Alpha directly controls the number of topics which generate each document, which determines how diverse a document’s assignment can be among the k topics. The patterns

111 in Figure 4.6 show this relationship the best, as low alpha values consistently contain more diverse patterns than any other parameter. Alpha remains a parameter with no ‘correct’ usage, though some values are more appropriate given separate analysis choices, such as the number of topics. Using alpha values tested from the ‘topicmodels’ package, which contains an option to estimate an appropriate alpha given all other parameters, low alpha corresponds better to a low number of topics, which is why diversity tends to decrease as topics increase with low alpha as a reference. A low enough alpha will theoretically generate the same diversity scores as a Latent Semantic Indexing model, which does not allow each document to contain more than one topic. Thus, while achieving higher diversity means using a low alpha, one unique aspect of LDA separates it from LSI via allowing multiple topic assignments. LDA generates not only likely topics for each document (the higher the likelihood, the higher the likely diversity), but allows for correlating topics through each document’s proportional assignment to multiple topics. Low alpha should therefore be a priority for using the LDA process. The Dirichlet distribution assures a sufficient difference between LDA’s multiple topic assignment and LSI’s single topic assignment, such that the unique capability of LDA is maintained and not threatened by theoretically-small values of alpha. This is especially true at small values for k, where large alpha fails to generate diverse results. Additionally, the literature suggests using a minimum tf-idf value close to the median for all terms in all documents to reduce the vocabulary to the most interesting terms (Griffiths and Steyvers 2004). This finding is supported by diversity measures, as vocabularies which are too big (lower minimum tf-idf) produce non-diverse results, and vocabularies which are too small (higher values of minimum tf-idf) do not add additional diversity, while decreasing the possible thematic range of topics by removing specific terms. 4.3.4.3 Geographic significance This definition of diversity matches the expectations laid out in geographic concepts of spatial clustering and autocorrelation. Spatial autocorrelation drives much of the assumptions that geographers test and design theories based on, namely Tobler’s famous First Law of Geography (Tobler 1970). A diverse pattern, which maximizes

112 within-topic similarity, while minimizing between-topic similarity, presents an optimal model to test the assumption that articles classified into the same topic are more likely also to be correlated in space. Less diversity in a pattern generated by LDA contains topics which merge together in semantic space as opposed to a pattern with greater diversity. The same relationship would then assume to occur in geographic space as well. Where the assumption of spatially autocorrelated topics does not hold, as at the borders between locations separated by their ideologies and reactions to current events, the diversity measure is not an ideal indicator, however. 4.3.5 Reliability The reliability interestingness measure attempts to evaluate the repeatability of model results, such that parameter combinations do not generate anomalous results compared to other models. This definition suggests measuring features in the opposite way that peculiarity or novelty do. However, although in concept these measures sound opposite, they do not measure the same features of a pattern, and thus are not inverses of one another. Peculiarity and novelty seek to establish the uniqueness of individual or groups of topics for further scrutiny. Reliability measures trust in the model by tracking the topics which generate each document over all patterns. Less variation in the topics for a single document across all patterns indicates higher reliability. As described in Chapter 1, reliability is computed as a comparison of each pattern’s topic definitions to every other patterns’ definitions to compute the consistencies between them. Greater shared consistency between a given pattern and all other patterns indicates reliability. One hundred documents are selected at random, making sure that each document is coded in each document and not excluded as described in 4.3.2 Generality/Coverage when coverage is suboptimal, and the consistency of their topic assignments are compared across each pattern. Then, using the Jaccard similarity measure (Chandrasekharan and Rajagopalan 1989), specifically calculated as the intersection of two sets over the union of those sets, the reliability of a pattern is established. Jaccard similarity uses text matching, so provides an advantage over the numerical comparisons of cosine similarity for this text-based application.

113 4.3.5.1 Sensitivity Testing Theoretically, each of the parameters examined here have some minor impact on the reliability of a pattern. Alpha determines the likelihood that any given document will be assigned to multiple topics proportionally, so the primary topic of a document with multiple topics could vary depending on alpha’s allowance to do so. But this effect will also depend on the randomness of the algorithm as well as the other parameters. Any individual document is no more likely to see its primary topic change as a result solely of alpha, except in rare situations where two primary topics tie for highest likelihood and sorting changes which of the two is designated the primary topic. Alpha does not impact the reliability in these calculations because a document’s topic assignments is measured using the primary topic to maintain a uniform topic size among all documents. See Figure 4.7 for an example of model reliability results where alpha appears to have no impact on the pattern’s reliability.

Figure 4.7 One example plot of pattern reliability, consisting of 16 patterns with a minimum tf-idf of 0.05, and varying k and alpha.

Through observing changes in the number of topics, k, some variation in reliability is expected. No patterns are observable among variations in alpha across changing number of topics. The same pattern is observable in each of the plots of different minimum tf-idf, so only one plot is shown here as an example. In a pattern

114 definition given through the list of defined topics, greater k generates topics which are more specific thematically than in patterns generated by smaller k. This trend is more observable using novelty to measure the interestingness of a pattern (see section 4.3.6). Thus, in theory, reliability is considered the opposite feature of novelty. In calculations over an entire pattern, however, the effect of k is significantly lessened. While some of the additional topics are novel, many more have semantic overlap with other patterns already existing in the pattern, reducing the overall reliability more than theory would suggest. Still, some evidence exists which shows that lower k models have higher reliability than higher-k models. Figure 4.7 shows a trend toward reducing reliability as the number of topics increase, particularly with the highest reliability values appearing for the lowest k models, having 10 topics. The same conclusion is less evident when comparing over other parameters, such as the example plot in Figure 4.8, which compares minimum tf-idf and k over a constant alpha of 0.011.

Figure 4.8 One example plot of pattern reliability, consisting of 12 patterns with an alpha of 0.011, and varying k and minimum tf-idf. Very little evidence exists to support the claim that greater topics yields greater reliability.

Figure 4.7 and Figure 4.8 show a counterintuitive relationship between reliability and minimum tf-idf. Logically, a larger vocabulary, given by a smaller minimum tf-idf,

115 would create topics defined by a greater variety of terms, making reliability less likely. However, the opposite relationship is clearer. In Figure 4.8, the lower minimum tf-idf produces the highest reliability scores regardless of the number of topics, and the highest minimum tf-idf produces the far lowest reliability. Figure 4.9 shows a similar pattern, with high reliability values in brighter yellow colors. Clusters of similar reliability patterns between pairs of patterns is given by the dendrogram and re-ordered X axis, while the Y axis is sorted by the number of topics, the alpha, and then the minimum tf- idf. Notably, each of the three different values for minimum tf-idf tend to cluster together, with the models using a minimum tf-idf of 0.03 in the left-most cluster, 0.05 in the center of the matrix, and 0.8 in a cluster on the right side. The clusters also appear in the rows of the matrix, since every third row contains the same minimum tf-idf.

116

Figure 4.9 Matrix consisting of the reliability between pairs of patterns, computed by the mean Jaccard similarity of 100 randomly sampled documents’ topic definitions. High reliability between two patterns is shown by brighter yellow cells. The X axis is sorted by observed clusters of similar models, and the Y axis is sorted first by the number of topics, then the alpha, and finally the minimum tf-idf.

Models with the same minimum tf-idf tend to cluster together with respect to reliability, so tf-idf and therefore the size of the vocabulary, have a noticeable impact on reliability. The particular impact, though counterintuitive at the scale of topics, makes sense by focusing on the document level. In LDA, the topics generate each document, so the more unique topics (defined as having more unique terms compared to other topics),

117 the closer those topics will be to approximating the true structure of the document. Thus, a larger vocabulary is more likely to generate a topic which fits the thematic nature of any given document than a smaller vocabulary generated by a higher minimum tf-idf. This causes the lower minimum tf-idf models to describe topics which more reliably fit each of the document regardless of the other parameter values used. In addition, this means that reliability exists in the same pattern where novelty or peculiarity also exist, since uniqueness of topics drives the definitions of each measure. 4.3.5.2 Recommendation Since reliability is not impacted very strongly by any particular parameter, reliability is a form of interestingness that is difficult to give a specific recommendation. Of the three parameters however, minimum tf-idf has the greatest impact on the measure. Particularly, a lower minimum tf-idf increases the vocabulary size, and more accurately represents the detail of input documents, leading to more reliable topic assignments regardless of the values used for k or for alpha. This recommendation is subject to the size and peculiarity of the documents themselves. Given the news articles used as input data here, the documents have significant variation in their language and are longer than most other textual sources used in LDA applications (CITE something that uses abstracts). Thus, when using LDA on shorter, more similar documents, the different in reliability generated as a result of changing the minimum tf-idf would not even be as great as demonstrated here. To some degree, reliability also relies on a smaller set of topics. However, this effect is primarily a result of a lower variation in themes present in the model results at lower k models. Lowering the minimum tf-idf generates a vocabulary which more reliably reflects the input data than reducing the number of topics in the model. The trend in reliability across decreasing values for k is less obvious than the observable trends in lowering the minimum tf-idf. Thus, the most reliable way to increase reliability and consistency among individual document assignments is to reduce the minimum tf-idf. 4.3.6 Novelty Novelty is the first of the subjective interestingness measures which is used to understand and evaluate successful function of the LDA process. Like peculiarity and diversity, novelty seeks to investigate the ability of the model to generate information that

118 is unique in terms of other topics and patterns, as well as in the knowledge of the evaluator. Thus, since such knowledge (and especially the lack thereof (Geng and Hamilton 2006)) is impossible to fully formalize, novelty is evaluated subjectively. Novel patterns must both contain information that cannot be understood via other patterns, and they must contain information that was not previously understood by the evaluator. Finding novel information with respect to other patterns is the simpler of the two to find, since comparison with other patterns is relatively simple, given a maximum set of patterns to observe, and since the knowledge base of an evaluator is generally large. 4.3.6.1 Sensitivity Testing The number of topics chiefly allows for large variations in novelty, as the themes of particular topics change in specificity with changing topics. Many use this knowledge to design their initial model, using much larger values for k than the logical thematic structure of the data would dictate. The 500-topic model chosen by Blei to illustrate LDA and its ability to extract peculiar topics (although this is not the stated goal of the application described by Blei) does not fit the expectation of thematic structure over the sample of news articles that were his input data (Blei 2012). Peculiarity and novelty share the similar goal of describing unique information in the data. The goal of novelty in describing individual unique topics based on the existing knowledge of the evaluator is particularly susceptible to the number of topics, because as topics increase, the same topics are likely to remain in the model output, while additional, more specific topics are introduced. For example, see Table 4.5 and a selection of topics from various models with novel, unique themes varying with k. Each of the topics defined in the table are the most novel topics from not only the pattern, but the set of patterns consisting of the same k parameter. As k increases, topics become more unique, rather than more specific. This indicates that most documents are not placed into topics which reflect their thematic nature in models with lower k, since higher-k models are more likely to extract the more specific topics. As an example, the ‘food recipes’ topic appears in a 25-topic model and in every model with k greater than 25. In the selected topic shown in Table 4.5, the term ‘nadal’ appears, probably referring to Spaniard Rafael Nadal of international fame, though the presence of his name juxtaposes with the food-related topic. In higher-k

119 models, ‘nadal’ does not appear in the same food-related topic, indicating thematic expansion and specification. In the 50- and 100-topic models, the selected topics become more esoteric, referring to a specific incident during a soccer match (Shakira and her partner enduring racist chants), and to specific people in the tangentially-related topic of veterinary medicine.

k alpha tf-idf Topic definition News description glencor, company, volkswagon, emiss, Volkswagon’s emissions, Glencore’s 10 0.029 0.03 arrest, polic, car, commod mining scandals dish, chicken, pan, rice, squid, dice, food recipes 25 0.029 0.08 nadal, meat abus, insult, shakira, pop, fling, peel, Shakira and partner endure racist 50 0.011 0.08 verbal, racist chants at Espanyol soccer match pet, vet, microchip, vaccine, refuge, Veterinary research, Foncubierta and 100 0.011 0.08 fenc, foncubierta, martinez Martinez authors

Table 4.5 Selection of novel patterns from each of the four values of k. The themes described in lower-k models also appear in higher-k models, indicating the novelty increase as topics increase. The text processing procedure stems words, removing them of their conjugations and leaving only the term’s root, as in ‘polic’ for ‘police’ and ‘commod’ for ‘commodity.’

The vocabulary size, indicated by the minimum term frequency-inverse document frequency, also has an impact on the novelty of a pattern. Table 4.6 shows a selection of topics from three patterns varying only by their minimum tf-idf. Each of these topics concerns riders in the Vuelta a España, but with noticeable differences at separate minimum tf-idf values. In the two higher minimum tf-idf models of 0.05 and 0.08, news about the ongoing Vuelta a España was given a single topic (in this 25-topic model. In models with greater k, more topics on the Vuelta emerge). Each of those two topics include terms pertaining specifically to the cycling event – ‘kilometer’ – and the names of prominent riders – Fabio ‘Aru,’ Tom ‘Dumoulin,’ Bert-Jan ‘Lindeman,’ Andy ‘Schleck,’ Chris ‘Froom,’ Esteban ‘Chaves,’ John ‘Degenkolb,’ and Alejandro ‘Valverde.’ In the model with lower minimum tf-idf of 0.03, however, the Vuelta topic expands to two separate topics, one which contains the names of many of the same riders, the other containing only generic references to the race.

120

k = 25, alpha = 0.016

tf-idf Topic definition News description dumoulin, stage, sec, aru, tour, chave, vuelta, Vuelta a España news, rider-centered rider 0.03 race, rider, stage, tour, motorbik, vuelta, Vuelta a España news, generic jersey, driver aru, dumoulin, froom, ec, chave, min, Vuelta a España news, rider-centered 0.05 degenkolb, valverd aru, sec, degenkolb, kilomet, contador, edit, Vuelta a España news, rider-centered 0.08 lindeman, schleck

Table 4.6 Selection of novel patterns from each of the three models generated by the three minimum tf-idf values and a k of 25 and alpha of 0.016.

Thus, the increased vocabulary size achieved by lowering the minimum tf-idf value for terms changes the semantic space by adding more terms which help define semantic themes. At smaller vocabularies, the models could distinguish only one topic pertaining to the Vuelta a España. Only when a greater number of terms were introduced into the vocabulary, was the model able to differentiate news about specific riders from generic news about racing. Term frequency-inverse document frequency has the effect of creating more specific topics with the increased vocabulary size. In Table 4.6, the terms themselves have some variation with the size of vocabulary, indicating that each term’s tf-idf does not solely dictate which terms are useful for identifying separate topics. The tf-idf has been proposed as a method in itself for determining topic structure (Aizawa 2003). If this were sufficient, the minimum tf-idf would have no effect on topic structure, as the terms with the highest tf-idf would always form the basis for topics. Table 4.6 shows that new terms are introduced into topic definitions when the vocabulary increases, even while terms with lower tf-idfs are the ones added to it. LDA’s ability to find coincident terms and establish a thematic structure among documents sets it apart from tf-idf schemes (Blei, Ng, and Jordan 2003). 4.3.6.2 Recommendations Novelty relates to increase the uniqueness of topics present within a pattern. Uniqueness is in reference to both the other patterns (and topics within patterns) as well

121 as to the knowledge of the evaluator. Two parameters influence the thematic nature of topics – the number of topics and the terms that are included in the vocabulary. A greater number of topics increases the specificity of each topic over a smaller-k model. The addition of more topics to a model with all other factors constant generates novel topics beyond the semantic content which is already present, and it introduces sub-topics, or specifies additional detail separating minor themes which appear more generally in a lower-k model. Many researchers use very high-k models to generate as many novel topics as possible, extracting those novel patterns for additional scrutiny (Blei 2012). As topics increase, the number of documents assigned to those topics decreases, which allows greater specification of topics, even as many topics are also likely to share a semantic structure. Novelty seeks to define topics which provide unique thematic results, which is achieved as the number of topics increase. In the examples seen in Figure 4.5, the veterinary research topic consists of very few articles, which would thus be undetected in a model with fewer topics. Additionally, increasing the size of the vocabulary increases the semantic range of terms included in the generation of topics. Increasing the vocabulary size by lowering the minimum tf-idf incorporates common and frequent words as well as rare specific terms. Both of which increase the thematic range of topic definitions, increasing the likelihood of generating novel topics and patterns. Where increasing the number of topics introduces new topics and themes to the results, decreasing the minimum tf-idf can only add more novelty and specification to existing topics by introducing additional terms. In some cases, the additional terms may be enough to generate a new topic with from a subset of an existing one, as Table 4.6 shows. The effect on novelty of increasing the number of topics is greater than the effect of increasing the minimum tf-idf. The most novel topics occur when the total topics is large, allowing both smaller subsets of documents to emerge and for increased separation between topic themes. Even when the number of topics has no theoretical basis (500 topics, as used in (Blei 2012), does not coincide with expectations of unique themes, even in a large collection of news articles), the advantage of such a scheme to the pattern’s novelty is evident in the ability to select individual topics for their novelty with respect to other topics and to other patterns’ topics.

122 4.3.6.3 Geographic significance Novelty follows one of the primary objectives of data mining in general and spatial data mining in particular, which is to find outliers among existing and expected spatial patterns (Shekhar, Zhang, and Huang 2009). Potential outliers may be found in many different ways where expected patterns are not consistent, such as thematically different observations in spatial proximity, observations connected to one another via shared themes yet do not exist in similar geographic spaces, and emergent themes in geographic or semantic spaces. Differences from the expectation of spatio-temporal patterns, majority behaviors, and previously observed themes indicate new relationships between recorded events and their spatio-temporal context. 4.3.7 Unexpectedness/Surprisingness Unexpectedness, or surprisingness, like novelty, requires knowledge of what a user already knows in order to discover and represent what is not known. Unexpectedness is an important theme in knowledge discovery, where contradictory and surprising results could indicate failures of model assumptions, or the presence of polarized themes within the observed data. Both indicate patterns worthy of further scrutiny and provide insight into model performance, so unexpectedness remains important as a model evaluation method and a way of considering the interestingness of particular patterns. Surprisingness is a difficult concept to discover and to confirm that the themes that it represents are of any use. First, data collection is often tailored to the expectations of the research objectives. Much of the potential results contained within all news published during the timeframe of this project have been reduced to the most likely relevant articles by targeting specific news sources and using key words in news article collection. Violations of those expectations are much more likely to be novel, hinting at unknown unknowns, or new emergent themes. To contradict existing knowledge is therefore difficult. 4.3.7.1 Sensitivity Testing Even with the wide net cast to collect news on the general topics of sport, politics, and regional identity and the unique and novel topics that appear as a result, much of the topics are not unexpected or oppose the assumptions carried into this project. unexpected patterns are sought by observing the connections between terms in topic definitions.

123 Much of the evaluation and use of LDA is based on the definitions of topics, rather than the assignments of documents themselves into topics. This is the dimension-reduction and summary technique for making sense of large collections of data. The topic definitions themselves are useful tools for understanding the thematic structure of the data. Although documents themselves may not follow the thematic structure that generates them since topics are a combination of terms drawn from many documents, the documents which comprise each topic may be consulted to compare the representativeness of the topic to the document which ascribes to it. Only one pattern demonstrated a topic which indicates significant unexpectedness, violating a project assumption related to the expected geography. That assumption was that Scotland has been friendly and accepting of refugees from Syria. The particular topic, shown in Table 4.7, suggests the opposite, merging terms that suggest Scotland’s news is hinting at restricting or ‘prohibiting’ refugees from Syria. Topic definitions among all patterns have some degree of randomness, where terms that do not seem immediately relevant to the predominant theme appear in all patterns regardless of parameter values (for example, ‘nadal’ in the food recipes topic in Table 4.5).

Minimum Topics alpha Topic definition News description tf-idf refuge, syria, syrian, prohibit, Scotland feelings toward 25 0.011 0.056 migrant, gordon, scotland, uefa Syrian refugees

Table 4.7 An unexpected pattern, designated as such for the combination of terms which suggest a Scottish anti-Syrian-immigrant policy.

In the case of the topic described in Table 4.7, the replacement of several terms could provide clarity to the contradictory nature of this topic. The presence of ‘prohibit’ in this topic could be reflective of news concerning other countries’ views and policies on refugee intake, especially that of Spain. Replacing either of ‘prohibit’ or ‘Scotland’ could form a more coherent topic, though it is also possible that the news generated by this topic merge related themes with crossover of some subset of terms in the topic definition.

124 Unexpectedness cannot otherwise be rated and compared among models for comparative testing. Contrary to other measures, unexpectedness is binary, where a topic or pattern contradicts existing knowledge or it does not. It is tempting to consider novelty as a lesser degree of surprisingness, but the goals of each method are very separate. No other patterns exhibited anything else suggesting unexpected themes present in this data. 4.3.7.2 Recommendations In the absence of a true sensitivity test of unexpectedness given the measure’s binary nature, little confidence for recommending specific parameters to maximize a pattern’s unexpectedness exist. Unlike novelty, which can be considered on an ordinal scale, unexpectedness is a binary value and so the single data point is insufficient to make a viable recommendation. The theoretical conclusion unexpectedness/surprisingness resulting from the k parameter and the minimum tf-idf may still exist, however. A k that is too low generates very broad topics, while a k that is too large generates specific topics that reuse many of the same terms across each topic. Both scenarios’ patterns do not yield unexpected topics. Similarly, with tf-idf – a smaller vocabulary focusses each topic, while a large vocabulary creates topics which are varied but do not contradict one another. Unexpectedness requires a balance between specificity and generalization to ensure that topics do not just consist of synonymous terms in great detail, but are restricted to the most topically reflective terms based on their coincidence. Thus, median values for k and for minimum tf-idf be combined in models may generate the highest likelihood of unexpected patterns. Nevertheless, a little bit of flexibility in individual beliefs and assumptions and creative interpretation of violations to those assumptions may be necessary to define a pattern as interesting. Unexpectedness attempts to find topics and patterns which are separate from the topics and patterns which the other interestingness measures attempt to maximize. It isn’t less important than the other measures, but a pattern that is unexpected is not less interesting than one that is unexpected and also, for example, diverse. Unexpectedness is motivated by evaluating the representativeness of model assumptions and discovering topics which challenge previously held beliefs, so any model seeking to maximize unexpectedness most likely does so isolated from other interestingness measures. Finding

125 unexpected patterns is as much a factor of luck and the definition of specific beliefs as it is of parameterization. 4.3.7.3 Geographic significance Geographically, unexpectedness/surprisingness and novelty often appear together, since an unexpected pattern must also be novel in order to contradict existing knowledge. Laube uses the idea of unexpectedness often in analysis of movement patterns to identify features which interact with their surroundings differently than others (Laube, Imfeld, and Weibel 2005, Laube and Purves 2006, Laube 2014). The ability to create a model which matches the expected geographic dynamics of a particular system can reveal much about the way it works and interacts with space and time, but a model should both represent known dynamics and extract and indicate when interesting violations to that expectation also occur. Visualization methods take advantage of this necessity to more prominently display such variations, especially when they occur in real-time (Cao et al. 2012, Chae et al. 2012). Unexpectedness may at the least be a precursor to identifying interesting features to explore further. 4.3.8 Utility/Actionability Actionability and utility are the only two interestingness measures which have been classified as semantic measures. Semantic measures are a subclass of subjective measures which rely not only on the beliefs and decisions of the evaluator, but also explicitly on the data itself. This difference is important so as to distinguish between a pattern that is interesting for its unique characteristics, and one that contributes toward taking some action or contributing to a stated goal. Utility and actionability are combined in this project. Actionability usually implies action which can be justified for a separate party, such as acting on an unexpected pattern to change an undesirable behavior (Geng and Hamilton 2006). Utility has a more broad definition, referring to the ability of a pattern to contribute to satisfying some predefined goal. For the purpose of this project, the goal and expectation for taking action are the same: to find and further investigate topics and news articles related to the intersection of politics and sport in Catalonia’s independence movement. In this way, actionability can be considered the opposite goal of unexpectedness.

126 4.3.8.1 Sensitivity Testing Like the case of unexpectedness/surprisingness, actionability/utility is closer to a binary search than the other measures which are evaluated on an ordinal scale. Thus, actionability demands a search for specific topics which indicate themes pertaining to the second research objective of this project: discovering spatial patterns within news media narratives relating to the merging of sport and politics and the social and spatial influences on them. Mapping sentiment toward current events helps to understand these patterns, especially since nouns are the most frequent key terms used in topic definitions. Specifically, the Catalonian independence movement, FC Barcelona soccer club, and their connection to the Scottish independence movement provide ample relationships to explore in local, national, and international media. The topic definitions of each pattern are explored in a semi-systematic way with respect to parameter combinations, searching for topics which suggest the merging of these given themes. Table 4.8 lists some of the actionable topics and the parameter values that generated them. In all cases, the topics are considered actionable because of the use of terms which cross the otherwise separated themes also listed in the table below, or explicitly specify sentiment toward the dominant theme.

127 Min News Topics alpha Topic definition News description #2 tf-idf description #1 catalan, mas, espanyol, resolute, Catalan Espanyol soccer, Kiko 25 0.011 0.03 novemb, casilla, independ, artur independence Casilla goalkeeper rajoy, independ, catalan, Catalan Scottish independence 25 0.016 0.03 catalunya, parti, scotland, elect, independence percent mps, respons, fear, english, vote, English Negative community 50 0.011 0.03 community, owner, irrat parliamentary response to vote voting mas, catalan, independ, athlet, Catalan athletics 100 0.011 0.03 yesterday, artur, brereton, gestur independence labour, basqu, corbyn, independ, Scottish Basque politics 100 0.016 0.03 scotland, vote, snp, elect independence independ, catalan, catalonia, elect, Catalan Catalan 100 0.029 0.03 mas, vote, proindepend, parti independence ‘proindependence’ subtopic leagu, catalan, football, club, Soccer Catalan indpendence 100 0.029 0.03 catalonia, independ, play, game

nonsecessionist, driver, formula, Formula 1 racing ‘nonsecessionist’ 100 0.016 0.056 prix, japan, hamilton, sainz, sentiment alonso

Table 4.8 Observed actionable patterns, the parameters that generated them, and the labeled description of the news explained in the topic.

As subjective measures of interestingness, utility and actionability depend on user-defined goals and personal definitions of how those goals are accomplished. In this analysis, the primary goal is to find subsets of news articles which pertain to the merging of multiple sport and geopolitical related themes. Catalan independence, as the primary topic of interest, appears as one or more topics in every pattern with varying degrees of specific themes. Table 4.8 shows many topics across several of the 48 patterns which demonstrate not only key terms indicative of a topic pertaining to Catalan independence, but also indicate the presence of sport- and opinion-based themes injected into the topic via one or more unique terms. These topics have been selected as actionable especially for the unique terms which set these topics apart from other semantically similar topics, generating at least two distinct descriptions of the news represented by the topic. Although the political identity of Spanish sport clubs has been well documented (Borden 2013, ESPN FC 2014), many patterns separate political and sport reporting in their generated topics. The most useful and actionable topics to address the research

128 objective of exploring the ways that news merges these topics will be ones most likely to contain individual articles which merge the themes in question. Three topics in Table 4.8 appear to integrate aspects of sports and Catalan politics related to the independence movement. Specifically, Catalan independence and the referendum vote of September 2015 are linked with reporting on football events and the Espanyol soccer club especially, whose politics juxtapose with those of FC Barcelona in their support of a unified Spain, generic reference to ‘athletes’ who may also be associated with public or political ‘gestures,’ and with Formula 1 racing which has a much greater presence and geopolitical identification among drivers than in the U.S. In most patterns, Scottish independence and Spanish politics remain separate topics, but in a few topics, they are merged into unique and actionable topics, addressing of the key expectations of this project – that sports intersects with Catalonian independence in news reporting. One hypothesis of this project was that the Scottish independence movement, which occurred less almost exactly a year prior to the September 2015 Catalan independence referendum, would significantly differ in the ways that different sides of the secessionist debate spoke of the prospect and consequences of leaving Spain. With these two themes combining into single topics in rows two and five of Table 4.8, rather than remaining separate as in other patterns, there is hope that the political debates cross international borders to take advantage of lessons learned, threats levied, and consequences debated of turning down a path of secession. Exploring these derived topics in greater spatial and semantic depth is the subject of Chapter 5, but the uniqueness of the combinations of terms in these identified topics also marks consistencies in the LDA model for generating actionable results. Most importantly, of the seven patterns with unique and actionable topics, six of them were generated by the lowest possible value of tf-idf – 0.03. An easy explanation would agree that a larger vocabulary, caused by a smaller minimum tf-idf, incorporates more rare terms into topic definitions which can help to illuminate secondary topics within articles. Some of these same terms do appear in higher-tf-idf models, such as the secondary terms of ‘espanyol’ in the first row of Table 4.8, ‘scotland’ in the second row, ‘basqu’ in the fifth row, and ‘nonsecessionist’ in the eighth row, but never in combination with the primary topics as they are observed here. More importantly though, the secondary terms

129 in the other rows, specifically the emotional terms ‘fear’ and ‘irrat’ and the political opinion expressed through ‘proindepend’ only appear in the vocabulary when reducing the minimum to 0.03. Thus, despite the clear themes of news relating to coverage of the Catalan independence movement and the referendum vote on parliamentary support of pursuing independence, a lower minimum term frequency-inverse document frequency will help to incorporate some more specific emotional and partisan terms into the topic definitions of news coverage. 4.3.8.2 Recommendations Actionability is very much in the eye of the evaluator and the definitions of what constitutes interesting information. Here, actionability blends concepts associated with peculiarity and novelty, as the terms comprising the topics must generate unique and separate themes, although neither measure will be able to highlight any of the actionable topics discovered here because actionable topics consist largely of small inconsistencies in very common topics, using very common key terms. The commonality of these terms means that despite the conceptual similarity, peculiarity will not coincide with actionability through the computation of the measures. Actionable topics or patterns of them need not always exist independently of other interestingness measures, but the targeted data collection of this project toward accomplishing the stated research goals has helped to generate many relevant topics, with only some providing the semantic merging of sport and political themes of interest. As such, the recommendations that this dissertation gives for generating actionable patterns is tentative and subject wholly to the definitions of useful and actionable. In this case, where actionable patterns are given by topics which are combinations of known and common themes, parameter values combine in ways that are unique to measuring actionability compared to other measures. Clearly, according to the topics extracted as actionable in Table 4.8, a low minimum tf-idf typically generates the most interesting patterns. The larger vocabulary size allows many topics to become more specific, as does the number of topics in the model. As many topics get more precise and specific, the most common themes which appear regardless of model parameters (for example, topics indicating the Catalan independence referendum, the Scottish independence referendum, and FC Barcelona athlete news) often appear in multiple

130 topics with only small, but important, variations. Such variations in terms, such as the insertion of the terms ‘espanyol,’ ‘basqu,’ ‘proindependen,’ and ‘nonsecessionist’ into otherwise routinely occurring topics pertaining to Catalonia, Scotland, and sports are not possible without the additional vocabulary size used in generating more specific topics. This effect is seen in several interestingness measures, where more specific, unique, and novel topics require lower minimum tf-idf values to maximize interestingness. To a lesser degree, alpha also impacts the actionability of patterns as defined in this project’s objectives. In the topics observed here as actionable, all but one of the seven patterns use mid-range alpha values, suggesting that both low and high alphas are less helpful for finding merged themes within a single topic. Alpha does not generally change the interestingness of a pattern, since k and tf-idf do much more to impact the presence of specific terms within topic definitions. But among models with otherwise identical parameter values, alpha does appear to allow some unique terms to be included in topics which otherwise contain the same terms. Median values for alpha appear to facilitate more of this topic integration, as all but one of the seven actionable patterns in Table 4.8 is generated by an alpha of either 0.011 or 0.016.

4.4 Discussion 4.4.1 Summary The interestingness measures examined, described, and implemented here are impacted by various motivations and by different parameter values and combinations of them, as described in the sensitivity testing sections of each measure. It is important to note that the conclusions reached depend largely on the only constant variable across every tested model – the input data. Results are largely influenced by the average and variance in the length of input documents, and so these findings may not hold for input text with different properties. News articles remain an important source of geopolitical and other data, and until these conclusions can be proven for other data types, such as social media posts which have very distinguishable lengths, the recommendations provided based on this research will help guide text modeling to understand a specific data source and model.

131 Table 4.9 summarizes the recommendations from each of the interestingness measures examined in this section. Additionally, subjective confidence rating which is based on the ease of generating the recommendation that accompanies each measure is included. Confidence is a product of the observable patterns in changing parameter values and the ability to observe and reason about the impact that the parameter(s) have on the measure.

INTERESTINGNESS MEASURE RECOMMENDATION CONFIDENCE CONCISENESS low topics high GENERALITY/COVERAGE low tf-idf high PECULIARITY low topics + low tf-idf / low high topics + high tf-idf DIVERSITY low alpha, median tf-idf medium RELIABILITY low tf-idf low NOVELTY high topics, low tf-idf high SURPRISINGNESS/UNEXPECTEDNESS mid topics, mid tf-idf low UTILITY/ACTIONABILITY low tf-idf, mid alpha medium

Table 4.9 Summary of all recommendations for maximizing each interestingness measure. The confidence is my subjective trust in the conclusions given by each evaluation.

Several patterns are evident in the recommended values for generating interesting patterns. A general trend appears to be that low values for topics and minimum term frequency-inverse document frequency produce many different types of interestingness. Most significantly, a discussion around alpha and its impact on interestingness requires a discussion. Alpha is an important parameter which sets LDA apart from other clustering algorithms by adding the tendency of topics to contain similar semantics as a parameter to the model. The propery of topics and patterns which alpha describes is directly measurable by the diversity interestingness measure. So the results here, particularly those observed by the diversity interestingness measure and the overall lack of alpha serving as an important parameter in other measurements of pattern interestingness, confirm the function of alpha and the literature suggested values for alpha have measurable effects on model results. Because alpha is utilized as a document- level distribution parameter (each document is sampled once per iteration over the dirichlet distribution given by alpha), any measure seeking to establish semantic

132 uniqueness of an entire topic or collection of topics will be unaffected by alpha, which is supported by the results in this chapter. Diversity is perhaps the most important parameter for a clustering process since it measures the separation of each cluster from one another, while maximizing the within cluster similarity. That is, diversity measures the ability of of a clustering process to perform its function most effectively. Alpha’s important impact on this measure underscores the importance of the parameter for the LDA process. Despite the large impact that alpha has on LDA, different implementations of LDA deal differently with the parameter. Within R, two packages implement LDA; ‘topicmodels’ and ‘tm’ (many other languages have implementations and some standalone software contain LDA, but they were not considered in this analysis). The ‘tm’ package accepts an alpha parameter, but provides no guidance as to its appropriate usage. Alternatively, ‘topicmodels’ does not accept a custom alpha. This implementation uses the recommended alpha value of 50/k given by Griffiths and Steyvers (2004), or provides an option to estimate the alpha value by iterating through multiple models and using log likelihood to choose the model with the optimal performance (Grün and Hornik 2011). According to Grün and Hornik, in tests of both methods for defining alpha, the estimated and fixed values produced widely different values, though neither method produced a model more desirable than the other. It is fairly apparent that there is little agreement among the individual interestingness measures as to the best values for the input parameters of LDA. The 48 unique models run in this project suggests that LDA is complex in its ability to extract different properties from the patterns it produces, and that interestingness is more of a guide to the production of topics rather than a set of rules. Thus, a pattern should not be evaluated as having the highest ‘average’ interestingness among all methods – interestingness provides a class of several methods, some of which are more relevant to a particular application than others. By exploring interestingness individually and isolating the parameters with the biggest impact, this project is able to consider the specific impacts of each measure on model evaluation. Interestingness serves to indicate some aspect of the success of a model, however, an evaluation can utilize a potentially infinite number of parameter combinations,

133 especially considering LDA’s random selection from multinomial and Dirichlet distributions. In Knowledge Discovery applications, interestingness is formalized and maximized according to parameters and data sampling. While this is useful to determine the optimal inputs for generating desired results, as this chapter has done, it also shows that prioritizing for one or more forms of interestingness does so at the expense of others. The conclusion here is thus, that in the application of interestingness, it is important to consider what a successful model looks like. Interestingness provides several alternatives to the expectation maximization that is so frequently utilized to objectively describe the representativeness of a chosen model. I have described eight measures of a successful model, which together and separately provide explicit ways of thinking about and planning for expected model results. Applications demand separate conceptualizations of success, and this analysis expects to have generated several ways of defining and calculating model goals. 4.4.2 Contributions Interestingness measures clearly have use in knowledge discovery in databases research for formalizing association rules and model standards. The objective interestingness measures are often incorporated into the data mining process to automatically select models which maximize certain interesting features. Although such an approach is useful in enforcing rules and standards, relying strictly on the computations of interestingness will produce a less viable model than one which considers interestingness as a class of motivations behind the success of a model. It is this reconsideration of what makes a model successful where this dissertation hopes to contribute to geographic information science knowledge. Unsupervised methods continue to be used in data-driven contexts, and often with insufficient evaluation to confirm the model is generating the most usable results. Validation processes often take the insufficient step of maximizing p-value (Ward, Greenhill, and Bakke 2010a) or maximum likelihood (Dempster, Laird, and Rubin 1977), serving to uncritically evaluate the model based on a single measure of success. What do p-value and maximum likelihood represent in the validation of a machine learning process? The answers are much less clear than in each of the interestingness measures, which evaluate very specific

134 properties of model results and can be combined to form more comprehensive understandings of model effects and more useful results. Thus, the analysis of interestingness done here via text modeling contributes new perspectives for GIScience research in ever-growing applications of spatial big data. Model validation remains important for computational processes, proving the viability of a chosen procedure, but many valid models are not useful (Augusiak, Van den Brink, and Grimm 2014). LDA’s topic structure is a valid organization of the input data, but as the model sensitivity analyses here show, not every pattern generated by LDA has the same useful properties. 4.4.2.1 Geographic Information Science The biggest issue in objective evaluation under big data is the formalization of geographic knowledge. Data-driven research likes to assume no prior knowledge or assumptions and allow patterns to emerge from the data. Thus, a conflict between formal theory necessary to evaluate a model and its results, and facilitation of emergent themes in the analysis process itself. Formal theory in the case of interestingness, however, does not need to be semantic. To guide the use of parameters based on expected outcomes, those outcomes only need to be as broad as the features which each interestingness measure illuminates. The model sensitivity analysis in this chapter show the impacts that the necessary parameters to LDA have on directing the model output toward specific means of measuring a model against several forms of a pattern’s interesting features. This project alleviates some of the burden of generating a multitude of patterns for the purpose of more comprehensive evaluation through its parameter-based sensitivity analysis. In other limited use of concepts related to the interestingness of unexpected and actionable patterns, geographers have explored subjective evaluation as a means of identifying interesting features from prohibitively large decision spaces (Laube and Purves 2006, Miller 2010). Although this dissertation takes a similar approach exploring the differences between 48 unique models, the comprehensive evaluation here of several dimensions of interestingness and their resulting recommendations provide guidance for reducing that space via targeted use of parameters and combinations of them. Interestingness plays a specific role in facilitating the use of data mining process which are largely guided by very abstract numerical parameters. The number of topics in

135 LDA has some real-world relevance, since the expected number of topics should align with the number of separate themes which should be extracted from the input data. But alpha and the term frequency-inverse document frequency are two parameters which depend on data properties and personal user choice. With no more guidance on proper parameter usage and their effects on the model outcomes, the process becomes a black, nontransparent box. LDA’s Dirichlet distribution and multiple randomized multinomial term selection processes increase this black box effect when subsequent processes do not generate similar results. This chapter opens up the black box process of LDA and of its evaluation, specifying ways of getting desired results without multiple testing and extensive evaluation. Work remains to test these hypotheses on additional iterations, text properties such as length and language, and directed analysis with exaggerated input parameters tailored to one specific goal. The general patterns and recommendations observed here should continue to open new avenues for geographic knowledge discovery from text, as the opportunity for spatial data analysis in urban and social context continues to expand. Finally, interestingness is not only defined by KDD properties, but geographic interestingness as well. In this way, each of the patterns generated by LDA and other data mining techniques also consider the geographic patterns which they illuminate. The KDD interestingness measures utilized here have geographic significance as well as data science significance, as explained in each section. Further work remains to refine and formalize the geographic interestingness of these measures into useable processes for evaluating spatio-temporal data. In the next chapter, geographic patterns are extracted from the interesting patterns identified here. In future work, this multi-step semantic evaluation and geographic pattern comparison procedure can be integrated into a more thorough process for making sense of spatial geographic information.

136

5. Chapter 5:

Mediascapes

“Mediascapes… tend to be image-centered, narrative-based accounts of strips of reality, and what they offer to those who experience and transform them is a series of elements… out of which scripts can be formed of imagined lives, their own as well as those of others living in other places.”

-Arjun Appadurai (1996), Modernity at Large: Cultural Dimensions of Globalization p. 35

137 5.1 Introduction In the previous chapter, this project looked at the semantic spaces contained within a large collection of news data. The LDA process used in this way serves to reduce high dimensional text data into manageable and distinct clusters for topical analysis. In this chapter, several distinct subsets of the news media data are explored in detail to measure geographic factors in the news content and in the media itself. The geographies which are inherent in the media and the geographic narratives that are advanced through it are entwined with one another through the nature of the media itself and the topics which it presents to readers. The complexities of these geographic relationships in the context of globalization are described by Appadurai’s concept of the mediascape (Appadurai 1996), explored here as a combination of media’s sites of production, audiences, and printed context (Rose 2012). The nature of how people interact with and react to current events through news media has changed with the Internet’s ability to connect people and places. This connectivity brings communities together in ways that the print media cannot because of ease of accessing local and distant issues. Appadurai argues that globalization has helped create a cultural landscape connected not via common location, but by connections in other ways, especially for the case of digital media (Appadurai 1996). In Appadurai’s modern world, media imagery facilitates imagined communities (Anderson 1983) connected via their digital spaces and cultural significance, rather than their physical proximity. In one particular example which appears in this current research, the ‘Euskal Kazeta’ news source reports directly on events pertaining to the Basque region of Spain and to Basque cultural events in the U.S. Euskal Kazeta is unexpectedly based in southern California. This chapter examines the geographic spaces of the digital media as observed through the Catalan independence movement and its relationship to sport in order to show the extent to which spatiality factors into the production of digital news and the semantic spaces within it. Appadurai coins the term ‘mediascape’ to describe the changing dynamic and perspective-based cultural and spatial landscape of a global media. The mediscape is defined in section 5.2 as it relates to both the production of news media and to its geography. Section 5.3 explains the process of extracting and interpreting the

138 geographies of media in its relevant forms. Section 5.4 examines the spaces of the media via the spatial scale and location of its production and its audience, Section 5.5 explores the geography of several related topics observed within observed by LDA. Section 5.6 concludes the chapter by considering the difficulties of mapping the mediascape and discusses the significance of national identity beyond the common use of placenames in establishing mediascapes.

5.2 Mediascapes Communities are not just imagined by readers in the process of observing media narratives, they are also generated by the media and its spatial and cultural context. For the sake of capturing such a spatial dynamic in this project, news sources catering to audiences in Argentina and Brazil were included in the RSS feeds scraped each day with the expectation of finding unique narratives concerning Catalan independence through the FC Barcelona soccer team. Two of the most well-known athletes, not only within the team, but in all of international soccer, Lionel Messi and Neymar, are from Argentina and Brazil, respectively. Adding to the spatial complexity of the situation, both of the news publications included for the purpose of capturing some media context from these athletes’ home nations, the Argentina and the Brazil Sun, are owned by the same media conglomerate located in Sydney, Australia. Although the online papers contain material specifically important to readers with interest in South American events and are published by authors and press agencies located in South America, clearly the Australian owners’ perspective is also necessary to understand the content and context of the news which together form the mediascape. Rose discusses the media’s geographies as the combination of three sites: the producer, the audience, and the content (Rose 2012). Specifically, Rose’s methodology refers to visual media – photographs, video, art, etc. – but textual reports present a different kind of visualization that is translated through visual language and that the reader relates to via the way they interpret its connection to their experience. To perform this interpretation and decipher its meaning, an observer must consider who is producing the image and where their experience comes from. Second, the intended audience of the media influences its purpose and content. Finally, the content itself contains reference to

139 what is important and what is not. Here, the media is also a product of these three sites’ geographies, and the mapping of this spatial mediascape reveals how a geographic and political narrative is advanced through news media. In this project, each of these sites are examined to better represent the mediascapes in the Catalonian and Scottish independence movements. Although Rose specifically refers to visual media and methodologies for understanding it, Appaduarai’s mediascapes prove that the imagined landscapes of media portrayals of current events and actors are in their own way a visually interpreted and translated medium. The following sections describe some of the ways that research in geography has considered each of Rose’s three facets of the mediascape and how they are captured for their spatiality in this dissertation. In the specific context of Catalonian and Scottish secessionist movements, those intersections reveal complex geopolitical systems regarding state-building, public sentiment, and regional differences. 5.2.1 Producer In the current research, the spatiality of news producers is measured via the address locations of a publication source’s headquarters. Historically, the headquarters of a media production entity is expected to represent the center of influence the publication would like to have. These locations are given on the websites of the source, extracted manually, and manually geocoded to produce the spatial data shown in Figure 5.1. Clearly, the locations of news sources are dominated by the two major cities in Spain – Madrid and Barcelona – with other regional publications located in secondary cities throughout Spain. Of course, this pattern is determined by the choices of sources in the data collection phase of the research.

140

Figure 5.1 Locations of the headquarters of each news source collected. News are primarily situated in Barcelona and Madrid with seven and 5 sources, respectively.

More interesting with regard to the locations of publication sources shown in the map of Figure 5.1 is that many sources do not have headquarters located within Spain at all. Only a couple of the source locations are located within expected regions given data collection targeted at recovering news from geographic contexts outside of Spain: The Guardian based in London, UK, and Pagina 12 and La Razon located in Buenos Aires. Several additional exceptions to the expectation that sources are located near their distribution exist and reveal a separate class of news geographies that are classified in this dissertation as “conglomerates;” news catering to local or regional topics that are owned and operated in a foreign country. Examples include ‘The Local’ English language Barcelona news published in Stockholm, Sweden, and the previously mentioned ‘Euskal Kazeta’ in Venice, California and the ‘Argentina Star’ and ‘Brazil Sun’ operated in Sydney, Australia. 5.2.2 Audience Geographic analysis of the audiences of news must carefully consider both the intended audiences and the ways that actual audiences interact with the media. Generally, the location of a media producer reflects its intended audience in several ways, but particularly if there is a print component to the media circulation. It must be able to rapidly produce, print, and disseminate content to the most likely readers, and the

141 correspondents who cover current events can do so easier when in closer proximity to the event itself and the participants who can provide comment. Except for the case of conglomerates mentioned above, whether producer and audience are co-located is important for understanding the intentions of the media as both a facilitator of truthful reporting and an entity interested in generating a profit by providing entertaining, imagery-filled stories. Mapping the specific locations of the media’s audiences is not possible without access to the locational tracking resources that presumably media companies keep very private. But combining the locations of the news producers with the scale of their intended dissemination, with some contextual information about regional focus when necessary to differentiate between a conglomerate news source and an international one, the audience can be approximated using the scale classifications determined here. A random sample of 1000 articles from each of the international and conglomerate news producers are used to extract the geographies of those producers for comparison. An equal sample is taken from each to ensure the ability to compare total places between scales. The international scale producers, consist only of articles from ‘The Guardian.’ The conglomerate scale producers consist solely of articles from ‘The Local,’ concerning Catalonian news but headquartered in Stockholm, Sweden. Again, the mediascapes of news producers and their audiences are only worth exploring when their content contains significant geographic and semantic variation. Identifying the contextual differences between them is not analytically interesting without exploring how these geographic differences are manifested in the actual images being produced. 5.2.3 Content In studies of the geographic content within news media sources, named places are extracted from the text of the article and their locations disambiguated to their correct locations on a map. Previously, the semantic content of news was summarized into topics using key terms, some of which have both explicit and implicit geographic references. Catalonia and Scotland appear as key terms in several topics, almost exclusively referring to those communities’ independence movements. But in addition to the geography already detected within topic labels, nearly every article reporting on current events also

142 contains references to where the event occurred. Using the process of placename extraction using Named Entity Retrieval (Cunningham 2002, Manning et al. 2014), spatial disambiguation using the Geonames repository (geonames.org), and point geometry, the spatial content of news under different classifications and topics is mapped and compared. This chapter combines placename counting and spatial analysis to understand the geographic nature of mediascapes and semantic spaces. 5.2.4 Mapping Process To map the geographic content of news, the procedure described in Section 3.5.4.3 is carried out using the GeoTxt online engine (Karimzadeh et al. 2013). GeoTxt extracts placename content using two open source Named Entity Retrieval services; GATE ANNIE (Cunningham 2002) and Stanford CoreNLP (Manning et al. 2014). Placenames are then queried against the Geonames database, mapped onto a basemap, and provided in downloaded GeoJSON format. Each article must be processed by GeoTxt individually to respect its character limit of 3,900. Some manipulation of the input text and manual disambiguation of correct locations (when more than one place matches a placename) are necessary, and some locations are inevitably misclassified. But an effort is made to capture as much of the geography from within a news article as possible. Each resulting GeoJSON codeblock is saved individually in a new local file. Then, the Mapshaper engine (mapshaper.org) is used to convert each individual file a single shapefile for use in ArcGIS. Finally, all shapefiles resulting from articles sampled from the entire collection based on topical or audience scale are merged together to examine the entire geography of the mediascape as defined through these specific methods.

5.3 Mapping Mediascapes In this section, the locations and scales of media’s producers and audience are mapped and compared as functions of the geographic locations provided within their content. Specifically, local and international news interact with global narratives and places in different ways because of the audiences their content intends to speak to. International news tends to be located at a location central to its primary audience. The

143 Guardian, which is the most international news source among the collection analyzed here, is located in London, and despite its international content, does favor locations within the UK, as this section shows. Conglomerate news has a more localized audience, but its site of production tends not to coincide with the site of its audience. And finally, local news covers the immediate area around its location, with little expectation of an audience beyond a local scale. The spaces observable in samples from each of these three classifications of news are examined here by mapping the locations which appear their text, and compared for their global nature, their geographic specificity in spatial scale, and their ability to generate imagery of communities and Appadurai’s non-spatial definition of ‘localities.’ The Catalonian independence movement is a localized event, in both a spatial sense and in Appadurai’s modern ethnoscape sense, where it is not tied to a specific geographic location. As a provincial unit with Spain, Catalonia is a geographically bounded region. Pro-independence sentiment, however, is not bounded to the Catalonian territory; the culture and ethnicity that comes from being Catalonian and not Spanish follow those who feel a part of a community of Catalans. This section attempts to quantify this diaspora of Catalonianism through the mediascapes of local and international news. 5.3.1 Spatial Frequency To test the effect of digital media’s composition, given via the geographies of its producers and it audiences, on the mediascapes that they embody and produce, several different media sources with various compositions are compared in this section. Of the 35 different news sources whose RSS feeds were collected for this study, several sources within the given categories of local, conglomerate, and international news are identified. Important information about each of these sources is given in Table 5.1. These news sources publish at different rates and with different relevance to the Catalonian independence context, and so the total articles captured through this study period is widely uneven. To compare the quantity and nature of places mentioned in these news classifications, each of the scales are sampled to the same rate of 1000 articles. Each of the conglomerate (‘The Local’) and international (‘The Guardian’) news classifications are represented by a single source Significantly more articles from The

144 Guardian were collected during this timeframe, so the random sample contains fewer of the total published articles than for The Local. This information is also contained for reference in Table 5.1 News publications captured at the conglomerate and national scales, their headquarters’ locations, and sampling information for geographic content.

Scale Source Location Sample Proportion of articles Size sampled from source’s total Conglomerate The Local Stockholm, Sweden 1000 54.2% International The Guardian London, UK 1000 9.9%

Table 5.1 News publications captured at the conglomerate and national scales, their headquarters’ locations, and sampling information for geographic content.

The local and national scales of news are not included in this comparative analysis. First, the local news collected here is entirely in Spanish or Catalan languages, which makes NER strategies incompatible for direct comparison. Although software exists for extracting Spanish placenames, the Geonames database is unreliable for disambiguating Spanish and Catalan spellings5. Second, from the small and nonrepresentative sample of local news that could be manually extracted from articles, the local news almost exclusively mentions local places. While significant, this does not contribute to a narrative of global identity shown to be central to the Catalonian independence movement. Thus, the mediascapes of only international and conglomerate news are considered here. To begin to understand the unique mediascapes characteristic of each of these scales of news, the placenames extracted from the news articles are summarized in table 5.2 by their frequency inside and outside of the spatial context of Spain. From this data, the geography of the media’s audience appears strongly, in terms of the locations where the media speaks to, as well as the scale of that audience. Bear in mind that the articles from which the locations analyzed in this section are collected by specifying key terms

5 Numerous attempts were made by the investigator to coax the Spanish and Catalan text into the GeoTxt process for extracting and mapping places. Significantly fewer and less accurate places were extracted – only 67 of 82 extracted places could be located, out of a sample of nearly 400 articles. The automatic process clearly performed well below expectations.

145 related Scottish and Catalonian independence and sporting events, so the locations already have some predisposition to appearing in specific geographic contexts. Hence, in table 5.2, locations are compared as ‘within Catalonia,’ ‘within Spain’ and ‘outside of Spain’ to examine general patterns in media geographies. Additionally, the scale of the places mentions are divided into the scalar categories of ‘country,’ ‘province,’ ‘city,’ and ‘other location.’ Other locations are landmarks, places of business, or landforms. From the data in table 5.2, a couple of patterns emerge. First, there is evidence that the scale of audience, given by the comparison of conglomerate and international news, is reflected in the places which appear in their articles. A much greater percentage of the place mentions and the unique places that appear in conglomerate news (and local news as well) are of places which are more local in scale; e.g., cities, particularly measured by the number of unique places. International news contains the highest number of references to cities inside of Catalonia (328 in international news versus 170 in conglomerate), Spain, (945, 920), and outside of Spain (845, 462), but those mentions are of the same places, rather than references to a variety of locales. Comparing unique places as cities of Catalonia (6 in international news versus 9 in conglomerate), of Spain (31, 93), and outside of Spain (106, 156), the relationship switches. This pattern indicates the power of local and conglomerate news to provide a more detailed account of the geography of events, as a greater number of impacted places are mentioned.

146

CONGLOMERATE INTERNATIONAL TOTAL 4163 4004 PLACE Country 1070 921 MENTIONS

Province Within Spain 218 97 Outside Spain 77 334

City Within Catalonia 170 328 Within Spain 920 945 Outside Spain 462 845

Other Within Catalonia 15 51 locations Within Spain 145 101 Outside Spain 136 117

TOTAL 582 313 PLACES Country 119 60

Province Within Spain 17 8 Outside Spain 19 17

City Within Catalonia 9 6 Within Spain 93 31 Outside Spain 156 106

Other Within Catalonia 6 8 locations Within Spain 40 18 Outside Spain 62 23

Table 5.2 Frequencies of place mentions and unique places in conglomerate and international news. Places are categorized by country, province (sub-nation administrative areas), cities, and other landmark or landform locations. Extra-state or unknown locations are not included in this data.

The local perspective tends to disappear in larger media sources, except where repeated placenames are more important than a variety of them. Popular geopolitics and event data, as two popular methods for geographic analysis from unstructured media text, are not solely interested in capturing diverse geographic entities from media, but the presumption that international media sources contain a greater range of geographically interesting information is not supported by the data here. The mediascape is a concept meant to emphasize geographic diversity through the media’s ability to speak to and of places that seem otherwise disparate. Here, international media may not be fulfilling its

147 expectation as being more able to effectively discuss global impacts as local and conglomerate news. Another significant pattern which is evident in Table 5.2 is the vast difference in the ways that conglomerate and international news handle province-level data. Inside of Spain, Conglomerate-level news mentions province-scale locations more than twice as frequently as international news (97 for international news, 218 for conglomerate), but outside of Spain, the relationship switches, with international news mentioning provinces over three times as frequently (334 to 77). Unique place mentions are not significantly different, though conglomerate still mentions more unique places than international. Since both Catalonia and Scotland are considered provinces inside of existing states, the pattern her reflects the high number of mentions of Catalonia in conglomerate news, and high mentions of Scotland for international news. 5.3.2 Spatial Distribution The global distribution of placenames mentioned in the news for each of the three scales of media are quite reflective of the locations of both their production and their audiences. First, the mean centers of the spatial distributions of mentions show the epicenter of all locations on a global scale. The mean center is the single location where the average distance between that location and every other point is minimized. Locations are weighted only by the number of times the place is mentioned, creating additional overlapping points in the map in Figure 5.2. Not surprisingly, the center for both conglomerate and international news classifications are within Spain. The Guardian’s location in London appears to pull the mean center north, indicating a greater proportion of locations in the UK. However, the conglomerate news mean center does not follow a pattern indicative of its site of production, which is in Stockholm, Sweden. Instead, the mean center is much closer to the primary study area of Barcelona.

148

Figure 5.2 Map of all places mentioned in the content of conglomerate and international news sources, separated by countries/continents and other sub-country scale locations. Distributions are summarized with the mean center location.

Additional patterns are also noticeable between local scale place mentions and country and larger spatial units. The larger points in Figure 5.2 indicate countries, regions, and continents. Although overplotting is a factor in the map, some general trends are apparent. Africa and Asia are largely unrepresented by local-scale place mentions in both international and conglomerate news (but most apparently by international news). Local scale locations largely appear through Spain and the UK. Both of these patterns reflect the topical focus of the news collected. Although the mean center provides a general sense of the spatial distributions of place mentions among each scale of news, the technique and the overplotting of multiple mentions at the same location do not indicate the specific areas which cause these distributions are different. Figure 5.3 compares the differences between areas by the density of place mentions in a 60 square kilometer grid cell. The density of international mentions is subtracted from the density of conglomerate mentions in each cell, creating

149 the divergent binary comparison in the figure. The more red the cell, the greater the proportion of places in the international news, and vice versa with cells in blue.

Figure 5.3 Map of Europe showing comparisons between conglomerate and international news. Specific important cities or other locations are labeled for reference.

In Figure 5.3, the northward pull of the international news is obvious in the location of the mean ceanter and the greater proportion of international news throughout the UK. These locations are emblematic of the ways that ‘The Guardian’ has reported on the continued impacts and deliberations on Scottish independence, reflecting the locations of the UK capital of London, and Scotland’s largest city of Glasgow. Conglomerate news, with its audience primarily oriented to the Catalonia province and Barcelona in particular, never references Edinburgh except in lists of European urban areas in non-secessionist contexts. Scotland, though frequently referenced for its own

150 attempt at independence, does so far less frequently with respect to the region’s positive views on the independence movement and referendum vote. Conversely, the conglomerate news contains references to several locations with greater frequency than in the international news. The particular locations which appear in Figure 5.3 include Europe and the Mediterranean, the countries of Syria, Italy, Greece, Germany, and Turkey, the cities of Paris, Stuttgart, Rome, and Brussels, and various cities and regions within the southern and northwestern Spain and the Spanish Canary Islands and Balearic Islands. The presence of several locations in Spain, not only in greater numbers in the conglomerate source, but not appearing at all in the international source, further indicate the local nature of the audience of the conglomerate news classification. Surprisingly, no places with close proximity to the site of production in Stockholm are evident. Although the site of production is reflected in locations in the international news classification, the conglomerate style of news generally appears to maintain a localized audience as it portends to do. Partially, this is a function of the choice of ‘The Guardian’ to represent international news – although covering a global audience, it remains synonymous with its UK headquarters. The process of transitioning from a local and national news to an international audience means that sources still tend to emphasize domestic, nationalistic perspectives on global affairs (Hjarvard 2001), supporting the understanding that the site of production influences even the global, digital media. Conglomerate news also has the added complication of frequently being the international headquarters of more than one local news publication. ‘The Local,’ captured here to represent the conglomerate style, owns news with local foci in several countries, including Austria, Denmark, France, Germany, Italy, Norway, Sweden, and Switzerland. If this international context to the news producer had any substantial impact, locations within those countries would appear in Figure 5.3 more than they do. Although France, Germany, and Italy do have some increased activity, the lack of representation from the other five countries shows that the aforementioned cities of Rome, Paris, Stuttgart are a major part of ‘The Local’s’ reporting in Catalonia. Kätke and Taylor examine global media cities through the presence of publishing headquarters and branch correspondent offices of media conglomerates and find similar connections between Madrid, Barcelona,

151 and Rome, and between Barcelona and Paris (Krätke and Taylor 2004). While ‘The Local’ may be displaying social links between some of the largest cities in Europe, it may also be picking up on the same media relationships which link these cities through their hubs for media conglomerates. These patterns are also reflected in the distance decay function plotted in Figure 5.4. As demonstrated in Chapter 3, if distance was the only factor influencing the connections of places in the context of Catalonian independence, only the locations nearest to Barcelona, as the center of the Catalonian independence movement, would appear in news articles, regardless of the media’s scale. Instead, Figure 5.4 shows that, at specific distances from Barcelona, conglomerate and international news mention places at very different rates. Where the densities are high, relative to an expected distance decay and to one another, specific places emerge as significant in the media’s context.

Figure 5.4 Density plot of conglomerate (blue line) versus international news (red line). The X-axis represents the distance Barcelona, and the Y-axis represents the density of places mentioned at the given distance. Distance represents the centroid of the feature.

Figure 5.4 supports the observations given by the subtracted densities shown in the map in Figure 5.3. The gap between the density curves in Figure 5.4 represent the intensity of the color of cells in Figure 5.3. The density plot’s single dimension prevents it from observing the pattern where conglomerate news puts emphasis on Brussels, because the similarly distant combination of London, Scotland, and Edinburgh mentions in the international news overshadow the conglomerate pattern. Conglomerate news

152 seems to have a focus on the Syrian refugee crisis, as Syria and Mediterranean both exist as peaks in blue. Additional peaks corresponding the United States (8000 km) and Australia (15000 km) appear similarly strong for both news classifications. 5.3.3 Discussion Much of the discussion around the globalizing nature of digital news media surrounds major, international news sources. These sources have the biggest capacity to accommodate a global audience by working with advertisers and correspondents with intimate knowledge of global social patterns to make an international news provider viable. The effect of advertising, in particular, is demonstrated in expanding global interests in some news sources (Thurman 2007), and also in the expanding popularity of sports, such as the Premier League. Advertising, in the form of team sponsors, reflects the global nature of the Premier League’s audience, and Thurman suggests that the same pattern evolves among suppliers of digital, globalized media. The process of integrating global perspectives, economically and topically, appears to continue at a slower rate than that of the Premier League, where, as seen in Figure 2.1, sponsors and the viewing audience are internationally reflective of one another’s geography. The expected emergence of Catalonia as an international actor requires that its capital city also has an international presence, so the connections between these major cities is not an accident on the part of the media or the actors responsible for the event on which it reports. This internationalization, or‘Europeanization,’ of Catalonia, and Barcelona particularly, shows up in its history as an Olympic Game host city and home to one of the most well-known soccer teams in the world; FC Barcelona. FC Barcelona, which carries an openly Catalan identity, plays host to frequent pro-indpendence displays at its stadium (see Figure 5.5). Importantly, the fact that the sign in Figure 5.5 is in English, rather than containing the Catalan phrase ‘Independencia!’ demonstrates the motivation to speak to more than local Catalonian sentiment. This international context is reflected in the geographies of media reports pertaining to the movement toward independence and its global implications.

153

Figure 5.5 Photo of routinely expressed pro-independence sentiment at FC Barcelona home soccer matches.

In the comparisons of conglomerate news to both local and international news, several particular locations appear as having significance to the ways that the scale of media portrays the Catalonian context and the story of its choice to pursue independence. For its part, ‘The Local,’ as a conglomerate style of news highlights the conglomerate nature of European and global media in the links that it forges between major cities in the continent. One indication of this primarily European context is the high mentions of Syria and the Mediterranean Sea. Several European countries are politically involved in rescuing and taking in refugees, and it seems clear that the effect of the Syrian crisis is a focus for several countries via the conglomerate source. Clearly, the geography of the media itself impact the geography of what it shares with its readers. Thus, the content generated by media sources is not independent of its geographic context, and it passes that context to its readers in terms of places, people, and themes. In the next section, a similar approach is taken to examine the geographies of specific topics discovered in the previous chapter. Many topics have implicit and explicit spatial references, as the previous chapter showed, but as the topic is a summary of the full text of a news article, geographic factors may exist only as context to the implementation of the major themes of that article. Next, those semantic spaces are mapped onto geographic space to explore that spatial context and how digital news combines current events with global spatial patterns.

154 5.4 Mapping Semantic Space The nature of the news was explored above to understand how the sites of production and the audience influence the geography of the mediascape. The geographic spaces of the media are also largely an effect of the semantic spaces, which are explored in this section via the same placename extraction and mapping method used previously. In particular, some of the topics identified in chapter 4 are reexamined here and compared spatially to explore the geography of the multiple narratives present in reporting on the Catalonian independence movement. Although many topics have both explicit and implicit geography, placenames do not regularly appear in the topic definitions generated by the LDA process (Li et al. 2008), despite the association of particular places with semantic clusters as this chapter shows. Thus, the geography of articles corresponding with semantic topics is extracted by the GeoTxt program in the process described in Section 3.5.4.3. In this section, specific topics identified as interesting from the previous chapter are spatially mapped and compared to the distributions of similar topics from the same pattern. For this analysis, the pattern generated by input parameters of 100 topics, and alpha of 0.029, and a minimum term frequency-inverse document frequency of 0.03 was chosen for having created several patterns identified as interesting, specifically consisting of topics merging the Scottish independence and sport into the topical context of Catalonian independence. Table 5.3 contains the three topics which are compared. Row (a) contains a topic representative of solely Catalonian independence terms, including ‘junt,’ which represents the primary pro-independence political party Junts pel Sì, or “Together for Yes.” In addition, the Catalan spelling of ‘Catalunya’ appears, which further emphasizes that this topic is primarily a localized and pro-independence topic. Row (b) contains a topic representative of the Scottish independence movement, which was a non-binding ‘referendum,’ rather than Catalonia’s parliamentary election. The presence of ‘sturgeon’ and ‘snp’ also signal this as a pro-independence topic, representing the Scottish National Party and its leader, Nicola Sturgeon. Row (c) contains one of the topics identified in Chapter 4 as actionable for its merging of Catalonian independence and soccer narratives.

155 Because of the choice of a pattern generated by a high number of topics (k equals 100), many topics do have the same terms in their definitions, as the three topics compared here do with the term ‘independ.’ Although this means that topics have more semantic crossover, it also means that small diffrences between topics make it easy to identify the variations between themes when they exist. The topics used for spatial comparison in this section are chosen partly for their shared terms, such that the unshared terms can be isolated as impacting their differences as much as possible, particularly their common use of the term ‘independ.’

(a) catalunya independ vote catalan junt resolute commiss elect (b) snp, vote, labour, scotland, parti, independ, referendum, sturgeon (c) leagu catalan football club catalonia independ play game

Table 5.3 Definitions for the topics used in section 5.5.1 and 5.5.2 to compare (a) the Catalonian independence context with that of (b) Scotland’s independence movement and (c) convergence between Catalonian independence and soccer

5.4.1 Global Distance Decay First, before exploring the particular places which are unique to each topic, this section examines the global patterns of places within all three topics. Spatial autocorrelation at the global scale can only demonstrate how likely each topic is to present a spatial context clustered around a single point. As previous figures have demonstrated, the geography of news, regardless of media scale or of semantic theme, tends to cluster around the central focal point of this entire project – Catalonia, and in particular, Barcelona. Local spatial autocorrelation might indicate the extent to which a subset of the news is focused on a few locations, but more importantly for establishing interesting geographic network connections is knowing where concentrations of place mentions are globally. For this purpose, this section utilizes spatial density as a function of distance from Barcelona to discover places which do not abide by expected rates of distance decay. Figure 5.6 shows the spatial densities of places within articles of the three topics defined above.

156

Figure 5.6 Density plot of place mentions given by Euclidean distance from Barcelona by topic. Catalonian independence is in red, Scottish independence is in blue, Catalonian independence with soccer is in green, and a sample of articles from other topics in the semantic ‘background’ in solid grey. Countries’ distances are given by their centroids.

Although the three topics share similar patterns as shown by the shape of the density curves, several points of diversion from not only a regular distance decay, but more importantly, from one another. The regular distance decay is approximated by the grey line in Figure 5.6, which represents a sample of articles from non-political and non- sport related topics. This ‘background’ density is less concentrated on locations in and near to Spain, and has a more even distribution of locations within a distance of around 2500 km than the other topics. Peaks in the background of news stories do occur, such as the visible one at around 11500 km, which corresponds to two phenomena. First, the presence of new a new location, Malaysia, which is not mentioned in any of the three topical sets of articles, creates a higher density at that distance. Second, the Catalonian topic contains a higher density of locations closer to 10000 km from Barcelona, which pulls the density peak to a nearer distance, even while the mentions of other places at 11500 km are similar between that and the background topic, which include Argentina and Chile. Some of the important locations indicating significant places to one or more of the topics are indicated on the plot. These will be explored in more detail in the following sections, but are quickly highlighted here. First, the Scottish independence topic, shown in blue, has its most significant peaks of place mentions above the other topics in the United States. The United States (including variations ‘US’ and ‘USA’) within the

157 context of the Scottish independence topic represent the most significant concentration of place mentions outside of countries in close proximity (approximately 2000 km) to Barcelona. This pattern may be a result of the English language nature of the news examined here, but it is not a result of news headquartered in the United States or catering to an American audience. This likely reflects the intense public interest in the Scottish referendum in the United States, particularly with respect to the similar movement in Catalonia. Additional peaks appear in the Catalonian topic at locations with greater distance from Catalonia, particularly the South American countries of Brazil and Argentina, and Australia, particularly Sydney. Distance seems to have less effect on the network of places facilitated by news pertaining to the Catalonian independence movement alone. This pattern is one that would be expected of the topic containing soccer, since international competition is governed less by proximity than by the presence of major urban areas which facilitate a fanbase. This pattern is explained in the next section. Instead, the soccer-specific topic emphasizes some unexpected places over the others. Three areas in the Middle East – Syria, Israel, and Qatar – signify important places for the context of soccer over the other topics. Although not directly in comparison with Catalonian independence, soccer is important in this region and in the timeframe of this project. Soccer has enabled many refugees from Syria to feel like they have a sense of normalcy (Smith 2015), while Qatar is linked to Barcelona soccer via economic networks of sponsorship. The places emphasized here by increased connections between topics generate interesting networks that are explored in greater depth in the next sections. These networks are significant for reasons other than population or distance to Barcelona. Many highly-networked cities in Europe, connected via major economic and social flows, such as Rome, Munich, Berlin, Amsterdam, Prague, Milan, and many others, though appearing at times in news articles in these topics, do not present as having a specific relationship with any of the topics explored here. Surprisingly, even London and Scotland do not present as major locations within the topic of Scottish independence. Given high densities for all topics up to 2500 kilometers in distance from Barcelona, these locations simply are subsumed by the large number of places mentioned throughout Europe. All three topics

158 also tend to mention the same places, though the patterns observed here and in the ensuing chapters present interesting networks for further analysis. 5.4.2 Comparing Catalonian Independence and Sport To explore the influence of sport, specifically soccer, on the Catalonian independence movement, this section compares two similar topics primarily representing news on Catalonian independence, though one also references soccer. The Catalonian topic is the same as the one explored in the previous section against the Scottish movement, appearing in row (a) of Table 5.3. The other topic, merging the Catalonian independence context with soccer, is also shown in the row (c) of Table 5.3. This football topic is also identified as actionable in Section 4.3.8 of the previous chapter for its merging of the two themes. The topic chosen to represent the convergence of the Catalonian independence context with that of soccer represents both themes nearly equally. Many terms are used to define multiple topics within the same pattern, but when terms which represent a second theme are added (observe the actionable topics in Table 4.8), most examples use a single term which bridges a primary theme with a second theme. In this case, in row (c) of Table 5.3, sport-specific terms represent 5/8 of the topic definition, while the remaining three terms are specifically Catalonian independence terms. The terms ‘catalan,’ ‘independ,’ and ‘catalonia’ are not commonly found in sport reporting. Reporting which focusses especially on FC Barcelona, Espanyol, Real Madrid, and the rest of Spain’s La Liga soccer league frequently uses identifies teams using the name of the city which their home field is, not by any other administrative or geographic area, such as Catalonia. In the political context, however, Catatalonia is a common reference to the semi- autonomous community with Barcelona at its center. Thus, mentions of the place primarily appear in the context of political reporting by the media. Eleven percent of the 100 topics generated by the model which also contain a high alpha parameter of 0.029 and a low minimum tf-idf value of 0.03 contain the Catalonia placename, the adjective derivative ‘Catalan,’ or the Spanish spelling of ‘Catalunya.’ In each of these 11 topics, Catalonia or one of its derivatives is combined with politically charged terms, including ‘independ,’ ‘vote,’ and the combination of ‘artur’ and ‘mas,’ referring to the former president of Catalonia who organized a nonbinding referendum vote on Catalonian

159 secession in 2014 and was subsequently banned from holding office in violation of Spanish law because of it (Minder 2017). Mentioning Catalonia, the language, or the cultural identity is not limited to a political context, but most often this is the frame in which mentions of the community appear. Thus, the combination of soccer-specific terms and political Catalonia context within a single topic is noteworthy. The two topics – rows (a) and (c) in Table 5.3 – share two terms in common – ‘catalan’ and ‘independ’ – while ‘catalunya’ and ‘catalonia’ also share a meaning as representative of the political context of the community. The rest of the topics’ terms is where they diverge between electoral politics and soccer. Those diverging themes are explored through the geographies which are contained in the articles generated by each of the topics. More importantly than the global patterns here are the specific places which are causing the differences in distribution between the two topics. Table 5.4 lists the 12 places with the largest variation in the number of mentions between both topics, six for each topic. Despite showing a more global distribution of place mentions, the Catalonian- specific topic also contains more mentions of the three most important places associated with Catalonian independence: Catalonia, Barcelona, and Madrid. This pattern is not a result of greater number of place mentions overall for the Catalonia-specific topic versus the soccer-specific one, since the difference of 20 placenames overall is not significant.

160 Mentions in Catalonia- Mentions in soccer- Place name Difference specific topic specific topic TOTAL 533 513 20

Barcelona 49 30 19 Catalonia 23 6 17 Madrid 54 37 17 Brussels 7 1 6 England 9 3 6 Scotland 16 10 6

Girona 0 4 4 Qatar 0 4 4 Germany 1 6 5 Alicante 1 9 8 Paris 3 12 9 France 2 16 14

Table 5.4. List of the places which have the largest difference in the number of mentions within articles pertaining to topics on Catalonian independence and Catalonian independence plus soccer.

The specific places in Table 5.4 also tell a story of geographic connections with respect to sports and politics. The soccer-specific topic contains high mentions of secondary cities in Spain, each of which represents a city with a soccer club with frequent matches against FC Barcelona; Alicante and . Paris Saint-Germain, located in Paris, France, is one of the primary international rivals of FC Barcelona and Real Madrid, which are also highly mentioned, but not to the degree that they are in the Catalonian independence-specific topic. The political context of each of the top six places with more mentions in the Catalonian independence topic is clear between the binaries of Spanish cities and between Scotland and England. The presence of Brussels on this list is further proof of the international political significance in corresponding news reports. Brussels is the headquarters of the European Union, frequently cited as an important entity to the establishment of a Catalonian state. Pro-independence officials have routinely called for EU support for secession (White and Larraz 2014), with frequent appeals for and response from the organization through Brussels. Additionally, the presence of Qatar in this list and Israel and Syria in Figure 5.6 suggests a unique link that soccer shares with the Arab world which is not considered in the Catalonian independence topic. Qatar has been linked to soccer through its controversial hosting of the soccer World Cup in 2022, but the root of the connection

161 between Barcelona, the tiny gulf nation, and soccer is through the economy. During the 2015 year, FC Barcelona was sponsored by Qatar Airways, symbolizing to many the international nature of the team (Lowe 2014). As discussed in Chapter 2, soccer sponsorships are an indication of international reach. Through this network – a ‘financescape’ to Appadurai (Appadurai 1996) – unique places are brought together, and in this case Qatar is specifically mentioned by name on the uniforms of the athletes and on banners around the stadium. Catalonia as a whole relies on these international connections to portray a global identity as well as a Catalan one. The connections between Barcelona and Qatar and Barcelona and Brussels signify the same requirement of an international nation-building exercise. Catalonia and Barcelona have been demonstrating this to the world through media portrayals of identity, economic networks, and sports, dating specifically to the 1992 Olympic Games (Kennett and Moragas 2006). This pattern contrasts with that of the Scottish independence movement, examined next. 5.4.3 Comparing Catalonian Independence to Scottish Independence The first comparison is between independence movements in Scotland and Catalonia, which are similar, but also different in many ways. The most significant difference, besides the obvious geographic contexts, is that the events discussed here did not occur in concurrent time frames. In one sense, the independence movements for both regions is ongoing, but the peaks of public sentiment and media attention do not coincide. Thus, thematic connections between the Catalonian movement, peaking at the parliamentary vote in September 2015, reference the Scottish movement (which peaked during the referendum vote, which resulted in a ‘no,’ a year prior, in September 2014) but not the other way around. The Scottish movement does have a unique spatial context and a historical set of precedents, just as does Catalonia’s movement. The comparison between the spatial content of news pertaining to the aftermath of Scotland’s movement and the concurrent movement and vote in Catalonia shows the geographic differences in those processes. Before exploring those differences, it is important to note that these topics are not entirely comparable from a frequency perspective. The Catalunya independence topic (144) contains less than half of the articles than the Scottish independence topic (310)

162 does. The LDA process generates topics of a wide range of sizes, especially at larger values of k. The difference in number of articles between these two topics creates a large difference in the number of mentions of placenames, with the Catalunya topic containing 533 mentions of places, and the Scottish independence topic containing 1350. Because of this great discrepancy, it makes little sense to compare the quantities of places mentioned. Despite this large difference, meaningful conclusions can be made by observing the global distributions of locations, however.

Mentions in Catalonia- Mentions in Scottish- Place name Difference specific topic specific topic TOTAL 533 1350 20

Vigo 7 0 7 Brussels 7 1 6 Las Palmas 6 0 6 Aberdeen 5 0 5 Japan 5 0 5 Manchester 4 0 4

US/New York 10 29 19 UK 7 27 20 France 5 27 22 Barcelona 49 92 43 Madrid 54 97 43 Spain 58 162 104

Table 5.5 List of the places which have the largest difference in the number of mentions within articles pertaining to topics on Catalonian independence and the Scottish independence referendum.

Although the counts of places are slightly misleading in Table 5.5 because of the drastic difference in total placename mentions between articles in these topics, a couple of patterns emerge that can be examined for geographic significance. First, the fact that several places do actually have greater mentions in the Catalonian topic than the Scottish one is significant. Only Brussels appears in both this and Table 5.4, further showing Catalonia’s constant and strategic referral to the EU headquartered there. Further locations with relevance to soccer appear here – Vigo and Manchester correspond to well-known soccer clubs. And the global pattern of places linked to Catalonia despite great distance is exemplified by Japan (and to a lesser extent not displayed here, Australia and Canada).

163 Most surprising is the politically-relevant location of Aberdeen appearing as significance to the Catalonian topic but not the Scottish topic. Aberdeen was the site of a major debate on the independence referendum in September 2014, which according to this data, does not appear in any articles related to Scottish independence. Since the topics do contain several of the same terms and a semantically similar independence context, it is expected that they would share much of the same locations. However, the lack of a greater presence of places in Scotland and the UK with respect to Scottish independence is perplexing. The Scottish independence movement and climaxing non-binding referendum vote captured the attention of American news sources perhaps more than the Catalonian parliamentary election did. Thus, it is not a surprise that several American places are frequently mentioned within the Scottish independence topic, as seen in Figure 5.6, especially the Unites States (and variants), New York, and Las Vegas. Only one North American news source was used (‘New York Times’), and few articles from this source appeared in the topics examined in this chapter. While undoubtedly, the New York Times, as well as any news source, would have mentioned a greater number of placenames within the country of its production, the places mentioned in the news pertaining to Catalonian independence in this section are mentioning international locations from sources within Spain or the UK (The Guardian). Thus, the Scottish movement has observed US interest in the referendum and included several links the American entertainment interests (Las Vegas, Times Square), and its financial sector (the New York Stock Exchange). Although the Catalonian movement has made a more impactful effort to incorporate international narratives into its reporting on the movement, the Scottish movement has made numerous appeals to the United States’ sentiment. Although the attention of the United States raises awareness, the identity-based approach of Catlonia has helped advance the movement from within. I believe this has led to a more positive outlook within Catalonia on the possibility of international support for an independent state than the outlook within Scotland. This could be the subject of additional research, as the media continues to report on both efforts to create independent states, and the movements continue to influence residents positively about the local and international consequences of independence. Catalonia’s

164 appeal to the European Union (again evident in this comparison, as the Catalonian topic mentions Brussels seven times to the Scottish topic’s single mention) for support now becomes more relevant to Scotland following Britain’s vow to exit the EU (Castle 2014). Both movements have a similar situation with the EU for the time being, as they exist inside of EU member states while vying to also become separate member states. Although the Catalonian context has been the more international of the two, Scotland’s case will force it to make its case, as the EU nations’ willingness to separate a Scottish community from a British one will determine their support with respect to Brexit. Catalonia has only been successful rallying local residents with their international emphasis, while Scotland’s majority has only voted against secession.

5.5 Conclusion This chapter has attempted to spatialize the geographic relationships among media reports during the Catalonian independence movement. The patterns observed here between type of media and topics within the media indicate important spatial relationships for understanding geopolitics through globalization and mass media with respect to this event and its connections with sport and the international community. The places which appear in news articles in similar thematic contexts indicates a social connection between places, even if those places are not directly discussed in the same article. By using density plots here, the effect of distance is minimized with respect to the reasons why Barcelona is linked with other places. Obviously, in Figures 5.4 and 5.6, distance decay does play an extreme factor in how places are linked. By the nature of the data collection and the news sources chosen, more locations throughout Spain are likely to appear in collected news articles than more distant locations. But it is those locations which are far away but critical to the mediascapes of conglomerate and international news and semantic topics within that news that underscore the geographic significance of the media’s portrayal of the Catalonian and Scottish narratives. Several important global cities do not appear significant to either region’s independence movement. But more importantly for defining how these places characterize their internationality and the

165 networks of places and people that they link together, are those places which are not close to Barcelona but share social, economic, and media connections. Despite the geographic significance evident through mapping mediascapes as a function of the media producers, their audiences, and their primary topics, media occasionally produces important geopolitical narratives without the use of placenames. Several articles collected in this study do not contain any spatial reference, and therefore have not appeared in this chapter in any form. Generally, such non-spatial articles are an exception to the normal reporting style of news. Closer examination of some of these articles shows that these are opinion pieces and calls to participate in elections and plan for the local and national future. In one particular example, the image of a local community is strong, despite not specifying any scale or location to accompany the image:

“…The responsibility lies with those individuals who have the opportunity to take action and make decisions. From this standpoint it is the owners in any Community who specifically have that power to make decisions and take action and if they allow things not to function they are responsible. It is our advice that you take control of your community as the power is in your hands to do so. You have the vote. You are responsible. If you fail to take responsibility and exercising your rights by your own default you are as responsible for empowering any professional negligence bullying or tyranny as the offenders themselves.” -‘Spain Spain Spain,’ collected Nov 6, 2015

The image of responsibility and community, explicitly in this case, reflects precisely the power of media to generate what Anderson calls an ‘imagined community’ in that the members of the community do not need to meet one another or be collocated to feel a binding connection (Anderson 1983, Appadurai 1996). The lack of locations in this article (ironically, the publication is adamant that you know it concerns Spain) shows that, while the geography of placenames bring specific locations together, as shown in this chapter, geographically significant discussions still appear without placename content. Articles without locations still emerge as helpful in thematic analysis, hence this example’s classification in a topic with a pro-independence theme.

166 The mediascape is difficult to define for several reasons discussed in this chapter, which also complicate the ability to automatically extract geographic information from the media. First, many automatic methods rely on Named Entity Extraction to formalize geographic information from the text, as we performed here. This step misses the critical example where articles contain no spatial reference at all. Second, many news sources’ geographies are extremely hard to pin down as a combination of production, audience, and content. Particularly, conglomerate-style news, as defined here, consists of an inconsistent geographic context among its sites and scales of production and audience. Assuming that the geographic context between international, national, and local news, where the sites of production and audience are coordinated, generate comparable geographies, mischaracterizes some sources. Conglomerate sources, as this chapter showed, contain unique geographies that the international and local news did not, despite what some see as an inconsistency in the way that conglomerates operate to produce unbiased and factual news (Champlin and Knoedler 2002, Williams 2002). Third, audiences are difficult to accurately measure in the modern digital context in which most people now consume their news. Many more news producers become dissociated from local context because their audiences are spread through diverse and global geographies. This chapter has shown that the mediascapes emerging from different media perspective paint different but important pictures about the ways that news is produced and the geography of geopolitical events is shared through various sources. Prioritizing some media over others with respect to its geographic facets – its sites of production, audience, and content (Rose 2012) – favors specific views on the event. Event data methods, in particular, should be aware of the geographies that they prioritize in choosing the most international or well-read sources from which to collect its information. Finally, NER is unable to detect and represent the entire geography of news as it related to the identity of the subjects of a report. Despite significant contextual refinement in NER software, like GeoTxt (Karimzadeh et al. 2013), the process still relies on text matching between terms found in text and a dictionary of placenames with their associated geographic location. Thus, misspelled locations (a disappointingly frequent occurrence among these popular news sources) and alternative spellings (‘Catalonia,’

167 Cataluña,’ and ‘Catalunya’ are examples) decrease the likelihood that places can be extracted accurately and mapped spatially. More importantly, however, is that peoples’ identity with place and culture are not reflected in placenames specifically. The fact of being ‘Catalan6’ carries a different meaning than ‘Catalonia’ the place. The inability of NER to extract information about identity or properties about a place is problematic for doing social analysis about the intended audiences of news. In the news collection analyzed here, where the data was collected with Catalonian and Scottish nationalism in mind, these places are fairly well- represented by both placenames and by their adjective derivatives. It is the potential links to other national identities that is being missed here. When the subject of an article is Catalonia, both terms ‘Catalonia’ and ‘Catalan’ are likely to appear, and the place represented. In no-less important instances, such as soccer players’ identities, they get a single mention – “the Brazilian striker” – and that spatial identity disappears from the analysis. Although there is a difference between the nation of Brazil and the descriptive fact about a person, those spatialities are tied together when the media chooses to identify them as coming from a different place. It is those media-facilitated connections that the mediascape theorizes, and this chapter has attempted to map. Perhaps the most important outcome of this mediascape mapping exercise is a better understanding of the ways that geographic narratives are advanced by different geographic scales of media to the public depending on how and where they access it. Local news sources are presumed to contain the best reporting on local issues. However, conglomerate news, where the production is removed from the context of the event’s location, also has the power to connect new places to promote global diffusion and emphasize significance beyond the local scale. In the context of Catalonian independence, global connections are crucial for establishing necessary economic financescapes and cultural ethnoscapes (Appadurai 1996) pertaining to Catalanism and Europeanization as distinct from its Spanish nation.

6 Catalan, incidentally, geolocates to the city in Turkey. Manual override is necessary to prevent this inaccuracy, as well as many others, from appearing in the spatial analysis.

168

6. Chapter 6:

Conclusions and Future Work

169 6.1 Summary This dissertation explored the intersections of several methodologies and spatial contexts in pursuit of geopolitical research on current global events. Merging text analysis methods, the spatiality of the news mediascape, and comprehensive evaluation with the help of interestingness measures, it provides a unique look into the internationality of the Catalonian independence movement and the ways that it utilizes globalization, sport, and European identity. It considers the evaluation of machine learning methods such that a careful consideration of what makes for an interesting or successful model is formalized and can be planned for via appropriate parameterization. Once an interesting theme is discovered, the semantic space created by the articles which comprise the given topic are compared for their spatial content and the geographic networks that that content generates via media connections. The media itself is influenced by spatial factors, however, so through mapping the mediascape of news producers and their audiences, the project spatializes three scales of news media’s engagement with the Catalonian independence narrative. Specifically, media brings together semantic and geographic spaces to produce several influential narratives on local and national identity in Catalonia. First, the presence of the Scottish movement and independence referendum the year prior to the elections in Catalonia proved an important context to reporting on precedent and global implications of pursuing independence for Catalonia. In comparing the international networks of news articles on these topics, it is clear that the Catalan media has more successfully looked toward global structures to produce narratives on Catalonian independence. The media links Catalonian identity with the international soccer network of the European Champions league and the global profile of Spain’s La Liga, showing that soccer’s financial and competitive networks are reflected in independence reporting as well. Parliamentary elections, which resulted in a majority of seats in Catalonia given to pro-independence legislators shows the maturity and significance of that narrative. Methodologically, topic modeling was recently introduced to the field of geographic information science to cope with big, textual data, but applying it here provides unique benefits over other text analysis methods used in the field. Without the need to generate an ontology of terms to themes before observing the text itself, this data-

170 driven process ensures that emergent narratives are captured and not pre-ordained by a known dictionary. To make this method work, comprehensive evaluation is necessary to ensure that a valid model also produces a useable result. The interestingness measures that are utilized here to do so represent a unique but necessary way of thinking about what makes a data-driven model successful and how one valid result might contribute more than another to knowledge about the news media. In the rest of this chapter, this project concludes by considering broader implications of the work and future iterations. First, it discusses several contributions to geographic scholarship and the publication of these contributions to the geographic literature. Next, it considers some important potential issues resulting from assumptions made through the course of this analysis and solutions to address them. Finally, it concludes by considering the next steps to continuing research in the area of media geography, the geopolitics of sport, and the changing global landscape from local sentiment to international geopolitics.

6.2 Contributions to Literature This project spans several important themes in the discipline of geography. Within these themes, specific parts of this project can be extracted and presented as parts of larger disciplinary movements. In this section, three particular categories of spatial research and the specific contributions which this research makes to them – popular geopolitics, geocomputation, and event data – are described. Each of these subdisciplinary areas presents an opportunity to publish this work in a separate context. 6.2.1 Popular Geopolitics Although the primary emphasis of the project is computational while popular geopolitics exists almost entirely in a qualitative methodological space within geography, the media-based focus of this project contributes to strands of the subfield of popular geopolitics. Popular geopolitics specifically considers the ways that geopolitics are conveyed through mass media and popular forms of geopolitical representation such as comic strips and film.

171 This project has considered the intersection of the geographies of producer and audience given both their location and their scale. Generally, popular geopolitics utilizes media at a single scale; the national scale. Local media was shown here to have a different geographical engagement with the geography of the Catalonian independence movement, indicating different geographic networks facilitated by the news media. Sport is a big part of those geographic networks, with major European cities connected through soccer matchups and other athletes competing in international competition. The medium of sport allows for the spread of awareness and support of Catalonian identity through these networks because of the Catalonian culture embraced by FC Barcelona and the resulting influx of that identity in the media. This application of popular geopolitics intersects established literature subsets in both the geography of media and of sport. At this intersection of media and national identity, networks between places and narratives produce local and international sentiment in the unique political situation of Catalonia’s attempt to build support for its own version of nation-building. Catalan identity is consumed in the Europeanizaton of its financial and media networks. 6.2.2 Geocomputation Although geocomputation began as a subfield devoted to computational representation of geography (Longley et al. 1998), it now takes on a clearly analytical structure, incorporating many types of spatial data, from text to Census variables at the local level (Farber et al. 2012). The contribution of this project to the field of GIScience is both conceptual and empirical. LDA is no longer a method that needs ‘introducing’ to geographic information science literature (Li et al. 2008, see, Shekhar, Zhang, and Huang 2009, Lansley and Longley 2016). The application of interestingness measures for evaluating topic models and other data-driven methods is not only critical to any machine learning process, it follows from recent debates in geography about strictly objective validation for these and similar methods (boyd and Crawford 2012). To this debate, this dissertation offers interestingness measures and a specific way of thinking about the ways that a model can be successful beyond their validity. Though not new to KDD, interestingness here is considered in both a strictly data science capacity, as well as in the novel spatial adaptation of the measures. Then, each measure is

172 tested, compared, and evaluated for its particular relevance to LDA and media analysis. The concept of interestingness contributes to GIScience literature as a way to push data- driven methods beyond current critiques (Miller and Goodchild 2014). Additionally, the empirically-derived recommendations for parameterizing LDA demonstrated here facilitate better use of a method quickly being adopted for semantic spatial analysis. 6.2.3 Event Data Although analysis of event data is not a common approach in geography, geographic information science can contribute to improving the technique which is common in political science. The spatial component of event data is a fairly recent addition, and one that creates some confusion when locations are mapped, but represent different scales of features. As a way of deriving spatial and semantic information from news media, event data has proven useful to geographers (Peuquet et al. 2015) and to high intelligence communities in the U.S. government. Event data relies on a comprehensive ontology of actors, places, and other terms indicating event themes throughout the world. Data-driven methods reduce the overhead and still produce meaningful classifications of the input news data. In publishing this work, a significant part of the product contains critiques of both event data and topic modeling approaches given a standard set of global pattern analysis objectives. The pre-defined dictionary approach of event data does not allow for emergent themes, while semantic modeling cannot be used to model temporal or spatial variation without separating the input data into arbitrary categories. Comparing both approaches establishes useful scenarios for both. But ultimately, a combination of methods to reduce their respective shortcomings would be a positive contribution to politically-inclined geographic journals such as Geopolitics or Political Geography, which have shown the desire to publish more quantitative geopolitical research.

6.3 Methodological Pitfalls and Solutions Despite the contributions this project has made to geographic scholarship and spatial analysis methods, several issues arose during the process which should be addressed in future iterations of the work. All applications using unstructured text to

173 derive data have some uncertainty, which could account for some of the semantic and spatial variation observed here in Chapters 4 and 5. But systematic sources of uncertainty are greater issues that remain to be dealt with. Some of these issues have been introduced in previous sections, but they are discussed here in a more comprehensive form with respect to the languages of news and the self-fulfilling prophecy. 6.3.1 Linguistic Effects International research of any kind generally requires comprehension of multiple languages. The ability to speak Spanish and Catalan was not necessary for this project, however the differences between reading, communicating, and nuanced interpretation of political language in context are quite vast. This sort of interpretation is difficult and subjective, but because LDA specifically reduces the context of a full piece of text down to several important terms, any evaluation necessarily requires an interpretation of single terms and how their combinations indicate important themes. The investigator has limited reading comprehension of Spanish and thus language presents a significant barrier in this project, as well as in similar projects to this. In this section, language not only presents an interpretive issue, but an analytical one as well. To approach a more accurate representation of the media perspectives involved in the Catalonian independence movement, media sources in English, Spanish, and Catalan were collected. As shown in Chapter 5, local news sources almost exclusively were written in Spanish or Catalan, and the exclusion of those sources would have systematically eliminated the local perspective on the issue. Collecting multi-lingual news is not a problem, but analyzing it in a data-driven dimension reduction methodology with several other languages is not possible. The few terms that unique languages share with one another almost ensures that a topic model will not assign two articles in different languages to the same cluster. The ultimate result will be some English clusters, some Spanish clusters, and some Catalan clusters. As a technical definition, this is still dimension reduction, however, reducing on the known variable of language provides no new or interesting evidence of semantic clustering. Instead, the languages of interest must be separated prior to analysis, and each language articles run through a separate topic modelling process. Text analysis packages in R have the ability to detect language, and using this method can separate articles in

174 different languages. This precludes the ability to directly compare the content of one language news versus another through an isolated topic model, but the resulting topics can be compared semantically, if interpreted into representative topics. Hence, the problem returns to accurate interpretation of textual summaries in a second language. In an effort to increase the interpretative accuracy of the Spanish language news throughout this project, an undergraduate Spanish-speaking assistant was hired for academic research credit to assist with topic interpretation. It quickly became clear in the course of interpreting both the linguistic and the contextual perspectives of the Spanish news that simple translation is insufficient for interpreting the results from a topic model. For example, after being convinced that a topic containing the Spanish term ‘campos’ had to do with recreational camping or refugee camps, the alternative translation of ‘campos’ into ‘fields’ and pertaining to improperly parsed html code would have been just as reliable as part of the given topic. Additionally, given the difficulty in reading stemmed vocabulary in English, neither of us were able to interpret the meaning of several stemmed Spanish terms. Our inability to effectively interpret the content of these topics meant that; first, we would be unable to reach an agreement about the content, let alone the interestingness of these topics, and second, that we would be even less accurate in attempting to evaluate news in Catalan. For this reason, the Catalan and Spanish language news collected from local and national sources within Catalonia and Spain were not analyzed for their respective topics here. With the help of a native Spanish and Catalan speaker familiar with the geopolitics of Catalonia, some of these issues could have been resolved for the purpose of comparing the topics between languages, and more particularly, topics between news reports at a local scale and at an international scale. The local perspective of somebody with knowledge of the and the ways that its identity is portrayed in the media internalized by its citizens would greatly add to the ability to translate, interpret, and compare the content of these topic models in a meaningful way. Although the attempt made here was insufficient for this purpose, in future applications using this data, the help of a co-evaluator and trusted consumer of Catalonian politics will be used to investigate this data more comprehensively.

175 Despite the difficulty and inconsistency, Spanish and Catalan news were not completely removed from this analysis. In Chapter 5, the effect of language on mediascape mapping was discussed as a function of the scale of the media source’s audience. The contextual clues that GeoTxt relies upon to determine the correct geographic location disappear in non-English text. Adding English context informs the geolocation process of the spatial scale of the location (adding ‘city,’ ‘region,’ or ‘country’), which sometimes helps to disambiguate between multiple candidate locations in the Geonames database. But non-English placenames are often also not discoverable by the Named Entity Recognition system unless it is written specifically to process text in that language. Stanford’s NER system does have implementations in multiple languages, including Spanish. The convenience of NER paired with Geonames search and JSON export within the GeoTxt system, plus the manual clarification necessary to produce accurate placenames outweighs the lengthy overhead of creating a similar workflow from scratch. With GeoTxt written for English, this also required some manual manipulation of text to convert the Spanish or Catalan spelling of placenames into English. Any time these manual edits are necessary, a degree of error is introduced in the possibility of collecting fewer placenames than existed in the text. This could be partly responsible for the much fewer mentions of placenames collected in the local news, as explored in Chapter 5. Introducing human error, especially in the form of ignorance of the linguistic content, could directly influence some of the patterns observed here, especially when the language presents such a systematic variation. Since all local news is non-English, the topics generated in an English text analysis represents only part of the overall news mediascape. It remains to be seen whether the topics generated by Spanish, Catalan, and English language topic models are comparable on a semantic level. In the sense that they all consist of the same number of terms; the topics are comparable. However, the systematic spatial nature of the languages used by news media means that any variation could due to linguistic factors or to media factors. With the help of a native speaker, this comparison would be more comprehensive and the factors would stand a better chance of being isolated for their semantic effects.

176 The effect of language on text analysis should not be understated. Some methodologies don’t depend on language specifically, such as the bag-of-words approach that LDA uses. The semantics of the words are irrelevant, but ultimately the generated topics do produce semantically coherent topics. Others, which require contextual clues to produce knowledge, do require language-specific implementations, would be beneficial to the process of extracting multilingual spatial content in future projects. The GeoTxt development team at Penn State is aware of this and is producing a standard version which automatically detects language and adjusts the NER process accordingly. 6.3.2 The Self-Fulfilling Prophecy of Evaluation The self-fulfilling prophecy refers to a hypothesis that is nearly guaranteed to come true because of the interaction between predictions, inputs, and results. The analogy is not a perfect match to the situation in this project, but it feels apt to describe the process of producing variations in topical results by using LDA parameter combinations, then evaluating those variations as functions of interestingness. Despite providing thorough methods for evaluation, it is not an accident that interestingness measures are directly influenced by LDA parameters. The effects of the model parameters, which are measured here via the selected interestingness measures, are designed to be noticeable by the same interesting features. The self-fulfilling prophecy is that, since LDA’s parameters specifically influence interesting features of the model, using those same interesting features to evaluate the differences between models reinforces known parameter impacts. The best example of this is the conciseness measure. Referred to as having fewer attribute-value pairs, a concise model in the case of LDA has a consistent number of values (articles) paired with a limited number of discovered attributes (topics). LDA controls for exactly that by specifying the number of topics as a parameter. Similarly, alpha controls for the likelihood that each article was generated by a single topic, which directly influences the diversity of the resulting pattern. Despite the self-fulfilling nature of the relationship between some of the parameters to LDA and some interestingness measures from KDD, Chapter 4 did show that combinations of parameters can influence interestingness in different ways, and that not all of the measures evaluated here experience the same scale of impact with parameterization. Although the effects of some parameters can be known ahead of time

177 without testing, this project showed that parameters have measurably different impacts on many ways of measuring a model’s interestingness.

6.4 Future Work Despite some of the issues in working with digital text data and international media producers, this research design has the ability to facilitate analysis in a number of similar applications with global geopolitical significance. The global and digital media facilitate the formation of virtual communities (Anderson 1983), connecting people across the world who share a belief in something. This project uses sport to consider the geopolitical significance of international compositions of athletes and audiences, which lead to geographic connections through the focus of the media. Sport facilitates these connections, as was demonstrated in this dissertation, but can be the primary source of international communities as well, rather than contributing to the establishment of a Catalan community both within the semi-autonomous Catalonia and outside of Spain. 6.4.1 Sport Sport remains a significant medium through which geopolitical relationships and their historical and cultural context play out in the public view. Sport proved important in the Catalonian independence movement as a means through which Catalonia, and Barcelona in particular, demonstrated an international network of influence and cultural community. The international structure of FC Barcelona ensures that media from a variety of international places will cover the team. Of course, international coverage of successful soccer teams is common, but the added context of FC Barcelona’s pro- Catalonia political stance means that the message that ‘Catalonia is not Spain’ is normalized for observers of the event. Sport has the tendency to mask the presence of political narratives within the course of play, because audiences are primarily attracted to and expect the media to portray an entertaining event. In the examples described in the following sections, sporting events present the potential for unique analyses of geopolitical relationships and popular attitudes pertaining to them.

178 6.4.1.1 The British Commonwealth Games First, the British Commonwealth Games presents an example where sports and geopolitics collide, but against the stated intentions of the event organizers. The British Commonwealth Games (BCG), similar to the Olympic Games, takes place every four years and sees athletes from nations currently or once a part of the Commonwealth compete in several individual and team events. The BCG Constitution attempts to divorce politics from friendly competition, stating in Bylaw 3 that “The Commonwealth Games are competitions between athletes and not contests between countries” (Sporting Intelligence 2013). Much research proves that national identity is consumed in sports (Jackson and Haigh 2008), and thus the attempt to focus solely on athletic competition in the BCG is a futile attempt which advances colonialist narratives of authority. Particular matchups between national athletes establish geopolitical binaries in historical and modern contexts, for example India-Pakistan, Malaysia-Singapore, and Bahamas- England. Political histories dictate the reactions by fans and local media, and perhaps long- term implications for international relations. Although attention to the BCG in the U.S. is small, a large portion of the world’s countries have the same feelings about the events as American feel about the Olympic Games. In a pilot study observing news containing mentions of the BCG and specific country pairs, evidence exists that there are regional interest levels in matchups between nations with political histories and between most commonwealth countries and England. This interest is enhanced when the stakes of the matchup are higher, as in the case where medals can be awarded. Additional semantic research is necessary for comparing local, regional, and international media narratives to explore the content of the increased media activity. A working hypothesis suggests that the political narratives between nations competing the BCG differ by their colonial histories and their political relationships with neighboring nations. In either case, the BCG’s Bylaw 3 is an overstated attempt to produce a spectator event which does not remove the effects of colonial history and its enduring politics. The BCG will take place again in Australia’s Gold Coast in April 2018. Data collection will begin anew to capture news produced throughout the commonwealth to further explore the political climate surrounding the event.

179 6.4.1.2 The World Baseball Classic The World Baseball Classic is a recent attempt to bring the globally expanding sport of baseball to a single international competition. Attempting to capitalize on the popularity of soccer’s World Cup, the WBC brings together nations every four years to play a month-long tournament prior to the Major League Baseball season. Contrary to the soccer World Cup, baseball consists of very few professional leagues and baseball is less developed as an international event, so the nationality requirements to compete for a country are quite low. International baseball is dominated by a couple of world regions, but the emergence of new national competitors and regional politics make the WBC an ideal medium for exploring specific geopolitics. Despite U.S. Major League Baseball’s reign as the premier professional baseball league in the world and the American venues holding the finals of each WBC event to date, the U.S.A has underperformed in the WBC until the 2017 competition. Still, the event is promoted and organized by American interests, and so national pride and ownership of the event lacks diversity. There is a colonial feel to the event, with the United States vying to spread the sport’s influence across the world through friendly competition with little more at stake than bragging rights. 6.4.2 Other Useful Domains The more general, media-based framework used here to explore nationalism and globalization applies to politics at multiple scales. Here, the local and international media were compared for the geographic networks that they facilitate. Even established democracies have identity crises, as Britain’s impending exit from the European Union continues to show. Additionally, a strictly local adaptation of these methods provides a unique perspective into urban systems and interactions between people and public space. 6.4.2.1 Brexit The so-called Brexit movement, through which Great Britain endeavors to end its relationship with the European Union gives rise to a complex international political moment emphasizing nationalism over international cooperative structure. Though bound to happen through the efforts of both Britain and the EU, Brexit is causing a logistical nightmare amid fear of increasingly complex international boundaries between the UK and the rest of Europe (Vale 2017). The full consequences of Brexit remain to be seen,

180 but international politics are already in the process of being shaken up, and further studies of the geographic relationships and semantic spaces presented by European and British news media could facilitate more local-scale comparison of geopolitics. A media analysis of Brexit could take multiple forms to evaluate sentiment throughout Europe and the rest of the world. First, national media producers would provide good sources of sentiment on the issue from a comparison of multiple perspectives, including nations with solid EU membership, nations with ambitions of joining the EU, and those in between with direct and measurable impacts of the Brexit campaign, particularly Scotland, Ireland, and even Catalonia. More importantly than media presentations of European identity through the EU structure, are the ways that individuals read the news, internalize its message, and react publically. Any study of media impacts and reactions to Brexit should include scrutiny of public engagement with the media. Although the voting period is over and the Brexit process is largely underway, public perception is just as important for understanding the underlying sentiment which makes nationalist processes possible. 6.4.2.2 Real-Time Analysis The temporal dimensions of changing topics have not been explored here because of the relatively short time scale of data collected, the objective of studying the more wide-ranging spatial and thematic aspects of media reports, and the general lack of methodological consideration of dynamic topics with respect to time. Experiments have been made into dynamic topic models based on LDA (Blei and Lafferty, 2006), but much work remains considering the nature of a changing set of topics and the impact that individual documents have on the reduced semantic dimensionality over those documents. A dynamic topic model, in the procedural structure of LDA, would re- compute the topic structure with each additional document through time, which would change the overall topic structure and potentially reclassify existing documents into topics which they were not originally assigned to. Since documents that change with time were at one point added to the document collection with a specific topic relevant at the time, this dynamic process has the ability to change previous facts; it would alter the past. In applications which track themes over time, this approach is clearly problematic.

181 However, many applications could utilize real-time textual input to compare changing themes through a period of time. For example, online public reporting mechanisms allow citizens to generate and view qualitative reports about the conditions of their neighborhood. Importantly, these reporting tools are monitored by public service providers to direct the relevant agent to assist in such cases as unlit streetlights, broken signs, and downed trees (for example, see FixYourStreet.com, maintained by Dublin City Council). In an example such as this, a sevice provider might be interested in knowing about general themes and trends with respect to public needs. Past topic proportions as well as current ones are important for both classification of reports and for comparing changes in urban needs over time. Thus, any dynamic topic model must maintain a history of classified reports, as well as a current record of unresponded-to reports and newly appearing ones. A streaming visualization method for displaying topics would accomplish these goals. Because a topic model is a static snapshot summary of the input texts, a real-time application of topic modeling needs to update with each change to those texts and show connections between subsequent models. A visual presentation, such as the one presented in (Havre et al 2002), where time flows horizontally along the x-asix and unique topics are grouped along the y-axis, would do several things to improve communication of the benefits of topic modeling, as well as make the technique usable in real time. First, by measuring topics through the number of texts assigned to them, an overall summary of the data is presented, and that summary can be tracked with changing texts. And second, the nature of the procedure is better understood via the changing combinations of terms which define topics. Although the assignment of texts to topics is not stationary, the primary themes emerge by comparing the similarities between texts. 6.4.3 Measuring Engagement with Digital Media This analysis of media’s sites and scales of production and audience have produced interesting information about how different sources engage with geopolitics, but without actually exploring the audience of the media, it falls short of being able to explain audience engagement with its content. In the next iteration of this research, public participation and sentiment with regard to current events and the news media’s production of those events can be explored using online comments, digital tracking,

182 speaking with media consumers, and social media engagement in semantic and quantitative ways. There is much work to be done in linking news reporting with social media beyond the obvious network of shares, retweets, and replies. Semantic analysis approaches this goal by reducing articles to combinations of key terms, combined with text search and topic analysis models tailored specifically for social media. Within Twitter, the limits on length in social media posts creates an environment with less sentence structure and greater chance of misspelling to conserve space within the character limit. Therefore, topic models are less effective on Twitter posts, especially when compared with other forms of text which abide by more standard grammatical rules. Similar to running a topic model when the input contains multiple languages, the format difference between news and social media creates an additional factor with potential to outweigh the semantic topics factors. Comment boards are simple to retrieve as text on an article’s website, but the dynamic nature of comments makes message board scraping an uncertain task temporally. Comment boards are notoriously sites of trolls and other nonproductive engagement with other commenters, more than of the content itself. Although geopolitical content remains the driving point behind semantic analysis, this methodology may also shed light on the recent proliferation of the concept of ‘fake news,’ and particularly how polarizing viewpoints are shared and advanced through the population. As more of the population access their news via indirect sources such as social media, which emphasize commentary prior to content, news and commentary merge. 6.4.3 Projections Neither the Scottish nor the Catalonian movements have resulted in any substantial geopolitical changes. Now three and two years removed from the votes themselves, both nations stand pat with similar situations as when they began the process of seeking local and international support for independence. Hence, the findings comparing the geographies of these two movements remains relevant, except that the European context is quickly evolving and forcing Scotland and Catalonia to the front of geopolitical attentions throughout the world. Despite a ‘no’ vote outcome from Scotland’s previous attempt to gauge public opinion, renewed efforts to break from the

183 United Kingdom, and increased belief that the European Union will reverse course and accept an autonomous Scottish state (Johnston 2017) have fueled new governmental and public efforts to again seek the right to secede from the UK. Although in a separate situation, where Spain is already an EU member state and will remain so, Catalonia has the support of its people and a well-established global network of economic, social, and media flows to produce support for cultural and political liberation from Spanish administration. In the current and near future, these geographic situations will continue to feed off of one another to produce global narratives of unique cultural and social identity, international identity through the connections between places, and the further production of diasporic communities of people and pro-independence sentiment. Their ability to establish and leverage those networks via the media and via sport have separated the two movements in terms of outlook, but as these networks increase in global scale, particularly revolving around the European Union and European identity, these movements will capture the world’s attention, and people throughout the world will understand how geography connects their experience to the experiences of the Scottish and Catalonian people. As global geopolitics continues to progress and evolve, so too does this project, capturing the mediascapes of geopolitics, sport, and the imagery portrayed by the media in useful and interesting ways.

184 References Aizawa, Akiko. 2003. "An information-theoretic perspective of tf-idf measures." Information Processing & Management 39 (1):45-65. Allen, James F. 1983. "Maintaining knowledge about temporal intervals." Communications of the ACM 26 (11):832-843. Alvarez-Garcia, J. A., J. A. Ortega, L. Gonzalez-Abril, and F. Velasco. 2010. "Trip destination prediction based on past GPS log using a Hidden Markov Model." Expert Systems with Applications 37 (12):8166-8171. doi: 10.1016/j.eswa.2010.05.070. Anderson, Benedict. 1983. Imagined Communities: Reflections on the Origin and Spread of Nationalism. London, UK: Verso. Anderson, Chris. 2008. The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired Magazine 16 (7). Accessed June 23. Andrienko, Gennady, Natalia Andrienko, Daniel Keim, Alan M. MacEachren, and Stefan Wrobel. 2011. "Challenging problems of geospatial visual analytics." Journal of Visual Languages & Computing 22 (4):251-256. Appadurai, Arjun. 1996. Modernity at Large: Cultural Dimensions of Globalization. Minneapolis: University of Minnesota Press. Aron, Leon. 2014. The Putin Olympics. Politico Magazine. Accessed April 13, 2016. Arva, Bryan, John Beieler, Bejamin Fisher, Gustavo Lara, Philip A. Schrodt, Wonjun Song, Marsha Sowell, and Sam Stehle. 2013. "Improving Forecasts of International Events of Interest." European Political Science Association Annual General Conference, Barcelona, Spain, June 20-22. Augusiak, Jacqueline, Paul J. Van den Brink, and Volker Grimm. 2014. "Merging validation and evaluation of ecological models to ‘evaludation’: A review of terminology and a practical approach." Ecological Modelling 280:117-128. doi: 10.1016/j.ecolmodel.2013.11.009. Beieler, John, Patrick T. Brandt, Andrew Halterman, Philip A. Schrodt, and Erin M. Simpson. 2016. "Generating Political Event Data in Near Real Time: Opportunities and Challenges." In Computational social science: discovery and prediction, 98-120. New York, NY: Cambridge University Press. Biro, Istvan, Jacint Szabo, and Andra A. Benczur. 2008. "Latent dirichlet allocation in web spam filtering." AIRWeb: Adversarial Information Retrieval on the Web, New York, NY. Blair, Benjamin D., Christopher M. Weible, Tanya Heikkila, and Darrick Evensen. 2016. "Comparing Human and Automated Coding of News Articles on Hydraulic Fracturing in New York and Pennsylvania." Society & Natural Resources 29 (7):880-884. doi: 10.1080/08941920.2016.1150543. Blei, David M. 2012. "Topic Modeling and Digital Humanities." Journal of Digital Humanities 2 (1). Blei, David. M. & John D. Lafferty. 2006. “Dynamic topic models”. In ICML 23rd International Conference on Machine Learning. Pittsburgh, PA. June 25-29 Blei, David M., and John D. Lafferty. 2007. "A Correlated Topic Model of Science." The Annals of Applied Statistics 1 (1):17-35.

185 Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. "Latent Dirichlet Allocation." Journal of Machine Learning Research 3:993-1022. Bogorny, V., B. Kuijpers, and L. O. Alvares. 2008. "Reducing uninteresting spatial association rules in geographic databases using background knowledge: a summary of results." International Journal of Geographical Information Science 22 (4):361-386. doi: 10.1080/13658810701412991. Borden, Sam. 2013. The Invisible Team: Barcelona's Little Brother, Espanyol. The New York Times. Accessed 6/20/2017. boyd, danah, and Kate Crawford. 2012. "Critical Questions for Big Data: Provocations for a cultural, technological, and scholarly phenomenon." Information, Communication & Society 15 (5):662-679. doi: 10.1080/1369118x.2012.678878. Braun, Joshua, and Tarleton Gillespie. 2011. "Hosting the Public Discourse, Hosting the Public." Journalism Practice 5 (4):383-398. doi: 10.1080/17512786.2011.557560. Cao, Nan, Yu-Ru Lin, Xiaohua Sun, David Lazer, Shixia Liu, and Huamin Qu. 2012. "Whisper: tracing the Spatiotemporal Process of Information Diffusion in Real Time." IEEE Transactions on Visualization and Computer Graphics 18 (12):2649-2658. Caragea, Cornelia, Anna Squicciarini, Sam Stehle, Kishore Neppalli, and Andrea Tapia. 2014. "Mapping Moods: Geo-Mapped Sentiment Analysis During Hurricane Sandy." Proceedings of the 11th International ISCRAM Conference - University Park, PA. May 2014. S.R. Hiltz, M.S. Pfaff, L. Plotnick, A.C. Robinson eds. Castle, Stephen. 2014. Scotland Votes to Deman a Post-'Brexit'Independence Referendum. New York Times Magazine. Accessed July 27, 2017. Chae, Junghoon, Dennis Thom, Harold Bosch, Yun Jang, Ross Maciejewski, David S. Ebert, and Thomas Ertl. 2012. "Spatiotemporal Social Media Analytics for Abnormal Event Detection and Examination using Seasonal-Trend Decomposition." IEEE Conference on Visual Analytics Science and Technology, Seattle, WA. Champlin, Dell, and Janet Knoedler. 2002. "Operating in the Public Interest or in Pursuit of Private Profits? News in the Age of Media Consolidation." Journal of Economic Issues 36 (2):459-468. Chandrasekharan, M.P., and R. Rajagopalan. 1989. "GOUPABIL1TY: an analysis of the propertis of binary data matrices for group technology." International Journal of Production Research 27 (6):1035-1052. lda: Collapsed Gibbs Sampling Methods for Topic Models. R package version 1.4.2. Chang, Jonathan, Jordan Boyd-Graber, Sean Gerrish, Chong Wang, and David M. Blei. 2009. "Reading tea leaves: How humans interpret topic models." Advances in Neural Information Processing Systems 22:288-296. Cheng, Daniel, Peter Schretlen, Kathan Kronenfeld, Neil Bozowski, and William Wright. 2013. "Tile based visual analytics for Twitter big data exploratory analysis." IEEE International Conference on Big Data, Santa Clara, CA, Oct. 6-9. Cunningham, Hamish. 2002. "GATE, a General Architecture for Text Engineering." Computers and the Humanities 36 (2):223-254.

186 D'Ignazio, Catherine, Rahul Bhargava, Ethan Zuckerman, and Luisa Beck. 2014. "CLIFF-CLAVIN: Determining Geographic Focus for News Articles." NewsKDD, New York, NY. Dalton, Craig M., Linnet Taylor, and Jim Thatcher. 2016. "Critical Data Studies: A dialog on data and space." Big Data & Society:1-9. doi: 10.1177/2053951716648346. Dalton, Craig, and Jim Thatcher. 2014. What does a critical data study look like, and why do we care? Seven points for a critical approach to 'big data'. Society and Space open site. Deerwester, Scott, Susan T. Dumals, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. "Indexing by Latent Semantic Analysis." Journal of the American Society for Information Science 41 (6):391-407. DeLyser, D., and D. Sui. 2013. "Crossing the qualitative- quantitative divide II: Inventive approaches to big data, mobile methods, and rhythmanalysis." Progress in Human Geography 37 (2):293-305. doi: 10.1177/0309132512444063. Dempster, A.P, N.M. Laird, and D.B. Rubin. 1977. "Maximum Likelihood from Incomplete Data via the EM Algorithm." Journal of the Royal Statistical Society B 39 (1):1-38. DeSantis, Chris. 2016. Kuwait Remians Banned from Olympics as IOC Talks Break Down. Swim Swam. Accessed April 13, 2016. Dick, Murray. 2014. "Interactive Infographics and News Values." Digital Journalism 2 (4):490-506. Ding, Li, Tim Finin, Anupam Joshi, Rong Pan, R. Scott Cost, Yun Peng, Pavan Reddivari, Vishal Doshi, and Joel Sachs. 2004. "Swoogle: a search and metadata engine for the semantic web." ACM international conference on Information and knowledge management, New York, NY. Dittmer, Jason. 2007. "The tyranny of the serial: popular geopolitics, the nation, and comic book discourse." Antipode 39 (2):247-268. Dittmer, Jason, and Klaus Dodds. 2008. "Popular Geopolitics Past and Future: Fandom, Identities and Audiences." Geopolitics 13 (3):437-457. doi: 10.1080/14650040802203687. Do, Chuong B, and Serafim Batzoglou. 2008. "What is the expectation maximization algorithm?" Nature Biotechnology 26 (8):897-899. Dodds, Klaus. 2006. "Popular geopolitics and audience dispositions: James Bonds and the Internet Movie Database (IMDb)." Transaction of the Institute of British Geographers 31 (2):116-130. Domingos, Pedro. 1999. "The Role of Occam's Razor in Knolwedge Discovery." Data Mining and Knowledge Discovery 3 (4):409-425. Dowler, Lorraine, and Joanne Sharp. 2001. "A Feminist Geopolitics?" Space & Polity 5 (3):165-176. Edsall, Robert M. 2007. "Iconic Maps in American Political Discourse." Cartographica: The International Journal for Geographic Information and Visualization 42 (4):335-347. Empetrisor. 2016. Dirichlet-3d-panel.png. No modifications from original: Creative Commons Attribution-Share Alike 4.0 International.

187 ESPN FC 2014. Barcelona confirm support for Catalonia independence vote. Accessed April 12, 2016. Falah, Ghazi-Walid, Colin Flint, and Virginia Mamadouh. 2006. "Just War and Extraterritoriality: The Popular Geopolitics of the United States' War on Iraq as Reflected in Newspapers of the Arab World." Annals of the Association of American Geographers 96 (1):142-164. Färber, Ines, Stephan Günnermann, Hans-Peter Kriegel, Peer Kröger, Emmanuel Müller, Erich Schubert, Thomas Seidl, and Arthur Zimek. 2010. "On Using Class-Labels in Evaluation of Clusterings." Procedings of the 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington D.C. Farber, Steven, Tijs Neutens, Harvey J. Miller, and Xiao Li. 2012. "The Social Interaction Potential of Metropolitan Regions: A Time-Geographic Measurement Approach Using Joint Accessibility." Annals of the Association of American Geographers:120705112556007. doi: 10.1080/00045608.2012.689238. Fox News 2013. Brazil soccer referee killed during match; his head displayed on stake midfield. Accessed Nov. 23, 2016. Frick, Bernd, Joachim Prinz, and Karina Winkelmann. 2003. "Pay Inequalities and Team Performance." International Journal of Manpower 24 (4):472-488. Gahegan, Mark, Masahiro Takatsuka, Mike Wheeler, and Frank Hardisty. 2002. "Introducing GeoVISTA Studio: an integrated suite of visualization and computational methods for exploration and knowledge construction in geography." Computers, Enironment, and Urban Systems 26 (4):267-292. doi: 10.1016/S0198-9715(01)00046-1. Gasher, Mike. 2009. "Mapping the Online News World: A News-flow Study of the Three U.S. Dailies." aether: the journal of media geography iV:101-116. Gasher, Mike, and Reisa Klein. 2008. "Mapping the Geography of Online News." Canadian Journal of Communication 33:193-211. Geng, Liqiang, and Howard J. Hamilton. 2006. "Interestingness Measures for Data Mining: A Survey." ACM Computing Surveys 38 (3):9:2 - 9:32. doi: 10.1145/. Gerner, Deborah J., Philip A. Schrodt, Ronald A. Francisco, and Judith L. Weddle. 1994. "Machine Coding of Event Data Using Regional and International Sources." International Studies Quarterly 38 (1):91-119. Getis, Arthur, and J.K. Ord. 1992. "The Analysis of Spatial Association by Use of Distance Statistics." Geographical Analysis 24 (3):189-206. Gibson, Owen. 2015. How IMG spreads Premier League's global brand - from a trading estate near Heathrow. The Guardian. Gold, Matthew K. 2012. Debates in the Digital Humanities. Minneapolis, MN: University of Minnesota Press. Gordon, Andrew D. 1981. Classification: Methods for the Exploratory Analysis and Multivariate Data. New York, NY: Chapman and Hall. Gordon, Robert S.C., and John London. 2006. "Italy 1934: Football and Fascism." In National Identity and Global Sports Events, edited by Alan Tomlinson and Christopher Young. Albany, NY: SUNY Press. Gould, Peter. 1981. "Letting the data speak for themselves." Annals of the Association of American Geographers 71 (2):166-176.

188 Griffiths, T. L., and M. Steyvers. 2004. "Finding scientific topics." Procedings of the National Academy of Sciences 101 (1):5228-35. doi: 10.1073/pnas.0307752101. Grün, Bettina, and Kurt Hornik. 2011. "topicmodels: An R Package for Fitting Topic Models." Journal of Statistical Software 40 (13):1-30. Gruteser, Marco, and Dirk Grunwald. 2003. "Anonymous Usage of Location-Based Services Through Spatial and Temporal Cloaking." 1st international conference on Mobile systems, applications and services, San Francisco, CA. Hampton, Stephanie E., Carly A. Strasser, Joshua J. Tewksbury, Wendy K. Gram, Amber E. Budden, Archer L. Batcheller, Clifford S. Duke, and John H. Porter. 2013. "Big data and the future of ecology." Frontiers in Ecology and the Environment 11 (3):156-162. Harding, David. 2014. Catalan Independence Could Mean Barca’s La Liga Days Are Numbered. Vocativ. Accessed Nov. 29 2014. Hartshorne, Richard. 1955. "'Exceptionalism in geography' re-examined." Annals of the Association of American Geographers 45 (3):205-244. Harvey, David. 1972. "Revolutionary and counter revolutionary theory in geography and the problem of ghetto formation." Antipode 4 (2):1-13. Havre, S., E. Hetzler, P. Whitney & L. Nowell (2002) “ThemeRiver: visualizing thematic changes in large document collections”. IEEE Transactions on Visualization and Computer Graphics, 8, 9-20. Hay, Iain, and Mark Israel. 2001. "'Newsmaking geograph': communicating geography through the media." Applied Geography 21 (2). Hey, Tony, Stewart Tansley, and Kristin Tolle, eds. 2009. The Fourth Paradigm: Data- Intensive Scientific Discovery. Redmond, WA: Microsoft Research. Hilderman, Robert J., and Howard J. Hamilton. 2001. Knowledge discovery and measures of interest. New York, NY: Springer. Hill, Christopher R. 1999. The Cold War and the Olympic Movement. History Today 49 (1). Accessed April 13, 2016. Hjarvard, Stig. 2001. "News Media and the Gloablization of the Public Sphere." In News in a Globalized Society, edited by Stig Hjarvard, 17-39. Gothenburg, Sweden: NORDICOM. Hoffman, Thomas. 2001. "Unsupervised Learning by Probabilistic Latent Semantic Analysis." Machine Learning 42 (1-2):177-196. Howe, Peter D. 2009. "Newsworthy Spaces: The Semantic Geographies of Local News." Aether: the journal of media geography IV:43-61. Huebner, Richard A. 2009. "Diversity-Based Interestingness Measures for Association Rule Mining." Proceedings of American Society for Behavioral and Business Science, Las Vegas, NV. International Olympic Committee 2013. Olympic Charter. Lausanne, Switzerland Jackson, Steven J., and Stephen Haigh. 2008. "Between and beyond politics: Sport and foreign policy in a globalizing world." Sport in Society 11 (4):349-358. doi: 10.1080/17430430802019169. Janowicz, Krzysztof, Martin Raubal, and Werner Kuhn. 2011. "The semantics of similarity in geographic information retrieval." Journal of Spatial Information Science 2:29-57.

189 Jarvie, G., and I. Reid. 1997. "Race relations, sociology of sport and the new politics of race and racism." Leisure Studies 16 (4):211-219. Johnston, Alison. 2017. Scotland's independence vote will complicate Brexit in some very interesting ways. The Washington Post. Accessed July 31, 2017. Jones, Christopher B., and Ross S. Purves. 2008. "Geographic Information Retrieval." International Journal of Geographical Information Science 22 (3):219-228. Karimzadeh, Morteza, Wenyi Huang, Siddhartha Banerjee, Jan Oliver Wallgrün, Frank Hardisty, Scott Pezanowski, Prasenjit Mitra, and Alan M. MacEachren. 2013. "GeoTxt: A Web API to Leverage Place References in Text." Geographic Information Retrieval, Orlando, FL. Karlis, Dimitris, and Evdokia Xakalaki. 2002. "Choosing initial values for the EM algorithm for finite mixtures." Computational Statistics and Data Analysis 41:577-590. Kassam, Ashifa. 2015. Catalonia goes to the pools in an 'incredible moment for democracy'. The Guardian. Accessed September 30, 2015. Kennett, Christopher, and Miguel de Moragas. 2006. "Barcelona 1992: Evaluating the Olympic Legacy." In National Identity and Global Sports Events, edited by Alan Tomlinson and Christopher Young. Albany, NY: SUNY Press. Kitchin, R. 2014. "Big Data, new epistemologies and paradigm shifts." Big Data & Society 1 (1). doi: 10.1177/2053951714528481. Knorr, Edwin M., Raymond T. Ng, and Vladimir Tuckakov. 2000. "Distance-based outliers: algorithms and applications." The International Journal on Very Large Data Bases 8 (3-4):237-253. Korson, Cadey. 2014. "Political Agency and Citizen Journalism: Twitter as a Tool of Evaluation." The Professional Geographer 67 (3):364-373. doi: 10.1080/00330124.2014.970839. Krätke, Stefan, and Peter J. Taylor. 2004. "A world geography of global media cities." European Planning Studies 12 (4):459-477. doi: 10.1080/0965431042000212731. Kwan, Mei-Po. 2002a. "Feminist Visualization: Re-Envisioning GIS as a Method in Feminist Geographic Research." Annals of the Association of American Geographers 92 (4):645-661. Kwan, Mei-Po. 2002b. "Is GIS for Women? Reflections on the Critical Discourse in the 1990s." Gender, Place & Culture 9 (3):271-279. doi: 10.1080/0966369022000003888. Kwan, Mei-Po. 2008. "From oral histories to visual narratives: re-presenting the post- September 11 experience of the Muslim women in the USA." Social & Cultural Geography 9 (6):653-669. Lai, Victoria, and William Rand. 2013. "How do Twitter Coversations Differ based on Georgaphy, Time, and Subject?" IEEE International Conference on Social Computing, Amsterdam, Aug. 15, 2013. Lansley, Guy, and Paul A. Longley. 2016. "The geography of Twitter topics in London." Computers, Environment and Urban Systems 58:85-96. doi: 10.1016/j.compenvurbsys.2016.04.002. Laube, Patrick. 2014. Computational Movement Analysis. Zuruch, Switzerland: Springer.

190 Laube, Patrick, Stephan Imfeld, and Robert Weibel. 2005. "Discovering Relative Motion Patterns in Groups of Moving Point Objects." International Journal of Geographical Information Science 19 (6):639-668. Laube, Patrick, and Ross S. Purves. 2006. "An Approach to Evaluating Motion Pattern Detection Techniques in Spatio-Temporal Data." Computers, Enironment, and Urban Systems 30:347-374. Leetaru, Kalev, and Philip A. Schrodt. 2013. "GDELT: Global Data on Events, Location, and Tone, 1979-2012." International Studies Association Meeting 2013, San Francisco, CA, April, 2013. Li, Zhisheng, Chong Wang, Zing Zie, Xufa Wang, and Wei-Ying Ma. 2008. "Exploring LDA-Based Document Model for Geographic Information Retrieval." In Advances in Multilingual and Multimodal Information Retrieval, edited by C. Peters, V. Jikoun, T. Mandl, H. Müller, D.W. Oard, A. Peñas, V. Petras and D. Santos. Berlin: Springer. Lieberman, Michael D., and Hanan Samet. 2012. "Adaptive context features for toponym resolution in streaming news." Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, New York, NY. Logan, Beth, Andrew Kositsky, and Pedro Moreno. 2004. "Semantic analysis of song lyrics." IEEE Conference on multimedia and expo, Taipei, Taiwan, June 27-30, 2004. Lombard, Matthew, Jennifer Snyder-Duch, and Cheryl Campanella Bracken. 2002. "Content Analysis in Mass Communication: Assessment and Reporting of Intercoder Reliability." Human Communication Research 28 (4):587-604. Longley, P.A., Susan Brooks, W. Macmillan, and R.A. McDonnel. 1998. Geocomputation: a primer. London, UK: Wiley. Lowe, Sid. 2014. Where will Barcelona and Espanyol play if Catalonia ge independence? The Guardian. Accessed August 2, 2017. Lukoianova, Tatiana, and Victoria L. Rubin. 2014. "Veracity roadmap: Is big data objective, truthful and credible?" Advances in Classification Research Online 24 (1):4-15. Manning, Christopher D., Mihai Sureanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. "The Stanford CoreNLP Natural Language Processing Toolkit." Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, June 23- 24. Massey, Doreen. 1999. "Space-Time, ‘Science’ and the Relationship Between Physical Geography and Human Geography." Transactions of the Institute of British Geographers 24:261-276. McFarlane, Thomas, and Iain Hay. 2003. "The battle for Seattle: protest and popular geopolitics in The Australian newspaper." Political Geography 22 (2):211-232. doi: 10.1016/s0962-6298(02)00090-2. Mennis, Jeremy, and Diansheng Guo. 2009. "Spatial data mining and geographic knowledge discovery—An introduction." Computers, Environment and Urban Systems 33 (6):403-408. doi: 10.1016/j.compenvurbsys.2009.11.001.

191 Miller, Harvey J. 2010. "The Data Avalanche Is Here. Shouldn’t We Be Digging?" Journal of Regional Science 50 (1):181-201. doi: 10.1111/j.1467- 9787.2009.00641.x. Miller, Harvey J., and Michael F. Goodchild. 2014. "Data-driven geography." GeoJournal. doi: 10.1007/s10708-014-9602-6. Miller, Harvey J., and Jiawei Han. 2009. Geographic Data Mining and Knowledge Discovery. 2 ed. Boca Raton, FL: CRC Press. Minder, Rashael. 2017. Artur Mas, Former Catalan Leader, Is Barred From Holding Office. New York Times. Accessed July 25, 2017. MIT Technology Review 2013. The Big Data Conundrum: How to Define it? Mitchell, Amy, Jeffrey Gottfried, Michael Barthel, and Elisa Shearer. 2016. The Modern News Consumer: News attitudes and practices in the digital era. Pew Research Center. Mitchelstein, Eugenia, and Pablo J. Boczkowski. 2009. "Between tradition and change: A review of recent research on online news production." Journalism 10 (5):564- 586. Monroe, Burt. 2013. "The Five Vs of Big Data Political Science: Introduction to the Virtual Issue on Big Data in Political Science." Political Analysis Virtual Issue 4. Nelson, John S., and G. R. Boynton. 1997. Video Rhetorics: Televised Advertising in American Politics. Urbana, Illinois: University of Illinois Press. O'Sullivan, David, and Steven M. Manson. 2015. "Do Physicists Have Geography Envy? And What Can Geographers Learn from It?" Annals of the Association of American Geographers 105 (4):704-722. Ó'Tuathail, Gearóid, and John Agnew. 1992. "Geopolitics and discourse: Practical geopolitical reasoning in American foreign policy." Political Geography 11 (2):190-204. Padmanabhan, Balaji, and Alexander Tuzhilin. 2000. "Small is beautiful: discovering the minimal set of unexpected patterns." ACM SIGKDD international conference on Knowledge discovery and data mining, Boston, MA, Aug 20-23, 2000. Pak, Alexander, and Patrick Paroubek. 2010. "Twitter as a Corpus for Sentiment Analysis and Opinion Mining." Conference on Language Resources and Evaluation, Malta, May 17-23. Pal, Mahesh. 2005. "Random forest classifier for remote sensing classification." International Journal of Remote Sensing 26 (1):217-222. doi: 10.1080/01431160412331269698. Peck, Brooks. 2014. "Serbia-Albania match abandoned after drone flies banner over the pitch." Yahoo Sports, Last Modified April 20, 2016. . Pelak, Cynthia Fabrizio. 2005. "Athletes as Agents of Change: An Examination of Shifting Race Relations Within Women's Netball in Post-Apartheid South Africa." Sociology of Sport Journal 21:59-77. Perkin, Harold. 1989. "Teaching the nations hot to play: sport and society in the British empire and commonwealth." The International Journal of the History of Sport 6 (2):145-155.

192 Petersson, Bo. 2014. "Still Embodying the Myth? Russia's Recognition as a Great Power and the Sochi Winter Games." Problems of Post-Communism 61 (1):30-40. Peuquet, Donna J., Anthony C. Robinson, Samuel Stehle, Franklin A. Hardisty, and Wei Luo. 2015. "A method for discovery and analysis of temporal patterns in complex event data." International Journal of Geographical Information Science 29 (9):1588-1611. doi: 10.1080/13658816.2015.1042380. Porteous, Ian, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling. 2008. "Fast Collapsed Gibbs Sampling For Latent Dirihlet Allocation." Knowledge Discovery in Databases, Nas Vegas, NV, August 24-27. Robinson, Colin, and Robert Feick. 2016. "Bumps and bruises in the digital skins of cities: unevenly distributed user-generated content across US urban areas." Cartogaphy and Geographic Information Science 43 (4):283-300. doi: 10.1080/15230406.2015.1088801. Rose, Gilian. 1993. Feminsm and Geography. Minneapolis, MN: University of Minnesota Press. Rose, Gillian. 2012. Visual Methodologies: An Introduction to Researching with Visual Materials. 3 ed. Thousand Oaks, CA: SAGE Publishing. Rudd, Adrian, and Roger Levermore, eds. 2004. Sport and International Relations: An Emerging Relationship. New York, NY: Routledge. Russell, Matthew A. 2014. Mining the Social Web. 2 ed. Sebastopol, CA: O'Reilly Media. Saltelli, A., S. Tarantola, and F. Campolongo. 2000. "Sensitivity Analysis as an Ingredient of Modeling." Statistical Science 15 (4):377-395. Sarantakes, Nicholas Evan. 2014. Jimmy Carter's Disastrous Olympic Boycott. Politico Magazine. Accessed April 13, 2016. Schnober, Carsten, and Iryna Gurevych. 2015. "Combing Topic Models for Corpus Exploration: Applying LDA for COmplex Corpus Research Tasks in a Digital Humanities Project." 2015 Workshop on Topic Models: Post-Processing and Applications, New York, NY. Schrodt, P. A., S. G. Davis, and J. L. Weddle. 1994. "Political Science: KEDS--A Program for the Machine Coding of Event Data." Social Science Computer Review 12 (4):561-587. doi: 10.1177/089443939401200408. Scranton, Sheila, and Anne Flintoff, eds. 2002. Gender and Sport: A Reader. New York, NY: Routledge. Shanley, Lea A., Ryan Burns, Zachary Bastian, and Edward S. Robson. 2013. "Tweeting up a Storm: The Promises and Perils of Crisis Mapping." Photogrametric Engineering & Remote Sensing. Sharp, Joanne. 1993. "Publishing American identity: popular geopolitics, myth and The Reader's Digest." Political Geography 12 (6):491-503. Shekhar, Shashi, Viswanath Gunturi, Michael R. Evans, and KwangSoo Yang. 2012. "Spatial big-data challenges intersecting mobility and cloud computing." MobiDE Eleventh ACM International Workshop on Data Engineering for Wireless and Mobile Access, New York, NY. Shekhar, Shashi, Pusheng Zhang, and Yan Huang. 2009. "Spatial Data Mining." In Data Mining and Knowledge Discovery Handbook, edited by Oded Maimon and Lior Rokash, 837-854. New York, NY: Springer.

193 Silberschatz, Avi, and Alexander Tuzhilin. 1996. "What Makes Patterns Interesting in Knowledge Discovery Systems." IEEE Transactions on Knowledge and Data Engineering 8 (6):970-974. Skey, Michael. 2014. "‘What nationality he is doesn't matter a damn!’ International football, mediated identities and conditional cosmopolitanism." National Identities:1-17. doi: 10.1080/14608944.2014.934214. Smith, Rory. 2015. Football has social responsibility to help Europe's refugee crisis. ESPN. Accessed August 2, 2017. Sporting Intelligence 2013. A Permier League of nations. < http://www.sportingintelligence.com/wp-content/uploads/2013/08/PL-of-nations- 13-14-map-1024x700.jpg> Sporting Intelligence 2015. Put Your Shirt On It: Premier League sponsorship deals for 2015-2016. < http://www.sportingintelligence.com/wp- content/uploads/2015/07/PL-shirts-updated-20.7.jpg> Sports Illustrated Magazine 2014. La Liga president: Barcelona to be kicked out if Catalonia secedes. Accessed October 28, 2015. Stehle, Sam, and Donna J. Peuquet. 2015. "Analyzing Spatio-Temporal Patterns and their Evolution via Sequence Alignment." Spatial Cognition and Cognition 15 (2):68- 85. Steiger, Enrico, Bernd Resch, and Alexander Zipf. 2016. "Exploration of spatiotemporal and semantic clusters of Twitter data using unsupervised neural networks." International Journal of Geographical Information Science 30 (9):1694-1716. Sterne, Jonathan, ed. 2012. The Sound Studies Reader. New York, NY: Routledge. Stewart, R., J. Piburn, A. Sorokine, A. Myers, J. Moehl, and D. White. 2015. "World Spatiotemporal Analytics and Mapping Project (Wstamp): Discovering, Exploring, and Mapping Spatiotemporal Patterns across the World’s Largest Open Soruce Data Sets." ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences II-4/W2:95-102. doi: 10.5194/isprsannals-II-4-W2- 95-2015. Strauss, Anselm, and Juliet M. Corbin. 1990. Basics of qualitative research: Grounded theory procedures and techniques. Thousand Oaks, CA: Sage Publications. Sui, Daniel, and Dydia DeLyser. 2012. "Crossing the qualitative-quantitative chasm I: Hybrid geographies, the spatial turn, and volunteered geographic information (VGI)." Progress in Human Geography 36 (1):111-124. Thatcher, Jim. 2014. "Living on Fumes: Digital Footprints, Data Fumes, and the Limitations of Spatial Big Data." International Journal of Communications 8:1765-1783. Thorogood, Joe. 2016. "Satire and Geopolitics: Vulgarity, Ambiguity and the Body Grotesque inSouth Park." Geopolitics 21 (1):215-235. doi: 10.1080/14650045.2015.1089433. Thurman, Neil. 2007. "The globalization of journalism online." Journalism 8 (3):285- 307. Tian, G., Y. Xia, Y. Zhang, and D. Feng. 2011. "Hybrid genetic and variational expectation-maximization algorithm for gaussian-mixture-model-based brain MR image segmentation." IEEE Trans Inf Technol Biomed 15 (3):373-80. doi: 10.1109/TITB.2011.2106135.

194 Tobler, Waldo R. 1970. "A Computer Model Simulating Urban Growth in the Detroit Region." Economic Geography 46:234-240. Tomlinson, Alan, and Christopher Young, eds. 2006. national identity and global sports events: culture, politics, and spectable in the olympics and the football world cup. Albany, NY: State University of New York Press. Toole, Jameson L., Meeyoung Cha, and Marta C. Gonzalez. 2012. "Modeling the Adoption of Innovations in the Presence of Geographic and Media Influence." PLoS One 7 (1):1-9. Vale, Jon. 2017. Ireland 'demanding sea border with UK after Brexit'. The Independent. Accessed July 31, 2017. Vatsavai, Ranga Raju, Auroop Ganguly, Varun Chandola, Anthony Stefanidis, Scott Klasky, and Shashi Shekhar. 2012. "Spatiotemporal data mining in the era of big spatial data: algorithms and applications." ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data New York, NY. Ward, Michael D., Brian D. Greenhill, and Kristin M. Bakke. 2010a. "The perils of policy by p-value: Predicting civil conflicts." Journal of Peace Research 47 (4):363-375. doi: 10.1177/0022343309356491. Ward, Michael D., Brian D. Greenhill, and Kristin M. Bakke. 2010b. "The peris of policy by p-value: Predicting civil conflicts." Journal of Peace Research 47 (4):363- 375. Watson, David F., and G.M. Philip. 1985. "A refinement of inverse distance weighted interpolation." Geo-processing 2 (4):315-327. Webb, Geoffrey I., and Damien Brain. 2006. "Generality Is Predictive of Prediction Accuracy." In Data Mining, edited by G.J. Williams and S.J. Simoff. Berlin: Springer-Verlag. Webster, James G., and Thomas B. Ksiazek. 2012. "The Dynamics of Audience Fragmentation: Public Attention in an Age of Digital Media." Journal of Communication 62 (1):39-56. Whang, Soon-Hee. 2006. "Korea and Japan 2002: Public Space and Popular Celebration." In National Identity and Global Sports Events, edited by Alan Tomlinson and Christopher Young. Albany, NY: SUNY Press. White, Alan. 2014. "A 13-Year-Old Boy Protested At The World Cup Opening Ceremony But You Didn't See It." BuzzFeed, accessed June 14, 2014. . White, Sarah, and Teresa Larraz. 2014. Catalonia seeks support from EU for independence. The Independent. Accessed July 25, 2017. Widener, Michael, and Wenwen Li. 2014. "Using geolocated Twitter data to monitor the prevalence of healthy and unhealthy food references across the US." Applied Geography 54:189-197. Williams, Dmitri. 2002. "Synergy Bias: Conglomerates an Promotion in the News." Journal of Broadcasting & Electronic Media 46 (3):453-472. Woon, Chih Yuan. 2014. "Popular Geopolitics, Audiences and Identities: Reading the ‘War on Terror’ in the Philippines." Geopolitics 19 (3):656-683. doi: 10.1080/14650045.2014.907277.

195 Yao, Yiyu, Yaohua Chen, and Xuedong Yang. 2006. "A Measurement-Theoretic Foundation of Rule Interestingness Evaluation." In Foundations and Novel Approaches in Data Mining, edited by Tsau Young Lin, Setsuo Ohsuga, Churn- Jung Liau and Xiaohua Hu, 41-59. Berlin: Springer. Zhihui, J. 2013. LDA-math - Understanding Beta / Dirichlet Distribution. Capital of Statistics. English translation. https://cosx.org/2013/01/lda-math-beta-dirichlet last accessed Oct. 1, 2017

196 Appendix

List of topics given by each of the 48 tested models

k alpha min. topics tf- idf

10 .006 .03 1 tour aru race vuelta dumoulin stage froom rider 2 hotel citi roman art cave visit town rout 3 independ elect parti vote mas catalan catalonia seat 4 glencor compani volkswagen emiss car commod price market 5 snp scotland scottish referendum labour vote sturgeon parti 6 mas catalan independ court vote parti artur elect 7 leagu game player messi goal play club espanyol 8 properti buyer cent per game foreign market leagu 9 bedroom photograph beach mile javier lizonepa stage peloton 10 polic women arrest chicken sea rescu girl hospit

.054 1 glencor mas volkswagen pet emiss car vet anim 2 hotel scotland colon dalt murada snp miraflor gala 3 airlin car passport flight airport ramosgetti driver hire 4 snp scotland labour mas golf lorenzo eta rossi 5 espanyol casilla atltico aspa messi bara keeper bilbao 6 properti bedroom buyer owner commerci certif homeown finca 7 messi neymar espanyol bara rayo coach piqu deulofeu 8 beach park lockedfals wlsdexcept unhidewhenusedfals semihiddenfals theme royal 9 aru dumoulin photograph froom chave sec lizonepa peloton 10 dish roman cave oliv chicken pan rice pilgrimag

.08 1 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals athlet bara namemedium gordon 2 hotel sail murada dalt colon miraflor gala gibraltar 3 pet pujol rossi vet mrquez vaccin bull busqueta 4 bedroom properti finca alcudia villa cave vilaluca vaccin 5 dish chicken pan rice squid dice carmichael transgen 6 properti buyer bbc shall homeown bust robinson paramount 7 cave roman photograph pilgrimag ramosgetti tower famlia sagrada 8 aru sec passport airlin torrevieja irrat degenkolb kilomet 9 photograph lizonepa golf waelecorbi tim royal bale iraizoz 10 glencor casilla volkswagen aspa emiss elch deulofeu bara

.011 .03 1 tour aru race stage dumoulin vuelta froom rider 2 hotel citi news roman cave art visit king 3 independ elect vote parti catalan mas catalonia junt 4 glencor compani volkswagen emiss commod price car park 5 snp scotland scottish referendum vote labour bedroom sturgeon 6 catalan mas court independ vote parti artur cup 7 leagu player goal game messi play score club 8 properti buyer cent per foreign game market minut 9 beach chicken sea rescu girl blue ramosgetti photograph 10 photograph javier stage lizonepa peloton mile waelecorbi tim

197 k alpha min. topics tf- idf

.054 1 pet anim refuge rescu bullfight vet migrant bull 2 hotel colon dalt murada miraflor gala tourism palma 3 properti bedroom buyer airlin owner flight commerci tourism 4 snp scotland labour golf sail bbc virgin syria 5 espanyol atltico messi mas casilla keeper iraizoz bilbao 6 mas glencor car volkswagen emiss passport ramosgetti rental 7 messi neymar espanyol bara aspa sevilla casilla emeri 8 beach park lockedfals wlsdexcept unhidewhenusedfals semihiddenfals royal torrevieja 9 aru dumoulin photograph froom chave sec lizonepa peloton 10 dish roman cave oliv chicken pan rice pilgrimag

.08 1 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals golf airlin bara namemedium 2 hotel dalt murada colon gala miraflor sail palma 3 glencor casilla volkswagen aspa emiss elch bara deulofeu 4 chicken dish fifa carmichael nadal eta roca leak 5 bedroom properti finca alcudia villa cave rossi vilaluca 6 properti buyer bbc shall homeown bust hotel paramount 7 roman photograph cave pilgrimag ramosgetti tower lizonepa waelecorbi 8 pet vaccin vet asunta bull forest espigolador busqueta 9 aru sec pujol torrevieja rice pan irrat degenkolb 10 photograph passport lizonepa waelecorbi royal bale tim banus

.016 .03 1 tour stage aru race dumoulin vuelta rider froom 2 hotel news citi roman visit art cave rout 3 independ elect parti vote mas catalan catalonia seat 4 properti bedroom buyer cent per sale market costa 5 snp scotland scottish labour referendum vote sturgeon parti 6 catalan mas independ vote parti court properti elect 7 game goal player leagu minut play score bara 8 leagu espanyol game goal park club messi play 9 photograph beach mile stage lizonepa peloton waelecorbi rider 10 glencor volkswagen compani emiss commod car market debt

.054 1 glencor volkswagen emiss pet mas anim car rescu 2 hotel colon dalt murada gala miraflor palma tapa 3 properti bedroom buyer airlin commerci owner tourism flight 4 dish roman cave oliv chicken pan rice refuge 5 snp scotland labour mas golf sail bbc virgin 6 photograph messi neymar javier waelecorbi lizonepa tim piqu 7 mas passport car pujol document ramosgetti rental driver 8 espanyol messi casilla atltico bara neymar aspa keeper 9 park beach lockedfals wlsdexcept properti unhidewhenusedfals semihiddenfals theme 10 aru dumoulin froom sec chave photograph peloton min

198 k alpha min. topics tf- idf .08 1 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals airlin athlet namemedium gordon 2 hotel dalt murada miraflor colon gala sail palma 3 bedroom properti finca villa alcudia cave alfr vilaluca 4 iraizoz bara carmichael fifa nadal uefa eta leak 5 properti buyer shall homeown hotel bust paramount jerez 6 bbc pujol rossi mrquez asunta robinson bull espigolador 7 roman cave pilgrimag girl walker fgm monument melilla 8 aru photograph sec lizonepa waelecorbi tim dish torrevieja 9 glencor casilla aspa volkswagen emiss pet deulofeu elch 10 passport golf royal bale irrat stfano marathon antequera

.029 .03 1 stage aru tour race dumoulin vuelta froom rider 2 hotel citi news roman cave art visit refuge 3 tourism flight airlin fli lockedfals wlsdexcept unhidewhenusedfals semihiddenfals 4 glencor compani volkswagen emiss arrest polic car commod 5 snp scotland scottish referendum vote labour bedroom independ 6 leagu espanyol goal club game player play passport 7 independ elect mas catalan parti vote catalonia cup 8 game player goal messi leagu play score minut 9 beach tourist marina sea rescu women girl ramosgetti 10 photograph stage javier lizonepa peloton waelecorbi mile rider

.054 1 pet anim bullfight percent vet mas refuge felip 2 hotel dalt murada colon miraflor gala palma tapa 3 properti bedroom buyer airlin golf commerci flight tourism 4 beach park properti theme royal torrevieja blue certif 5 snp scotland labour lockedfals wlsdexcept unhidewhenusedfals semihiddenfals mas 6 messi neymar espanyol bara rayo piqu barca deulofeu 7 aru dumoulin photograph froom sec chave lizonepa javier 8 mas glencor volkswagen car emiss passport hire document 9 dish roman cave chicken pan rice pilgrimag recip 10 espanyol casilla atltico aspa bara sevilla keeper bilbao

.08 1 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals athlet bara namemedium gordon 2 hotel dalt murada colon gala miraflor sail mallorca 3 bedroom properti villa alcudia finca rossi cave alfr 4 airlin dish chicken pan rice ramosgetti irrat squid 5 golf passport royal vaccin adelson eurovega banus antequera 6 properti buyer pet bbc homeown shall vet bust 7 aru photograph sec lizonepa waelecorbi roman tim cave 8 pujol refuge asunta forest walker firefight busqueta manel 9 percent carmichael nadal roca sex leak prostitut eta 10 glencor casilla volkswagen emiss aspa elch bara torrevieja

199 k alpha min. topics tf- idf 25 .006 .03 1 news refuge sep percent germani austria oct aug 2 hotel torrevieja athlet lorenzo rossi celebr dalt murada 3 car properti passport tourism hire rental fee hotel 4 lockedfals wlsdexcept hotel unhidewhenusedfals semihiddenfals pet accent ramosgetti 5 hospit health polic vaccin patient pan rice women 6 flight airlin airport tourist tourism rout las puls 7 beach sea flag blue water boat rescu women 8 hrs anim bike gibraltar puls tenaci categori provinc 9 snp scotland referendum scottish vote labour parti independ 10 roman cave theme rout pilgrimag park citi art 11 glencor volkswagen mps emiss commod vote car debt 12 mas catalan parti independ vote cup elect artur 13 player aspa leagu game club play espanyol season 14 match sport nadal linesman game gordon play scotland 15 leagu goal minut game score neymar ball messi 16 messi player game espanyol play score goal leagu 17 chicken dish citi cook elch rayo bust guest 18 catalan espanyol court club novemb mas casilla independ 19 park fear royal london chariti foundat asunta polic 20 independ elect parti vote catalan mas catalonia seat 21 respons communiti owner sail fire charter balear jerez 22 properti bedroom buyer costa cent hous foreign golf 23 froom stage tour vuelta race chave cent sec 24 aru dumoulin stage vuelta tour rider jersey race 25 photograph stage lizonepa mile peloton javier waelecorbi rider

.054 1 syria refuge nadal prohibit reproduct syrian keeper bull 2 messi neymar hotel espanyol piqu murada dalt bara 3 properti buyer hotel lockedfals wlsdexcept unhidewhenusedfals semihiddenfals passeng 4 tourism elch rayo fli emeri oliv transgen fgm 5 flight airlin airport museum passeng pilot bust carrier 6 messi mas prosecutor pujol neymar felip summon father 7 snp scotland labour mas sand adelson eurovega vega 8 chelsea pedro bara bike atltico app mourinho arrow 9 beach blue sail boat virgin award concha httpcommonswikimediaorg 10 labour grayl athlet lloret scotland bbcs rescu robinson 11 snp scotland labour golf syria antequera student bbc 12 espanyol casilla aspa atltico sevilla bilbao beto coach 13 messi neymar espanyol bara casilla barca angl fifa 14 royal park pet marina banus chariti marathon carmichael 15 mas glencor volkswagen emiss car chines prosecutor catalua 16 aru dumoulin froom sec chave min valverd contador 17 theme park deulofeu paramount alhama multin anim beach 18 gibraltar bullfight tenaci southampton meat tourism deck genoa 19 owner properti pet shall homeown irrat certif fli 20 vaccin census document eta diphtheria javier marriag festiv 21 passport rossi mrquez affili passeng mosso lorenzo donat 22 bedroom properti dumoulin finca cave alcudia villa commerci 23 roman cave car hire pilgrimag rental ramosgetti tower 24 photograph lizonepa waelecorbi javier peloton tim torrevieja barrientosap 25 dish chicken pan rice recip simmer chop oliv

200 k alpha min. topics tf- idf

.08 1 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals namemedium mrquez rossi meat 2 hotel murada dalt gala colon miraflor mallorca port 3 royal marathon hotel gala urdangarn fifa robinson infanta 4 airlin palma norwegian vuel japan cech roma psychiatr 5 percent nadal roca celler tenni eta djokov efe 6 virgin paramount properti vacation diver parad forcad espigolador 7 casilla bara vilaluca alfr asunta porto fox bentez 8 golf bullfight adelson eurovega antequera vega transplant bull 9 bbc refuge melilla drown rosel pet bartomeu robinson 10 glencor ramosgetti volkswagen emiss photograph famlia tower sagrada 11 aspa motorbik casilla uci refuge unipubl transgen gordon 12 properti buyer shall homeown rent hotel grade condominium 13 rice pan dish dice squid vaccin walker prawn 14 sex prostitut mcdermid workingclass cloud babi cabifi firefight 15 bale cristiano athlet jerez bara mansel mourinho pilot 16 bara iraizoz irrat overcom abus uefa leo easyjet 17 stfano athlet breez nonsecessionist googl contamin mosso iaaf 18 elch gibraltar tenaci carmichael steer southampton jonatha fgm 19 glencor volkswagen chicken emiss torrevieja pujol dish catalana 20 deulofeu bust banus forest linesman firefight montserrat lloret 21 pet vet cdc affili busqueta manel microchip vaccin 22 photograph lizonepa waelecorbi tim cave roman pilgrimag barrientosap 23 passport properti hire buyer puls hamilton envelop moren 24 bedroom properti villa finca alcudia cave buyer golf 25 aru sec degenkolb kilomet contador lindeman schleck oliveira

.011 .03 1 news passport refuge percent sep austria germani itali 2 hotel lockedfals wlsdexcept unhidewhenusedfals semihiddenfals pet dalt murada 3 polic women hospit vaccin health sea rescu man 4 flight airlin airport tourist tourism rout provinc las 5 snp scotland referendum vote scottish labour independ parti 6 puls categori sail news menu compani tag latest 7 beach labour parti blue flag referendum courtesi independ 8 golf chicken dish elch cook rayo minut costa 9 parti mas catalan independ rajoy catalunya court elect 10 respons communiti owner broadcast bbc fire referendum robinson 11 aspa player messi tax leagu sevilla club espanyol 12 torrevieja athlet lorenzo rossi mrquez race sport king 13 roman cave theme rout pilgrimag citi park art 14 leagu club player goal season score footbal play 15 leagu minut goal messi score neymar game ball 16 catalan mas espanyol resolut novemb casilla independ artur 17 messi espanyol game pujol play player deulofeu document 18 photograph stage lizonepa mile javier waelecorbi peloton tim 19 park fear royal arrest polic london terrorist pari 20 stage aru tour dumoulin vuelta race froom rider 21 independ elect catalan parti vote mas catalonia seat 22 car tourism hire rental driver fuel marina insur 23 bedroom properti hous certif estat commerci owner shall 24 properti buyer cent per foreign costa market sale

201 k alpha min. topics tf- idf 25 glencor volkswagen emiss mps commod vote car debt

.054 1 refuge syria syrian prohibit migrant gordon scotland uefa 2 hotel messi neymar murada dalt miraflor colon gala 3 properti buyer hotel lockedfals wlsdexcept unhidewhenusedfals semihiddenfals homeown 4 froom chave aru dumoulin sec min valverd kilomet 5 beach airlin airport flight passeng fli properti blue 6 mas boat marina sail yacht banus festiv port 7 snp scotland labour syria grayl bbc rescu lloret 8 labour snp scotland mas athlet sand adelson eurovega 9 roman cave car pilgrimag ramosgetti tower gaud photograph 10 espanyol keeper casilla bilbao refere messi neymar bara 11 aspa tourism espanyol sevilla casilla beto atltico roca 12 aru dumoulin golf sec document vaccin antequera census 13 park theme car paramount driver museum vacation alhama 14 atltico espanyol piec mas corner bicycl diego percent 15 glencor volkswagen emiss car chines rossi mrquez motogp 16 messi espanyol neymar barca bale striker piqu cristiano 17 mas passport pujol catalua cdc affili summon prosecutor 18 dish oliv pan chicken rice recip tomato simmer 19 park royal pedro chelsea chariti marathon bara birthday 20 pet owner anim deulofeu dog vet speed driver 21 bullfight mas gibraltar meat tenaci southampton busqueta manel 22 felip queen gasol obama efe award royal solidar 23 virgin diver parad boat carmen eta walker festiv 24 photograph lizonepa waelecorbi tim peloton javier torrevieja shall 25 bedroom properti finca alcudia cave villa commerci riviera

.08 1 glencor volkswagen emiss reproduct campo prohibit disqualif diesel 2 hotel murada dalt miraflor colon gala mallorca port 3 hotel virgin refuge bull bullfight diver melilla parad 4 torrevieja asunta hotel uefa porto bara basterra atkinson 5 alfr vilaluca fox photograph pene cech properti renf 6 passport airlin palma vuel transgen vacation norwegian genet 7 dish chicken pan rice dice squid nadal roca 8 shall properti bust carmichael leak bartomeu rosel pisarello 9 pujol linesman espigolador barba leftov percent catalana banca 10 aru sec bbc degenkolb robinson edit lindeman schleck 11 athlet mourinho japan bara schmidt zara hire brigada 12 bedroom properti homeown villa finca alcudia cave buyer 13 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals iraizoz motorbik namemedium bentez 14 bale bara cristiano stfano abus documentari lfp hera 15 jerez motorbik pilot mansel motogp nonsecessionist balloon senna 16 photograph ramosgetti tower sagrada famlia lizonepa tallest basilica 17 aspa casilla eta walker bosqu evan blott nudist 18 royal banus marathon mcdermid startup workingclass stake cloud 19 golf adelson eurovega antequera vega transplant inland wifi 20 gibraltar elch tenaci steer southampton deck genoa monaco 21 deulofeu bara vaccin casilla forest diphtheria montserrat clinic 22 properti buyer rent grade fgm protocol januarymay hotel 23 rossi irrat mrquez overcom gasol cancer easyjet seminar

202 k alpha min. topics tf- idf 24 pet vet kilomet aru cdc affili microchip vaccin 25 photograph lizonepa cave roman waelecorbi tim pilgrimag cyclist

.016 .03 1 car passport hire fee rental compani charg driver 2 hotel lockedfals wlsdexcept unhidewhenusedfals pet semihiddenfals tourism accent 3 dumoulin stage sec aru tour chave vuelta rider 4 respons owner communiti tapa sail hotel broadcast fire 5 flight tourist airlin rout airport tourism las puls 6 beach blue flag virgin courtesi imag sea award 7 anim provinc hrs gibraltar tenaci southampton puls bull 8 torrevieja athlet lorenzo rossi mrquez race felip king 9 snp referendum scotland vote independ scottish labour parti 10 snp scottish scotland labour vote mps parti english 11 news refuge fear royal park london syria germani 12 independ elect vote parti mas catalan catalonia seat 13 rajoy independ catalan catalunya parti scotland elect percent 14 player leagu golf match game champion play manchest 15 messi player game espanyol play leagu season score 16 leagu goal minut neymar score messi game ball 17 chicken dish elch minut rayo cook guest unai 18 catalan mas independ artur cup novemb elect espanyol 19 pujol commiss produc jordi vilaluca alfr fox photograph 20 polic women sea vaccin hospit rescu rice pan 21 photograph stage lizonepa mile javier peloton waelecorbi tim 22 race rider stage tour motorbik vuelta jersey driver 23 glencor volkswagen emiss roman rout cave commod price 24 properti bedroom buyer foreign hous invest cent costa 25 tour aru vuelta stage froom cent race per

.054 1 bale prohibit cristiano uefa coach campo reproduct manchest 2 hotel murada dalt gala miraflor colon bust sail 3 properti buyer hotel lockedfals wlsdexcept unhidewhenusedfals semihiddenfals tapa 4 flight airlin refuge airport migrant palma tourism carrier 5 virgin bbc robinson bbcs festiv tradit boat diver 6 busqueta manel syria felip efe gasol refuge obama 7 snp scotland labour syria bbc grayl fair student 8 royal park atltico chelsea chariti pedro marathon birthday 9 mas scotland ramosgetti labour snp sand photograph adelson 10 espanyol speed driver casilla stielik uci linesman unipubl 11 rossi driver mrquez jerez lorenzo formula mansel car 12 aru dumoulin froom sec chave min degenkolb valverd 13 bilbao iraizoz refere athlet gracia gorka keeper bicycl 14 messi espanyol neymar sevilla aspa rayo emeri bara 15 mas passport pujol messi prosecutor catalua vilaluca alfr 16 messi neymar bara stfano piqu espanyol corner coach 17 car cave roman hire pilgrimag rental driver piec 18 dish oliv pan chicken tourism rice recip tomato 19 mas anim bullfight bull gibraltar meat tenaci southampton 20 glencor volkswagen emiss car chines marina banus puerto 21 photograph beach lizonepa waelecorbi tim javier peloton properti 22 properti certif torrevieja shall rent beach owner rescu

203 k alpha min. topics tf- idf 23 pet properti airport passeng park owner theme fli 24 golf document vaccin asunta census antequera carmichael diphtheria 25 bedroom properti homeown finca alcudia villa cave commerci

.08 1 glencor volkswagen emiss pujol motorbik uci unipubl reproduct 2 hotel murada dalt miraflor colon gala palma fira 3 hotel properti tapa paramount cappuccino port bicycl arround 4 pet royal vet cdc marathon vaccin affili microchip 5 vaccin banus forest transgen diphtheria montserrat genet humid 6 bara brazil uefa mourinho cas argentina fifa cuadrado 7 elch getaf esprito kakuta bueno jonatha casilla percent 8 photograph cave roman pilgrimag lizonepa barrientosap tower athlet 9 bedroom properti bbc villa finca alcudia cave golf 10 hire sail carmichael leak puls japan port roddin 11 refuge bust jerez pilot motorbik mansel pisarello juncker 12 bullfight lockedfals wlsdexcept bull unhidewhenusedfals semihiddenfals gordon bullrun 13 casilla aspa bara vilaluca asunta alfr porto fox 14 bara abus espigolador barba leftov urdangarn infanta nurs 15 passport torrevieja fifa robinson pope atkinson hamilton cuba 16 photograph waelecorbi tim lizonepa ramosgetti jordanafpgetti tower famlia 17 iraizoz sex prostitut lloret mcdermid girl coggin cloud 18 dish chicken pan rice dice squid nadal prawn 19 airlin lockedfals wlsdexcept unhidewhenusedfals semihiddenfals palma vuel norwegian 20 golf bale stfano adelson eurovega cristiano antequera vega 21 deulofeu gibraltar virgin sail tenaci southampton fgm diver 22 irrat roca eta walker overcom motogp celler blott 23 aru sec degenkolb kilomet contador edit lindeman schleck 24 rossi mrquez meat vacation summon breez nonsecessionist cancer 25 properti buyer homeown shall rent linesman grade ownership

.029 .03 1 passport golf refuge percent applic unit antequera costa 2 hotel lockedfals wlsdexcept roman cave unhidewhenusedfals semihiddenfals dalt 3 polic arrest terrorist vaccin kill suspect pan rice 4 flight airlin airport tourism rout island puls tourist 5 referendum scotland scottish vote snp independ labour parti 6 shall properti puerto club marina certif communiti banus 7 snp scotland vote scottish labour sturgeon parti mps 8 news anim mayor franco bullfight thelocalat thelocald thelocaldk 9 player aspa game leagu club play espanyol sevilla 10 athlet torrevieja lorenzo rossi mrquez race championship sport 11 player leagu espanyol goal play match minut season 12 messi goal leagu game neymar score play surez 13 properti cent per buyer foreign costa market sale 14 stage aru dumoulin tour vuelta race rider froom 15 independ mas elect catalan vote catalonia parti seat 16 properti royal park fear london foundat certif chariti 17 glencor volkswagen pet commod emiss price theme debt 18 elect parti vote independ catalan catalonia chicken cup 19 catalan mas independ parti elect vote artur cup 20 car driver compani hire rental fee charg custom

204 k alpha min. topics tf- idf 21 respons communiti owner fire sail charter jerez balear 22 beach sea blue water flag women rescu boat 23 bedroom properti hous sale buyer estat commerci costa 24 pujol document produc investig vilaluca fox alfr compani 25 photograph stage lizonepa mile peloton javier rider waelecorbi

.054 1 properti lockedfals wlsdexcept unhidewhenusedfals semihiddenfals hotel certif namemedium 2 hotel murada dalt miraflor colon gala rice pan 3 tourism puls categori properti roca degenkolb startup sbarag 4 flight airlin airport fli bilbao passeng tourism palma 5 boat sail rescu refuge migrant melilla yacht port 6 froom sec aru chave dumoulin min valverd kilomet 7 felip busqueta manel forest award podemoss montserrat obama 8 snp scotland labour syria bbc hes grayl student 9 labour scotland snp document asunta census bbc porto 10 park royal chelsea pedro runner espanyol birthday chariti 11 golf mas sand adelson eurovega athlet vega antequera 12 espanyol aspa casilla atltico sevilla corner beto stfano 13 fli southampton tenaci oliv gibraltar ship transgen atltico 14 mas virgin percent festiv eta robinson diver parad 15 mas passport pujol messi prosecutor vilaluca alfr fox 16 neymar messi bara torrevieja deulofeu coach manchest corner 17 bedroom properti finca alcudia villa cave commerci elch 18 refuge syria festiv javier linesman syrian uefa terrorist 19 messi espanyol neymar bara barca banus marina puerto 20 properti buyer beach tourism theme park blue passeng 21 pet anim bullfight bull vet bike dog meat 22 owner properti shall homeown certif tourism carmichael leak 23 cave roman chicken pilgrimag dish tower ramosgetti gaud 24 glencor car volkswagen emiss driver hire rental museum 25 photograph aru dumoulin lizonepa peloton javier waelecorbi tim

.08 1 refuge bbc prohibit uefa fifa robinson campbel reproduct 2 hotel dalt murada colon miraflor gala palma tapa 3 hotel port sail mallorca cappuccino hire forcad sick 4 hotel eta roca ram fira fun celler prohibit 5 vaccin deulofeu bull forest diphtheria googl montserrat firefight 6 airlin irrat palma banus vuel overcom abus easyjet 7 aspa casilla bara elch carmichael getaf jonatha leak 8 iraizoz percent app sex prostitut startup mcdermid oct 9 pujol espigolador bicycl barba leftov percent catalana pope 10 roman cave pilgrimag vacation tower monument jame summon 11 bedroom properti alcudia finca villa cave bbc golf 12 golf adelson eurovega antequera vega bullfight transplant inland 13 jerez motorbik pilot mansel bentez rafa benzema benitez 14 passport athlet motogp song pilot cancer balloon bilic 15 royal torrevieja asunta marathon porto basterra rosel bartomeu 16 dish chicken pan rice squid dice nadal meat 17 bara mourinho urdangarn infanta roma juventus uefa vermaelen 18 glencor sec aru volkswagen emiss bale motorbik cristiano 19 casilla fox alfr vilaluca photograph bosqu tower cech

205 k alpha min. topics tf- idf 20 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals namemedium bust linesman transgen 21 gibraltar virgin sail tenaci southampton lloret diver girl 22 aru rossi mrquez melilla schleck contador ceuta enclav 23 pet vet kilomet affili cdc busqueta manel microchip 24 properti buyer homeown shall hotel paramount rent grade 25 photograph lizonepa waelecorbi tim ramosgetti barrientosap jordanafpgetti cyclist

50 .006 .03 1 gordon los scotland polic campo reproduct refuge prohibit 2 hotel dalt murada colon miraflor gala celebr tapa 3 tourism airlin flight fear fli airport rout travel 4 news polic refuge franco sep austria thelocalat thelocalfr 5 motogp japanes rider consecut helmet balloon sport town 6 document news resid polic donat census fals flat 7 properti buyer foreign costa cent market passeng per 8 train passeng driver bike polic car speed app 9 referendum snp parti independ rossi mrquez lorenzo vote 10 snp scotland vote scottish labour sturgeon parti confer 11 scotland referendum labour snp scottish vote independ parti 12 lockedfals wlsdexcept roman cave unhidewhenusedfals semihiddenfals rout accent 13 park royal court london catalan mas chariti foundat 14 mps vote labour snp scottish english scotland parti 15 store messi puls categori citi menu restaur shop 16 golf respons communiti owner referendum scottish costa robinson 17 puls compani categori percent car custom news tag 18 properti cent buyer per costa market sale foreign 19 player club chelsea match pedro play leagu game 20 goal score leagu season messi match player game 21 fire jerez yesterday museum catalan mansel mcdermid pilot 22 espanyol minut atltico casilla goal header bilbao keeper 23 aspa game sevilla minut play espanyol leagu emeri 24 score play leagu goal bale linesman hes fan 25 leagu play neymar game surez bara goal score 26 messi espanyol game player play nou goal enriqu 27 properti certif shall vaccin owner health communiti rent 28 tour froom communiti fli homeown stage owner oliv 29 beach blue flag courtesi imag award court playa 30 independ catalan mas parti catalonia elect seat vote 31 theme park resort marina produc murcia banus paramount 32 sport citi athlet bust king gestur brereton reproduc 33 festiv san javier airport flight ticket band globe 34 sea rescu beach water women provinc mar hrs 35 anim gibraltar tenaci southampton bull genoa steer malta 36 torrevieja king felip holiday enjoy urdangarn market park 37 polic pari arrest terrorist syria kill spaniard languag 38 dish passport chicken cook pan rice oil recip 39 bedroom properti commerci estat villa finca alcudia cave 40 photograph stage peloton mile imag lizonepa ramosgetti javier 41 car pet hire insur hous rental deulofeu vet 42 enjoy sail beauti museum asunta art citi charter 43 race stage rider tour sec dumoulin froom chave

206 k alpha min. topics tf- idf 44 dumoulin aru stage vuelta tour rodrguez sec dutchman 45 photograph stage mile rider waelecorbi tim lizonepa javier 46 catalan independ mas parti vote cup elect seat 47 pujol court investig mas judg catalan compani search 48 glencor volkswagen commod emiss debt price market slump 49 court neymar tax accus club player sentenc mas 50 independ vote parti elect catalan catalonia mas seat

.054 1 beach blue owner award concha httpcommonswikimediaorg campo reproduct 2 hotel dalt murada colon miraflor gala tapa mallorca 3 athlet busqueta manel montserrat forest lorenzo gestur podemoss 4 flight airlin airport fli tourism palma carrier irrat 5 owner homeown mas properti irrat concept eta proxi 6 refuge migrant walker syrian syria blott african boat 7 snp scotland labour golf grayl antequera fair hes 8 snp rossi mrquez lorenzo labour scotland mas evan 9 mas document census catalua irregular investitur uncov detain 10 bullfight bull virgin festiv anim tradit scotland diver 11 snp scotland labour syria forcad bbc student nun 12 espanyol stielik casilla atletico tiago calderon corner atltico 13 bbc robinson bbcs mas imparti scotland russia putin 14 scotland gordon linesman snp brown breast labour cancer 15 passport uefa prohibit hamilton beach nudist naturist nudism 16 rayo elch nadal getaf esprito kakuta jonatha bueno 17 mas summon prosecutor catal soccer attorney disqualif coach 18 bara bilbao keeper messi neymar iraizoz refere casilla 19 deulofeu atltico espanyol casilla elch substitut sevilla hes 20 asunta stfano porto cech coach basterra argentinian captain 21 atltico bale espanyol piec cristiano mcdermid diego bara 22 aspa espanyol sevilla casilla beto emeri festiv javier 23 messi neymar espanyol barca corner piqu pedro substitut 24 fli oliv transgen genet larva oxitec nurs greenpeac 25 messi rice pan recip dish squid dice simmer 26 motorbik speed driver uci unipubl peloton fgm protocol 27 tenaci southampton gibraltar deck genoa monaco malta steer 28 contamin palomar kerri drunken saloufest washington student radioact 29 pet vet microchip vaccin anim dog atkinson mas 30 roman cave pilgrimag bust gaud tower museum pisarello 31 torrevieja carmichael leak waterpark park roddin memo sourc 32 chicken dish marina banus guest tomato puerto oliv 33 flight airport museum vacation airlin ticket passeng puls 34 car hire rental driver percent debit cruyff fill 35 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals sail boat rescu namemedium 36 dumoulin cdc froom schleck aru categori chave orica 37 properti fox vilaluca alfr villa photograph pene tworoom 38 glencor volkswagen emiss car jerez anim chines engin 39 tourism properti hotel commerci roca celler gastronomi eat 40 chelsea pedro donat affili passeng catalua copisa driver 41 sand adelson eurovega vega espigolador percent leftov barba 42 properti certif shall owner felip rent degenkolb grade 43 properti buyer tourism passeng airport beach fli homeown

207 k alpha min. topics tf- idf 44 bedroom properti finca alcudia cave villa commerci golf 45 theme park paramount hotel tourism alhama multin airport 46 park royal chariti marathon lockedfals wlsdexcept birthday unhidewhenusedfals 47 aru dumoulin sec photograph froom chave peloton min 48 photograph waelecorbi tim lizonepa javier aru jordanafpgetti peloton 49 glencor volkswagen emiss transplant car meat chines neymar 50 ramosgetti photograph kilomet tower aru famlia sagrada tallest

.08 1 pet bale cristiano vet stfano reproduct campo prohibit 2 hotel dalt murada colon miraflor gala tapa palma 3 hotel sail port cappuccino mallorca carmichael leak roddin 4 hotel paramount properti robinson castl carl athlet mir 5 torrevieja hotel fira ram fun dugdal weekday feburari 6 cech roma benitez bentez cristiano rafa soldier bara 7 iraizoz gordon athlet leo cuba argentina iker uruguay 8 glencor volkswagen emiss adelson eurovega vega fgm protocol 9 irrat overcom seminar phobia easyjet proxi rosel bartomeu 10 bbc fifa robinson refuge scaremong cabl contador netflix 11 espigolador barba leftov abus bara insult shakira racist 12 soccer bara efe primera roma argentin michell cristiano 13 properti buyer homeown rent grade assessor epc certifi 14 eta nonsecessionist cesar stake proeuropean def dos wari 15 photograph lizonepa waelecorbi tim aru barrientosap cyclist jordanafpgetti 16 melilla ceuta enclav prix coastguard merced subsaharan hamilton 17 nadal tenni ferrer djokov hotel semifin atp seed 18 bara uefa mourinho contamin brazil drown prohibit cuadrado 19 motogp balloon spencer beater freddi geisha hailwood katana 20 aspa casilla bara sociedad clinic prez bosqu elch 21 elch pet bullfight esprito kakuta jonatha bueno getaf 22 lloret casilla girl nava coggin gea iker keylor 23 passport rossi mrquez sex casilla prostitut nurs academi 24 bara gay wifi homosexu partida coentro fbio streak 25 deulofeu motorbik uci unipubl photograph gameiro gaya toch 26 dish chicken pan rice squid dice meat prawn 27 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals namemedium shade qformattru namecolor 28 vaccin linesman diphtheria transplant evan olot bull nudist 29 hire busqueta manel roca celler lorca dish vat 30 golf percent antequera forcad inland sick nun oct 31 shall properti bust pisarello condominium transferor deed buyer 32 walker edit blott carlin redund mythic cabifi editori 33 roman cave pilgrimag tower monument jame properti pilot 34 pujol catalana banca renf ferrusola thiem denis angi 35 salah risen nov abdesalam flood percent getaf envelop 36 transgen forest genet mcdermid montserrat firefight benahav humid 37 airlin palma vuel norwegian atkinson addison kendal eta 38 bedroom properti vilaluca alfr fox villa photograph tworoom 39 bullfight urdangarn infanta pilot mallorca tejada bull commissair 40 royal athlet asunta marathon porto basterra rosario psychiatr 41 ramosgetti photograph cdc tower famlia sagrada tallest basilica 42 virgin diver vacation sail breez summon parad schmidt 43 properti buyer hotel gibraltar januarymay airlin downturn hardship

208 k alpha min. topics tf- idf 44 bedroom properti finca alcudia cave villa golf miraflor 45 aru sec degenkolb lindeman schleck gougeard contador balcel 46 glencor volkswagen emiss affili copisa gibraltar diesel picardo 47 kilomet aru oliveira cruyff campbel contador johan stuart 48 jerez mansel motorbik pilot hotel jewel eduardo chequer 49 tenaci gibraltar southampton malta genoa deck steer monaco 50 photograph lizonepa waelecorbi tim cyclist jordanafpgetti app barrientosap

.011 .03 1 rayo elch prohibit court getaf esprito kakuta jonatha 2 hotel dalt murada colon miraflor gala tapa mallorca 3 espanyol player stielik play goal bicycl atltico titl 4 puls categori vaccin airport news compani tag menu 5 provinc car driver road traffic kilometr citi town 6 properti buyer cent per foreign costa market sale 7 fli oliv transgen refuge organ larva releas genet 8 snp referendum vote scotland independ parti sturgeon labour 9 scotland snp scottish labour vote referendum elect parti 10 athlet sport bust gestur club celebr citi uefa 11 referendum scottish labour scotland parti bbc independ vote 12 pet hous owner communiti vet homeown your anim 13 park royal london chariti foundat race marathon runner 14 tax messi court accus prosecutor neymar alleg club 15 document census fals polic resid flat detain foreign 16 tourism restaur marina tourist banus puerto food wine 17 golf vote independ parti elect scotland referendum costa 18 polic pari women hospit french arrest kill terrorist 19 percent car arrest melilla compani engin oct terrorist 20 elect elector parti commiss match referendum mayor vote 21 player goal leagu club score play game messi 22 independ catalan mas elect jerez parti pilot forcad 23 minut goal ball atltico espanyol surez score neymar 24 cup season score club leagu stfano play titl 25 game messi espanyol play aspa minut leagu score 26 leagu footbal club play catalan game espanyol catalonia 27 news austria germani itali aug thelocalat thelocalch thelocald 28 dish passport chicken pan cook rice oil minut 29 gordon scotland motogp japanes rider brown currenc town 30 pujol catalan court commiss construct resolut compani judg 31 citi museum art travel attract enjoy vacation flight 32 theme park murcia film paramount entertain cruz multin 33 anim king bullfight felip franco bull death festiv 34 sea beach rescu women polic water british mar 35 rossi fire mrquez lorenzo catalonia race montserrat forest 36 mps respons fear english vote communiti owner irrat 37 properti certif shall purchas owner hotel communiti rent 38 bedroom properti estat torrevieja villa finca alcudia cave 39 car hire rental tourist driver insur sand fuel 40 beach blue flag courtesi imag award concha httpcommonswikimediaorg 41 rout lockedfals wlsdexcept cave roman unhidewhenusedfals semihiddenfals citi 42 virgin mas consult novemb catalan sea carmen diver 43 catalonia catalan court independ mas elect parti sail 44 independ catalan elect mas parti vote catalonia seat

209 k alpha min. topics tf- idf 45 aru stage tour dumoulin vuelta race rider jersey 46 stage froom sec race aru vuelta chave kilomet 47 dumoulin tour stage rider chave race vuelta sprint 48 ramosgetti photograph imag david asunta tenaci southampton gibraltar 49 photograph stage mile lizonepa javier peloton waelecorbi tim 50 glencor volkswagen emiss commod price debt compani car

.054 1 golf antequera prostitut sex reproduct inland campo prohibit 2 hotel park royal torrevieja gala celebr murada miraflor 3 hotel mallorca flight cappuccino airport airlin colon dalt 4 hotel jerez forest fira mansel ram montserrat pilot 5 pujol fli irrat overcom clan phobia easyjet seminar 6 flight airlin tourism airport palma carrier tenerif app 7 festiv virgin boat diver parad javier carmen fiesta 8 tourism fgm roca tradit protocol celler gastronomi solidar 9 efe cuba cuban elkhazzani efefil mexican obama franci 10 snp scotland labour bbc grayl syria fair hes 11 labour snp scotland mas evan rossi mrquez convergncia 12 mas passport sail yacht catalua mcdermid investitur port 13 nadal ferrer tenni cech djokov rafael semifin partida 14 scotland snp labour linesman gordon hes brown juncker 15 driver speed car uci unipubl motorbik traffic review 16 dish chicken pan rice espanyol recip simmer casilla 17 robinson bbc bbcs fifa festiv bilic song putin 18 fox vilaluca alfr photograph motogp japanes helmet lap 19 messi neymar espanyol bara barca hes piqu casilla 20 chelsea pedro syria mourinho nurs jihadist daesh cuadrado 21 aspa espanyol sevilla deulofeu bullfight beto emeri bull 22 atltico espanyol bale marina piec banus casilla puerto 23 asunta coach stfano porto soccer argentinian captain basterra 24 glencor volkswagen emiss bilbao car chines iraizoz keeper 25 rayo fli elch oliv jonatha museum ticket transgen 26 messi neymar corner ter stegen emeri piqu roma 27 neymar bara brazilian abus bate piqu alv messi 28 vaccin cloud cashpoint muslim pregnant thiem denis cough 29 cave roman pilgrimag gaud rossi tower mrquez lorenzo 30 cdc mas percent documentari hera bst detent affili 31 prohibit mas uefa nudist naturist nudism beach nuditi 32 mas forcad percent cruyff nun sick rivera com 33 theme park paramount alhama multin photograph studio pedro 34 refuge migrant syrian syria melilla african asylum isi 35 shall properti document certif bust census owner fraudul 36 owner aru startup hrs froom sitter melchor irrat 37 carmichael meat leak store lloret birmingham rescu coggin 38 lockedfals wlsdexcept pet unhidewhenusedfals semihiddenfals vet anim namemedium 39 properti buyer tourism passeng beach airport fli homeown 40 bedroom properti commerci finca alcudia cave villa hotel 41 sand adelson eurovega vega espigolador barba leftov percent 42 car hire rental driver volkswagen donat emiss affili 43 beach properti blue certif kilomet award rent degenkolb 44 homeown owner properti concept froom rescu contador helicopt 45 vaccin walker diphtheria blott nonsecessionist carlin contagion bildu

210 k alpha min. topics tf- idf 46 dumoulin froom chave aru valverd motorbik orica scan 47 athlet ramosgetti photograph tower famlia sagrada retir lorenzo 48 tenaci southampton gibraltar genoa malta deck lisbon steer 49 photograph aru dumoulin sec lizonepa javier peloton waelecorbi 50 mas messi prosecutor transplant summon busqueta manel attorney

.08 1 glencor volkswagen emiss campo reproduct prohibit los diesel 2 hotel dalt murada gala miraflor colon palma tapa 3 hotel sail port cappuccino mallorca carmichael leak hire 4 hotel refuge castl monument mir bellver firewor fundaci 5 airlin hotel palma busqueta manel fira ram norwegian 6 bara montserrat forest googl firefight benahav humid chernik 7 rossi mrquez linesman japan psychiatr motogp athlet tendon 8 glencor bale cristiano volkswagen stfano emiss dugdal ancelotti 9 girl walker fgm protocol lloret edit blott coggin 10 irrat overcom app easyjet seminar airlin phobia puls 11 bbc bust robinson pisarello putin scaremong amazon asham 12 gibraltar tenaci southampton steer genoa deck malta monaco 13 transplant urdangarn nurs infanta royal cabifi liver babi 14 athlet gordon leo argentina uruguay sur sportsmen brigada 15 pujol catalana banca netflix milk angi cassandra douglass 16 cdc disqualif campbel contador cashpoint affili unsur stuart 17 transgen genet documentari hera cuba cuban jihadist villanovens 18 forcad atkinson nun sick addison kendal prix magaluf 19 nadal meat tenni cancer djokov semifin seed ferrer 20 iraizoz bentez rafa iker nava keylor casilla gea 21 athlet motogp mcdermid song balloon bilic ham beater 22 elch getaf jonatha casilla esprito kakuta bueno bildu 23 aspa casilla motorbik unipubl uci bull bosqu evan 24 bara coentro fbio streak bale metro airbnb hire 25 ramosgetti bara photograph casilla tower famlia sagrada tallest 26 passport hire riazor deulofeu elch gameiro gaya toch 27 uefa fifa bara cas salah soccer roma efe 28 abus insult shakira pop fling peel verbal racist 29 aru prostitut sex contador melchor mauri cruyff edit 30 eta adelson eurovega vega percent cesar stake dos 31 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals namemedium shade qformattru namecolor 32 golf vaccin antequera inland affili diphtheria copisa nudist 33 mosso contamin drown nov abaaoud kerri los vent 34 melilla ceuta enclav photograph coastguard refuge fenc subsaharan 35 jerez motorbik pilot mansel jewel herrero chequer gir 36 refuge espigolador barba leftov juncker rent maroto kidnap 37 cave roman pilgrimag banus tower monument jame cyclist 38 asunta vacation porto basterra nonsecessionist rosario girl usca 39 dish chicken pan rice torrevieja dice squid roca 40 virgin bullfight diver bull parad flood sail pilot 41 aru sec kilomet degenkolb lindeman oliveira schleck gougeard 42 pet royal vet marathon microchip vaccin cat princess 43 properti rent grade assessor epc certifi buyer puls 44 percent bicycl nov oct thiem denis secondhand thanksgiv 45 summon breez pujol hire aru spokesmen coutinho manoeuvr 46 properti buyer homeown shall paramount hotel condominium transferor

211 k alpha min. topics tf- idf 47 properti buyer hotel airlin gibraltar januarymay downturn anymor 48 bedroom properti villa finca alcudia cave golf miraflor 49 alfr vilaluca fox photograph mourinho villa properti tworoom 50 photograph lizonepa tim waelecorbi deulofeu barrientosap jordanafpgetti cyclist

.016 .03 1 gordon scotland los campo reproduct prohibit jose substitut 2 hotel dalt murada colon miraflor gala tapa mallorca 3 vaccin polic document health parent hospit asunta mother 4 commiss elector flight referendum recommend tenaci southampton gibraltar 5 virgin bullfight bull town alfr vilaluca festiv anim 6 provinc island citi town murcia categori restaur puls 7 properti cent per airlin buyer flight foreign airport 8 referendum parti snp independ scotland vote scottish labour 9 scotland scottish snp referendum labour vote sturgeon parti 10 snp vote scotland independ labour parti referendum sturgeon 11 tour communiti froom homeown fractur societi owner stage 12 beach blue flag courtesi imag court award concha 13 passport sail applic charter british balear enjoy beauti 14 messi tax charg prosecutor compani father alleg lionel 15 golf respons communiti owner broadcast bbc costa journalist 16 patient hospit transplant health nurs medic anim campbel 17 court mas novemb neymar catalan accus investig consult 18 hous play match score leagu game goal bale 19 match goal leagu player score season real ronaldo 20 leagu club score play goal footbal game messi 21 fire forest jerez sea mansel pilot montserrat metr 22 atltico minut espanyol goal game piec corner casilla 23 espanyol minut game play aspa messi leagu season 24 rayo elch linesman loan sevilla eta season penalti 25 espanyol citi tourism stielik play goal player corner 26 neymar messi bara ball minut goal surez score 27 bedroom cave properti lockedfals wlsdexcept roman unhidewhenusedfals semihiddenfals 28 dish chicken pan cook rice oil minut recip 29 parti king elect felip mayor franco vote urdangarn 30 theme park resort marina murcia paramount banus puerto 31 independ elect vote parti catalan mas catalonia seat 32 refuge syria arrest terrorist pari polic islam french 33 sport jump club metr messag runner express uefa 34 news court thelocalat thelocalch thelocald thelocaldk thelocalfr thelocalit 35 tourism deulofeu player play roca food restaur wine 36 sea rescu polic beach women man water british 37 photograph ramosgetti train imag david app compani tower 38 pet park royal london anim vet chariti foundat 39 properti buyer costa market cent per foreign price 40 car hire rental euro tourist sand driver fuel 41 properti certif purchas shall hotel owner rent communiti 42 citi museum art athlet bust travel attract meat 43 mps fli fear vote english oliv labour scottish 44 catalan mas parti court independ cup vote artur 45 stage tour dumoulin vuelta aru race rider froom 46 stage dumoulin aru rodrguez race vuelta sec tour

212 k alpha min. topics tf- idf 47 kilomet rider aru stage mile motogp japanes enter 48 parti elect catalonia independ catalan vote rajoy mas 49 photograph stage mile peloton lizonepa rider javier waelecorbi 50 glencor volkswagen emiss commod car price compani debt

.054 1 forcad reproduct prohibit campo mas los nun communic 2 hotel murada dalt colon miraflor gala tapa marina 3 hotel cappuccino mallorca port shop airport coff pollensa 4 hotel casilla monument pujol nonsecessionist mir carl festiv 5 hotel jerez fira mansel ram museum pilot car 6 owner irrat fli mas overcom hrs easyjet phobia 7 flight airlin park theme tourism airport palma paramount 8 virgin bullfight bull festiv anim fiesta tradit diver 9 rescu beach lloret birmingham store girl coggin drown 10 vaccin diphtheria emeri blown olot sevilla rosel bartomeu 11 snp scotland labour golf bbc antequera anniversari inland 12 snp scotland labour rossi mrquez lorenzo syria bbc 13 park royal torrevieja fli chariti marathon birthday oliv 14 documentari hera efe parad puerto efefil trump fenc 15 scotland labour linesman grayl gordon hes robinson fair 16 car bicycl driver passeng atkinson traffic bike nurs 17 mas mcdermid scotland dialogu manley mythic coexist amateur 18 bilbao iraizoz refere gracia walker gorka keeper blott 19 aru dumoulin sec chave min froom nadal valverd 20 messi neymar bara prosecutor pedro father transplant leo 21 roma bayern manchest cech coach juventus munich messi 22 glencor volkswagen emiss car chines casilla atltico dumoulin 23 asunta stfano porto basterra campbel argentinian captain coach 24 espanyol atltico corner piec diego tiago casilla stielik 25 aspa espanyol sevilla deulofeu beto emeri casilla pau 26 espanyol messi bale ramosgetti neymar photograph cristiano hes 27 rayo elch jonatha javier festiv sevilla getaf esprito 28 neymar messi bara tenaci southampton gibraltar casilla angl 29 rice pan recip dish squid simmer dice chop 30 motorbik driver speed athlet unipubl uci fgm protocol 31 chicken mas dish guest meat oliv urdangarn pimiento 32 bust athlet pisarello monarchi anomali felip queen museum 33 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals pujol uefa prohibit catalana 34 car hire rental driver ski donat affili debit 35 refuge syria syrian melilla terrorist isi migrant percent 36 mas passport catalua summon investitur catal bilic song 37 tourism carmichael leak roca celler gastronomi roddin memo 38 roman cave pilgrimag gaud tower museum monument iberia 39 chelsea pedro bbc mourinho renf cuadrado mosso scaremong 40 hrs castelln airport startup flight percent cabifi electr 41 sand vega adelson eurovega percent espigolador barba leftov 42 pet anim lockedfals wlsdexcept unhidewhenusedfals semihiddenfals vet dog 43 vilaluca alfr fox photograph tworoom gasol tower pene 44 beach shall properti blue certif award degenkolb concha 45 properti buyer tourism passeng hotel sail commerci yacht 46 bedroom properti finca alcudia cave villa commerci asda 47 properti owner certif homeown rent busqueta manel grade

213 k alpha min. topics tf- idf 48 froom aru montserrat scan contador motorbik forest peloton 49 aru document dumoulin kilomet census cdc marriag irregular 50 photograph lizonepa javier waelecorbi peloton tim alvaro barrientosap

.08 1 bale cristiano stfano campo reproduct prohibit los andor 2 hotel dalt miraflor murada colon gala mallorca cappuccino 3 hotel elch fira esprito kakuta ram jonatha dalt 4 tapa hotel espigolador barba leftov cruyff ruta bier 5 asunta uefa porto basterra atkinson bara rosario addison 6 linesman leo argentina athlet bara summon uruguay beliz 7 properti buyer homeown rent grade hotel assessor epc 8 bbc documentari hera robinson refuge cancer scaremong magaluf 9 golf antequera forest inland montserrat firefight humid benahav 10 mosso cabifi startup elkhazzani nov uber blanket internazional 11 torrevieja paramount properti hotel bicycl puls tapa fox 12 motorbik irrat uci overcom unipubl seminar phobia easyjet 13 airlin cairn downturn hardship anymor prix properti roman 14 banus athlet jihadist iaaf typic royal greek abaaoud 15 eta pujol robinson catalana cesar banca bildu dos 16 kilomet aru gordon oliveira campbel cuba stuart cuban 17 jerez pilot mansel motorbik herrero proeuropean chequer gir 18 roman cave pilgrimag tower monument jame contador cashpoint 19 nadal tenni ferrer djokov semifin refuge seed atp 20 soccer efe primera magaluf eibar tomatina miami rebrand 21 motogp balloon song bilic moto beater freddi geisha 22 aspa casilla bara nurs clinic coentro fbio streak 23 bentez rafa casilla keylor nava collet gea iker 24 iraizoz bara bartomeu rosel brazil renf iker absenc 25 vaccin casilla diphtheria affili bosqu evan olot academi 26 deulofeu prez gibraltar gameiro gaya toch riazor wellstruck 27 gibraltar tenaci southampton steer genoa malta deck monaco 28 abus nonsecessionist insult shakira pop summon racist fling 29 rice pan dish squid dice roca prawn tablespoon 30 bullfight fgm protocol bull girl balcel crespo fgmaffect 31 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals namemedium shade qformattru namecolor 32 chicken dish pan prohibit nudist naturist nudism nuditi 33 rossi mrquez percent melilla googl startup motogp enclav 34 royal athlet marathon rfea sur brigada tendon sportsmen 35 glencor volkswagen passport emiss forcad sick nun hamilton 36 airlin palma bust vuel norwegian pisarello puls mallorca 37 refuge percent vega adelson eurovega meat fenc cancer 38 properti buyer shall hotel condominium transferor deed gibraltar 39 aru sec degenkolb edit walker lindeman blott schleck 40 transgen mcdermid genet mourinho zara cuadrado primark inditex 41 pet vet microchip vaccin volkswagen emiss diesel cat 42 girl lloret bicycl percent coggin babi drown salah 43 pujol bull ferrusola lorca tordesilla poet vinegar doll 44 vacation hire breez bullrun bull percent bullfight nov 45 hire bara chernik coastguard subsaharan ceuta cyclist roden 46 bedroom properti villa finca alcudia cave vilaluca alfr 47 sail virgin diver transplant app parad urdangarn mallorca 48 ramosgetti photograph cdc fifa tower sagrada famlia tallest

214 k alpha min. topics tf- idf 49 cech roma tabarca olympiako lewandowski wolfsburg lyon eindhoven 50 photograph lizonepa waelecorbi tim barrientosap jordanafpgetti cyclist busqueta

.029 .03 1 elch document rayo coupl resid census getaf fals 2 hotel dalt murada colon miraflor gala tapa mallorca 3 referendum elector commiss nadal match vote scottish friday 4 mas investig fox vilaluca alfr rodrguez eta photograph 5 flight airlin airport rout tourism palma carrier fli 6 provinc hrs island kilometr citi road resid town 7 torrevieja chave dumoulin stage tour vuelta rider heat 8 properti cent per buyer foreign passeng alicant kilomet 9 snp vote referendum labour scotland independ parti scottish 10 scotland snp scottish labour vote referendum sturgeon parti 11 journalist bbc referendum coverag scottish robinson bbcs broadcast 12 tourism messi tax prosecutor club court roca accus 13 beach blue flag courtesi imag award concha httpcommonswikimediaorg 14 minut game atltico deulofeu play champion match espanyol 15 vaccin health franco rossi hospit mrquez race patient 16 player match coach leagu real play club stadium 17 espanyol game aspa messi play minut leagu sevilla 18 score season cup pujol club leagu real stfano 19 surez goal minut neymar messi corner espanyol ball 20 minut athlet bilbao mlaga sport iraizoz refere reproduc 21 play bale meat score fan fine cristiano hes 22 leagu club goal game score play footbal bara 23 hous linesman owner write sea your abus alba 24 independ catalonia elect court catalan romeva parti mass 25 dish passport chicken pan cook rice minut oil 26 tour froom stage race rider vuelta fractur contador 27 communiti homeown owner societi gasol polic properti cycl 28 anim court bull carmichael leak town parti dog 29 catalan independ mas parti elect vote cup golf 30 refuge polic arrest syria terrorist pari suspect islam 31 news puls categori percent menu sep compani latest 32 theme park murcia resort marina puerto tourism paramount 33 languag school english student obama unit king oct 34 fire jerez forest museum mansel pilot montserrat afternoon 35 sea rescu beach polic women water man british 36 asunta mother commiss app bike elector recommend parent 37 properti buyer costa market foreign cent per sale 38 respons communiti owner fli festiv oliv san javier 39 properti certif shall owner communiti rent purchas grade 40 bedroom cave properti lockedfals wlsdexcept roman citi 41 unhidewhenusedfals 42 car hire rental tourist driver compani euro sand 43 pet fear travel fli vet passeng drive hrs 44 photograph stage mile peloton lizonepa rider javier waelecorbi 45 sail ramosgetti photograph imag david balear charter tower 46 sec stage dumoulin aru min race jersey froom 47 park royal london foundat chariti race marathon birthday 48 independ vote elect parti catalan catalonia mas seat 49 aru stage dumoulin vuelta tour rider race ride

215 k alpha min. topics tf- idf 50 produc store basqu espigolador leftov barba compani cdc glencor volkswagen emiss car commod price compani debt

.054 1 vaccin espigolador barba leftov campo reproduct prohibit passeng 2 hotel murada dalt colon miraflor gala tapa mallorca 3 sail hotel yacht port skipper boat mallorca castel 4 flight airlin airport tourism palma carrier passeng fli 5 bullfight virgin bull festiv anim fiesta tradit diver 6 glencor volkswagen emiss car chines fli eta irrat 7 hrs rescu boat canaria coastguard flood gran helicopt 8 golf antequera fgm neymar protocol inland collabor bara 9 snp scotland labour syria bbc grayl fair student 10 lorenzo rossi mrquez athlet felip royal forcad gestur 11 chicken dish guest oliv tomato mcdermid pimiento spoon 12 labour snp scotland walker blott carlin homosexu tabloid 13 dumoulin aru kilomet forest montserrat schleck oliveira haimar 14 espanyol casilla atltico corner neymar messi bara diego 15 mas tourism roca investitur dialogu scotland celler gastronomi 16 mas summon robinson bicycl catal song festiv bbcs 17 motogp japanes campbel balloon circuit moto beater freddi 18 dumoulin sec aru chave min froom nadal valverd 19 southampton tenaci gibraltar meat monaco cancer lisbon genoa 20 puls categori manchest schmidt atltico leverkusen bayern espanyol 21 photograph lizonepa javier waelecorbi tim peloton torrevieja barrientosap 22 pan rice recip dish squid dice simmer chop 23 aspa espanyol sevilla casilla beto coach pau keeper 24 bilbao espanyol iraizoz refere messi piqu gracia neymar 25 ramosgetti photograph bale tower cristiano sagrada famlia fifa 26 messi espanyol deulofeu rayo elch hes emeri barca 27 neymar messi bara bust stegen ter corner munir 28 pujol car cdc mas affili catalua donat volkswagen 29 document census detain fraudul irregular marriag uncov brigad 30 froom aru contador javier festiv dumoulin scan motorbik 31 owner properti puls categori beach sitter irrat cashpoint 32 refuge syria syrian migrant melilla isi terrorist muslim 33 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals namemedium shade qformattru namecolor 34 car hire rental driver jerez mansel uefa pilot 35 fox vilaluca alfr photograph decor tworoom pene villa 36 mas carmichael leak gasol attorney disqualif roddin prosecutor 37 asunta porto basterra girl ferrer rosario chines stake 38 tourism hotel paralys elabor student puerto driver cyclist 39 park theme paramount bike alhama multin app speed 40 messi scotland linesman prosecutor gordon father leo defraud 41 speed driver lloret beach rescu unipubl uci motorbik 42 pet park royal anim vet chariti marathon dog 43 passport mas cambridg prize diseas hamilton trump magaluf 44 pedro chelsea fli oliv transgen mourinho larva genet 45 marina banus puerto transplant busqueta manel store arabia 46 properti certif owner shall homeown rent degenkolb grade 47 properti buyer beach tourism blue passeng airport award 48 bedroom properti commerci villa finca alcudia cave hotel 49 percent sand vega adelson eurovega finland dialogu sainz

216 k alpha min. topics tf- idf 50 roman cave pilgrimag gaud tower museum motorbik iberia

.08 1 glencor volkswagen emiss reproduct prohibit campo los partial 2 hotel dalt murada gala colon miraflor mallorca cappuccino 3 hotel properti robinson paramount bbc fox mir castl 4 rossi tapa hotel mrquez banus royal urdangarn palma 5 hotel fira ram fun weekday feburari suday colon 6 photograph lizonepa waelecorbi tim barrientosap cyclist jordanafpgetti aru 7 airlin sail palma vuel mallorca norwegian port hire 8 motorbik irrat uci unipubl overcom easyjet phobia seminar 9 percent startup googl nov oct capita fas optimist 10 walker edit blott carlin schmidt editori drown mythic 11 bbc documentari casilla hera amazon nava iker gea 12 primera soccer cabifi nov roma athlet efe argentin 13 properti buyer homeown rent grade hotel januarymay assessor 14 bale cristiano stfano app compon ancelotti altruism selfish 15 forcad bicycl nov nun sick jihadist cairn percent 16 puls properti flood greek airbnb wifi eva helmet 17 shall properti jerez pilot mansel affili condominium transferor 18 forest montserrat firefight renf cashpoint humid benahav flame 19 iraizoz linesman cloud iker soccer efe espadal athlet 20 aru vacation breez schleck statehood dubious wedg centralistmind 21 golf antequera inland volkswagen emiss prix diesel hamilton 22 nadal tenni ferrer djokov semifin seed atp efe 23 motogp balloon spencer moto beater freddi geisha hailwood 24 casilla aspa bara bosqu abus lfp shakira elch 25 deulofeu elch getaf jonatha esprito kakuta bueno prez 26 bara nurs streak wifi coentro fbio leo bale 27 bara bentez brazil cruyff benitez rafa johan prez 28 passport royal marathon prostitut sex gala hamilton juve 29 fgm girl protocol bullrun cat abus song bull 30 transplant eta bartomeu nudist naturist nudism nuditi rosel 31 atkinson hire electr refuge magaluf addison kendal partida 32 ramosgetti photograph tower athlet sagrada famlia basilica tallest 33 refuge melilla ceuta enclav fenc percent oct juncker 34 bust meat pisarello cancer cech juncker roma gram 35 adelson eurovega vega percent pope vatican princess leonor 36 roman cave pilgrimag tower vilaluca alfr fox photograph 37 torrevieja nov netflix tapa flamenco pollut coromina aemet 38 hire summon nonsecessionist lamont wheelchair solo maxi neuron 39 bedroom properti pujol villa finca alcudia cave golf 40 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals namemedium shade qformattru namecolor 41 bara bullfight espigolador uefa barba leftov bull prohibit 42 vaccin diphtheria lloret evan girl drown coggin olot 43 transgen genet mcdermid refuge campbel bildu zara inditex 44 athlet asunta porto basterra rosario sur japan brigada 45 pet vet vaccin microchip mourinho stake salah cuadrado 46 virgin sail gibraltar tenaci southampton carmichael busqueta manel 47 dish chicken pan rice squid dice prawn tablespoon 48 roca celler gasol watson dish eurobasket iberdrola nobel 49 aru sec degenkolb kilomet contador lindeman oliveira gougeard 50 cdc fifa bara affili cas efe soccer blatter

217 k alpha min. topics tf- idf

100 .006 .03 1 hotel cappuccino reproduct prohibit campo port los shop 2 hotel dalt murada miraflor colon gala tapa palma 3 hotel restaur free island tradit visit festiv mir 4 polic mother court victim women death woman anim 5 flight airlin airport rout tourism palma carrier tenerif 6 properti buyer foreign cent per market costa sale 7 fifa obama test store washington unit cuba contamin 8 referendum scottish labour independ parti scotland vote bbc 9 vaccin transplant price volatil snps liver babi hospit 10 citi bust jerez mayor scotland museum pilot sport 11 commiss referendum vote elector recommend britain charg deficit 12 referendum scotland snp scottish sturgeon parti vote independ 13 flag patient hospit cruyff union treatment suicid andrea 14 beach blue flag courtesi tourism imag award concha 15 sturgeon parti confer abus alba labour corbyn elect 16 pujol catalan investig court mas commiss judg novemb 17 speed driver rider motorbik unipubl uci sagan imag 18 parti elect catalonia catalan independ cup busqueta manel 19 snp scotland vote labour independ sturgeon scottish corbyn 20 snp vote scotland labour independ referendum parti sturgeon 21 puls categori news compani tag menu properti latest 22 match leagu player stadium coach play efe season 23 club player chelsea pedro season bara hes play 24 aru stage dumoulin vuelta tour ride astana rider 25 aspa play game espanyol minut sevilla score leagu 26 vaccin season cup score stfano diphtheria sex prostitut 27 espanyol atltico goal casilla corner surez piec play 28 minut bilbao mlaga iraizoz refere gracia header goalkeep 29 leagu catalan footbal catalonia club independ play lui 30 rayo elch emeri sevilla goal play penalti villarr 31 passport applic british real keeper hub expat david 32 messi espanyol game play nou camp goal score 33 goal game atltico match minut real score champion 34 neymar surez messi bara ball goal minut game 35 jose substitut game cesar minut target def dos 36 fgm protocol forcad nun sick girl parent health 37 chemic cloud packag explos polic bomb espadal firefight 38 gordon scotland motogp japanes rider pension brown town 39 citi museum art travel enjoy attract vacation flight 40 court san javier festiv town beach breez globe 41 snp scotland scottish vote referendum parti confer labour 42 properti hotel romeva commerci rock elect linesman catalan 43 catalan mas independ parti cup elect artur vote 44 arrest polic terrorist islam suspect syria isi interior 45 mayor elect provinc municip parti design holiday partida 46 club bara uefa messag player fan leagu sport 47 percent oct nov film startup music trump rate 48 bike app driver car passeng road citi speed 49 news aug austria germani thelocals thelocalch thelocalfr thelocaldk 50 beach franco drown man anim die camp memori 51 wine friend goal casilla deportivo tequila bankia user 52 news refuge sep germani rajoy austria itali thelocald

218 k alpha min. topics tf- idf 53 goal messi score neymar pedro minut bara player 54 provinc flood town island tabarca del car storm 55 king franco felip award queen celebr juan church 56 fire tenaci southampton gibraltar forest genoa deck malta 57 british tourist beach polic benidorm island photo holidaymak 58 tour race froom vuelta nibali egypt season stage 59 shall properti certif communiti asunta purchas owner debt 60 jump train retir metr hrs strike hospit airport 61 properti certif rossi mrquez rent lorenzo grade race 62 sea women water rescu beach mar lloret friend 63 search women sea polic rescu water driver formula 64 meat eat export cancer food product sale tomato 65 boat rescu hrs coast provinc helicopt air craft 66 cent properti per buyer costa tourism british search 67 bedroom properti estat finca alcudia cave real villa 68 pet mps hous vote english vet travel your 69 tourist sand vega adelson eurovega billion las spend 70 car hire rental driver fli insur compani fuel 71 torrevieja enjoy sail youll holiday charter balear yacht 72 tourism chicken dish food cook enjoy guest relax 73 lockedfals wlsdexcept cave roman unhidewhenusedfals rout semihiddenfals citi 74 rice pan recip dish dice squid minut simmer 75 virgin sea diver town carmen boat citi tradit 76 marina banus puerto bullfight shop polic resort marbella 77 theme park murcia paramount construct alhama multin donat 78 communiti owner respons tour froom homeown societi fractur 79 fear deulofeu fli player irrat overcom combin play 80 document resid census fals flat polic coupl foreign 81 stage race tour rider vuelta froom chave dumoulin 82 stage race aru rider min sec jersey winner 83 dumoulin aru stage tour minut vuelta rodrguez climb 84 race rider tour stage motorbik pull bmc garderen 85 athlet eta gestur lorenzo jos celebr investig fiesta 86 park royal ramosgetti london photograph imag david chariti 87 golf costa antequera del kingdom unit inland sol 88 sec chave catalonia stage froom chelsea invest lindeman 89 produc espigolador barba leftov food footbal schmidt imperfect 90 athlet polic suspect court iaaf marta friend feder 91 photograph stage waelecorbi tim mile javier lizonepa rider 92 photograph stage peloton mile lizonepa javier barrientosap alvaro 93 independ parti catalan vote elect rajoy catalonia mas 94 independ junt pel mas vote elect alli drive 95 vote independ elect parti catalan mas seat catalonia 96 independ elect catalan parti vote mas catalonia seat 97 independ catalan court catalonia argu mass elect mas 98 vilaluca alfr fox rodrguez car photograph volkswagen emiss 99 tax messi prosecutor compani father alleg accus charg 100 glencor volkswagen commod emiss price debt compani market

.054 1 reproduct campo prohibit retir los review depress sus 2 hotel colon murada dalt miraflor gala mallorca cappuccino 3 hotel forcad nun celebr sick uni iii mas

219 k alpha min. topics tf- idf 4 tapa virgin hotel diver carmen celebr boat motogp 5 chelsea pedro hotel fira ram mourinho fun fair 6 puls categori properti proeuropean continent wari cas fifa 7 rosel bartomeu nacion audiencia thiem denis coromina astorga 8 snp scotland mas labour dialogu disqualif convers attorney 9 golf scotland snp antequera inland labour hes volatil 10 deficit percent fas mas optimist scotland lorca capita 11 snp scotland labour bbc syria student bbcs dugdal 12 contamin kerri palomar ship washington russia obama plutonium 13 scotland labour snp evan princess pollster lamont leonor 14 beach blue bust award concha httpcommonswikimediaorg pisarello monarchi 15 tourism bale hotel cristiano hes paralys mas altruism 16 mas summon mcdermid robinson scotland investitur catal bbcs 17 cech andrea tabarca bayern striker lewandowski sedat roma 18 formula driver prix hamilton sainz thanksgiv alonso singapor 19 vaccin nadal donat affili copisa diphtheria tenni djokov 20 coach soccer efe neymar atletico primera piqu striker 21 dish chicken pan rice recip simmer chop oliv 22 atltico casilla espanyol deficit corua abraham cornellael fifthplac 23 stfano syria argentinian daesh terrorist jihadist isi muslim 24 espanyol atltico corner piec substitut diego stielik tiago 25 aspa sevilla espanyol emeri beto casilla blown pau 26 bilbao iraizoz refere gracia degenkolb froom gorka keeper 27 messi espanyol neymar bara casilla barca aspa piqu 28 rayo elch getaf esprito kakuta jonatha bueno santo 29 asunta porto keeper bentez coach rafa keylor basterra 30 pet vet anim microchip vaccin casilla dog ownership 31 deulofeu substitut jose emeri hes magaluf gameiro toch 32 dumoulin abus schleck shakira piqu verbal peel banana 33 tourism roca prostitut sex bicycl celler gastronomi prize 34 girl fgm protocol lloret birmingham rescu aston coggin 35 cloud espadal firefight chemic eat tradit emoji dish 36 festiv student saloufest eva drunken catalua stab nake 37 photograph peloton lizonepa javier barrientosap alvaro waelecorbi tim 38 jerez mansel pilot museum cesar chequer def gir 39 properti buyer passeng airport tourism fli puls beach 40 beach drown nudist mas naturist nudism nuditi felip 41 refuge syrian migrant sep syria com asylum cuba 42 froom scotland vent forest tradit vat momentum neverend 43 car castelln rain flood hotel mohedano beach commissair 44 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals namemedium shade qformattru namecolor 45 busqueta manel melilla podemoss enclav african isi terrorist 46 uefa bara gestur provoc offens transmit prohibit arbitr 47 tomato bullfight festiv percent isil zone deciph rivera 48 pope percent homosexu vatican franci gay priest oct 49 roman cave pilgrimag gaud tower monument iberia jame 50 document census fraudul irregular detain marriag uncov brigad 51 airport pilot flight plane passeng hrs palma fli 52 eta nonsecessionist properti commerci downturn forgiven doom hardship 53 vaccin cough airbnb babi hotel tunisia hutch riu 54 properti buyer hotel vilaluca fox alfr commerci photograph

220 k alpha min. topics tf- idf 55 froom aru transplant scan runner liver contador crest 56 anim bullfight bull bullrun festiv gore photograph fiesta 57 passport pujol mas cdc treasur catalua document mosso 58 gasol pau egypt anim cairo eurobasket navarra cyclist 59 owner sitter irrat proxi pet properti car blue 60 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals dog namemedium cat anim 61 rossi mrquez lorenzo espigolador barba leftov valentino imperfect 62 properti buyer tourism corner januarymay stake car seatbelt 63 contagion diphtheria vaccin iberdrola maroto marriag carvaj texa 64 flight airlin airport tourism palma carrier tenerif ryanair 65 fifa airport flight wifi passeng storm vuel flood 66 javier festiv breez globe bbc singer robinson scaremong 67 properti certif owner shall homeown rent grade ownership 68 bedroom properti finca alcudia cave villa commerci compact 69 marina sand banus adelson eurovega vega puerto store 70 car driver hire rental meat cancer cabifi startup 71 torrevieja pujol park waterpark catalana banca tapa hrs 72 theme park paramount alhama multin studio pedro airport 73 fli museum flight carmichael leak vacation irrat overcom 74 boat sail rescu yacht port beach helicopt coastguard 75 fli oliv transgen larva genet oxitec gibraltar hyperloop 76 scotland gordon montserrat forest car evacu firefight humid 77 car hire rental driver cashpoint snow ski storm 78 athlet lorenzo gestur celebr sportsmen sur hernndez connot 79 motorbik cancer garderen boeckman bike breast chave belgian 80 award bilic song actress ramosvinola prize actor klopp 81 ramosgetti photograph tower sagrada famlia tallest basilica gallifa 82 tenaci southampton gibraltar deck genoa malta steer monaco 83 park royal chariti marathon birthday glencor celebr runner 84 beach pedro blue elkhazzani award electr belgium concha 85 athlet ski iaaf puerto marta domnguez passport biolog 86 messi prosecutor father citizenship defraud leo uruguay payment 87 messi neymar roma stegen ter refuge milk florenzi 88 photograph waelecorbi tim lizonepa javier jordanafpgetti peloton jose 89 chave sec dumoulin froom kilomet aru valverd lindeman 90 store shop amazon zara retail inditex efe pakistan 91 aru dumoulin sec min ceremoni froom gougeard melchor 92 glencor mas volkswagen emiss car chines sourc prosecutor 93 snp labour scotland grayl fair hes syria darken 94 neymar messi bara corner munir bate brazilian vermaelen 95 speed driver unipubl uci motorbik overtak apolog peloton 96 campbel stuart marriag receipt cardboard solo angi cassandra 97 walker blott carlin editori tabloid edit newsquest nuj 98 app bike dialogu cruyff startup map finland johan 99 aru dumoulin haimar mano eurosport lorenzo valverd ceremoni 100 linesman juncker refere jimnez muoz jeanclaud helmet englishlanguag

.08 1 prohibit reproduct campo los partial andor para indirect 2 hotel murada dalt colon miraflor mallorca gala cappuccino 3 hotel carniv colon dalt miraflor murada gala palma 4 motorbik hotel uci unipubl castel bellver firewor fundaci 5 tapa hotel ruta salvador murada bier coll martiana

221 k alpha min. topics tf- idf 6 melilla enclav correa brcena racist jew jane plugin 7 percent googl startup oct teruel slope pist jane 8 meat cancer brazil japan gram coutinho jamn philipp 9 nonsecessionist breez dugdal kezia stfano coromina lamont tous 10 atkinson addison kendal michell juve peer wallac sociedad 11 lloret sex prostitut girl coggin drown capita fas 12 bbc refuge robinson scaremong jame quiz cervant mackay 13 vaccin diphtheria olot song bilic soldier contagion elkhazzani 14 jerez pilot mansel motorbik jewel cashpoint senna chequer 15 alfr vilaluca fox photograph villa tworoom pene tower 16 greek flood eva isil properti imf armrest alget 17 hire robinson bicycl port secondhand asham putin breez 18 campbel proeuropean stuart wari tomatina abus vacation buol 19 walker blott carlin cuba tabloid editori edit cuban 20 everybodi gordon chat nov homosexu gay stuart magaluf 21 royal banus marathon drown gala tabarca dubious wedg 22 prix getaf vent hamilton iberdrola alonso texa merced 23 nadal tenni djokov semifin ferrer seed muguruza atp 24 bara soccer benitez efe primera bentez rafa cristiano 25 athlet leo royal sportsmen brigada rfea preparatori tobalina 26 magaluf miami rebrand sant oct moragu hashish afacan 27 nudist naturist nudism nuditi prohibit lorca japan cpr 28 casilla elch bara jonatha getaf esprito kakuta bueno 29 bale cristiano linesman stfano affili copisa ancelotti altruism 30 gea bentez casilla nava iker porto keylor rafa 31 aspa casilla espigolador barba leftov ferro castel verdu 32 iraizoz forcad refuge nun sick percent iker juncker 33 casilla bosqu evan academi transplant ownership secondhalf sociedad 34 bara streak coentro fbio bale roma fifa cas 35 riazor prez gameiro gaya toch elch deulofeu wellstruck 36 cruyff stake netflix johan rastar ajax tower cheapest 37 abus bara insult lfp shakira section banana peel 38 fgm protocol girl bildu salah abdesalam crespo fgmaffect 39 firefight forest montserrat humid benahav cloud flame mosqu 40 cdc documentari hera affili sant lit balloon fenc 41 cesar def eta dos lliur terra grapo lehmann 42 irrat overcom seminar phobia easyjet proxi vaccin babi 43 bara uefa leo gambl prohibit cas klopp chess 44 athlet wifi partida iaaf domnguez passport biolog dope 45 bust pisarello stfano redund shall cosmos raul pilot 46 princess leonor juventus royal vinegar sofa parad actress 47 mosso vegetarian defibril reanim los benzema blackmail tapa 48 aru deulofeu edit melchor mauri contador degenkolb mythic 49 vacation athlet nov girl lpez juncker thirdplac nichola 50 eta jihadist kidnap moren soldier gando oct surfer 51 photograph ceuta coastguard cualedro firefight flame melilla subsaharan 52 rent hotel percent startup riu tunisia app oct 53 properti buyer hotel paramount fox airlin downturn hardship 54 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals namemedium shade qformattru namecolor 55 bara cech brazil roma vermaelen lewandowski volley bartomeu 56 bullfight bull contamin kerri obama prohibit mallorca nov 57 vega adelson eurovega percent hotel golf florida coin

222 k alpha min. topics tf- idf 58 bull bullrun flood tordesilla risen bullfight spear forest 59 neuron solo angi cassandra douglass gibb incapacit jeal 60 passport hamilton ebola cyclist abus oliva romero oct 61 roca celler nurs transplant dish liver shred rubber 62 asunta porto basterra cat zara rosario pet amazon 63 rossi mrquez motogp roma contador prix marquez roquebrun 64 volkswagen emiss glencor diesel japan tendon psychiatr maebashi 65 virgin diver parad mourinho sail cuadrado cabifi puls 66 obama nobel efe hire medicin putin carter aguirr 67 carmichael leak roddin memo babi mitchel questionnair abaaoud 68 disqualif schmidt mexican efe gladbach album indirect bundesliga 69 properti shall buyer condominium transferor deed ownership expressli 70 properti buyer gibraltar januarymay mallorca yorkbas fox airlin 71 bedroom properti finca alcudia cave villa golf miraflor 72 properti homeown buyer rent grade assessor epc certifi 73 torrevieja tapa doln jacquelin los cyclist inland sant 74 dish pan chicken rice squid dice prawn tablespoon 75 pet vet microchip vaccin refuge fenc foncubierta martinez 76 cave roman pilgrimag tower monument jame bara repurchas 77 bedroom properti villa alcudia finca cave golf hotel 78 sail gibraltar tenaci southampton genoa monaco malta steer 79 airlin palma vuel app norwegian airway mallorca puls 80 percent transgen genet nov cancer defacto abus salut 81 aru sec kilomet degenkolb lindeman oliveira schleck gougeard 82 ferrer morisco nava carvaj palma benzema secondhalf volley 83 motogp pilot helmet balloon moto beater freddi geisha 84 golf antequera inland rosel bartomeu nacion audiencia sharm 85 busqueta manel ramosvinola argentina playmak michelin doll tenni 86 hire seatbelt dgt booster frontseat segu directa restraint 87 villanovens tejada pilot puls carvaj holland aspa mirag 88 vatican pope homosexu priest gay balda nov leak 89 refuge gibraltar electr bicycl meter picardo hire nov 90 summon vacation spokesmen existenti manoeuvr montn junta expir 91 photograph lizonepa waelecorbi tim jordanafpgetti cyclist barrientosap miami 92 ramosgetti photograph tower sagrada famlia tallest basilica expir 93 gordon mcdermid refuge mohsen cenaf ruth osama hungarian 94 renf cabl graduat idena indian pene envelop vilafranca 95 fifa soccer efe efefil blatter platini uefa coldplay 96 urdangarn infanta royal blanket palma flood huelva knit 97 photograph lizonepa waelecorbi tim barrientosap cyclist jordanafpgetti aru 98 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals namemedium shade qformattru primark 99 pujol catalana banca gasol ferrusola eurobasket lill mauroy 100 glencor volkswagen emiss hash acorn resin melilla cher

.011 .03 1 beach blue flag courtesi imag award concha httpcommonswikimediaorg 2 hotel murada dalt colon miraflor gala celebr tradit 3 hotel cappuccino shop port mallorca visit town percent 4 torrevieja tapa hotel enjoy holiday market park price 5 women sex babi violenc prostitut percent worker mother 6 refuge hotel murcia chariti town healthcar tourist citi 7 properti buyer cent foreign tourism per costa market

223 k alpha min. topics tf- idf 8 referendum vote scotland scottish sturgeon elector per independ 9 snp scotland vote referendum labour sturgeon independ scottish 10 labour fgm walker protocol corbyn scotland independ snp 11 flag compani hospit union patient cruyff transplant suicid 12 tourism citi pujol hotel judici charg catalana judg 13 mas catalan independ athlet yesterday artur brereton gestur 14 vote stone britain cairn stewart negoti british treati 15 news aug thelocald thelocaldk thelocalch thelocalfr thelocalit thelocalno 16 journalist robinson bbcs referendum scottish gasol protest independ 17 referendum parti independ snp scottish labour scotland vote 18 elect coverag volatil media revenu snps product scottish 19 mcdermid shes hes women fiction degre misogynist ruth 20 bull anim cech festiv bullrun bullfight kill death 21 aru stage tour sprint dumoulin astana match vuelta 22 leagu goal match player real club score coach 23 keeper captain hrs real train gea casilla strike 24 espanyol casilla atltico goal surez minut play game 25 season score cup stfano leagu real club footbal 26 aspa espanyol minut sevilla play game leagu beto 27 minut bilbao goal mlaga iraizoz header goalkeep player 28 leagu play footbal catalan club score catalonia game 29 rayo elch getaf santo esprito kakuta jonatha loan 30 messi espanyol game play nou camp deep 31 sail enjoy charter balear yacht youll beauti island 32 pedro chelsea player leagu corner club score minut 33 properti sale hous price court market mortgag purchas 34 citi museum art travel enjoy attract vacation flight 35 link belief centralistmind wellconnect statehood wedg mobilis workingclass 36 sec stage aru dumoulin overal chave court froom 37 elect mayor parti municip women melilla vote gibraltar 38 referendum vote scotland elect parti labour snp tradit 39 messi minut score goal ball match leagu season 40 independ vote elect parti catalan mas catalonia seat 41 scotland gordon forcad brown nun sick salmond love 42 marina bust banus puerto linesman mayor citi resort 43 club uefa sea sport athlet messag fifa court 44 news itali germani thelocalat thelocals thelocalno thelocalch thelocalfr 45 syria arrest islam terrorist daesh polic moroccan jihadist 46 player bara club messi neymar play ball surez 47 polic migrant hrs rescu provinc african oct boat 48 urdangarn infanta mother royal king felip father husband 49 fear fli irrat overcom combin phobia seminar travel 50 provinc rain hrs island citi weather beach aragn 51 cent passport per british properti buyer applic sale 52 rossi mrquez race lorenzo catalonia invest champion dialogu 53 shall properti certif communiti asunta purchas owner debt 54 bike app road car citi speed kilometr rout 55 oct design startup percent efe market featur music 56 festiv san javier refuge breez globe euro visit 57 chicken dish cook enjoy guest relax food oliv 58 bedroom properti estat alcudia finca real cave villa 59 flight airlin properti airport rout tourism certif palma 60 car hire rental tourist driver sand insur compani

224 k alpha min. topics tf- idf 61 pet vaccin deulofeu vet player travel health anim 62 roman cave rout citi pilgrimag heritag compostela art 63 pan rice recip dish minut dice squid simmer 64 park theme court murcia paramount mass summon alhama 65 communiti owner respons tour froom fli homeown oliv 66 properti hotel vilaluca alfr fox rodrguez commerci rock 67 hous sea beach women water friend rescu mar 68 photograph stage waelecorbi tim mile lizonepa javier rider 69 document resid census fals flat coupl polic foreign 70 race tour stage vuelta rider froom chave dumoulin 71 stage aru dumoulin vuelta tour kilomet astana dutchman 72 motogp japanes rider moto beater freddi geisha hailwood 73 ramosgetti photograph david imag tenaci southampton gibraltar tower 74 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals accent namemedium jerez sport 75 park royal london chariti foundat race marathon runner 76 golf costa bullfight antequera del kingdom scotland unit 77 categori puls news tag menu compani latest climb 78 car volkswagen emiss vehicl engin diesel softwar seat 79 game messi ball neymar surez goal player play 80 languag english school felip king contamin test obama 81 independ elect vote catalonia mas referendum parti seat 82 polic construct compani pujol investig commiss catalan search 83 snp independ sturgeon referendum carmichael parti leak britain 84 scotland snp scottish labour frack confer vote syria 85 meat pope cancer vatican airport church muslim franci 86 produc espigolador barba leftov leagu atkinson season club 87 glencor volkswagen commod emiss price debt market slump 88 pari polic man victim terrorist franco juan dead 89 independ parti catalan catalonia rajoy separatist elect basqu 90 photograph spaniard fire imag prize award vegetarian cualedro 91 independ mas catalan catalunya junt resolut vote pel 92 stage race tour degenkolb min adrift sec froom 93 speed rider driver motorbik uci unipubl sagan motorcycl 94 catalan mexico maxi teacher exhibit divorc solo profound 95 independ catalan mas parti catalonia elect cup seat 96 photograph stage mile peloton lizonepa javier ride alvaro 97 emeri blown villarr sevilla goal play season celta 98 catalan elect independ parti court vote resolut catalonia 99 messi tax prosecutor father charg alleg accus compani 100 mps english vote labour scottish grayl legisl hous

.054 1 reproduct prohibit campo los con persona sus todo 2 hotel colon dalt murada miraflor gala fair tradit 3 hotel dalt murada tapa colon miraflor gala palma 4 hotel cappuccino port shop mallorca bbc gala coff 5 vent anim tradit thorsten geoff hake jane plugin 6 passport festiv javier breez globe formula hamilton driver 7 snp scotland labour syria bbc student airport hes 8 labour grayl fair scotland hes dugdal davi fractious 9 tourism scotland gordon roca mcdermid gastronomi celler tradit 10 partida com fas terrorist deficit mosso capita alert 11 donat affili copisa gasol award agbar donor catalua

225 k alpha min. topics tf- idf 12 motorbik chave dumoulin cofidi bouhanni nacer uphil froom 13 roman cave pilgrimag gaud tower monument jame iberia 14 documentari hera percent greek redund fenc oct hrs 15 car hire rental driver meat debit robinson fill 16 campbel bbc bbcs stuart scotland receipt chess cpr 17 snp scotland labour sword solo angi cassandra douglass 18 snp scotland mas labour forcad dialogu sick nun 19 vilaluca alfr fox espigolador photograph barba leftov villa 20 nadal tenni djokov rafael semifin efe fognini muguruza 21 coach soccer manchest efe munich bayern mas primera 22 prez sociedad jonatha mourinho pedro corner clinic messi 23 deulofeu substitut jose emeri riazor hes bottom gameiro 24 atltico espanyol casilla neymar bara messi piec corner 25 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals namemedium stfano shade qformattru 26 espanyol aspa casilla sevilla beto stielik atltico atletico 27 bilbao iraizoz refere gracia gorka keeper iker flight 28 bale cristiano atltico hes bara cech striker espanyol 29 espanyol messi fgm piqu protocol neymar rescu javier 30 rayo elch emeri sevilla coach getaf eta santo 31 asunta porto keeper basterra gea casilla iker rosario 32 messi espanyol neymar barca pedro bara alv substitut 33 abus gallifa fifa shakira piqu alv forna fling 34 prostitut sex nonsecessionist beach drown mosso ezkerra paula 35 chicken dish guest oliv pimiento spoon tomato pan 36 chelsea pedro cesar mourinho cuadrado def dos con 37 glencor volkswagen emiss car chines scotland mas juncker 38 mas beach felip nudist naturist nudism nuditi prohibit 39 properti buyer tourism passeng airport fli januarymay beach 40 owner irrat fli overcom easyjet phobia seminar rain 41 refuge melilla migrant syrian african syria isi terrorist 42 scotland snp labour bust monarchi pisarello evan rerun 43 bullfight uefa bull prohibit anim bara provoc lorca 44 bike app dialogu map cruyff johan finland semest 45 tourism hotel paralys cancer breast diagnos amazon elabor 46 document carmichael census leak detain irregular fraudul uncov 47 bull anim bullfight festiv bullrun hrs ski gore 48 forest montserrat evacu firefight humid benahav extinguish motorway 49 gay homosexu cuba cuban franci pope obama maroto 50 neymar brazil bara brazilian santo defraud sourc bartomeu 51 muslim park ski storm snow mosqu flood bilic 52 rice pan dish recip dice squid simmer chop 53 retir tendon achill maebashi depress jumper psychiatr japan 54 flight airlin airport palma tourism carrier passeng fli 55 busqueta manel podemoss hrs nurs mas electr meter 56 andrea tabarca volatil snp denounc celebr sedat parad 57 park royal chariti marathon birthday celebr runner anim 58 rossi mrquez lorenzo valentino motogp hrs yamaha bike 59 photograph waelecorbi tim lizonepa javier peloton vinegar map 60 vaccin diphtheria transplant olot contagion liver intens dhebron 61 tomato salah car belgium abdesalam terrorist festiv hutch 62 puls categori properti ferrer abaaoud syria catalua navarra 63 speed driver unipubl uci motorbik overtak apolog peloton

226 k alpha min. topics tf- idf 64 rescu beach lloret birmingham boat drown migrant aston 65 nov percent suicid puls categori cave driver renf 66 puerto elkhazzani thanksgiv belgium euribor passeng khazzani chariti 67 properti owner certif homeown shall buyer rent grade 68 bedroom properti commerci hotel alcudia finca cave villa 69 beach blue award concha httpcommonswikimediaorg gibraltar tourism properti 70 mas sand vega adelson eurovega percent sainz marina 71 park theme torrevieja paramount alhama multin properti studio 72 pet vet microchip vaccin walker anim blott dog 73 properti buyer neymar messi bara babi loan chernik 74 virgin celebr diver carmen boat motogp japanes tradit 75 aru froom marina banus puerto scan berth melchor 76 sail ticket museum yacht vacation flight park skipper 77 fli oliv transgen genet larva oxitec helicopt rescu 78 chave sec froom dumoulin degenkolb min valverd orica 79 aru dumoulin sec min froom passeng gougeard driver 80 mas summon store catal catalua froom primark provoc 81 athlet lorenzo gestur celebr anniversari nazi brigada rfea 82 car traffic driver stake seatbelt rastar chines booster 83 southampton tenaci gibraltar linesman ship lisbon deck genoa 84 messi jerez neymar mansel pilot museum eclips herrero 85 golf antequera inland anniversari scotland messi hrs cas 86 renf cabl googl passeng adif vilafranca pene figuer 87 refuge rosel bartomeu sep asylum nacion audiencia thiem 88 mas abus netflix fierc bitter defacto salut sourc 89 proeuropean continent magaluf wari blue advertis miami mas 90 percent rivera syria oct russia mas efe nov 91 bedroom properti finca alcudia cave villa commerci roig 92 mas investitur catalua midday bst proeu watson scotland 93 photograph waelecorbi tim lizonepa javier jordanafpgetti peloton jose 94 photograph peloton lizonepa javier barrientosap alvaro dumoulin aru 95 neymar messi bate bara catalua leo tenerif father 96 ramosgetti photograph tower sagrada famlia tallest basilica gaud 97 aru dumoulin kilomet sec oliveira valverd haimar mano 98 pujol clan athlet cdc catalana marta imput banca 99 gasol percent envelop nov powder chemic deficit santand 100 glencor volkswagen emiss car chines diesel engin owner

.08 1 prohibit reproduct campo los partial andor para indirect 2 hotel murada dalt colon miraflor mallorca gala cappuccino 3 hotel carniv colon dalt miraflor murada gala palma 4 motorbik hotel uci unipubl castel bellver firewor fundaci 5 tapa hotel ruta salvador murada bier coll martiana 6 melilla enclav correa brcena racist jew jane plugin 7 percent googl startup oct teruel slope pist jane 8 meat cancer brazil japan gram coutinho jamn philipp 9 nonsecessionist breez dugdal kezia stfano coromina lamont tous 10 atkinson addison kendal michell juve peer wallac sociedad 11 lloret sex prostitut girl coggin drown capita fas 12 bbc refuge robinson scaremong jame quiz cervant mackay 13 vaccin diphtheria olot song bilic soldier contagion elkhazzani 14 jerez pilot mansel motorbik jewel cashpoint senna chequer

227 k alpha min. topics tf- idf 15 alfr vilaluca fox photograph villa tworoom pene tower 16 greek flood eva isil properti imf armrest alget 17 hire robinson bicycl port secondhand asham putin breez 18 campbel proeuropean stuart wari tomatina abus vacation buol 19 walker blott carlin cuba tabloid editori edit cuban 20 everybodi gordon chat nov homosexu gay stuart magaluf 21 royal banus marathon drown gala tabarca dubious wedg 22 prix getaf vent hamilton iberdrola alonso texa merced 23 nadal tenni djokov semifin ferrer seed muguruza atp 24 bara soccer benitez efe primera bentez rafa cristiano 25 athlet leo royal sportsmen brigada rfea preparatori tobalina 26 magaluf miami rebrand sant oct moragu hashish afacan 27 nudist naturist nudism nuditi prohibit lorca japan cpr 28 casilla elch bara jonatha getaf esprito kakuta bueno 29 bale cristiano linesman stfano affili copisa ancelotti altruism 30 gea bentez casilla nava iker porto keylor rafa 31 aspa casilla espigolador barba leftov ferro castel verdu 32 iraizoz forcad refuge nun sick percent iker juncker 33 casilla bosqu evan academi transplant ownership secondhalf sociedad 34 bara streak coentro fbio bale roma fifa cas 35 riazor prez gameiro gaya toch elch deulofeu wellstruck 36 cruyff stake netflix johan rastar ajax tower cheapest 37 abus bara insult lfp shakira section banana peel 38 fgm protocol girl bildu salah abdesalam crespo fgmaffect 39 firefight forest montserrat humid benahav cloud flame mosqu 40 cdc documentari hera affili sant lit balloon fenc 41 cesar def eta dos lliur terra grapo lehmann 42 irrat overcom seminar phobia easyjet proxi vaccin babi 43 bara uefa leo gambl prohibit cas klopp chess 44 athlet wifi partida iaaf domnguez passport biolog dope 45 bust pisarello stfano redund shall cosmos raul pilot 46 princess leonor juventus royal vinegar sofa parad actress 47 mosso vegetarian defibril reanim los benzema blackmail tapa 48 aru deulofeu edit melchor mauri contador degenkolb mythic 49 vacation athlet nov girl lpez juncker thirdplac nichola 50 eta jihadist kidnap moren soldier gando oct surfer 51 photograph ceuta coastguard cualedro firefight flame melilla subsaharan 52 rent hotel percent startup riu tunisia app oct 53 properti buyer hotel paramount fox airlin downturn hardship 54 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals namemedium shade qformattru namecolor 55 bara cech brazil roma vermaelen lewandowski volley bartomeu 56 bullfight bull contamin kerri obama prohibit mallorca nov 57 vega adelson eurovega percent hotel golf florida coin 58 bull bullrun flood tordesilla risen bullfight spear forest 59 neuron solo angi cassandra douglass gibb incapacit jeal 60 passport hamilton ebola cyclist abus oliva romero oct 61 roca celler nurs transplant dish liver shred rubber 62 asunta porto basterra cat zara rosario pet amazon 63 rossi mrquez motogp roma contador prix marquez roquebrun 64 volkswagen emiss glencor diesel japan tendon psychiatr maebashi 65 virgin diver parad mourinho sail cuadrado cabifi puls 66 obama nobel efe hire medicin putin carter aguirr

228 k alpha min. topics tf- idf 67 carmichael leak roddin memo babi mitchel questionnair abaaoud 68 disqualif schmidt mexican efe gladbach album indirect bundesliga 69 properti shall buyer condominium transferor deed ownership expressli 70 properti buyer gibraltar januarymay mallorca yorkbas fox airlin 71 bedroom properti finca alcudia cave villa golf miraflor 72 properti homeown buyer rent grade assessor epc certifi 73 torrevieja tapa doln jacquelin los cyclist inland sant 74 dish pan chicken rice squid dice prawn tablespoon 75 pet vet microchip vaccin refuge fenc foncubierta martinez 76 cave roman pilgrimag tower monument jame bara repurchas 77 bedroom properti villa alcudia finca cave golf hotel 78 sail gibraltar tenaci southampton genoa monaco malta steer 79 airlin palma vuel app norwegian airway mallorca puls 80 percent transgen genet nov cancer defacto abus salut 81 aru sec kilomet degenkolb lindeman oliveira schleck gougeard 82 ferrer morisco nava carvaj palma benzema secondhalf volley 83 motogp pilot helmet balloon moto beater freddi geisha 84 golf antequera inland rosel bartomeu nacion audiencia sharm 85 busqueta manel ramosvinola argentina playmak michelin doll tenni 86 hire seatbelt dgt booster frontseat segu directa restraint 87 villanovens tejada pilot puls carvaj holland aspa mirag 88 vatican pope homosexu priest gay balda nov leak 89 refuge gibraltar electr bicycl meter picardo hire nov 90 summon vacation spokesmen existenti manoeuvr montn junta expir 91 photograph lizonepa waelecorbi tim jordanafpgetti cyclist barrientosap miami 92 ramosgetti photograph tower sagrada famlia tallest basilica expir 93 gordon mcdermid refuge mohsen cenaf ruth osama hungarian 94 renf cabl graduat idena indian pene envelop vilafranca 95 fifa soccer efe efefil blatter platini uefa coldplay 96 urdangarn infanta royal blanket palma flood huelva knit 97 photograph lizonepa waelecorbi tim barrientosap cyclist jordanafpgetti aru 98 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals namemedium shade qformattru primark 99 pujol catalana banca gasol ferrusola eurobasket lill mauroy 100 glencor volkswagen emiss hash acorn resin melilla cher

.016 .03 1 charg gordon scotland reproduct campo prohibit los cashpoint 2 hotel murada dalt colon miraflor gala mallorca cappuccino 3 hotel commiss recommend elector celebr referendum britain rueta 4 virgin hotel tradit sea celebr festiv citi town 5 torrevieja tapa hotel enjoy holiday market price park 6 refuge migrant border healthcar hospit melilla rajoy health 7 flight airlin airport rout tourism palma fli carrier 8 mexico oct iglesia singer percent efe mexican billion 9 bike car driver app road traffic passeng rout 10 labour parti referendum vote scottish independ catalan bbc 11 referendum scotland snp scottish sturgeon independ vote parti 12 referendum elector scotland vote scottish commiss ballot bbc 13 respons communiti owner flag irrat neglig function sport 14 pet hous vet your owner travel vaccin microchip 15 snp vote scotland labour parti independ sturgeon referendum 16 catalan mas pujol parti independ commiss court cup

229 k alpha min. topics tf- idf 17 speed rider driver motorbik uci sagan unipubl race 18 referendum robinson score leagu greec game bbcs journalist 19 english snp mps labour hous scotland scottish vote 20 snp referendum independ parti sturgeon vote confer quebec 21 golf club costa del antequera kingdom uefa fifa 22 match nadal game tenni play ferrer titl player 23 match player coach real goal play leagu messi 24 club pedro chelsea player leagu unit bara season 25 passport espanyol casilla applic stielik play real goal 26 atltico minut espanyol goal game piec champion casilla 27 cup season score leagu real stfano club titl 28 aspa sevilla espanyol minut game play leagu season 29 minut bilbao mlaga rossi iraizoz mrquez header lorenzo 30 play rayo elch score goal bale fan getaf 31 leagu catalan footbal club catalonia independ play game 32 fear fli festiv javier san irrat overcom travel 33 messi espanyol game lionel tax player play derbi 34 neymar messi surez ball goal bara minut score 35 shall properti certif communiti purchas debt expens owner 36 minut score ronaldo hattrick game messi espanyol player 37 rider race abus alba sustain motorbik dumoulin reuter 38 sex prostitut worker hospit patient cabl renf parent 39 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals accent namemedium fgm que 40 polic bicycl vuelta cycl belong recov cloud stolen 41 fli court oliv transgen releas genet larva consum 42 mayor parti elect municip carmena fine citi town 43 snp scotland scottish labour frack confer vote tax 44 parti rajoy percent elect vote catalan independ resolut 45 independ catalan vote elect parti catalonia mas proindepend 46 cent per properti buyer sale british search foreign 47 ramosgetti photograph david imag tenaci southampton gibraltar tower 48 jump retir metr citi king quiet juan carlo 49 news aug thelocalat thelocalch thelocald thelocaldk thelocalit thelocalno 50 puls categori citi news tourism menu tourist tag 51 obama commiss version efe english felip juncker cuban 52 tourism tourist sport roca food wine tradit restaur 53 polic arrest costa car provinc tabarca drug suspect 54 rescu boat migrant coastguard helicopt morocco craft sea 55 fire forest weather firefight wind montserrat evacu afternoon 56 bullfight king felip royal urdangarn infanta franco bull 57 sea women water beach rescu lloret mar swim 58 mother asunta parent anim court daughter polic porto 59 arrest consum polic custom network terrorist mobil suspect 60 athlet lorenzo gestur celebr jos accompani twitter sur 61 produc espigolador leftov barba food atkinson imperfect club 62 pari kill terrorist dead injur son victim die 63 properti buyer cent foreign costa per market invest 64 bedroom properti estat real finca alcudia cave villa 65 marina tourist sand banus vega adelson eurovega puerto 66 car hire rental driver insur fuel fee deposit 67 beach blue flag courtesi vilaluca alfr imag fox 68 properti certif rent catalonia grade effici sell owner

230 k alpha min. topics tf- idf 69 dish chicken cook pan rice minut oil recip 70 roman cave rout citi pilgrimag heritag compostela art 71 park theme murcia paramount resort alhama multin cruz 72 communiti document homeown resid societi owner properti census 73 properti hotel commerci rock investor encourag purchas sector 74 enjoy sail beauti citi museum art charter travel 75 vaccin languag diphtheria english health children olot school 76 bust donat construct mayor catalan affili compani cech 77 leagu emeri play player sevilla blown beach footbal 78 froom stage tour aru kilomet fractur withdraw rider 79 tour race froom stage vuelta climb rider contador 80 aru vuelta astana tour stage triumph winner fabio 81 jerez transplant pilot mansel museum race senna eclips 82 park royal london chariti foundat race marathon birthday 83 eta scottish scotland judg warner distinct kill british 84 referendum parti labour snp independ confer scottish journalist 85 mps vote english scottish labour carmichael leak grayl 86 franco percent film award oct nov spaniard design 87 meat cancer health women vaccin export breast spaniard 88 photograph stage waelecorbi tim mile javier lizonepa rider 89 stage dumoulin race vuelta chave tour froom rider 90 strike athlet store compani hrs worker contamin driver 91 dumoulin aru stage sec jersey tour min astana 92 glencor volkswagen emiss car commod price debt market 93 mas independ court catalan novemb artur elect catalunya 94 vote parti independ elect seat referendum scotland catalonian 95 photograph stage peloton mile lizonepa javier barrientosap dumoulin 96 labour basqu corbyn independ scotland vote snp elect 97 independ elect parti catalonia catalan vote cup seat 98 mas independ catalan catalonia parti court seat elect 99 catalan mas parti court vote independ cup seat 100 court deulofeu mass linesman player summon mas play

.054 1 bull prohibit reproduct campo los bullrun para con 2 hotel colon murada dalt miraflor gala mallorca cappuccino 3 hotel walker edit tradit blott carlin manley bellver 4 tapa hotel espigolador leftov barba dialogu imperfect arround 5 messi sevilla leo bara palma jane plugin thorsten 6 airport flight puls categori airlin helicopt hrs rescu 7 virgin diver carmen boat parad tradit celebr effigi 8 neymar gasol bara brazil brazilian pau defraud vermaelen 9 scotland snp labour gordon evan brown cech anniversari 10 pujol clan cdc catalana treasur imput banca egypt 11 magaluf deficit fas optimist hrs capita mallorca scotland 12 snp scotland labour thread ferrier redund morisco antipoverti 13 tourism roca athlet celler gastronomi eat ski castelln 14 snp scotland labour syria bbc student airport fair 15 properti hotel commerci avenida michell earthquak review peer 16 kilomet aru oliveira traffic cairn valverd peloton speed 17 beach drown gallifa coastguard forna rescu greek boat 18 bbc robinson bbcs meat scotland imparti cancer cabifi 19 vaccin diphtheria olot contagion intens scotland dhebron diseas 20 labour snp scotland syria mock student bbc hes

231 k alpha min. topics tf- idf 21 mas sourc dialogu eta cesar dos def con 22 nonsecessionist driver formula prix japan hamilton sainz alonso 23 nadal donat affili tenni ferrer copisa djokov semifin 24 messi bilbao espanyol iraizoz corner refere stielik gracia 25 atkinson coach primera addison kendal soccer toshack venabl 26 manchest soccer coach efe bayern watson proeu chelsea 27 espanyol rescu atltico lloret birmingham casilla piqu neymar 28 stfano argentinian coach retir mas marriag bite sword 29 neymar messi bara espanyol atltico casilla piec angl 30 aspa sevilla espanyol beto emeri casilla pau neymar 31 bale cristiano hes salah belgium car ancelotti tomato 32 rayo elch getaf loan esprito kakuta jonatha emeri 33 asunta porto keeper captain basterra iker casilla rosario 34 casilla academi bosqu payment vicent substitut ownership secondhalf 35 messi espanyol deulofeu hes neymar barca striker rayo 36 linesman atltico bara arrow fullback coentro fbio nacho 37 substitut jose prez corner gameiro toch wellstruck emeri 38 partida mosso com dog atltico desquadra defibril reanim 39 tenaci southampton gibraltar deck malta monaco genoa steer 40 prostitut sex soldier dump troop ezkerra paula vip 41 fgm protocol girl froom crespo fgmaffect angliru garzn 42 scotland cloud volatil properti downturn commerci forgiven hardship 43 dunleavi qvortrup unifi saloufest drunken student festiv connot 44 beach nudist naturist nudism nuditi felip prohibit obama 45 passport forcad nun sick renf cabl com trump 46 pet vet anim microchip vaccin mas dog convergncia 47 refuge migrant syria melilla syrian terrorist isi asylum 48 properti certif shall owner bust rent grade pisarello 49 bullfight uefa prohibit bull anim mas juncker mestr 50 mcdermid balcel singer writer fiction hes misogynist ruth 51 andrea sep sedat festiv columbuss celebr clinic doll 52 document census marriag detain fraudul irregular uncov agent 53 cancer ski storm breast flood snow risen store 54 golf antequera inland anniversari scotland categori klopp ticket 55 properti buyer homeown twe pool tourism bedroom airbnb 56 sail yacht boat port skipper mallorca beach ticket 57 park theme paramount athlet alhama multin lorenzo gestur 58 urdangarn felip royal infanta passeng queen cristina driver 59 bicycl cruyff schmidt navarra characterist bike secondhand johan 60 transplant categori puls donat carolina storm torrenti donor 61 anim dog cat owner shelter pet getaf isil 62 rossi mrquez tourism lorenzo hotel paralys valentino motogp 63 mas pedro chelsea mourinho catalua hrs bara cuadrado 64 mas syria cuba efe cuban obama park russia 65 museum bike vacation app flight park ticket retir 66 googl continent proeuropean graduat wari percent blue nov 67 sand vega adelson eurovega percent sainz marina hotel 68 flight airlin airport tourism palma carrier javier festiv 69 properti buyer tourism passeng airport tradit januarymay fli 70 bedroom properti finca alcudia cave commerci villa asda 71 car hire driver rental traffic debit fill motorway 72 torrevieja park waterpark tabarca earthquak tapa sourc tradit 73 beach blue marina banus puerto award concha httpcommonswikimediaorg

232 k alpha min. topics tf- idf 74 chicken dish guest oliv dumoulin motorbik pimiento spoon 75 roman cave pilgrimag gaud tower motorbik iberia jame 76 pan rice recip dish dice simmer squid chop 77 owner homeown properti sitter irrat concept proxi ownership 78 fli overcom irrat phobia easyjet seminar speed rain 79 fli oliv transgen larva genet transplant oxitec palomar 80 montserrat forest firefight humid evacu benahav mas bst 81 espanyol bale pepe haul stfano cristiano atltico balthazar 82 park royal chariti marathon birthday celebr runner gala 83 aru froom dumoulin contador scan schleck melchor mauri 84 dumoulin chave sec froom degenkolb aru orica valverd 85 sec froom chave lindeman min aru garderen dan 86 motogp japanes circuit moto beater freddi geisha hailwood 87 jerez eta mansel pilot museum bildu chequer gir 88 glencor volkswagen emiss car chines mas prosecutor pedro 89 ramosgetti photograph tower sagrada famlia tallest basilica gaud 90 messi neymar roma coach munir ter stegen corner 91 alv argentina messi prosecutor amazon playmak father document 92 labour grayl carmichael leak fair hes scotland roddin 93 vilaluca alfr fox photograph busqueta manel podemoss tworoom 94 aru dumoulin min sec haimar mano gougeard medina 95 photograph lizonepa javier waelecorbi tim peloton barrientosap alvaro 96 speed driver unipubl uci motorbik mas overtak apolog 97 museum iberdrola review texa nov kit tradit efe 98 fifa anim mas cas abus chess bullfight percent 99 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals namemedium shade qformattru namecolor 100 mas summon investitur catal prosecutor catalua midday provoc

.08 1 prohibit reproduct campo los partial andor para indirect 2 hotel murada dalt colon miraflor mallorca gala cappuccino 3 hotel carniv colon dalt miraflor murada gala palma 4 motorbik hotel uci unipubl castel bellver firewor fundaci 5 tapa hotel ruta salvador murada bier coll martiana 6 melilla enclav correa brcena racist jew jane plugin 7 percent googl startup oct teruel slope pist jane 8 meat cancer brazil japan gram coutinho jamn philipp 9 nonsecessionist breez dugdal kezia stfano coromina lamont tous 10 atkinson addison kendal michell juve peer wallac sociedad 11 lloret sex prostitut girl coggin drown capita fas 12 bbc refuge robinson scaremong jame quiz cervant mackay 13 vaccin diphtheria olot song bilic soldier contagion elkhazzani 14 jerez pilot mansel motorbik jewel cashpoint senna chequer 15 alfr vilaluca fox photograph villa tworoom pene tower 16 greek flood eva isil properti imf armrest alget 17 hire robinson bicycl port secondhand asham putin breez 18 campbel proeuropean stuart wari tomatina abus vacation buol 19 walker blott carlin cuba tabloid editori edit cuban 20 everybodi gordon chat nov homosexu gay stuart magaluf 21 royal banus marathon drown gala tabarca dubious wedg 22 prix getaf vent hamilton iberdrola alonso texa merced 23 nadal tenni djokov semifin ferrer seed muguruza atp 24 bara soccer benitez efe primera bentez rafa cristiano

233 k alpha min. topics tf- idf 25 athlet leo royal sportsmen brigada rfea preparatori tobalina 26 magaluf miami rebrand sant oct moragu hashish afacan 27 nudist naturist nudism nuditi prohibit lorca japan cpr 28 casilla elch bara jonatha getaf esprito kakuta bueno 29 bale cristiano linesman stfano affili copisa ancelotti altruism 30 gea bentez casilla nava iker porto keylor rafa 31 aspa casilla espigolador barba leftov ferro castel verdu 32 iraizoz forcad refuge nun sick percent iker juncker 33 casilla bosqu evan academi transplant ownership secondhalf sociedad 34 bara streak coentro fbio bale roma fifa cas 35 riazor prez gameiro gaya toch elch deulofeu wellstruck 36 cruyff stake netflix johan rastar ajax tower cheapest 37 abus bara insult lfp shakira section banana peel 38 fgm protocol girl bildu salah abdesalam crespo fgmaffect 39 firefight forest montserrat humid benahav cloud flame mosqu 40 cdc documentari hera affili sant lit balloon fenc 41 cesar def eta dos lliur terra grapo lehmann 42 irrat overcom seminar phobia easyjet proxi vaccin babi 43 bara uefa leo gambl prohibit cas klopp chess 44 athlet wifi partida iaaf domnguez passport biolog dope 45 bust pisarello stfano redund shall cosmos raul pilot 46 princess leonor juventus royal vinegar sofa parad actress 47 mosso vegetarian defibril reanim los benzema blackmail tapa 48 aru deulofeu edit melchor mauri contador degenkolb mythic 49 vacation athlet nov girl lpez juncker thirdplac nichola 50 eta jihadist kidnap moren soldier gando oct surfer 51 photograph ceuta coastguard cualedro firefight flame melilla subsaharan 52 rent hotel percent startup riu tunisia app oct 53 properti buyer hotel paramount fox airlin downturn hardship 54 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals namemedium shade qformattru namecolor 55 bara cech brazil roma vermaelen lewandowski volley bartomeu 56 bullfight bull contamin kerri obama prohibit mallorca nov 57 vega adelson eurovega percent hotel golf florida coin 58 bull bullrun flood tordesilla risen bullfight spear forest 59 neuron solo angi cassandra douglass gibb incapacit jeal 60 passport hamilton ebola cyclist abus oliva romero oct 61 roca celler nurs transplant dish liver shred rubber 62 asunta porto basterra cat zara rosario pet amazon 63 rossi mrquez motogp roma contador prix marquez roquebrun 64 volkswagen emiss glencor diesel japan tendon psychiatr maebashi 65 virgin diver parad mourinho sail cuadrado cabifi puls 66 obama nobel efe hire medicin putin carter aguirr 67 carmichael leak roddin memo babi mitchel questionnair abaaoud 68 disqualif schmidt mexican efe gladbach album indirect bundesliga 69 properti shall buyer condominium transferor deed ownership expressli 70 properti buyer gibraltar januarymay mallorca yorkbas fox airlin 71 bedroom properti finca alcudia cave villa golf miraflor 72 properti homeown buyer rent grade assessor epc certifi 73 torrevieja tapa doln jacquelin los cyclist inland sant 74 dish pan chicken rice squid dice prawn tablespoon 75 pet vet microchip vaccin refuge fenc foncubierta martinez 76 cave roman pilgrimag tower monument jame bara repurchas

234 k alpha min. topics tf- idf 77 bedroom properti villa alcudia finca cave golf hotel 78 sail gibraltar tenaci southampton genoa monaco malta steer 79 airlin palma vuel app norwegian airway mallorca puls 80 percent transgen genet nov cancer defacto abus salut 81 aru sec kilomet degenkolb lindeman oliveira schleck gougeard 82 ferrer morisco nava carvaj palma benzema secondhalf volley 83 motogp pilot helmet balloon moto beater freddi geisha 84 golf antequera inland rosel bartomeu nacion audiencia sharm 85 busqueta manel ramosvinola argentina playmak michelin doll tenni 86 hire seatbelt dgt booster frontseat segu directa restraint 87 villanovens tejada pilot puls carvaj holland aspa mirag 88 vatican pope homosexu priest gay balda nov leak 89 refuge gibraltar electr bicycl meter picardo hire nov 90 summon vacation spokesmen existenti manoeuvr montn junta expir 91 photograph lizonepa waelecorbi tim jordanafpgetti cyclist barrientosap miami 92 ramosgetti photograph tower sagrada famlia tallest basilica expir 93 gordon mcdermid refuge mohsen cenaf ruth osama hungarian 94 renf cabl graduat idena indian pene envelop vilafranca 95 fifa soccer efe efefil blatter platini uefa coldplay 96 urdangarn infanta royal blanket palma flood huelva knit 97 photograph lizonepa waelecorbi tim barrientosap cyclist jordanafpgetti aru 98 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals namemedium shade qformattru primark 99 pujol catalana banca gasol ferrusola eurobasket lill mauroy 100 glencor volkswagen emiss hash acorn resin melilla cher

.029 .03 1 prohibit court app bike reproduct los campo beach 2 hotel murada dalt miraflor colon gala celebr fair 3 hotel cappuccino shop restaur visit mallorca port citi 4 hotel fli tapa oliv transgen releas mallorca colon 5 polic arrest victim eta woman car suspect murder 6 flight airlin airport rout tourism palma carrier passeng 7 festiv san javier breez globe euro visit band 8 properti cent per buyer foreign passeng rayo elch 9 referendum vote scottish scotland sturgeon snp elector independ 10 mps vote english scottish labour scotland referendum parti 11 mas novemb catalan consult investig fgm court artur 12 referendum snp independ scotland parti labour scottish vote 13 linesman flag sport cruyff alleg town committe union 14 ramosgetti photograph david imag tower sagrada famlia tallest 15 court pujol investig commiss compani search jordi judg 16 espanyol play player goal stielik establish atltico calderon 17 club leagu player goal manchest unit score season 18 marina banus puerto resort journalist arabia berth saudi 19 snp vote labour scotland parti independ referendum sturgeon 20 sturgeon parti labour confer independ snps babi snp 21 motogp japanes rider formula driver race circuit balloon 22 match nadal messi game play player fifa tenni 23 messi game player match bara neymar ball surez 24 fire jump forest montserrat retir metr afternoon blaze 25 club atltico minut champion uefa game leagu casilla 26 season cup score real club carmichael stfano leagu

235 k alpha min. topics tf- idf 27 minut espanyol messi goal game play bilbao score 28 minut aspa play espanyol game sevilla ball casilla 29 neymar score play goal bara messi minut bale 30 leagu catalan footbal club catalonia independ play game 31 real athlet keeper bentez gea rafa iker coach 32 tourism tourist food roca casilla gastronomi restaur celler 33 game jose substitut surez minut goal deportivo messi 34 tour pedro froom chelsea minut stage fractur player 35 properti certif rent grade effici owner sell abus 36 rossi mrquez lorenzo race sex prostitut walker titl 37 arrest polic terrorist suspect syria terror pari islam 38 citi museum art travel enjoy attract vacation flight 39 glencor volkswagen commod emiss price debt compani market 40 mayor citi elect tourism parti colau hotel municip 41 scotland snp scottish labour vote confer frack tax 42 car tenaci gibraltar southampton deck volkswagen enabl genoa 43 vote independ parti catalan elect mas seat proindepend 44 catalan independ yesterday forcad elect parti nun mas 45 news refuge germani austria itali thelocalno thelocalch thelocaldk 46 efe legisl obama file unit spaniard thursday cuba 47 news catalonia invest foreign output dialogu produc black 48 transplant donat liver hospit organ donor patient health 49 bullfight bull anim town festiv kill bullrun protest 50 sea provinc beach water women rescu mar hrs 51 king felip sport queen franco royal urdangarn media 52 properti buyer costa cent market per foreign sol 53 puls categori news compani tag menu asunta latest 54 passport properti british applic rate hous sale purchas 55 film award star oct singer iglesia actress featur 56 fear fli passeng drive driver travel traffic speed 57 produc espigolador leftov barba food schmidt donat imperfect 58 bedroom properti estat real finca alcudia villa cave 59 pet vaccin travel vet anim microchip insur dog 60 tourist sand vega las adelson eurovega billion euro 61 car park royal hire rental london driver insur 62 torrevieja park market enjoy holiday price waterpark heat 63 beach blue flag courtesi imag award playa concha 64 chicken dish cook enjoy guest relax food oliv 65 roman cave rout citi pilgrimag compostela heritag art 66 rice pan dish recip minut chop dice squid 67 virgin sea town diver boat carmen citi tradit 68 park theme murcia paramount cruz alhama multin film 69 communiti owner respons enjoy sail homeown charter balear 70 properti hotel rock commerci race motorbik pull rider 71 hous owner your write pujol judg sitter email 72 referendum bust elector commiss recommend britain mayor citi 73 document resid fals census foreign meat flat coupl 74 migrant beach boat border refuge drown rescu african 75 tour race stage vuelta rider froom chave dumoulin 76 athlet celebr gestur lorenzo jos concert alberto juan 77 speed rider driver motorbik uci unipubl sagan motorcycl 78 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals accent namemedium jerez que

236 k alpha min. topics tf- idf 79 golf costa del antequera kingdom unit inland sol 80 train passeng strike franco hrs worker memori rail 81 kilomet aru stage ride vuelta mile rider astana 82 deulofeu player play love everton talent scotland roberto 83 independ mas catalonia seat catalan elect parti vote 84 aru stage tour sprint dumoulin vuelta astana rider 85 messi ball goal neymar score ter stegen munir 86 elect independ parti catalan catalonia rajoy romeva negoti 87 temperatur sea ski water snow storm resort weather 88 atkinson renf cabl train gibraltar addison kendal british 89 parti mas catalan cup independ elect vote seat 90 independ catalan catalonia elect mas vote proindepend parti 91 stage sec race froom aru chave min mountain 92 photograph stage waelecorbi mile tim javier lizonepa rider 93 percent store startup compani oct googl zara rate 94 scotland gordon scottish brown contamin currenc british warner 95 percent rajoy parti languag podemo english oct nov 96 court independ vote elect catalan ballot mas mass 97 climb torr vuelta dumoulin polic minut schleck bicycl 98 dumoulin aru stage sec vuelta tour jersey astana 99 catalunya independ vote catalan junt resolut commiss elect 100 photograph stage peloton mile lizonepa javier barrientosap alvaro

.054 1 reproduct prohibit campo los para con resum persona 2 hotel colon murada dalt gala miraflor palma tapa 3 hotel cappuccino port mallorca shop mas pin arround 4 park hotel royal chariti celebr marathon gala birthday 5 atltico eta bara bale nacho sevilla streak fullback 6 bullfight bull anim festiv bullrun gore fiesta car 7 flight airlin airport tourism palma carrier fli passeng 8 deulofeu substitut jose emeri riazor hes bottom gameiro 9 snp labour renf hes scotland mosso dog cabl 10 transplant scotland snp evan deficit liver fas donat 11 golf scotland antequera labour snp anniversari inland rerun 12 snp scotland labour syria student bbc airport hes 13 snp scotland labour alli bbc unsur thread mohedano 14 roman cave pilgrimag gaud atltico tower casilla monument 15 pujol sail yacht clan skipper port catalana boat 16 scotland bbcs robinson bbc ferrer festiv abaaoud putin 17 mas forcad sick nun com monasteri marta carcao 18 fgm protocol girl commerci fli balcel hardship downturn 19 dish chicken pan rice recip simmer chop oliv 20 nadal donat affili tenni copisa djokov semifin rafael 21 bilbao iraizoz refere gracia linesman gorka keeper bara 22 coach soccer efe manchest primera juventus benitez atletico 23 walker fifa blott carlin edit editori tabloid newsquest 24 stfano forest montserrat firefight evacu coach argentinian humid 25 espanyol messi atltico neymar corner piec diego tiago 26 aspa espanyol sevilla beto casilla emeri pau cancer 27 espanyol bale piqu abus cristiano neymar javier rescu 28 rayo elch emeri sevilla getaf jonatha esprito kakuta 29 asunta porto keeper captain bentez basterra rafa gea 30 casilla bosqu vicent nonsecessionist coach substitut academi ownership

237 k alpha min. topics tf- idf 31 neymar bara messi casilla angl keeper bate espanyol 32 corner tomato disarray cristiano argentinian piqu villanovens substitut 33 prostitut sex gasol rain flood storm torrenti pau 34 virgin diver carmen boat celebr tradit fiesta parad 35 passeng airport ticket driver hrs wifi puls categori 36 espigolador barba leftov dialogu documentari hera imperfect juncker 37 bust monarchi pisarello felip queen cesar tradit anomali 38 torrevieja park waterpark tapa earthquak sourc tradit chess 39 jerez mansel pilot museum beach nudist naturist nudism 40 refuge sep percent migrant thiem denis tradit blue 41 mas catalua bayern munich psc olympiako hes investitur 42 refuge syria terrorist syrian isi melilla muslim migrant 43 contamin partida kerri palomar com eva washington ship 44 fli irrat overcom uefa easyjet phobia seminar song 45 document census detain uncov fraudul irregular marriag brigad 46 proeuropean continent student car wari graduat kidnap oct 47 messi neymar stegen ter roma barca munir bayer 48 tourism roca tradit gastronomi celler birth simon coexist 49 park theme paramount multin alhama hotel studio pedro 50 photograph alfr vilaluca fox properti puls categori tower 51 museum bara pedro student nov tan pupil review 52 beach blue award concha httpcommonswikimediaorg gibraltar marina tourism 53 museum vacation flight ticket park stake rastar chines 54 properti hotel commerci avenida forest trucador greenpeac jorna 55 vaccin tourism diphtheria hotel olot paralys contagion andrea 56 mas pope sep homosexu vatican abus gay franci 57 store percent primark retail elkhazzani babi zara inditex 58 rossi mrquez lorenzo valentino motogp yamaha bike sepang 59 pet anim vet dog vaccin microchip cat shelter 60 obama bbc felip cuba cuban award prize scaremong 61 chelsea pedro atkinson cruyff mourinho nurs addison kendal 62 beach rescu boat drown pedro coastguard gallifa helicopt 63 properti passport passeng buyer airport tourism fli beach 64 festiv javier breez globe singer blue ticket airport 65 properti owner certif homeown shall buyer rent grade 66 properti buyer tourism hrs januarymay tradit loan mallorca 67 bedroom properti alcudia finca cave villa commerci golf 68 mas sand vega adelson eurovega percent sainz marina 69 car hire driver rental traffic park debit fill 70 marina banus puerto shop store berth arabia typic 71 fli oliv transgen larva genet oxitec greenpeac categori 72 percent meat oct mas cancer rivera startup googl 73 froom aru kilomet dumoulin scan motorbik oliveira contador 74 dumoulin sec aru chave min froom orica valverd 75 scotland gordon froom bicycl brown prize characterist machin 76 athlet gestur celebr lorenzo labour anniversari royal sur 77 photograph peloton lizonepa javier barrientosap alvaro tim reinaafpgetti 78 motorbik garderen cofidi boeckman bouhanni nacer bike belgian 79 japanes motogp helmet pilot circuit sword moto lap 80 southampton tenaci gibraltar lisbon ship monaco genoa deck 81 bike app retir runner finland tendon underw map 82 puls categori cashpoint eat tradit getaf properti machin

238 k alpha min. topics tf- idf 83 mas summon catal bartomeu rosel prosecutor nacion audiencia 84 schmidt bara mourinho manchest leverkusen gladbach super messi 85 glencor volkswagen emiss car chines mcdermid mas prosecutor 86 neymar bara brazil brazilian santo sourc vegetarian defraud 87 bull netflix singer festiv existenti tordesilla mexican oct 88 messi prosecutor father leo argentina defraud payment citizenship 89 aru mas melchor mauri ceremoni youngest statehood dubious 90 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals namemedium shade qformattru cech 91 athlet iaaf ski watson marta proeu domnguez biolog 92 labour grayl degenkolb fair sbarag froom hes sec 93 rescu lloret birmingham beach girl aston coggin lisa 94 photograph waelecorbi tim javier lizonepa jordanafpgetti peloton jose 95 speed driver uci unipubl motorbik busqueta manel apolog 96 dumoulin aru cdc schleck haimar mano bst affili 97 carmichael leak roddin memo mitchel questionnair rato euan 98 mas urdangarn dialogu infanta investitur sourc cristina catalua 99 volatil snp denounc princess parad iberdrola leonor fundamentalist 100 photograph ramosgetti tower sagrada famlia tallest basilica bildu

.08 1 prohibit reproduct campo los disqualif indirect persona para 2 hotel dalt colon mallorca murada gala miraflor cappuccino 3 hotel cdc uni jame carniv carrer rueta iii 4 hotel monument castl castel sant mir bellver firewor 5 tapa hotel meat colon cancer ruta bier coll 6 hotel ram fira fun murada feburari suday till 7 flood risen googl plugin thorsten geoff hake jane 8 athlet argentina royal rfea brigada sur preparatori tobalina 9 dugdal hire contador unsur thread kezia mohedano percent 10 percent startup googl oct michell app chess nov 11 photograph lizonepa waelecorbi tim barrientosap aru cyclist campo 12 bbc espigolador leftov barba refuge quiz jame cervant 13 kilomet aru oliveira bildu eta insidi arraiz saltir 14 virgin diver parad sail lloret maroto gay abbey 15 sex prostitut cairn downturn hardship anymor abaaoud airlin 16 rossi bullfight mrquez bull prix motogp greek ebola 17 robinson bbc fifa scaremong putin asham michelin soccer 18 campbel proeuropean stuart wari abus tomatina buol googl 19 bale cristiano bara stfano ancelotti streak altruism coentro 20 transgen mcdermid genet workingclass statehood dubious wedg centralistmind 21 prohibit reproduct campo los disqualif indirect persona para 22 nadal tenni djokov semifin villanovens fognini novak ramosvinola 23 soccer sociedad benitez efe primera clinic prez argentin 24 aspa casilla dgt ashley segu hacker databas maddison 25 breez stfano hutch athlet smell agricultur greenedg mafia 26 casilla bosqu secondhalf ownership academi sociedad elch seatbelt 27 iraizoz iker bara brazil coutinho argentina uruguay palma 28 elch alfr vilaluca fox photograph getaf esprito kakuta 29 casilla gea nava iker bentez keylor porto rafa 30 bara casilla transplant urdangarn infanta clinic contamin liver 31 deulofeu elch leo gameiro gaya toch riazor wellstruck 32 athlet japan psychiatr jumper tendon achill maebashi iaaf

239 k alpha min. topics tf- idf 33 melilla abus mourinho bara racist ceuta enclav shakira 34 fgm protocol girl crespo fgmaffect pollut nov whale 35 walker edit blott carlin cloud tabloid editori mythic 36 documentari hera lit fenc nobel japan preston medicin 37 eta cesar dos def jihadist glorifi lliur grapo 38 percent nov oct thanksgiv emoji sex balloon adri 39 bull bullrun nudist naturist nudism nuditi prohibit tordesilla 40 rubber vat shred dump moren gando cpr coastguard 41 chicken dish pan rice vent coin turmoil neverend 42 banus partida typic bob royal redund usca renf 43 forcad nun sick electr meter refuge uni bara 44 bust pisarello gasol eurobasket shall pilot mallorca magaluf 45 athlet thiem denis isil uruguay astorga highestearn beliz 46 roman cave pilgrimag tower jame monument contador snore 47 nonsecessionist mosso tabarca los defibril reanim manual scent 48 glencor forest montserrat firefight volkswagen humid benahav emiss 49 rosel bartomeu pet cat nacion audiencia cecilia dump 50 vega adelson eurovega percent golf hotel airbnb amsterdam 51 properti rent puls mohsen cenaf osama refuge galn 52 aspa muguruza nolito ferrer richter wta garbi seed 53 app hire miami florida forest android cyclist puls 54 evan cancer academi aru lamont tan climber angliru 55 passport hamilton pirotecnia zaragozana garrapinillo utrecht squid procur 56 motogp balloon beater freddi geisha hailwood katana motegi 57 song ham bilic cancer upton album salazar mammogram 58 properti shall deed condominium transferor babi schmidt buyer 59 sail hire port mallorca refuge fun croatia oct 60 lloret bicycl girl drown coggin secondhand port bather 61 properti buyer januarymay fox airlin mallorca asian yorkbas 62 bedroom properti villa finca alcudia cave golf miraflor 63 properti homeown buyer drown ownership bedroom euribor dorada 64 torrevieja tapa coromina tous burglar refuge doln ruiz 65 gibraltar southampton tenaci steer monaco deck sail genoa 66 properti hotel rent paramount buyer grade assessor epc 67 pet vet asunta vaccin microchip porto basterra rosario 68 pan rice dish dice squid prawn tablespoon audaci 69 irrat overcom easyjet seminar phobia proxi airlin ibi 70 vacation spokesmen indian existenti manoeuvr montn expir rhino 71 aru roca schleck celler melchor mauri contador degenkolb 72 bedroom properti alcudia finca villa cave golf miraflor 73 airlin palma gordon norwegian vuel pilot mallorca tejada 74 vaccin diphtheria olot contagion hyperloop morisco ahlborn infecti 75 puls nurs wifi prohibit oct carl wherev pop 76 motorbik uci unipubl gibraltar lorca picardo poet correa 77 roma summon cech lewandowski oct alcantara volley olympiako 78 jerez pilot mansel motorbik photograph senna herrero chequer 79 royal marathon gala watson actress oscar coastguard doll 80 golf antequera inland solo angi cassandra douglass gibb 81 milk bara hotel hire tunisia riu dairi pakistan 82 bara uefa brazil salah cas abdesalam fifa prohibit 83 juventus efe wolfsburg psv porto zenit lyon moscow 84 linesman volkswagen emiss diesel jimnez carcao dealership nox 85 pujol catalana banca parad cabifi princess leonor ferrusola

240 k alpha min. topics tf- idf 86 cruyff johan magaluf miami kidnap ajax rebrand cancer 87 atkinson bara addison kendal carvaj bentez ferrer sociedad 88 pope cuba vatican homosexu cuban franci gay soldier 89 affili copisa metro mosso flu vaccin jihadist luiz 90 lockedfals wlsdexcept unhidewhenusedfals semihiddenfals namemedium shade qformattru namecolor 91 obama mexican efe efefil ngela mango percent michell 92 refuge juncker fenc resettl obama oct nov bashar 93 renf cabl zara primark inditex electr puls iberdrola 94 nov vuel juve envelop alonso powder airlin googl 95 sec aru degenkolb lindeman gougeard sardinian campo blanket 96 carmichael leak roddin memo cashpoint mitchel questionnair flood 97 stake amazon netflix graduat rato rastar idena kradonara 98 glencor volkswagen emiss busqueta manel tower deciph encrypt 99 photograph lizonepa waelecorbi tim jordanafpgetti cyclist barrientosap balcel 100 photograph ramosgetti tower sagrada famlia basilica tallest gaudi

241

Vita Samuel Stehle

Samuel (Sam) Stehle graduated from the University of Utah’s Geography department in 2010 with a Bachelor’s of Science degree and an undergraduate minor in

Computer Science. While at Utah, he worked for the Digitally Integrated Geographic

Information Technologies (DIGIT) Lab as a GIS Analyst for two and a half years. His studies included GIS and Remote Sensing, ultimately gaining Certificates in both concentrations. He joined Penn State’s graduate degree program in 2011 to pursue a

Master’s in Geography. Sam’s Master’s thesis, successfully defended in 2013, was titled

“Pattern Matching via Sequence Alignment: Analyzing Spatio-Temporal Patterns and their Distances.” He immediately re-joined the Geography department to pursue a PhD, and complete the current project.

While a graduate student at Penn State, Sam has been a research assistant, teaching assistant, and course instructor. As a research assistance, he helped develop the

STempo pattern discovery environment and helped analyze social media data for emergency response. He has assisted on courses in urban geography and introductory

Geographic Information Systems. Sam has also instructed the Introduction to GIS course, and developed and taught a brand new course on GIS for Energy Land Management for the Energy, Business, and Finance department.

Sam will begin a postdoctoral research assignment at Maynooth University in

Maynooth, Ireland in fall 2017. The project merges statistics, big data, and visual analytics to create a dashboard for urban data based in Dublin.