Comparing surveillance by governmental actors and commercial actors: An LDA-assisted analysis of news media discourse

Keywords: discourse, framing, topic modelling, LDA, surveillance, privacy, news media, digital humanities, methodology

Abstract This research compares the American news media discourse on technology-practices of surveillance and matters of privacy in the context of government actors with the discourse on such practices and matters with regards to commercial actors. To this end, a hybrid, mixed method methodology is developed. A Python-based topic modelling analysis is combined with a qualitative content analysis inspired by the discipline of discourse analysis. The corpus for the topic modelling consists of tens of thousands of texts taken from the News on the Web-corpus. The corpus for the qualitative analysis consists of relevant articles identified via the topic modelling analysis, amongst other text mining procedures. These are articles in which surveillance, as well as other technology- practices related to privacy, are either negatively of positively framed. The analyses showed that government practices of surveillance are often critically approached from a law perspective and positively evaluated from a national security perspective. In the commercial actor-discourse the positive and negative evaluations of surveillance can be regarded as a struggle between economic benefits and consumer convenience on the one hand, and information security on the other. In both discourses, privacy is the most talked about issue in negative framing, as other studies have also shown. Furthermore, technological properties are found to play a big role in both discourses, especially in surveillance-critical arguments. Evaluating the effectivity of the methodology of this study, which can be regarded as experimental, it is found that topic modelling is a promising way to improve the objectiveness of the standard, keyword-based manner of selecting a corpus for a content or discourse analysis, and increase the representativeness the eventual findings.

A Master Thesis for the MSc Communication and Information Sciences program of Tilburg University Author: Chantal van Elden First reader and supervisor: dr. Emmanuel Keulers Second reader: dr. Ruud Koolen Submitted: July 27 2017 Table of contents

1. Introduction 2 2. Theoretical framework 6 2.1. The characteristics of data-driven surveillance 2.2. Issues concerning surveillance Ethics: privacy harm and discrimination Philosophical and practical criticisms 2.3. Public discourse and pending questions 3. Methodology 15 3.1. Discourse Analysis, Content Analysis and Natural Language Processing Various approaches to discourse research and their theoretical and practical ramifications The methodological focus of this thesis 3.2. Method for the quantitative analysis The research corpus: origins and pre-processing Extraction of relevant articles: NER, keyword identification and collocation Topic modelling with LDA 3.3. Method for the qualitative analysis The implications and context of the corpus The coding framework 4. Results 31 4.1. Government surveillance and commercial surveillance compared in topics 4.2. Comparing the use of frames in the context of different kinds of actors The results of the topic modelling search for relevant articles Categorising the articles A closer look on the issue of privacy 5. Conclusion 41 6. Discussion 44

References

Appendix

1

1. Introduction

Societies around the world are becoming increasingly ‘digitalized’: human interaction, machine interaction and human-machine interaction nowadays largely takes place via computer-facilitated communication channels, and much information is converted to- or created in digital forms that can be accessed via the internet. This digitization has had, and continues to have, great effects on the way people relate to the world and on the way knowledge is generated,1 which makes the technologies, practices and discourses that constitute this digitization a popular subject of critical research into the social, cultural or ethical implications and consequences of certain dimensions or specific cases of digitization. One of the prime digitization-related subjects that has been studied in this regard is the subject of surveillance, the “systematic investigation or monitoring of the actions or communications of one or more persons” (Clarke 2012). Following the digitization of society as a whole, surveillance has also become increasingly data-driven, as the actions and communications that are observed via systems of surveillance are nowadays more often than not virtual: they consist of data or metadata that are the result of online (social) action. These data are then collected and analysed to monitor or predict the characteristics or behaviour of certain individuals, groups, or human behaviour in general, via data- mining methods.2 These data-driven surveillance practices are for example executed by security agencies, that search for meaningful patterns or relationships in certain large collections of data such as aircraft passenger-name records in order to identify high-risk individuals and subject them to pre- emptive security measures such as extra airport security checks or a prohibition to fly (Leese 2014). Commercial actors engage in similar analyses, though for very different reasons: banks and insurance companies for example may use predictive data mining to automatically asses a potential customer’s credit worthiness, and retailers may use data mining to recommend certain items to the visitors of their online platform based on their overall web behaviour – all to the end of minimising costs or increasing revenue (Cheng et al. 2015, 6). Within various academic disciplines, and then especially the field of culture studies, critical analysis of such ‘technology-practices’3 of surveillance has become increasingly popular. This research can roughly be placed in two categories: descriptive studies – either theory-oriented or case-oriented –

1 For example, according to the much-cited sociologist Castells (2011) the “revolution of communication technologies” has led to the emergence of a global ‘network society’, which is a new social structure in which the old limitations of networked interactions have been eliminated for users of new network technologies. Similarly, Mayer-Schoenberger and Cukiern (2013) observe that the ‘datafication’ of society has led to a new epistemological paradigm, where the increased quantification of social action in the form of analysable data has led to radically new ways of creating (perceived) knowledge about social behaviour (in: Van Dijck 2014). It must be noted though that the ‘new’ communication- and information technologies that drive digitization are neither completely new (Bolter and Grusin 1996; Agar 2006; De Vries 2012; Beer 2016) nor determining for social conditions (Sismondo 2010). 2 Data mining is the process of “discovering novel, interesting, and potentially useful patterns from large data sets and applying algorithms to the extraction of hidden information”, typically in order to “build an efficient predictive or descriptive model of a large amount of data that not only best fits or explains it, but is also able to generalize to new data” (Cheng et al. 2015, 1). 3 With ‘technology-practices’ I mean the assemblage of technical, cultural and organisational aspects that make up a technology as it occurs in society; it is the object, or collection of objects plus the technical processes that sustain them, as well as its use and embedment in society (Pacey 1983, 5-6). 2 and discourse research. Where critical descriptive studies are concerned, authors from various fields of studies have voiced concerns about the consequences of both governmental and commercial surveillance practices for people’s right to privacy and the risks of inadequate data security (e.g. Tavani 1999; Van Wel and Royakkers 2004; Millar 2009; Hull et al. 2010; Bauman et al. 2014; Lyon 2014; Van Dijck 2014). Moreover, in the case of pre-emptive security measures based on predictive data mining, there are warnings of discrimination and jeopardizing the legal principle of presumption of innocence (e.g. Amoore and De Goede 2008; Kerr and Earle 2013; Leese 2014). Where discourse research is concerned, scholars have identified various patterns in the way people talk about and make sense of the technologies and practices related to surveillance in the media. Researchers have for example identified certain ‘frames’ that are commonly used to talk about specific technology-practices or controversial cases of surveillance, like the ‘surveillance can counter crime and terror’-frame, the ‘surveillance has led to an Orwellian dystopia’-frame, and the ‘privacy and digital services are inherently at odds’-frame (Bernard- Wills 2011; Lischka 2015; Mols and Janssen 2016). Interestingly, existing discourse research seems especially focussed on surveillance in a state context, and consequently privacy in the context of police action, security services and government legislation, while surveillance is also executed by commercial actors – for example in the form of targeted advertising. The surveillance practices by governmental actors and commercial actors are also inherently intertwined: national security agencies for example often work together with commercial actors such as telecom providers and airliners to get the data they need (Amoore 2009; Lyon 2014). The relative lack of attention to surveillance practices by commercial actors in discourse- analytical studies makes it relevant to compare the discourse about technology-practices of surveillance by government actors on the one hand and the discourse about such practices by commercial actors on the other. Are privacy, discrimination and other ethical issues that accompany surveillance about in a different way or degree? Are certain frames more dominant in one discourse than in the other? Is the one discourse more positive or negative than the other? Or to summarize all these concerns into one coherent research question: How are technology-practices of surveillance by government actors typically problematized in current-day news media discourse, and how is this done for similar practices by commercial actors? Next to these unanswered questions, existing discourse research into technology-practices of surveillance has left open some other opportunities for new research – opportunities located in the domain of methodology rather than theory. Currently, discourse-analytical research questions are generally answered by means of a qualitative analysis of content, in spirit of a certain methodological tradition of (critical) Discourse Analysis or its more social-sciences oriented cousin, Content Analysis. For example, texts and sentences are studied and placed in certain predetermined meaningful categories (e.g. Lischka, 2015), or frames are inductively deduced from close reading (e.g. Möllers and Hälterlein 2013; Mols and Janssen 2016). This kind of research sometimes also has a quantitative dimension, where for example the number of positive frames is diachronically compared to the number of negative frames

3

(e.g. Fiss and Hirsch 2005). However, combining qualitative analysis with text mining techniques from the field of computational linguistics on large numbers of news media articles – like Kutter and Kantner (2012) and Koteyku, Jaspal and Nerlich (2011) have done for journalistic accounts of EU military intervention and global warming respectively – does not appear to have occurred yet in research concerned with the subject of surveillance. The relative lack of studies combining the strengths of qualitative analysis and quantitative analysis, in general as well as regarding the subjects of surveillance and privacy, makes it highly interesting to engage in such a hybrid, mixed method study, and critically reflect on the procedures and results. Furthermore, when text mining techniques are combined with a qualitative analysis, these are typically techniques like collocation (e.g. Kutter and Kantner 20120; Koteyko, Jaspal and Nerlich 2011), concordances (e.g. Kutter and Kantner 20120; Branum and Charteris-Black 2015), and simple statistical measures like comparing the presence of certain keywords in different corpuses across time, or testing whether the presence of certain frames inside a corpus is statistically significant (e.g. Fiss and Hirsch 2005; Hoffman 2011). The more complex topic modelling approach on the other hand, a Bayesian unsupervised machine learning technique that has been successfully used to analyse word patterns in a variety of large corpora (e.g. Newman et al. 2006; Maskeri, Sarkar and Heafield 2008; Mimno 2012; Rosen and Shibab 2015; Savov, Jatowt and Nielek 2017), does not appear to often have been embedded yet in a study involving research questions that belong to the domain of the humanities – though there have been some recent examples of such research (e.g. Jaworska and Nanda 2016; Van den Bos and Griffard 2016; Livermore, Riddell and Rockmore 2017). Lastly, since the research corpora for discourse research, regardless of the subject of interest, are always constructed using a subjective, limited list of keywords, it is also highly interesting to explore different, more objective ways of selecting texts for analysis and reflect on their implications. Can text mining, and then specifically topic modelling, be used to identify relevant texts for a qualitative analysis? Thus, to answer the research question voiced earlier in this chapter, I will combine a topic modelling analysis of a very large corpus of recent English-language news media articles with a qualitative analysis of a sample of such articles that is selected via text-mining methods, while drawing upon theory from the field of surveillance studies, new media studies and computational linguistics. The specific topic modelling method I will use is the popular latent dirichlet allocation (LDA) algorithm. Using this technique, I will compare the word patterns in the discourse about governmental technology- practices of surveillance with that in the discourse about such practices by commercial actors. The results of this analysis will also be used to identify relevant texts for the discourse analysis. This research will be guided by the following two sub-questions:

SQ1: What concepts are commonly associated with privacy and surveillance in news media articles about commercial actors and governmental actors respectively?

4

SQ2: How are technology-practices of surveillance described, justified and critiqued in the news media discourse about governmental actors and commercial actors respectively? And what is the role of privacy in these discourses?

The answers to these questions will not only further academic understanding of the associations people have with surveillance and privacy in the context of different practices by different types of actors, as well as the strengths and weaknesses of mixed-method studies, but will also have decisive societal relevance. As Möllers and Hälterlein asserted, privacy is a concept that caries meaning within a society, while being culturally and historically variable – and thus open to investigation of what it means exactly within a certain time and space bound context, and how this meaning influences technological development and political policy (2013, 58). The same accounts, perhaps to a somewhat lesser extent, to the concept of surveillance. An analysis of discourse is a fine road via which to execute these investigations, as discourse both influences and reflects people’s fears, desires, needs and understanding of surveillance practices and privacy rights, which in their turn influence and reflect the way people act in regards to these surveillance practices and associated technologies and regulations. This means that for policy makers, activist and technology producers concerned with surveillance practices, it is highly relevant to understand what words the media – and thus the people in general, taking the media as spokespersons for (a part of) society at large as well as an influence on people’s view of the world – use to talk about privacy and surveillance; what this may mean for their view of the appropriateness of certain technology-practices; and whether and why there is a discrepancy between the way they view governmental actors and commercial actors in context of these technology-practices. As the emergence of new technology-practices and regulations always takes place within a societal context of meaning, explicated in discourse, understanding this discourse furthers understanding of what regulations are necessary, what technology-practices are desirable, and what information may need to be provided to the public to correct their understanding of surveillance and privacy in light of academic research into the nature and problematics of these topics. The remainder of this thesis will be structured as follows. In chapter two I will provide an overview of the critical scientific discourse about the subject of data-driven surveillance. Here, attention will be paid to the ontology and historical significance of this type of surveillance; the (ethical) problems associated with it, amongst which harm to privacy; and the role of public (news) discourse. In chapter three I will discuss in the methodology of this research, starting with a brief exploration of the fields of Discourse Analysis and Content Analysis and the positioning of this thesis within these fields, followed by a detailed description of the research corpus and the steps taken for the quantitative and qualitative analysis respectively. The fourth chapter will be dedicated to the actual analyses: the topic modelling of the government actor-centred corpus and the commercial actor-centred corpus and the discourse analysis of a sample of relevant texts from these two corpora. Lastly there will be a conclusion and discussion chapter.

5

2. Theoretical framework: Surveillance and privacy in contemporary societies

In this chapter I will explore in detail the concept of data-driven surveillance and the characteristics of technology-practices that belong to the domain of such surveillance. Moreover, I will discuss the main ethical issues, cultural implications and efficacy issues associated with surveillance practices in humanities discourse, which will provide a useful point of reference for my analysis of news media discourse. The last subchapter will be dedicated to the main results of previous discourse research into the subject of surveillance and the relevance of this for this thesis.

2.1 The characteristics of data-driven surveillance

Data-driven surveillance is a large and complex concept that encompasses and relates to a variety of different technology-practices such as CCTV in public spaces, GPS location feeds on smartphones, internet cookies, predictive policing, algorithmic censorship and targeted online advertising. These are technology-practices of different kinds – from an inherent characteristic of a technological device that is necessary for some, many or even all its functions (e.g. location feeds) to a collective term for various specific practices of acting upon predictive information (e.g. predictive policing and targeted advertisement) – that have in common that they either require or produce user data and either rely on or support predictive data mining. It is important to note that present-day, data-driven surveillance is not a completely new phenomenon. Rather, it should be regarded as an augmentation of pre-existing trends in surveillance (Lyon 2014, 4). The mathematical and computing sciences that underlie the communication- and information technologies that make data-driven surveillance possible have always been of influence on the practices of the military and state (Amoore 2009, 54), and governmentality has been characterised by the increasing creation, collection and use of data since the dawn of the 20th century (Grew 1984). The past couple of decades though, such procedures have become cheaper and easier to use, as well as more powerful, backed by automation, high computing power and the inherent intertwinement of technologies of surveillance with services people use in their everyday life, such as email, social media and banking. This has had various effects on the nature and consequences of surveillance that distinguish it from traditional forms of surveillance. The first of these effects is the high opaqueness of surveillance: while the ordinary lives of people are increasingly transparent thanks to the documentation of online activities, the actors that are exercising surveillance are increasingly invisible – rather than a physical camera holding watch or a clerk asking your questions, it is often computer programs that observe and take note. The second effect is that next to governmental actors, commercial actors now have the means for surveillance as well: the technologies that can be used for surveillance are highly accessible to anyone with the right resources – human and monetary – and can be applied for many different purposes. Related to this, the role of

6 monetary incentives is also distinctive of current-day surveillance. In commercial surveillance, this is obvious: all data is used to create financial value, either by selling it, analysing it for eventual profit, or optimising one’s services. In governmental surveillance, the role of monetary incentives is visible in the choice for data-driven surveillance in the first place, as the hardware, software and data needed for this are oftentimes cheaper than security measures that rely on human labour (Lyon 2014, 6). Lastly, data-driven surveillance – as well as any other form of surveillance, really – is also characterised by a dependency on commercial actors, government actors and civil actors at the same time. All the specific technology-practices that have been mentioned so far are connected to, or even controlled by, various kinds of actors at different stages of their actualisation – sometimes even to such an extent that they “exceed any clear distinction between military/civil/commercial spheres” (Amoore 2009, 49-50). All surveillance requires technologies that need to be produced and distributed commercially; regulations that allow for their actualisation; and a public sphere of individuals to engage with. The PRISM program by the National Security Agency of the (NSA) for example, which gives the anti-terrorism agency unencrypted access to huge amounts of (meta)data about users of commercial services like Google, Facebook, Verizon, Microsoft and YouTube, needs the co-operation of these companies – under force of the rule of law, appeals to their commercial interests or via other means of persuasion –, as well as the co-operation of the public in the form of them using the technologies the NSA monitors and accepting the rightfulness and legality of the security agency’s practices. The latter is something which has become unstable since the whistle blower Edward Snowden leaked internal documents detailing the previously covert surveillance in 2013 – an act that lead to criticism on the NSA by various media, (American) political actors and civil rights organisations. The consequence of the leaking and the ensuing public debates was that various “technical and legal responses” occurred, leading to greater civil and inter-country distrust as well as the discontinuation of some of the programs (Lyon 2014, 9). At the same time, there were governing bodies and other actors that rather than critiquing the practices, claimed them to be necessary to protect national security and keep public order (Bauman et al. 2014, 140). Opinion polls showed the US public to be quite evenly divided between condemning Snowden for threatening the operations of the NSA and supporting him for exposing the depth of the surveillance (141). The surveillance by the NSA was criticised on various specific argumentative grounds, many of which also play a role in criticisms on other technology-practices of data-driven surveillance. In the academic sphere, and then specifically in the fields of surveillance studies and (critical) technology or new media studies, the critiques can be grouped in roughly three categories: ethical issues, long-term socio-cultural implications and a lack of efficacy.

7

2.2: Issues concerning surveillance

Ethics: privacy harm and discrimination Where ethical issues are concerned, the biggest issue that accompanies surveillance – or at least the most talked about issue – is privacy. As the goal of surveillance is always to gather information about people, the nature of this information – private, not private or of some undetermined in-between variety, depending on the context the information is taken from and used in –4 can make the act of surveillance desirable or undesirable from a civil rights perspective, and legal or illegal regarding the existing regulatory frameworks applicable to the surveillance in question. This is rarely a straightforward, unambiguous assessment though, as privacy is a slippery social process that knows many different dimensions or varieties, from bodily privacy – which encompasses a range of rights such as freedom from torture and the right to keep bodily characteristics private –, to behavioural privacy – which encompasses things like the right to a private place and personal autonomy (Michael and Clarke 2013, 221). Where data-driven surveillance is concerned, privacy critiques are focussed especially on the privacy of information and/or the privacy of communication: people’s right to control and have knowledge of what happens with information or (meta)data about themselves, and the right to communicate without being monitored (ibid; Van Wel and Royakkers 2004, 130-131).5 The collection and analysis of personal information for surveillance purposes is often perceived to violate one or both of these principles, as personal data is either used without people’s consent (the way for example the NSA analysed personal data), or used in highly opaque ways, where people may have consented to information like the location data of their phone to be harvested and used by a certain commercial actor, but not (knowingly) consented to that data being sold to third parties (Michael and Clarke 2013, 224). Two other problems that accompany data-driven surveillance and associated technology- practices are the interrelated issues of discrimination and harm upon the presumption of innocence. The law theorists Kerr and Earle for example have expressed concern that the belief in the accuracy and necessity of data-driven surveillance for security purposes will lead to “a fundamental jurisprudential shift from our current ex post facto system of penalties and punishments to ex ante preventative measures”, meaning that people are treated as being guilty until proven otherwise instead of the other way around (2011, 66). This fear has a lot to do with the earlier mentioned concept of pre-emptive action: the limiting of the range of future options of a potentially risky person, either based on an

4 For interesting reflections on the context-dependant nature of privacy, where information that is public in one context can become private in another, see for example Nissenbaum (2011) and Tavani (1999). Another interesting take on privacy can be found in the texts by the philosopher Millar, who for example argues that it is more productive to reflect on the nature of new information extracted from information gathered via surveillance than that information itself, as it is especially that new information that can be harmful to privacy: they are inferences about a person that may point to personal facts that they do not want to disclose about themselves (2009). For a general overview of privacy research in various disciplines, see the meta-study by Smith et al. (2011). 5 Different authors may distinguish different types of privacy, that encompass different elements. Moreover, definitions of privacy are also culturally dependant, as is the value of it. Indeed, as the sociologist Lyon has also noted, the focus the value of privacy in in much of the debate on surveillance shows the shows the dominance of western, liberal legal traditions in this debate (2014, 10). It has been argued that these issues complicate privacy as an effective means of keeping surveillance in check (Möllers and Hälterlein 2013, 57-58). 8 algorithmic prediction of the likely consequences of allowing a person to act in a certain way, or some other, more traditional form of risk evaluation (67).6 Systems of data-driven surveillance have provided security actors with new kinds of information to base pre-emptive action on, which Kerr and Earle observe to be problematic in the way that these systems and information, and the assumptions of accuracy and necessity underlying them, make it increasingly common that prediction replaces the need for proof in matters of national security (69). Here, some might argue that prediction is in fact sufficient proof, whereas others might argue that pre-emption based on surveillance-assisted prediction gives way for discrimination, dehumanisation and unjust punishment, while people are “unable to observe, understand, participate in, or respond to information gathered or assumptions made about them” (71). Indeed, there are various examples of technology-practices of surveillance that have subtle discriminatory effects, both in governmental spheres and commercial spheres. In a case study of automated credit scoring with data collected via, amongst others, surveillance methods, the law theorists Citron and Pasquale showed how the intransparent process of predicting the creditworthiness of potential customers in the US banking industry leads to arbitrary results that in many cases also particularly disadvantage women and minority groups, making the scoring system a self-fulfilling prophecy (2014, 8-18). Though there is anti-discriminatory legislation in place for credit scoring practices, which has been successfully appealed at in some cases, the costs of going to court, the general lack of oversight and the black-box nature of credit scoring severely complicate undertaking action (ibid). Something similar is visible in surveillance-based structures of national security: forms of discrimination that can occur in regular policing work, such as racial profiling, can be extended to data- driven forms of profiling,7 which relies on creating strict group distinctions based on measurable variables (subject A has characteristic B, thus fits in category X). Here too, policy makers can implement structures to prevent issues like discrimination, such as is for example attempted in the PNR proposal of the European Commission,8 though the question remains whether data-driven surveillance and its uses are often not so far-reaching, black-boxed and subtle that they “might no longer be grasped by the direct and indirect anti-discriminatory approaches of the law” (Leese 2014, 500).

Philosophical and practical criticisms On a more abstract and philosophical note, critical scholars have argued that current-day practices of dataveillance have intensified the anticipatory nature of surveillance. New data sources and data mining

6 Pre-emptive measures are taken all the time for the sake of security and public order. When there is a public demonstration about a sensitive matter for example, extra police security will be allocated, and populations of entire countries can be barred entry from a certain state. 7 In profiling, a data-subject is created that represents an individual. This can be accomplished using all sorts of data about a person, from browsing behaviour to social media messages or census data, in which relevant correlations are sought, either between the various parts of the data subject or between the subject and other data subjects (Hildebrandt 2006, 19-20). 8 The PNR directive is a set of EU-wide rules that regulates the use of ‘passenger name records’ (PNR data) for security purposes. The directive includes various privacy-protective and anti-discriminatory clauses, such as the rule that data needs to be anonymised after six months; that its forbidden to collect sensitive information such as sexuality and race; and that whether someone is subjected to a security check on the basis of the data always needs to be a human decision (Van Elden 2016). 9 techniques brought promises of even more accurate classification and prediction, which strengthened the anticipatory desires of surveillance and brought about a shift in the thinking about risk and security with harmful implications for the power relations within society. This shift entails a change in the focus of government action towards the future rather than the present or the past; towards pre-emption and the managing of consequences rather than trying to understand the causes of social problems (Lyon 2014, 6) – this while the latter may often be a more effective way of protecting society from disorder and violence, though it is arguably also more difficult and expensive (Agamben 2013). The shift in thinking about risk also entails a change in the conception of the future itself: rather viewing the future as a “the predictable outcome of present trends or past occurrences” it is problematized as an “open set of endless possibilities”, that is thus inherently risky and dangerous (Anderson 2010, 792-793). These observations resonate with the thesis of the Canadian philosopher Massumi that after the 9/11 attacks, the American collective consciousness entered a stage of perpetual insecurity that is symbolized by color-coded alert systems that know only stages of risk, no true safety (2005). The logic behind this kind of security system is one of political power and governmentality: by turning a threat, that is by definition unknowable, into a tangible cause of real action, political power can be wielded and acquired (35). This means that governmental technology-practices of surveillance are in a way both the cause and the remedy of a threat and that they, as the political geography scholar Amoore argues, rely upon a war-like architecture of us versus them and safe versus unsafe, where the lines between categories of risk and threat are typically drawn based on hidden associations within data (2009, 51). The entirety of the global collection of data-driven security systems that is built upon these various logics has been argued to be so ubiquitous that is a real-life approximation of Deleuze’s famous ‘control society’ (Lyon 2014, 2) – an envisioned society characterised by radical transparency and intertwined networks; in which people no longer move from one separate space within society to the other, each with their own system of discipline (for example, from school to work to home), but in which all the spheres of societies have sort of melted together and are supervised and controlled by the same, often invisible complex systems (Deleuze 1992, 4-6).9 In a similar manner, the surveillance practices of commercial actors – which entail the profiling of people as consumers rather than citizens and potential enemies – has also been argued to be such a ubiquitous system that has become an inherent logic of society. An example is the critique by Campbell and Carlson, who view commercial surveillance as an exercise in efficiency that is characteristic of modern capitalism in general. According to them, commercial surveillance is in its essence the

9 This position might seem dystopian, but it should be noted that this is not necessarily so: as Deleuze himself stressed, the control society should not be taken as a final or worst possible system, but simply as a different kind of regime from the ones that can be distinguished in the past (1992, 4). Still, as Baumann et al. have rightly noted, where this type of critique is concerned “there is a danger that both popular and scholarly debate will be reduced to familiar narratives about technological developments reshaping the relations between watchers and watched, or the fulfilment of predictions by George Orwell or Philip K. Dick, or the transformation of representative democracies into totalitarian regimes in the name of protection”, which downplays the real-life complexity of technology-practices of surveillance and diverts attention from less obvious problematic consequences of surveillance (2014, 125-216). 10 harvesting and processing of potentially profitable information in order to ensure “the greatest possible extraction of surplus value from production and consumption”, which is the driving force of capitalism (2002, 587). In this capitalist system of information- and communication technologies, they argue, personal information has become a commodity, as has privacy – commodities that are hold hostage by punishment of exclusion from the online marketplace (592). In a Marxist manner, Campbell and Carlson regard surveillance as inescapable tools of social control over the means and processes of economic production, which they consider especially problematic because of people’s own cooperation in their surveillance – the prime characteristic of Foucault’s interpretation of Bentham’s self-disciplining ‘panopticon’ (588-589), that was also a main inspiration for Deleuze’s ‘control society’.10 This overlap between the criticisms on commercial surveillance and security surveillance illustrate the deep intertwinement of technology-practices of surveillance by government actors and commercial actors: whether surveillance is executed to secure the safety and stability of a state or society, or to make profit, the profiling relies on the same technologies, utilises the same data, and cannot function as well without each other as with each other – though the latter accounts for government surveillance more than for commercial surveillance, as government actors generally do not share their data with commercial actors. Lastly, surveillance by government actors has also been criticized for a perceived lack of efficacy.11 This position has already been touched upon when describing the inherent focus of data- driven surveillance on hidden data patterns and the future rather than past trends and structural causes. This is especially something that can be seen in terrorism prevention: an official EU research into the results of anti-terrorism actions between 2001 and 2013 showed that quantifiable predicted impacts have a much higher priority for security agencies than societal impacts. Moreover, the research showed that there is too little known about the efficacy and legitimacy of these actions (‘SECILE Report Summary’). In a number of past cases of terrorism though, the terrorists were either already known by security agencies or could easily have been identified using public records and standard police procedures – the problem was not a lack of data or appropriate mining methods, but rather a lack of communication between (international) security actors and a lack of knowledge and skills amongst their employees, which casts doubt upon the effectiveness of current data-driven surveillance practices for terrorism prevention (Jonas and Harper 2006; Van Elden 2016). The question of cost-effectiveness also comes into play here: are mass dataveillance and predictive data-mining worth the huge resource investments? (Jones and Harper 2006). Are pre-emptive security measures on for example airports worth the investments? (Stewart and Mueller 2013). This sort of questions is of a very different kind than cultural, social and philosophical criticisms, yet they are likely highly important for the actors that have an

10 The analysis by Campbell and Carlson can of course be criticized, for example for its assumption that people have no benefit at all in the products of commercial surveillance (personalised advertisement etc.); that giving up personal information in exchange for a free to-use service such as email or social media is not a commercial transaction; and that consumers are largely or even completely unaware and passive about the surveillance.

11 The efficacy of commercial surveillance practices is not really a matter of critical research but rather of industry discourse and engineering research. 11 interest in employing practices of surveillance and thus worth keeping in mind when analysing the public discourse about such practices.

2.3 Public discourse and pending questions Until now I have only discussed descriptive research into technology-practices of surveillance: culture- oriented, sociology-oriented, law-oriented, or more philosophical evaluations of the phenomena as they occur in society. Another prominent field of research is the analysis of discourse. Discourse can be defined as “a particular way of talking about and understanding the world”, or a certain aspect or dimension of this world (Jorgensen and Phillips 2001, 1). Discourse analysis is thus in the broadest sense of the word the analysis of material traces of this talking-about and sense-making in expressions of natural language. It is also a specific family of methodological approaches to textual analysis that exist next to other approaches, as will be further detailed in the next chapter. For now, it suffices to state that through discourse analysis, a subject like surveillance can be studied by their publicly perceived impact on society rather than their observed impact. Existing discourse-analytical research into surveillance and privacy has concerned itself with the representation of specific technology-practices such as CCTV (Mollers and Halterlein 2013) and the surveillance by the NSA (Branum and Charteris-Black 2015; Mols and Janssen 2016); or with the representation of a collection of cases and/or the subject surveillance in general, where the discourse is taken to be representative of the discourse on surveillance as a whole (Finn and McCahill 2010; Bernard- Wills 2011; Lischka 2016). All this research surveys text, usually from mass media such as newspapers or broadcasters, sometimes from blogs, user comment platforms or government proceedings – any public arena where discourse resides and evolves is suitable. Several researchers have identified specific frames that are often used to describe and make sense of technology practices of surveillance. The political scientist Bernard-Wills for example has performed an inductive, frame-centred discourse analysis of 300 British news media articles from 1990 till 2008, to understand “what counts as surveillance and when surveillance is considered acceptable and appropriate or unacceptable and inappropriate” (2010, 548). He found that the discourse about appropriate surveillance regularly features frames about effective crime prevention, risk management and protection of the vulnerable from terrorism. The discourse about inappropriate surveillance on the other hand features frames about harmful human impacts such as damage to privacy (referred to by him as personal liberty), inappropriate subjects of surveillance, capitalist surveillance industries, and totalitarianism (555). He also found that especially the surveillance-negative, personal liberty-centred discourse is highly complex and context- dependent as it centres on the perception of certain lines being crossed or not – it matters whether surveillance is targeted or mass-applied, whether it is overt or covert, whether it is consensual or not, et cetera (563). Complementary to these results, the communication researcher Lischka found in an analysis of 475 radio transcripts and TV-broadcast transcripts that in surveillance-negative framing privacy is the

12 most frequent subject, while safety and terrorism are the main subjects in surveillance-positive discourse (2015). She also found that when surveillance is criticised the focus is on certain specific technology- practices – i.e. cases where the lines are crossed – rather than surveillance in general, and that moralising arguments are used rather than arguments embedded in law and authority, which are more common in legitimising frames (15-18). Following the dominance of the privacy issue in surveillance discourse inside the academic and the public sphere, there are also studies that have focussed on privacy attitudes about surveillance rather than surveillance attitudes in general. Möllers and Halterlein for example have performed an inductive discourse analysis of 117 documents – newspapers, blogs and parliamentary debates – about ‘smart CCTV’ to clarify how the relationship between surveillance and privacy is constructed in the public debate and find out why surveillance technologies are so popular (2011). They uncovered four distinct discourses within the general discourse: ‘smart CCTV can counter crime and terror’; ‘regular CCTV is inefficient but can be improved with smart components’; ‘mass panic is unpredictable and uncontrollable but CCTV is a useful help’; and ‘CCTV is a threat to personal liberty but can be improved by increased control by data protection commissioners’ (64). In the discourse, personal liberty is equalled to the right to privacy; Möllers and Halterlein conclude that privacy critiques are isolated from the other arguments against surveillance and that privacy and CCTV are not considered to be mutually exclusive. According to their analysis, privacy harm is typically seen as a regulatory problem only, and they argue that because of this privacy critiques do not generate the political pressure that critiques focussed on other social consequences such as social inequality might achieve (66-67). A similar interesting study was performed by Mols and Janssen, who analysed Dutch privacy attitudes in 257 news articles and user-generated blogs about the surveillance by the NSA and their associates (2016). They found two surveillance-positive frames, namely the ‘if you’ve got nothing to hide, you’ve got nothing to fear’-frame and the ‘the end justifies means’-frame, where diminished privacy is argued to be a worthy price for increased safety. Where surveillance-negative attitudes are concerned, they also found two dominant frames: the ‘Orwellian dystopia’-frame and the ‘privacy is dead’-frames, which both argue that we have entered a dangerous situation where privacy is concerned, that can only be resolved by a different type of government/radically different legislation and getting off the internet respectively. Lastly, there are two more nuanced frames: the ‘privacy paradox’-frame, that describes the reality of privacy in the digital age as complex issue where privacy and user- convenience are at odds, and the ‘empower the user’-frame, that describes privacy as an important right that needs to be protected by activism and inclusive public debate (7-8). They found that in the user- generated content published in the two weeks after revelations about the NSA, the negative, dystopian frames were most common, whereas in professional journalistic content the ‘empower the user’-frame is most common (10). This research gives rise to various further questions, but the most obvious gap in the research so far is the question of the difference between the way people perceive technology-practices of

13 surveillance by government actors and the way they perceive such practices in the context of the actions of commercial actors. These two contexts do not appear to have been comparatively analysed yet, and the commercial context in general appears less studied than the governance context. Moreover, as existing research has shown that privacy is a highly complex, context-dependant issue with doubtful value as an effective anti-surveillance argument, it is also interesting to pay specific attention to the way and extent at which the privacy issue is discussed in these two different contexts. This has resulted in the following research question: How are technology-practices of surveillance by government actors typically problematized in current-day news media discourse, and how is this done for similar practices by commercial actors? To answer this question, I will combine two different research strategies: topic modelling and a framing-oriented discourse analysis. The theoretical foundations and methodological details and advantages of these strategies will be discussed in the next chapter. The quantitative analysis will also be used as a tool for generating a more objectively representative corpus for the qualitative analysis, as will also be explicated in the next chapter.

14

3. Methodology

This chapter will start off with a discussion of the theoretical assumptions and strengths and weaknesses of the two categories of discourse research and their various sub-categories, followed by the position of my thesis within this framework. After this, the research corpus and steps taken for the quantitative part of my research will be described. Lastly, the theoretical underpinnings and steps taken for the qualitative analysis will be explained.

3.1 Discourse Analysis, Content Analysis and Natural Language Processing

Various approaches to discourse research and their theoretical and practical ramifications At the core of this thesis lies a discourse analysis: an analysis that aims to understand how people perceive the world by studying meaningful patterns in the structure of discourse, usually as expressed in texts. It is important to note here that the structure of a discourse is not at all permanent; it is rather a fluent, ever-changing construction that can emerge or fade away over time (Barnard-Wills 2011, 553). Next to being time-specific, discourse is also bound to a certain space, in the sense that the structure of a discourse can differ across different domains of society, and that it can be specific to certain group of people (Jorgensen and Phillips 2001, 1). A discourse about a certain subject can for example be ‘political’ or ‘economical’, following the subjects that are discussed in the discourse and the voices that are considered an authority; or a discourse can belong to a particular organisation or culture, in the sense that it is produced by a group of people with certain shared characteristics, with group-specific purposes in mind. Where this kind of research is concerned, there are roughly two disciplines to distinguish: Discourse Analysis and Content Analysis. The latter is concerned with classifying a body of texts with the use of a ‘codebook’ of frames or some other sort of category, in order to make inferences about the meaning of the texts in a manner that is as structural, impersonal and objective as possible (Lewis, Zamith and Hermida 2013, 36). The former on the other hand is the interpretive, reflexive and context- conscious project of studying meaningful patterns in full texts – often in a critical manner, inspired by the philosopher Foucault, meaning that they are concerned with revealing linguistic expressions of power relations in society for the sake of social change (Jorgensen and Phillips 2001, 2).12 There is some overlap between these two disciplines though, as discourse analysists can also construct a systematic coding framework to guide the analysis, and content analysist can pay attention to the social context of texts, the political power of discourse, or the political implications of the analysis itself. This research can also be placed somewhere in the grey zone between these disciplines. Within both Discourse Analysis and Content Analysis various specific approaches possible. Where Discourse Analysis is concerned, these approaches are distinguished mainly by their theoretical

12 There exist various slightly different definitions of both discourse and content analysis, depending on the field of studies. The definitions I adhere to are the ones that are to my knowledge dominant in the field of cultural studies, where discourse analysis primarily belongs to, and social sciences, where content analysis belongs to. 15 premises about the relations between actors, signs, structures, meaning and practices, which make them suitable for slightly different goals. There is no need to describe all these approaches, which are manifold and theoretically complex; for the methodology of this thesis it suffices to describe the social constructivist position that underlies many, if not all, approaches to discourse analysis, which is also the approach that I will take. This position can most briefly be summarized by three premises, namely the premise that knowledge about reality does not objectively reflect that reality; that our knowledge, as well as the representations of that knowledge (discourse), are culturally, socially and historically constructed and thus influenced by these contexts; and that knowledge and discourse influence people’s actions (Jorgensen and Phillips 2001, 5-9).13 In other words: the social constructivist position states that discourse both reflects and affects meaning and social processes; that it is both an expression of knowledge and a means of generating it. In the context of surveillance and privacy this means that an analysis of the discourse about these concepts can reveal the way people make sense of technology- practices, as well as reveal traces of influential contexts. Moreover, it means that discourse analysis can offer grounds for making inferences about the way the discourse may influence people’s view of the world – and consequently their acting in it. Going onwards to the field of Content Analysis – where texts are classified in a systematic manner with the help of a codebook – there are roughly two main approaches, the qualitative and the quantitative. Qualitative Content Analysis is generally concerned with classifying a relatively small sample of texts, based on the content of the whole texts, while Quantitative Content Analysis is concerned with classifying a relatively large sample of texts, based on the presence of certain (combinations of) words or frames (Kutter and Kantner 2012, 8). Since the development of Natural Language Processing (NLP) – automated techniques to convert natural language into a form that computers can process in order to execute a certain task –, Quantitative Content Analysis is typically executed with the assistance of computers – an approach that is often referred to as ‘corpus linguistics’. Using computers to code a corpus increases the systematicness and objectiveness of the coding, as different human coders may deal differently with certain linguistic features and unexpected variations (ibid). Next to replicability and a lack of human subjectivity, computer-assisted Content Analysis also has the advantage of being able to deal with vastly larger quantities of texts than human coders. Moreover, NLP can enhance classic Quantitative Content Analysis by uncovering statistical patterns inside documents or corpora. Computer programs can for example be used to reveal networks of words that are often used to combination with each other (collocation), or to test whether the difference between texts or corpora concerning the number of occurrences of certain words or frames is statistically significant.

13 It is important to note here that these premises do not (necessarily) mean that there is no such thing as an objective reality, nor that there are no representations of objective facts possible; it rather means that all representations of knowledge are constructed in the inevitably influential context of social processes and existing knowledge, and are thus variable in the flexibility of their link to reality (Jorgensen and Phillips 2001, 5-9). 16

Some problems of non-computer assisted content analysis continue to persist though, and new problems occur. Different human coders deal differently with ambiguity, but computers cannot handle ambiguity at all, nor can they attribute meaning. If a computer is not explicitly told how to handle a certain textual element, it will not, which means that the validity of the categorisation can still be questionable, not so much from the procedure itself as because of the rules of coding. Moreover, the process of interpretation and inference making cannot be outsourced to a computer – human researchers will always need to give meaning to categorisations, patterns and relationships identified by computers. Indeed, researchers have reported computer-assisted content analysis to only wield satisfactory results for superficial analytical goals (Lewis, Zamith and Hermida 2013, 38). Thus, as is the case with quantitative analysis by human coders as well, a qualitative reading of texts or fragments of texts in light of their relevant contexts is a valuable addition to computer-assisted Quantitative Content Analysis, as several authors have argued (e.g. Lewis, Zamith and Hermida 2013, Koteyku et al. 2013, Kutter and Kantner 2012). At the same time, a quantitative analysis is also a helpful addition to a qualitative analysis, as the larger number of examples in the research corpus increases the empirical validity of the analysis and through this the generalisability of the findings (Jaworska and Nanda 2016).

The methodological focus of this thesis Following this advice, I have taken on a hybrid research approach that can be placed somewhere in between discourse analysis, qualitative content analysis and quantitative content analysis, as it combines NLP-methods and methods of automated Quantitative Content Analysis with a systematic qualitative analysis of a smaller sample of full texts, while drawing upon some assumptions about reality essential to the discipline of Discourse Analysis and taking note of the societal implications of the research and the possible relevance of it for movements of social change. The quantitative dimension of this thesis concerns an advanced analysis of statistical word patterns inside a corpus of news articles to reveal the differences between the concepts associated with the two subjects of interest in these thesis (surveillance and privacy) in two different contexts (government actors and commercial actors). For this comparison, so-called ‘Latent Dirichlet Allocation’ (LDA) will be used: a probabilistic Bayesian machine learning model for representing documents in a large corpus as a collection of topics. Each topic is a probability distribution of all words in the documents, where the most probable words together represent a certain abstract theme within the documents. The LDA-topics will also be used to identify texts of interest and narrow down the corpus for the qualitative dimension of this research, which will entail a close reading of the news articles with the help of a collection of narrative frames that have been identified by other researchers as common ways to talk about surveillance and privacy. This methodology – or rather, this collection of methodologies – has some definite advantages compared to existing discourse-analytical research into surveillance and/or privacy, as well as such research in general. First of all, there is the already discussed advantage of complementing quantitative

17 information with qualitative information, where the qualitative gives context and meaning to the quantitative, and the quantitative gives structure and increased validity to the qualitative. Secondly, using topic modelling to identify relevant texts for the qualitative analysis will likely make for a more reliable research corpus. Both in qualitative and quantitative analyses of discourse, the texts for the research corpus are typically identified by searching for the mention of certain keywords in a large body of published texts. The list of key terms for this search is created in a highly subjective manner – the researcher selects them based on his or her own experience with the subject in question. However, there is no telling how complete such a list is; relevant terms may be missing, especially those that relate to issues that are not on the radar yet. With topic modelling however, one can identify texts not by keywords but by topics, which are distributions of many words that function as overarching themes – themes that are discovered based on statistical word patterns and probability distributions rather than human reasoning. This has the potential to increase the representativeness and validity of a research corpus for discourse research. The next two subchapters will further detail the methodology of this research. Moreover, the origins and characteristics of the research corpus will be accounted for.

3.2 Method for the quantitative analysis

This subchapter will detail the steps taken for the quantitative dimension of this research, which will support the following research question: What concepts are commonly associated with privacy and surveillance in news media articles about commercial actors and governmental actors respectively? These steps are a) the pre-processing of the research corpus to make the data suitable for the topic modelling analysis, b) the creation and fine-tuning of a topic-modelling algorithm, and c) the use of topic modelling to select a corpus for the discourse analysis. Most of this work took place in Python – a high-level programming language that can be used for all sorts of data manipulations.

The research corpus: origins and pre-processing The research corpus of this thesis is the News-on-the-Web (NOW) corpus: an incredibly large corpus of texts scraped from online English-language news sources from all over the world, created by Brigham Young University, under the leadership of the linguistics scholar Davies (http://corpus.byu.edu/now/). The data in the corpus goes back to 2010 and thousands of articles are added very day. Each article is retrieved with BYU’s scraping software by following hyperlinks published on Google News during the hours of ten PM and one AM. The sources vary from big mass media sources like CNN and The New York Times to smaller, local news source like The Missoulian and Oregon Live; and from thematic news outlets like Politico and Medical News Today to weblogs like Lifehacker and Mobile Marketing Watch. The raw corpus is organised in columns that contain, amongst others, a unique numerical ID for every article, the title and source of the articles, the original text of the articles, and a tokenized and

18 lemmatized version of the articles – meaning that all individual words are separated from each other and, if necessary, reduced to their base form (verbs to stems, plurals to singulars), so the corpus can easily be used in NLP-procedures. I chose to use this corpus mainly because of convenience: I needed a corpus of Dutch or English news articles that is large enough for topic modelling as well as accessible to me, as I did not have the time and skill to create a corpus myself. Initially I planned to use a corpus that is part of the Leipzig Corpora Collection (Quasthoff, Goldhahn and Eckart 2012). This corpus consists of 50 million lines of text taken from English news media articles published between 2010 and 2015, all mixed up together to comply with copyright regulations. However, when grouping the randomized lines back into the original articles – which was possible because each line was accompanied by the URL it was taken from – and calculating the distribution of the lengths of the articles, my thesis supervisor and I noticed that this distribution was Zipfian. 9.999.991 sentences formed 2.434.022 articles, some of which had a length of hundreds of lines, while the great majority was only a couple of lines long. When comparing the content of the recreated articles with the content of about twenty active URL’s, it became clear that the articles contain only a selection of lines from the original articles rather than the full text. Because of this I decided that the corpus is unfit for topic modelling, as this is a procedure that is meant for describing the topical structure of complete texts that are of sufficient length to infer multiple themes per text. The NOW corpus does comply with these prerequisites: both the exceptionally long documents (over 8000 words) and exceptionally short documents (between 100 and 150 words) in the corpus are almost complete articles – only a couple of groups of words are deleted in each article to comply with copyright regulation. For this research, I did not use all the data of the NOW corpus: I chose to use only use all US- based articles from January to August 2014, 2015 and 2016, to avoid distorting the results with cultural differences and because the US data files are the biggest. For the comparative topic modelling analysis, two sub-corpora needed to be extracted from my selection of the NOW corpus: one corpus that contains only articles that are about governmental actors or governmental and commercial actors both, and one corpus with only articles that are about commercial actors or governmental and commercial actors both. These corpora will be referred to as the ‘actor corpora’. For the discourse analysis, the corpus then needed to be further reduced to just articles about surveillance practices and/or privacy, as these are the specific news articles that will need to be qualitatively analysed to further generate knowledge about the differences between the discourses about surveillance and privacy in the context of different actors (see figure 1).

19

S

G C

P

Figure 1: This figure illustrates what the research corpus looks like in terms of actor-subjects and the themes of surveillance and privacy. There is expected to be an overlap between articles about commercial actors (blue) and government actors (green), and articles about surveillance (orange) and privacy(yellow). The eventual corpus for the discourse analysis are the two areas covered with stripes and triangles.

To simply use a list of predefined keywords to construct these various corpora is unlikely to lead to the most representative collection of articles possible: there is no way to check how complete such a subjectively constructed list is, which means that the reduced corpus would probably miss some or even many relevant articles. To avoid this problem, I developed a five-step procedure for creating the actor corpora and the corpus for the discourse analysis:

a) use named entity recognition (NER) to identify all actors in the NOW corpus; b) manually extract all relevant actors from this list to be used as keywords; c) use collocation to identify non-entity words that are highly associated with certain key actors and add these to the keyword list; d) use the keywords to extract all articles from the corpus that contain governmental actors and/or commercial actors and create two ‘actor corpora’; e) use topic modelling onto the two corpora; f) use a list of domain keywords to identify topics about surveillance and/or privacy; g) identify all articles that fit at least one of the topics and create the final corpus.

The last three steps are not just methodological procedure to identify relevant texts but also an interesting analysis in and of itself: the difference between the distribution and composition of the topics within the two corpora will make apparent some of the structural differences between the two different discourses that the two corpora represent.

20

Extraction of relevant articles: NER, keyword identification and collocation For creating a list of all named entities in the corpus, the NER-function of an NLP-module for Python called spaCy was used – a module that is especially fast, up-to-date and well-suited for large quantities of data. The list of named entities was then manually searched by me for words that designate a governmental or commercial actor. In advance, I defined a governmental actor as any actor – be it an individual, organisation or other group – with certain political power that is concerned with the creation, execution or monitoring of policies that serve the needs of the peoples of one of more nation-states, or the needs of those states in themselves.14 When actors have political power, this means that they participate in an authoritative process of allocating values and resources for a society (Easton in: Van der Eijk 2001, 9). This kind of power is often wielded by members of the government of a state for just those people inside the state, but may also be wielded by non-state actors – the past of couple of decades, political power has increasingly spread to international powers like inter-governmental organisations (WHO, WTO) and supranational organisations (EU institutions), as well as to local powers like municipalities (Bovens 2005). All these non-state actors can be considered government actors as well when they are non-profit and co-operate with states to create and implement policies. The same accounts for entities that execute certain specific tasks of the state, such as the police and courts, as these too are part of the sphere of governance. Thus, the list of key words about government actors was expected in advance to be highly diverse, ranging from country/state names and municipality names to the individual heads of governing bodies, and from government organisations such as security services and courts to supranational organisations or bodies such as the European Commission and the United Nations. Commercial actors on the other hand are a bit easier to define: I understand these as fully private, commercial enterprises and their representatives; all organisations that produce a certain product to generate profit and those people that speak for-, or are at the head of, such organisations. As it is only companies that work with people’s private data/information or produce technologies that can be implemented in surveillance, it is tempting to restrict the search to just that kind of company. However, because in the current time it is hard to tell in advance which companies use people’s personal data in some way in their business model, simply all commercial enterprises and their representatives will be added to the list, to avoid missing out on less known technology-practices. When writing my Python code, I decided to not extract all named entities but only extract words that refer to organisations and geographical locations. Individuals can also represent a governmental or commercial actor, but because I would have to look up all personal names in the list of results to find out what their affiliation is, which would take an enormous amount of time, it is not do-able to include these actors. This unlikely to be a limitation for my research as affiliated persons are generally mentioned in combination with the organisation they are affiliated with. I assumed all geographical locations that

14 This conception of governance is partly inspired by Foucault, who regards the governing of a state as the art of disciplining the individuals of society for the sake of both, or either, their well-being and/or that of the state, using various tactics and tools – like for example, but not necessarily, taxes, health care, economy, nationalism, religion, the family, armed forces, etc. (1991). 21 refer to areas to be potential references to governmental actors. All organisations that I recognized I grouped in the category I knew them to belong, and I rejected (unfamiliar) organisations where it was obvious from their name that they are not commercial or governmental (charities, hospitals etc.). Organisations I did not have any information on at all I looked up in a search engine. I also excluded news organisations, because they are likely referenced for their news content, and schools, universities, hospitals and art and research institutes, because they might be commercial but may also be partly or fully government-supported. Plus, these actors are generally more common-good oriented than profit oriented – including such organisations would make the list of keywords too broad and the data noisy. Also, I did not use NER on all articles inside the corpus because this too would take far too much time; instead I only surveyed three months from one year (February, March and August 2014), in the assumption that the most common/important actors will be mentioned in these time periods as well as in any other recent time. Exploring the composition of the results of the NER procedure led me to decide to delete all geographic locations except country capitals such as Washington, Brussels and The Hague, as leaving in these locations would again result in a too-broad list of keywords, which would make the two actor corpuses too much alike. This process resulted in two lists of keywords. From each of these two lists, I randomly selected ten actors to perform a collocation analysis with, using the three months of articles I also used for the NER-procedure. As mentioned before, collocation is a procedure for analysing the co-occurrence relationships of (specific) words inside a text, to identify words that are semantically related in some way. The degree of semantic relatedness is calculated based on parameters such as the distance between words, the frequency of individual words inside the corpus and the exclusivity of the relationship between two words (Brezina, McEnery and Wattam 2015). Different collocation tools may consider different parameters and thus provide a researcher with different results when ‘fed’ a certain corpus. In this thesis, the GraphColl tool will be used, which was developed by Brezina et al. As this is a relatively new tool that appears to rely on many different parameters, GraphColl seems to have an edge over similar available tools. Furthermore, contrary to popular linguistics software like WordSmith, the GraphColl tool is free to use. I performed the collocation analysis onto all articles from march 2014 (which was the maximum amount of text the program could handle). The list of top ten collocates of the selected actor-keywords (see Appendix A), led me expand my lists of actors with eleven non-actor words. These words were chosen because they a) had a high degree of semantic relatedness with the named entities, which means they are often used in relation with these entities, and b) because I deemed these words highly associated with either the domain of governance or business, meaning that they are unlikely to be used in an article that does not feature either a governmental actor or commercial actor. On the other hand, they might feature in articles that are about governmental or commercial technology-practices of surveillance but do not feature a named entity from the list. On top these eleven words, I also added thirteen other words

22 that did not come up in the top of the lists of collocates but realistically could have. These are very general words like government, legislation, business and marketing (see Appendix B). Searching all articles from January to August of the years 2014, 2015 and 2016 for the presence of at least of one the keywords from each list, led to a government corpus of 136.203 articles and a commercial corpus of 157.003 articles. There was an overlap of 80.001 articles. A topic model needed to get trained on each of these two corpuses. To keep the number of topics within a range that is relatively easy to explore by one person, as well as to not overburden my computer, I trained the topic models on only 50.000 randomly selected articles from the corpuses – which is certainly a large enough sample to get a high-quality model. Lastly, to identify relevant topics in the model I created a short list of keywords that designate surveillance practices or privacy problematics, based on my own domain expertise and the keywords used by other researchers. This list encompassed the following one-word terms: ‘surveillance’, ‘wiretapping’, ‘cctv’, ‘spying’, ‘tracking’, ‘rfid’, ‘privacy’, ‘private’, ‘NSA’, ‘data- mining’.15 The next paragraph will describe what LDA is exactly and how it works, as well as the specific steps I took to train my topic model.

Topic modelling with LDA Latent Dirichlet Allocation is a statistical, probabilistic NLP-algorithm to extract related groups of words called topics from documents and represent both the full body of documents and the individual texts in terms of these topics (Riddell 2014, 100). Like many NLP-algorithms, as well procedures for machine learning, LDA makes use of a vector-space model: texts are represented as a vector, a row of values inside a dimensional space, where each column is a word and each value a frequency (Turney and Pantel 2010, 142-143). This sort of model is sometimes also called a ‘bag-of-words’ model, as the order of words in the text is discarded – the text simply becomes a bag of words. By representing a text, or any other entity with structural characteristics, as a vector of word frequencies or other features, large amounts of entities can be easily searched for the presence of certain values, compared for similarity, or otherwise analysed for structural features. The technique is applied to many useful (commercial) applications, such as online search engines, recommendation systems and facial recognition – indeed, all sorts of data-driven technology-practices of surveillance in fact rely on this sort of vectorised representation of information. It also common in exact sciences such as artificial intelligence, cognitive neuroscience and computational linguistics (143-144). Where natural language processing is concerned, vector space models can also be used to make inferences about the meaning of texts. This sort of practice relies on what Turney and Pantel call the ’statistical semantics hypothesis’: the assumption that “statistical patterns of human word usage can be used to figure out what people mean” (146). For example, the more similar the word frequency vectors of two texts are, the more similar the meanings inside the text will be (for example: two books that both

15 Only one-word terms can be considered because topic models are based on a ‘dictionary’ of single terms, in which there is no relation between the words. 23 have ‘France’ and ‘wine’ as most frequent words will probably both be about French wine), and the more frequent certain pre-specified terms are in a document vector, the higher the chance that the document contains meanings that are associated with those terms (for example: the more often terms that are tagged as positive, such as ‘good’ and ‘fine’, are in a restaurant review, the higher the change that is a positive review). LDA represents documents not as one vector in space but a collection of topics, which are vectors that represent a probability distribution of all words in the corpus, centred around a group of words that have a relatively high probability of appearing in a specific topic. The procedure of LDA is roughly as follows: a collection of documents goes into the black-boxed LDA computer program, after which a predefined number of topics is discovered or ‘learned’ in the corpus via a generative, unsupervised machine learning procedure. These topics are predictive in nature which means that new, unseen documents can be provided to the trained computer program and get assigned topics. Each of the documents provided to the model is then going to have a certain distribution of topics: the occurrence and frequency of words in the document overlaps to a certain extent with that predicted in the earlier discovered topics (Riddell 2014; Doig 2015).16 Abstract meanings can be derived from the topics, where the most probable words together convey a meaning that cannot be captured in one word, which provides a detailed approximation of the content of a document that is ideally as close as possible to what a human interpreter would find. Still, the topics can also be quite meaningless, as their quality can differ – sometimes the collection of words appears random to a human interpreter. The quality of different topics, as well the overall quality of a topic model as a whole, is highly dependent the specific settings of the available parameters of the LDA model: each topic is unique to the model used to generate it, and only as good as that model (Riddell 2014, 108). What exactly constitutes a ‘good’ model is dependent on the size and characteristics of the data that the model should describe, and can only be discovered through interaction with the data. This means that for topic modelling research, different models should be tested to find a model that fits the data and purpose of the research. Moreover, any decisions made about the parameter settings for the model should be carefully documented. There are three main settings that determine how a topic model functions: the number of topics that the model should find (T), the assumed document–topic distributions (alpha) and the assumed topic– word distributions (eta). Unfortunately, there is little research into how best to find the optimal settings for a model, and there is also little known about the consequences of suboptimal settings for the quality of the topics (Wallach, Mimno and Callum 2009, 1; Tang et al. 2014). This is especially true for choosing the number of topics that the model should find (Wallach, Mimno and Callum 2009, 7). When a corpus contains enough different topics though, which is almost certainly the case with a corpus of such a size

16 The LDA process is a Bayesian learning process where the distribution of both the topics in the documents and the words in the topics is assumed to follow the Dirichletian statistical model (Doig 2015; Zhao et al. 2015). For a more detailed, technical description of the procedure see Blei 2012. 24 as mine, and the documents are of sufficient length, which is also the case with news articles, the assignment of certain topics to specific documents ought to be affected very little by an increasing number of topics, as the additional topics should be relatively uncommon. But this is only in ideal situations: generally, users should be careful not to select a too high number of topics, because this will negatively affect the coherence of the individual topics (ibid; Tang et al. 2014). To ensure that the topic model is as efficient and robust as possible, Wallach, Mimno and Callum recommend “using an asymmetric, hierarchical Dirichlet prior over the document–topic distributions and a symmetric Dirichlet prior over the topic–word distributions”,17 as this way the distribution of topics over documents gets most specific, which means that the documents will become as dissimilar as possible (1-2, 8). This way the distribution of the most likely topics also remains most stable in the face of increasing numbers of topics (ibid). Furthermore, Tang et al. recommend that for a corpus with documents that are likely to contain only a handful of topics “the Dirichlet parameter of the document-topic distributions should be set small (e.g. alpha=0.1)” to ensure that documents aren’t assigned too many topics (the expected probabilities should centre on relatively dominant topics). Moreover, they recommend that when “topics are known to be word-sparse, the Dirichlet parameter of the word distributions is set small (e.g. eta=0.01)” to ensure that topics aren’t assigned too many words and remain as specific as possible (the expected probabilities should centre on relatively dominant words).

Choosing a topic model The specific algorithm that I have used for the topic modelling is the LDA model of a popular Python module called Gensim. The coding in Python required the following two steps: turning articles into bags-of-words and training a bunch of topic models to find the most appropriate one for the analyses at hand. Based on the research mentioned in the previous paragraph, I specified the document–topic distribution of my topic model (alpha) to be asymmetric and the topic-word distribution symmetric (eta). Because it’s difficult to know in advance whether or not documents will contain many or few topics, because they all have different lengths, I let the model automatically learn an appropriate alpha value from the data. Moreover, because it is also difficult to predict the word-sparsity of the topics but it is likely to be high as the articles are too, I also created a model with the eta-settings advised by Tang et al., for a low value for the eta might be preferable to a symmetric one. To find a value for T that is not too small or too large, I trained many different variants of the two potential models, each with a different T-value, on 50.000 articles from the government actor corpus. Looking at the characteristics of this corpus, where 50.000 articles with a median length of 608 words contain 41.526.851 words in total, of which 61.882 unique and relevant (237.949 word types

17 The prior is the assumption that the model makes about the distribution of tokens and topics within the corpus. As mentioned in an earlier footnote and specified in the name LDA, this distribution is Dirichletian. When the Dirichletian distribution is specified to be symmetric, there is assumed to be no prior knowledge about the distribution of tokens in topics and topics in documents, and vice versa. 25 minus all types that occur in at least sixty percent of the documents), a T of at least 200 – but probably more – seemed most appropriate.18 I tested both the coherence and effectivity of the models. The coherence is the quality of the individual topics, which I assessed by evaluating the ten most likely topics of all trained models. Subjective human interpretation of topics is a common way find a good T-value for a topic model, that has been shown to correspond with statistical measures of topic accuracy, or even outperform them (Zhao et al. 2015; Chang et al. ).19 The effectivity of the models I assessed by checking how many topics each model discovered in which one of the surveillance-keywords occurred in the top thirty of most probable words, which is very important because the goal of this research is not just getting high quality topics that correctly describe the corpus but also identifying relevant articles – two goals that may be at odds. Indeed, the most coherent model turned out to have a T of 350 (with a symmetric eta), while the model that returned the most topics was the one with a T of 900 (eta=0,01) (see Appendix C).20 Because the 350-model did not return any coherent topics for the keywords, I chose the 900-model for my analysis. After training this model on each of the two actor corpuses, I explored the structural differences between the corpuses to answer the sub-question: What concepts are commonly associated with privacy and surveillance in news media articles about commercial actors and governmental actors respectively? Or, in terms of topic modelling: What do the topics that contain terms associated with surveillance and privacy look like? The results of this analysis will be discussed in the next chapter, after the sub-chapter about the qualitative research method. I also applied the topic models to all articles in the actor corpuses to extract those articles that fit with a topic about surveillance and/or privacy. This came down to 10.721 articles for the government corpus and 19.458 articles for the commercial corpus – 27.511 unique articles. From these articles, a workable random sample was selected for the discourse analysis – the methodology of which will be discussed in the next section. The aim was to retrieve around 60 articles. More articles would be preferable, but this would not be manageable given the limited time one can spend on a Master’s thesis. The results of this search will be discussed in the results chapter.

18 Using perplexity scores, Blei, Ng and Jordan found an optimal T of 50 for a corpus of 5,225 abstracts of research articles with 28,414 unique terms, and a T of 100 for a corpus of 16,333 newswire articles with 23,075 unique terms (2003). For a corpus of 885 abstract published in the IEEE Transactions on Computational Biology and Bioinformatics, containing 5004 unique words, Tang et al. found an optimal T of 40 (2015). This small set of examples seems to signal that both an increase in documents and an increase in unique tokens leads to an increase in topics, with the effect of the increase being greater the smaller the corpus is. 19 The ideal T for a topic model can be attempted to find using an iterative machine learning process, where many different models with different values for T are generated using a certain training set of data, after which the models are presented with a test set and a so-called perplexity rate is calculated (Zhao et al. 2015), or some other method of accuracy prediction (Wallach et al. 2009). The assumptions behind this are: the lower the perplexity, the less ‘surprised’ the model is by the test data, the more accurate the model’s settings are, and the more suitable the model is to be applied to describe other yet unseen documents (Zhao et al. 2015). Such procedures are complex and time-consuming though, as well not infallible, and certainly beyond the scope of this thesis, especially as there is no library for such a procedure available in Python for LDA. 20 Seeing as these are American news articles and I am not American, some topics that I deem incoherent might actually be coherent for American readers. Moreover, as each new model trained on the corpus will be different because of the random initial distributions, regardless of the parameter settings, the factor chance is also of importance for the performance of a model. This means that in a different test session, other models might perform better. 26

3.3 Method for the qualitative analysis

This sub-chapter details the method for the qualitative part of this research: the discourse analysis. Like many discourse analyses, it is an analysis of frames. Moreover, it is a descriptive rather than inductive analysis, which means that I will not induce frames from the articles but rather place articles inside established frames, to answer the following sub-questions: How are technology-practices of surveillance described, justified and critiqued in the news media discourse about governmental actors and commercial actors respectively? And what is the role of privacy in these discourses? The choice for the specific frames will be motivated in the current chapter, as well as the implications and context of my research corpus that are relevant for the interpretation of the results of the discourse analysis.

The implications and context of the corpus As described earlier in the first subchapter, discourse analysis entails the studying of meaningful patterns in the structure of discourse, usually as expressed in texts. Furthermore, a discourse always encompasses a certain domain of reality, restricted in time and space; it generated by certain actors for certain actors, in a historical and cultural context. Any analysis of discourse, as well as the interpretation of that analysis, should thus be made with reference to the identity of the actors that created it, the audience that it is addressed to, and the specific context that discourse has been created in. In the case of this research, the discourse is produced by news media, for a broad, American audience. Moreover, as it centred around the concepts of surveillance and privacy, it is produced in the context of all sorts of other discourses that are of influence to it, such as political language, product- promoting language by technology manufacturers, and activist language by interest groups. Because of these facts, the representativeness of the news media discourse for the surveillance and privacy discourse as a whole – the full assemblage of practices, technologies and discourses (Barnard-Wills 2010, 549- 550) – is limited in some particular ways. Firstly because of the obvious limitation to American news media sources as producers of the discourse, which means that the discourse is inherently influenced by judgements of ‘newsworthiness’ – issues will only get reported on when the journalists and editors believe they make good news stories for their English-speaking, largely American audience. Moreover, the multiplicity of different news actors in the news media sphere, as well in the media sphere as a whole, means that different journalistic media may contribute to a different section of a discourse. They may give some subjects a lot of attention while remaining disinterested in others, and may prefer giving word to certain kinds of voices, based on the experience, knowledge, or political or ideological orientations of these actors. For example, as Branum and Charteris-Black (2015) found in their analysis on the reporting strategies of various British newspapers on the revelations of Edward Snowden about the worldwide surveillance by the NSA, the content of the newspaper articles differed between media outlets along the lines of the personal biases of the journalists they attract and the expectations and interests of their regular audience. For example, the newspaper The Guardian used legal and moral

27 grounds to condemn the surveillance practices revealed by Snowden, while The Sun condemned Snowden and defended surveillance on grounds of the wellbeing and safety of the state. The Daily Mail on the other hand stayed away from argumentative frames and only described the events and Snowden as a person (18-20). These differences between media outlets to not matter much for the current analysis though, as the goal of this research is to reveal the multiplicity of meanings inside the news discourse as a whole, with all its different underlying interests and influences, that may or may not explicitly become clear throughout the analysis. This thesis will compare two spheres in the American public discourse on technology-practices of surveillance, namely those practices executed or controlled by government actors and those by commercial actors. Of course, the two spheres are not separate; as described earlier in this thesis, security actors may use surveillance technologies developed by commercial actors and often rely on the co- operation of commercial actors to do their surveillance work, while commercial actors are restrained and guided by government policy and legal frameworks in all facets of their business. Thus, close reading of the research corpus at hand is required to unravel the messy borders between surveillance by government actors and commercial actors, and to also get insight in how exactly these borders are portrayed in the media. For this purpose, a collection of frames, or categories of frames, is needed to interpret the sentences of the four corpuses.

The coding framework Like any discourse analysis, mine will be just one of many different possible interpretations of the material – it is an exercise in creating a ‘reading’ that is as probable and true as possible given the corpus at hand and my domain knowledge as a researcher, and relevant to the theory that this thesis engages with. The specific framework is based largely on the one that was inductively constructed by Barnard- Wills from a corpus of three hundred randomly selected texts from thousands of scraped UK news media articles (2011). Like many frameworks, this one has as a main distinction positive and negative outlooks on the appropriateness of the technology-practices of actors. These two categories overlap to some extent, in the sense that they often refer to each other as well as that they are influenced by the same other discourses – for example those emergent from national governments, the EU, certain businesses and social action groups (554). Still, they are the clearest broad categories to consider, as research has shown that they tend to use opposing frames in the same areas of interest. Within the two categories of appropriateness and inappropriateness, I will distinguish five areas of public interest where frames can appeal at, inspired by, but not entirely equal to, Barnard-Wills’ surveillance-specific interpretation of Neuman’s five key themes in news media texts (554-562). These are:

- ‘Economic issues’ (economical workings and interests; costs and benefits for an organisation or individual);

28

- ‘Human impact issues’ (consequences for culture, social relations, safety and the all-round welfare of (a certain group of) individuals); - ‘Conflict situations’ (categorising protagonists as ‘us’ or ‘them’; a struggle between a ‘good’ and ‘bad’ party); - ‘Control by powerful others’ (dystopian critiques or critiques thereof; grass-roots activism or local or new (business) initiatives against control by powerful actors), and: - ‘Moral values’ (which is closely related to human impact, but focussed on abstract values rather than physical consequences for people’s lives)

There are of course many frames for describing surveillance that fall in more than one category. For example, when a surveillance practice is framed in an ‘end justifies means’ manner, where national security is argued to be more important than privacy – one of the main ways in which privacy is framed, according to Mols and Janssen (2016) and Lischka (2015) – this both says something about moral values (they are not absolute) and human impact (the physical safety of the individuals of society is paramount and can effectively be protected via surveillance). In such cases, the datils of the framing are deciding: when national security or privacy is defended by presenting it as something that is important in and of itself the frame fits the ‘moral values’ category; when it is defended by referring to specific consequences for people’s lives (such as: people may get killed by terrorists, or people’s private photos may get viewed without permission) it is a human impact issue. In this sense, the categories described above should not be considered exclusive boxes but rather as methodological tools to help identify and interpret the most common frames inside the corpus, and through this systematically compare the two discourse spheres. Moreover, though an inductive analysis is not the purpose of this thesis, I will remain attendant throughout the analysis to dominant categories of framing that did not come up in Bernard Will’s study but are relevant to understand the discourse that is the subject of study in this thesis. More categories may be added to the five key themes described above when this is beneficial to the analysis. The articles that will be analysed in this thesis are not just those that were selected because they feature talk about surveillance, but also those that feature the core issue of one of the negative frames of those practices, namely privacy. This is because of dominance and complexity of the subject of privacy in critical academic research into technology-practices of surveillance as well as the dominant presence of privacy in previous discourse research. As Barnard-Will’s frames are inspired by general frames within the news media, regardless of topic of interest, the current framework is expected to also suffice to describe the framing of the concept of privacy outside of the context of surveillance. Moreover, it can be expected that the categories of human impact and moral issues will be highly dominant in this corpus, as privacy is perceived a human right and a moral issue in the Western world, and is thus something that has tangible impact on people’s lives and welfare. The presence (or non-presence) of certain frames in the four overlapping discourses – privacy and surveillance; government practices and commercial practices – and semantic details of the framing are highly interesting considering existing theory on how

29 to make sense of surveillance and privacy and argue for social or regulatory action with regards to these subjects. To this purpose, the results of the discourse analysis will be discussed with reference to existing critical research.

30

4. Results

4.1 Government surveillance and commercial surveillance compared in topics

In the government actor corpus, the LDA model found nine topics in which one of the surveillance/privacy keywords (surveillance, wiretapping, cctv, spying, tracking, rfid, privacy, private, NSA, data-mining) occurred in the top thirty of most probable words. Five of these topics were indeed about these themes (see table 1), the other four featured the term ‘private’ and were clearly related to, for example, the finance sector and the prison industry (see appendix D). In the commercial actor corpus, six topics were found, of which four were clearly or somewhat related to technology-practices of surveillance (see table 2, appendix D). The topics are named intuitively.

Web tracking FBI encryption Surveillance legality Spying NSA scandal google fbi amendment cia spy search cook surveillance celebration nsa privacy bernardino appropriation celebrate dom data comey amend mayo snowden user encryption invoice tangible surveillance site unlock viability spying revelation information government stymie covert bulk web noah lister inadvertently router use james massie anniversary encrypt collect locked handicap geoff bruno website privacy bankston 6-year-old bride tracking would fourth soa sonar tool encrypted congresswoman rodeo edward page ludicrous defined meltdown targeted online help backdoors birthday intelligence can co-defendant gather sanger agency internet enforcement crypto slow-moving government personal deluge 702 cinco es other case prevent sparkler maloney location director warrantless firefox phone

Table 1: The surveillance-related topics in the government actor corpus.

The government actor-related topics feature several known actors: the government, the FBI, FBI director Tim Comey, Google, Apple’s CEO Tim Cook, the CIA, the NSA (which was also a keyword), Edward Snowden and the web browser Firefox. It features technologies and objects like the internet, websites, webpages, encryption, warrants, amendments, routers, phones, and invoices. Verbs are to search, to collect, to track (which was also a keyword), to unlock, to enforce, to appropriate, to gather, to prevent, and to reveal – which are all actions that can be executed by actors with political power or surveillance powers. Interesting adjectives and other words are personal, location, warrantless, ludicrous, viability,

31 inadvertently and bulk, which may be used to describe certain kinds of data or surveillance practices. There are also some seemingly irrelevant/illogical words included in the topics, especially in the CIA- topic, and to a lesser extent in the amendment-topic.

Online info policy Snowden booksale Fitness tracking Surveillance footage information Chain activity ms use Cent exercise footage our Shop physical surveillance us edward fitness convenience may Nsa yoga clerk your Per day tidal transfer kraft gym warrant privacy royalty tracking cathy share snowden routine stash policy pencil tracker unlock how bookstore cage locked see merchandise step shootout detail sell monitor couch other noble wrist robber service york-based sit embezzlement outside revelation daily asa if e-book sync resettle locate store minute tearful business best-seller watch twig tailor its walk surrey

Table 2: The surveillance-related topics in the commercial actor-corpus.

The commercial actor-related topics are less numerous and also appear less strongly connected to surveillance and privacy problematics. The only recognizable named actors are Snowden and the NSA, and the topics in which they feature appear to predominantly describe articles about Snowden-related sales, like books. There are generally a lot of sales- and enterprise related words like business, service, shop, store and best-seller. It is also interesting there is a whole topic devoted to exercise-related practices of (arguable, potential) surveillance, that includes words like monitor, fitness, physical, tracking, daily, wrist and watch, and appears to describe articles about wearable tech that tracks physical reactions to exercise. The four commercially-oriented topics returned almost double the number of articles than the five government-oriented topics, which means that the latter topics are on average less common than the former. This might be due to the fact that the commercially-oriented topics are more broad and less specific to surveillance practices. Indeed, the lack of specific actors mentioned in the commercially-

32 oriented topics most logically means that technology-practices of surveillance are generally discussed with reference to government actors and legal issues. There are three commercial actors mentioned in the government-oriented topics. The first is Google, that features in a topic that appears to be about the various uses of web browsers like Google and the way this specific business handles the data of its users. No legal terms are ranked highly in this topic, but privacy is ranked very high, which means the privacy impact of Google’s policies and other, more general web browser-related actions are a frequent subject of interest in this topic. The second actor, the CEO of Apple, is mentioned in the FBI encryption topic, most likely because of the public debate in 2016 about whether the FBI can legally force tech-manufacturers like Apple to give up user data that is protected by encryption when that data is needed for a criminal investigation. The last commercial actor, Firefox, is mentioned in the spying topic that also features the CIA – it is impossible to tell from other probable terms in the topic why this actor appears in this context. Zooming in on the explicit mentions of the concepts of surveillance and privacy in the topics, here too there are clear differences between the commercial actor topics and the government-oriented topics. When surveillance is explicitly mentioned in the discourse about commercial actors this mostly appears to concern video footage, for example in the context of robberies. In the discourse about government actors on the other hand, surveillance is explicitly mentioned in a topic about amendments, invoices congress, and warrants, meaning that technology-practices of surveillance get explicated when their legal standing and appropriateness is being questioned by government actors. When privacy is explicitly mentioned, in the commercially-oriented topics this occurs in the context of the online information policy of companies – the occurrence of the words ‘share’, ‘see’, ‘use’ and ‘service’ point towards a perspective focussed on user choices. In government-oriented topics on the other hand, privacy is explicitly mentioned in topics about Google and FBI encryption, which raises the thesis that the privacy of civilians might be an equally common concern regarding technology practices by governmental actors and commercial actors. In the web tracking topic, words like ‘collect’, ‘tracking’, ‘personal’, ‘data’ and ‘location’ imply a perspective focussed on what companies are doing with the personal data of users. The occurrence of the word privacy in the FBI encryption-topic appears to mean that regarding this public and legal debate, the danger of allowing security agencies unencrypted user data access for the privacy of users is the main ethical point of criticism – though of course the term privacy may also be used in arguments refuting such a criticism. Indeed, to understand exactly how the terms in the topics relate to each other and understand in what ways the two discourses are different, a close reading of documents containing these topics is a necessity. How are technology-practices of surveillance described, justified and critiqued in the news media discourse about governmental actors and commercial actors respectively? What is the role of privacy in these discourses? And do the results of the discourse analysis correspond to- or explain the results of the topic modelling analysis?

33

4.2 Comparing the use of frames in the context of different kinds of actors

The results of the topic modelling search for relevant articles As described in the methodology, around 60 articles needed to be retrieved from the 27.00+ articles in the full corpus of texts that fit at least one of the topics. The prerequisite for such an article to actually be included in the discourse analysis was that the article either evaluates technology-practices in some way, or that it talks about people’s right to privacy – either in the context of surveillance or some other context. Scanning random articles in pairs of a hundred in search of such texts, led to a corpus of 67 articles. 38 texts featured surveillance practices by governmental actors, 19 by commercial actors. 16 articles weren’t about surveillance but did discuss privacy and data – these usually fit the commercial actor-category better than the government category. These 67 articles were chosen from a sample of 700 articles, meaning that the topic model identified a lot of unsuitable articles. There are many different reasons for this. First, some articles included text that was not actually part of the article, which eschewed the topic assignment and possibly also the content of some of the topics. Examples are privacy policies of websites (“Your information may be shared with other NBCUniversal businesses and used to better tailor our services and advertising to you. For more details about how we use your information, see our Privacy Policy”), and comment section information (“To protect your own privacy and the privacy of others, please do not include personally identifiable information, such as name, Social Security number, DoD ID number, OSI Case number, phone numbers or email addresses in the body of your comment”). Furthermore, there were articles in which important words from the topics were mentioned in a passing way (“The sheriff 's office have obtained surveillance photos of the suspect and hope someone may recognize her”; “The elder Seleznev insisted in an interview that his son was innocent (…). He also said his son 's arrest may have been retaliation for Russia 's harboring of former National Security Agency contractor Edward Snowden”), and articles where words were used in a context unrelated to surveillance or privacy (“The tool is also handy for tracking language and style changes over time”). Third, there were articles that did talk about technology-practices of surveillance, or practices or technologies that are connected to it or could potentially be a part of surveillance, but did not include any evaluation of the practices or technologies in this light, making them irrelevant to my analysis even if they did fit one of the topics. Lastly, the assignment of topics to documents by the model appears somewhat too unspecific for the purpose of identifying relevant articles – for some texts it was unclear how they fit the topics. This accounted mostly for the relatively short articles that the search returned.

Categorising the articles In the corpus of 67 articles, there are texts from many different sources, in different journalistic styles, with different subjects of interest. The opinions, observations and facts about technology-practices of surveillances and privacy rights can be critical or supportive of the practices; can about actual practices

34 and situations or potential future practices and situations; and can be voiced by the authors of the article or by actors cited inside the article. Moreover, like any news article, they can provide a one-sided, unbalanced view of the practices in question or a comprehensive view; they can be argumentative in nature or descriptive; and be build from strong or weak arguments. I read each article in detail in search of positive and negative framing that fit one of the five thematic categories I selected (‘economic issues’, ‘human impact issues’, ‘conflict situations’, ‘control by powerful others’, and ‘moral values’). During this process, I found two more categories that did not come up in the analysis of Bernard-Wills. These are the ‘technological properties’-frames, which are arguments against or in favour of certain technology-practices based on technological limitations or opportunities and the perceived inevitability of certain characteristics of technology, and the ‘legal issues’-frames, which are arguments rooted in facts of law and evaluations of whether practices are too regulated, not regulated enough, or wrongly regulated.21 The results of the discourse analysis are schematically represented in the table below.

Surveillance by Surveillance by Other practices involving Categories of frames government actors commercial actors (potential) privacy harm positive negative positive negative positive negative Economic issues 2 2 7 4 2 3 Human impact issues 18 20 7 11 16 11 Conflict situations 0 1 0 2 0 0 Control by powerful others 0 7 0 6 0 3 Moral values 5 11 0 4 1 6 Legal issues 6 23 0 5 2 7 Technological properties 4 15 5 11 4 5 Total number of mentions of frames 35 79 19 43 25 35

Table 3: All individual frames in the corpus, sorted by category, sentiment and actor/practice. If one frame was mentioned in two different articles it was counted two times, if it was mentioned twice in one article it is counted one time. When two different frames in one article belong to the same category, the two frames were added to the count of that category.

In the discourse about government practices of surveillance, the two major technology practices that are discussed are the data mining by US national security actors like the NSA and the FBI, either targeted or in bulk, and the encryption of personal communication by technology companies. The main case in which these two technology practices confront each other is the public discussion on the fight between the FBI and the technology company Apple regarding the iPhone of the perpetrator of a mass shooting – which is also a topic identified by the LDA model. The FBI couldn’t access the information on the locked phone and wanted Apple to create a ‘backdoor’ in the phone’s software, which Apple refused despite a court order. Popular frames in support of the FBI were the moral argument that the phone could

21 When privacy is framed as a right, this frame belongs to the category of moral values – only when a specific section of the law is referred to it is a ‘legal issues’ frame. 35 contain evidence important for punishing the perpetrators and the human impact argument that it could help fight terrorism and prevent another attack. Popular frames in support of Apple were the technological properties and human impact-arguments that creating such a backdoor would form a general weakness for all iPhones, which would endanger the privacy and security of all consumers. It was, as one journalist observed, a matter of “security versus security”. The discourse about this specific case is also characterized by arguments against and in favour of encryption in general. National security actors and legal actors are repeatedly quoted as saying that the inability to access encrypted info makes it hard to find criminal activity and potential terrorists, and that new legislation is necessary to ensure companies like Apple are forced to comply in the future – the FBI eventually cracked the phone without Apple’s help, but this was not a scalable solution as it cost a lot of resources (an economic issue). Representatives of Apple and actors supportive of the company’s stance on the other hand argued against this by referring to physical consequences to people’s life and moral and law-based arguments: taking encryption away from civilians was for example argued to be counterproductive as the bad guys will still find ways to use it, and encryption was also equalled to privacy, which was defended as being part of civil liberty and the American Constitution. Another main subject in the articles about government surveillance is the whistle-blower Edward Snowden, or rather the practices by the NSA he revealed – which just like the Apple versus FBI case is also a topic identified by the topic model. Where this kind of surveillance, as well as government actor-surveillance in general, is concerned, positive discourse is characterised especially by human impact-frames related to national security. Other arguments are rare, and either argue that the actions are legal (and those by Edward Snowden not) or that when better technology is used, the problem of accessing US citizens’ private communication without their consent will be diminished. The negative discourse on the other hand is characterized by many privacy arguments – either moral arguments presenting privacy as a right or something valuable in and of itself, and human impact arguments about private photo’s being the hands of the NSA and the danger of self-censoring. In the category of human impact, the pushback against journalists critical of surveillance and the critique that people of colour suffer more from surveillance are also present, though both only once. Even more so than fears for privacy, legal arguments are characteristic of the critical discourse on government surveillance – existing legislation is argued to be insufficient and the practices of security agencies as too intransparent, or even illegal in the face of existing rule of law. In the category of technology-based frames, there are observations that the NSA 's ability to collect data is now outpacing its ability to analyse it; that they have failed to prevent terrorist attacks in the past despite all the data; and that as the internet gets more centralized and ubiquitous, surveillance is only going to get worse. In the category of ‘control by powerful others’ citizens are called to reclaim their rights themselves by taking measures such as using encryption. Also interesting here is the ‘conflict situation’ category, as in the surveillance-negative discourse the government is sometimes presented as the bad party and the tech

36 companies as the good ones, while in the surveillance-positive discourse the American security agencies are presented as the good actors when the surveillance in question in executed by other nations. The discourse about surveillance practices by commercial actors is a lot less law-centred, which fits the results of the topic modelling. Arguments against surveillance are based mostly on human impact issues and technological properties; arguments in support of surveillance are predominantly based on human impact issues and economic issues – which is unsurprising since the prime interest of commercial enterprises is generating revenue, and economic issue according to the definition of this category of frames. Legal issues are not once mentioned in support of commercial surveillance. When they are mentioned as a criticism on these practices, this is mostly in the form of calls of stricter regulation or observations that practices don’t line up with existing laws, just as is the case with governmental surveillance. The discourse about commercial surveillance can be split in two parts: a larger part about tracking web users and collecting, selling or analysing personal data for the sake of advertisement or personalized services, and a smaller part about specific (future) technology-practices that might be used for data-driven surveillance, amongst other uses. In the first category, surveillance is positively regarded from the viewpoint of economic issues like influencing consumer behaviour, identifying potential clients and monitoring the market – in only one case there is a human impact-frame, about how relevant advertisement benefits users. More interesting is the previously unidentified category of frames about technological properties: here, surveillance is for example defended by referring to the possibilities of new technologies to ‘tag’ data that might be inappropriate, or allowing users to delete certain data. More often though, technological properties are mentioned as a criticism on surveillance – the internet itself is for example described as a technology that is inherently bad for privacy as its impossible to escape tracking; anonymized data is argued to ‘still be there’ while users don’t know what’s happening with it; and privacy is argued to become increasingly important as technology progresses and data gets more personal, for example about one’s health and movements. Most human impact issues in this part of the discourse centre on privacy: that companies know too much; that services that are presented as free aren’t actually free; that people don’t know how much companies know and what they do; and that online anonymity is important for people to be themselves and explore other sides of themselves. Related to these arguments are the moral frame that monetising personal data is inherently bad, and the economic frame that the services that are given in exchange for personal data are not worth the cost. Just as in the critical discourse about government surveillance, consumers are called to protect themselves by doing things like using encryption, changing their privacy settings and using software that blocks third-party data-miners. There is one article that goes even further and argues that more control of data is not the solution, but that a new paradigm of thinking about technology and information is needed to end the idea that “more information is always better”. The only conflict-frame in the corpus is one where the battle against commercial surveillance is framed as a battle of the rich and powerful one percent against the ninety-nine percent.

37

Next to the more general debate on commercial data-mining practices, there are also articles that evaluate specific technology-practices. Here, there are again two categories: evaluations of drones, CCTV with facial recognition and other smart camera’s; and evaluations of virtual reality, augmented reality and ‘internet-of-things’ applications – which are usually hypothetical rather than case-based. Regarding camera’s, some of these articles discuss government practices rather than commercial practices, or both. Here, human impact is the main focus: advantages like crowd control and catching shoplifters contrast with concerns about taking footage of people when they aren’t aware of it or are in a private space, and concerns about the illegal collection of evidence. Regarding VR, AR and IoT, advantages for consumers in the form of better and more personal (social) information and advantages for businesses in the form of better targeted advertising, are combined with more general societal advantages of greater efficiency in any area of logistics and information and the empowerment of people that struggle with their health or physical disabilities. Concerns about privacy are addressed only once in frame positive about these technologies, and rejected in that same article by the statement that privacy doesn’t exist anymore: people upload their personal information everywhere and companies just use it. In this part of the surveillance-negative discourse privacy is not perceived as dead yet. Instead, the complete eradication of it when data is collected everywhere around people, or even by technology they wear on or inside their body, is a human impact concern. Regarding virtual reality, one article raises the question whether people’s behaviour in VR, or even their thoughts and fantasies, should be regulated and scanned for inappropriateness. Where technological properties are concerned, multiple articles raise the question whether all the data that will be collected and automatically used in practical applications will really lead to better decisions, and whether the data (and people’s privacy) will be safe from hackers. One article also raises a conflict situation argument: the future world will be divided in two competing classes, the watchers and the watched, who will be experimented on. Just as in the other ‘conflict’ frames in the corpus, there is a good, victimized party and a bad, powerful party, where the good party is the one that represents the public, or fights to protect it.

A closer look at the issue of privacy In the articles discussed up till now, privacy is the main human impact concern as well as the main moral concern, both in the context of government actors and commercial actors – almost all negative framing in these two categories either directly or indirectly concerns privacy. In the surveillance-positive discourse, privacy is addressed less often. When it is mentioned in the context of government actors, it is only mentioned in legal issue-frames that argue that actions are sufficiently in compliance with (new) regulations or advices by for example the Privacy and Civil Liberties Oversight Board, which means that in surveillance-positive discourse, other (human impact) arguments simply seem to weigh more than privacy arguments. In articles that are positive about commercial surveillance, privacy is mentioned only in technological property-arguments about the choices tech companies allow users to make to protect their own personal information; the things companies to anonymize data; and the non-existence

38 of privacy in the data-driven world. Moreover, one time it is argued that social media companies never do anything with user data that actually harms users, and one argument states that people’s conception of privacy will change along as more objects and services work via the internet and society becomes more data-driven and transparent. Privacy is also a much-discussed subject in articles about data-practices that do not fall under the umbrella of surveillance, but are still accompanied with questions of whether or not it is appropriate that personal information is collected, stored or used in a certain way. Here, there are articles about new bills that aim to strike a balance between data privacy and data access, like for example an American bill about access to the Facebook data of deceased people, and a European bill about the ‘right to be forgotten’ (the right of people to demand that search engines like Google remove links to certain information about them). Other articles are about industry-specific cases of information collection and sharing, like education institutes connecting student data from various sources with the help of private companies, and health care providers, health researchers and patients sharing data with each other. There is also one article on the general use of a specific technology, namely cloud computing. Furthermore, there are articles about court cases on (mis)use of personal images, like publishing a picture of a deceased victim of terrorism and secretly taking pictures below women’s skirts. Lastly, there are two articles that discuss surveillance technology that is used by consumers, like CCTV to protect one’s house and an app with which parents can track the whereabouts of their children. In the articles that are positive about one these diverse set of technology-practices, the main arguments concern quality of services and security – both of which are intertwined with privacy. On the one side, quality can be achieved while providing information security and privacy protection, thanks to characteristics of the technologies that are used, careful practices and good rules/laws. On the other side, the practices can achieve security: CCTV at home, wearable GPS for children and giving patients access to their own health data are presented as initiatives that give people more control and a sense of security. In the articles that are critical of these technology-practices, security, consent/knowledge and emotional harm are the most common arguments. Information technologies are observed to be inherently hackable and thus unsecure and dangerous for privacy; tech developers are observed to have too few incentives to do their very best to ensure safety; there are worries about what might happen with data without people’s consent or knowledge (data being distributed to many unknown parties; advertising on the basis of private information, at undesirable moments); and there are complaints that information has been made public or used in a way that is perceived as harmful (images that were published that were private; information being gathered about students that may come back to haunt them later in life). Sometimes, these arguments are accompanied by a call for stricter laws about what can and can’t be done with certain information. All in all, the discourse about privacy in general, across all articles in the corpus, is characterised by a focus on privacy as an abstract moral right or as having knowledge of- and control over what happens with (sensitive) private information – knowledge and control that can be taken away by hackers

39 and surveillance. Sometimes, but not often, human impact-issues that are consequence of this are discussed, like self-censoring, illegal evidence being used against suspects of a crime, negative impact on a career, emotional stress and, in one case, that the dominant societal conception of privacy will broaden and complaints about privacy harm by surveillance practices will die out. Much of the main trends in the corpus regarding privacy correlate with the academic discourse on data-driven surveillance: in both discourses, privacy is the most talked about critical issue, and in both discourses privacy is mostly understood in terms of privacy of information and communication, where people are argued to have the right to control or have knowledge of what happens with information or (meta-)data about themselves, and the right to communicate without being monitored (Michael and Clarke 2013, 221; Van Wel and Royakkers 2004, 130-131). There are also some differences though, just as there are similarities and differences – but mostly similarities – between the results of this research and the results of other discourse-analytical studies of surveillance or privacy, which will be reflected on in the next chapter.

40

5. Conclusion

This thesis is centred around the following research question: How are technology-practices of surveillance by government actors typically problematized in current-day news media discourse, and how is this done for similar practices by commercial actors? The analyses in this thesis have shown that the core difference between the discourse about government surveillance and the one about commercial surveillance concerns the framework through which the practices are approached. As both the topic modelling analysis and the discourse analysis have shown, government practices are often approached from a law perspective – articles report on court cases, practices are evaluated in light of a certain law or right and there are calls for better or stricter laws, mostly when surveillance is framed in a negative manner. When surveillance is framed positively, not law based arguments but human impact arguments, and then specifically arguments related to the necessity or efficacy of surveillance for fighting crime or terrorism, are dominant. In the discourse on surveillance by commercial actors, which appears to be smaller than the other discourse and also features less explicit mentions of surveillance, the positive and negative evaluations can be regarded as a struggle between benefits for businesses and people as consumers, who need more and better information and efficiency, and benefits for people as human beings and the weaknesses of technologies, where the former need privacy and security and the latter cannot guarantee those things. In both discourses, privacy is the core human impact issue and moral issue – privacy of information and communication, where the privacy can be right or a feeling, or a lack of control and knowledge about personal information in some form (an image of the physical body, communication content, characteristics of a person, meta-data). This is especially true in the commercial actor-oriented discourse. In the positive government surveillance discourse, privacy critiques are refuted by (implicitly) judging them secondary to other human impact issues; in the discourse on commercial surveillance these critiques are refuted by describing rules and technologies that help to make the surveillance practice or other technology-practice that may harm privacy safe from a privacy perspective, or by stating that privacy as a concept has changed or will change to fit current data practices. Next to privacy, security is a core concept that emerged in the discourse analysis. It is a concern that features in both discourses, in surveillance-positive as well as surveillance-negative frames. In the discourse about government surveillance, security is approached from a national security perspective, where terrorism and crime are the threats to people’s safety. In articles critical of government surveillance, security plays a role in the FBI versus Apple debate, where security is evoked as the security of people’s information and communication from ill-willed hackers, which would be compromised if the government would take steps to weaken encryption. In the discourse about commercial surveillance, security is evoked in a similar manner – here, safety and privacy are inherently intertwined as well, as there are worries about how well-protected certain technologies are against

41 hackers and/or surveillance. These arguments are often based on observations about technological properties, of the internet in general or of more specific technological objects or processes such as CCTV or cloud computing. These same criticisms are sometimes applicable to surveillance by government actors. Both discourses also feature frames in which people are called to reclaim control over their data against the power of government actors and/or big companies, sometimes with the help of small companies that for example have developed apps to communicate securely. Comparing these results to the theory discussed in the second chapter, the criticism in the recent American news discourse on technology-practices of surveillance as well as the critical academic discourse feature privacy as the most common source of negative evaluations of technology-practices of surveillance. As could be expected though, the news discourse is less diverse and more superficial than the academic discourse. Though governmental and commercial surveillance are highly interconnected, in the new media discourse the great majority of articles is focussed on only one of these actors. In articles about the dispute between the FBI and Apple, some articles even present tech- companies as the ‘good guys’ that protect the public against the too-powerful government – which may have its origins in American culture, that traditionally highly values corporate and personal freedom. Predictably, more abstract, philosophical arguments against surveillance don’t often feature as arguments – there were just two articles discussing the cultural or ideological roots and implications of surveillance. More surprising is the relative uncommonness of criticism on discrimination and efficacy. Discrimination was discussed in only one article, in the context of government surveillance, where people of colour are argued to be more often unrightfully targeted. Where efficacy is concerned, matters of costs and benefits are generally discussed in support of surveillance, as are arguments about how certain technology-practices can be used to optimize processes, create commercial opportunities and make life better or society safer. In criticisms of surveillance, the efficacy of the data-harvesting by the NSA was questioned only twice, and the costs of following up on alleged threats only once. The efficacy of the law on the other hand is a very big topic, especially regarding surveillance by government actors. Comparing the results of this research with the results of similar research, there are also some interesting parallels, as well as differences that cannot immediately be traced back to the differences between the corpora and the research focus (time, culture, keywords). Like Bernard-Wills and Lischka, I found that human impact frames related to physical security (from terrorism) are the most common frames in support of government surveillance – Lischka also found that privacy is the main concern in texts that are critical of such surveillance. Striking is that even though it was not the aim of this thesis, I found two categories of frames that did not emerge in Bernard-Will’s analysis, namely the ‘legal issues’- category and the ‘technological properties’-category. Neither of these themes play a role of significance in Bernard-Will’s analysis; in Lischka’s analysis the law only plays a role an authorisation for surveillance or condemnation of whistle-blowers. In privacy discourse analysis by Mols and Janssen, legal issues don’t emerge as topics of interest, but technological properties are, in an implicit way – the properties of the internet and its applications are part of the ‘privacy is dead’ frame, one of the six main

42 ways in which they found privacy to be framed. In the CCTV and privacy analysis by Möllers and Halterlein, both technological properties and legal issues play a role: they found the technological effectiveness of CCTV technology to be the most important positive framing, while in the negative discourse they found privacy issues to be the leading theme, combined with calls for better regulation – which is also something my analysis shows. My analysis can thus be seen as support for Möllers and Halterlein’s thesis that the dominant intertwinement of privacy critiques with regulatory problems that can be solved rather than with other moral, economical and human impact-issues such as for example inequality and cost-effectiveness prevents privacy from exerting sufficient social and political pressure to actually halt surveillance. At the same time, the calls for stricter laws appear to translate to actual stricter laws, which can be considered a pushback against surveillance – though this mostly seems to apply to government surveillance and not to surveillance by commercial actors. In general, it appears that government surveillance is a greater concern in the American public sphere than commercial surveillance.

43

6. Discussion

To close off this research, it is important to reflect on up- and downsides of the methodology of this research – or rather, the combination of methodologies. After all, answering the main research question was not the sole purpose of this research: it was also an attempt to find a more objective and valid way of selecting the corpus for the discourse analysis than simply searching for the presence of one or more pre-defined keywords. To this end, I used a rather complex layered method consisting of named entity recognition, keywords and topic modelling. This method returned a highly diverse set of articles of which the great majority wasn’t relevant. Those articles that were deemed relevant though, can reasonably be expected to be very representative of the news discourse as a whole, as they didn’t need to feature specific surveillance and privacy-related keywords to be selected – rather, they needed to be semantically related to those keywords to such an extent that they fit with a topic containing them. So even though the method is very experimental, the project is a success in the sense that the eventual corpus was certainly accomplished in a manner more objective than that of previous discourse research. The results of my analysis also showed a diverse range of frames in the surveillance discourse and opened new opportunities for further research. The first opportunity concerns the framing of surveillance and privacy in terms of technological properties. The presence of this type of framing in the research corpus highlighted the lack of attention in much previous research for the agency of technology in technology-practices of surveillance. The hackability of technologies; ways for people to take control of technologies that contain their data; the technological complexity of surveillance practices; regulatory responses to these matters; and the discursive construction of all these dimensions are subjects that need to be studied in more detail to understand and shape the future of surveillance and privacy. How much control over their data – and thus their privacy – can people reclaim? In what ways and under which conditions is technology a tool of empowerment in this regard, and when is it a source of sense of powerlessness or social problems? How does technology change people’s perception of privacy? The second opportunity concerns the thesis raised by Möllers and Halterlein about the infectivity of privacy critiques, which is also supported by my research results. Why is there such a focus on privacy in the discourse? What are differences between how privacy harm is framed and how other moral or human impact issues that are the consequence of surveillance are framed? Which arguments are most effective for actually limiting surveillance? Third, more comparisons between the discourse about surveillance by government actors and the discourse about surveillance by commercial actors is desirable, especially considering the relatively small size of the corpus I used for the discourse analysis. In a way, my analysis can be regarded as mostly exploratory: there is but a limited amount of time given for a Master’s thesis and the discourse analysis was combined with a time-consuming topic modelling analysis.

44

Indeed, the last opportunity for further research concerns the topic modelling method, and then particularly the uses of it for the digital humanities. As is apparent in the results and conclusion section, the most interesting information for answering the research question comes from the discourse analysis – as several other researchers have also observed, a superficial, computer-generated description of statistical patterns within a corpus holds very little value without human interpretation of the actual texts that make up the corpus – especially when there is a risk that the texts are not one hundred percent as expected, as was the case with the articles in this research, that sometimes included lines of text that weren’t actually part of the article. Furthermore, this research has shown that topic modelling itself is a quite ambiguous method. The slightest change in the parameter settings of a model can produce a vastly different analysis, and there is little known about what the ideal settings are for a model given a certain corpus. The goal of any analysis is to approach reality as closely as possible; to do this via topic modelling one needs to pay a lot of attention to the small details of the model. Many different models will need to be trained and tested in light of the specific purpose of the research, and the model that appears to return the strongest, most coherent topics might in fact miss some important themes in the corpus. Thus, more research into how to find the most optimal model is definitely necessary. Moreover, any application of topic modelling for the humanities or social sciences needs to occur with full knowledge of the limitations of topic modelling and the consequences of different parameter settings. At any rate, using topic modelling for selecting a research corpus certainly seems like a more fruitful use of this technique than using it to interpret a corpus.

45

References

Agar, Jon. "What difference did computers make?." Social Studies of Science 36, no. 6 (2006): 869- 907. Agamben, Giorgio. “For a theory of destituent power.” Public lecture in Athens, invitation and organization by Nicos Poulantzas Institute and SYRIZA Youth, November 16, 2013. http://www.chronosmag.eu/index.php/g-agamben-for-a-theory-of-destituent-power.html Amoore, Louise. "Algorithmic war: Everyday geographies of the war on terror." Antipode 41, no.1 (2009): 49-69. Amoore, Louise, and Marieke De Goede. "Transactions after 9/11: the banal face of the preemptive strike." Transactions of the Institute of British Geographers 33, no. 2 (2008): 173-185. Anderson, Ben. "Preemption, precaution, preparedness: Anticipatory action and future geographies." Progress in Human Geography 34, no. 6 (2010): 777-798. Barnard‐Wills, David. "UK news media discourses of surveillance." The Sociological Quarterly 52, no. 4 (2011): 548-567. Bauman, Zygmunt, Didier Bigo, Paulo Esteves, Elspeth Guild, Vivienne Jabri, David Lyon, and R. B. J. Walker. "After Snowden: Rethinking the impact of surveillance." International Political Sociology 8, no. 2 (2014): 121-144. Beer, David de. “How should we do the history of big data?” Big Data & Society 3, no. 1 (2016): 1- 10. Blei, David M. "Surveying a suite of algorithms that offer a solution to managing large document archives." Communication of the ACM 55, no.4 (2012): 77-84. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." Journal of machine Learning research 3 (2003): 993-1022. Bolter, Jay David, and Richard Grusin. “Remediation.” Configurations 4.3 (1996): 311-358. Bovens, Mark. “De verspreiding van de democratie.” B en M 32 (2005): 119 Branum, Jens, and Jonathan Charteris-Black. "The Edward Snowden affair: A corpus study of the British press." Discourse & Communication 9, no. 2 (2015): 199-220. Brezina, Vaclav, Tony McEnery, and Stephen Wattam. "Collocations in context: A new perspective on collocation networks." International Journal of Corpus Linguistics 20, no. 2 (2015): 139- 173. Campbell, John Edward, and Matt Carlson. "Panopticon. com: Online surveillance and the commodification of privacy." Journal of Broadcasting & Electronic Media 46, no. 4 (2002): 586-606. Castells, Manuel. The rise of the network society. John Wiley & Sons, 2011.

46

Chang, Jonathan, Sean Gerrish, Chong Wang, Jordan L. Boyd-Graber, and David M. Blei. "Reading tea leaves: How humans interpret topic models." In: Advances in neural information processing systems, 288-296. 2009. Chen, Feng, Pan Deng, Jiafu Wan, Daqiang Zhang, Athanasios V. Vasilakos, and Xiaohui Rong. "Data mining for the internet of things: literature review and challenges." International Journal of Distributed Sensor Networks (2015). Citron, Danielle Keats, and Frank A. Pasquale. "The scored society: due process for automated predictions." Washington Law Review 89 (2014). Clarke, Roger. “A Framework for Surveillance Analysis.” Last modified February 16, 2012. http://www.rogerclarke.com/DV/FSA.html Davies, Mark. “NOW Corpus (News on the Web).” http://corpus.byu.edu/now/ Deleuze, Gilles. "Postscript on the Societies of Control." October 59 (1992): 3-7. Finn, Rachel, and Michael McCahill. "Representing the surveilled: media representation and political discourse in three UK newspapers." In: Political Studies Association Conference Proceedings, 121-33. 2010. Fiss, Peer C., and Paul M. Hirsch. "The discourse of globalization: Framing and sensemaking of an emerging concept." American Sociological Review 70, no. 1 (2005): 29-52. Foucault, Michel. “Governmentality.” In: The Foucault effect: Studies in governmentality. University of Chicago Press, 1991. Grew, Raymond. ‘The nineteenth century European state’. In: Statemaking and social movements, edited by Charles Bright en Susan Harding, 83-113. University of Michigan Press, 1984. Hamilton, William L., Jure Leskovec, and Dan Jurafsky. "Diachronic word embeddings reveal statistical laws of semantic change." arXiv preprint arXiv:1605.09096 (2016). Hildebrandt, Mireille. "Defining profiling: a new type of knowledge?." In: Profiling the European Citizen: 17 Cross-Disciplinary Perspectives, 17-45. Springer Netherlands, 2008. Hoffman, Andrew J. "Talking past each other? Cultural framing of skeptical and convinced logics in the climate change debate." Organization & Environment 24, no. 1 (2011): 3-33. Hull, Gordon, Heather Richter Lipford, and Celine Latulipe. "Contextual gaps: privacy issues on Facebook." Ethics and information technology 13, no. 4 (2011): 289-302. Jaworska, Sylvia, and Anupam Nanda. "Doing Well by Talking Good: A Topic Modelling-Assisted Discourse Study of Corporate Social Responsibility." Applied Linguistics (2016): 1-28. Jonas, Jeff, and Jim Harper. “Effective counterterrorism and the limited role of predictive data mining.” Policy Analysis, no. 584 (2006). Jørgensen, Marianne W., and Louise J. Phillips. Discourse analysis as theory and method. Sage, 2002. Keller, Reiner. "The sociology of knowledge approach to discourse (SKAD)." Human Studies 34, no. 1 (2011): 43.

47

Kerr, Ian, and Jessica Earle. "Prediction, preemption, presumption: How Big Data threatens big picture privacy." Stanford Law Review Online 66 (2013). Koteyko, Nelya, Rusi Jaspal, and Brigitte Nerlich. "Climate change and ‘climategate’ in online reader comments: a mixed methods study." The Geographical Journal 179, no. 1 (2013): 74-86. Kutter, Amelie, and Cathleen Kantner. "Corpus-based content analysis: A method for investigating news coverage on war and intervention." International Relations Online Working Paper 1 (2012). Leese, Matthias. "The new profiling: Algorithms, black boxes, and the failure of anti-discriminatory safeguards in the European Union." Security Dialogue 45, no. 5 (2014): 494-511. Lewis, Seth C., Rodrigo Zamith, and Alfred Hermida. "Content analysis in an era of big data: A hybrid approach to computational and manual methods." Journal of Broadcasting & Electronic Media 57, no. 1 (2013): 34-52. Lischka, Juliane A. "Surveillance discourse in UK broadcasting since the Snowden revelations." Digital Citizenship and Surveillance Society Media Stream (2015). Livermore, Michael A., Allen Riddell, and Daniel Rockmore. "Agenda Formation and the US Supreme Court: A Topic Model Approach." Arizona Law Review, Forthcoming (2017). http://neukom.dartmouth.edu/docs/16_ agenda_formation_livermore_riddell_rockmore.pdf Lyon, David. "Surveillance, Snowden, and big data: Capacities, consequences, critique." Big Data & Society 1, no. 2 (2014). Maskeri, Girish, Santonu Sarkar, and Kenneth Heafield. "Mining business topics in source code using latent dirichlet allocation." In: Proceedings of the 1st India software engineering conference, 113-120. ACM, 2008. Massumi, Brian. "Fear (the spectrum said)." Positions: East Asia Cultures Critique 13, no. 1 (2005): 31-48. Michael, Katina, and Roger Clarke. "Location and tracking of mobile devices: Überveillance stalks the streets." Computer Law & Security Review 29, no.3 (2013): 216-228. Millar, Jason. "Core privacy: a problem for predictive data mining." Lessons from the identity trail: Anonymity, privacy and identity in a networked society (2009): 103-119. Mimno, David. "Computational historiography: Data mining in a century of classics journals." Journal on Computing and Cultural Heritage (JOCCH) 5, no. 1 (2012): 3-19. Mimno, David, Hanna M.Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. "Optimizing semantic coherence in topic models." In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011. Möllers, Norma, and Jens Hälterlein. "Privacy issues in public discourse: the case of “smart” CCTV in Germany." Innovation: The European Journal of Social Science Research 26, no. 1-2 (2013): 57-70.

48

Mols, Anouk, and Susanne Janssen. "Not Interesting Enough to be Followed by the NSA: An analysis of Dutch privacy attitudes." Digital Journalism (2016): 1-22. Newman, David, Chaitanya Chemudugunta, Padhraic Smyth, and Mark Steyvers. "Analyzing entities and topics in news articles using statistical topic models." In: ISI, 93-104. 2006. Nissenbaum, Helen. "A contextual approach to privacy online." Daedalus 140, no.4 (2011): 32-48. Pacey, Arnold. “Technology: Practice and Culture.” In: The culture of technology. MIT press, 1983. Riddell, A. (2014). “How to read 22,198 journal articles: Studying the history of german studies with topic models.” Distant Readings: Topologies of German culture in the long nineteenth century, 91-114. Rosen, Christoffer, and Emad Shihab. "What are mobile developers asking about? a large scale study using stack overflow." Empirical Software Engineering 21, no. 3 (2016): 1192-1223. Savov, Pavel, Adam Jatowt, and Radoslaw Nielek. "Towards Understanding the Evolution of the WWW Conference." In: Proceedings of the 26th International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, 2017. Sismondo, Sergio. “The Social Construction of Scientific and Technical Realities.” In: An introduction to science and technology studies, 57-71. Chichester: Wiley-Blackwell, 2010. Smith, H. Jeff, Tamara Dinev, and Heng Xu. "Information privacy research: an interdisciplinary review." MIS quarterly 35, no.4 (2011): 989-1016. Stewart, Mark G., and John Mueller. "Cost-benefit analysis of airport security: Are airports too safe?" Journal of Air Transport Management 35 (2014): 19-28. Tang, Jian, Zhaoshi Meng, XuanLong Nguyen, Qiaozhu Mei, and Ming Zhang. "Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis." ICML, 2014. Tavani, Herman T. "Informational privacy, data mining, and the internet." Ethics and Information Technology 1, no. 2 (1999): 137-145. “Topic Modeling with Python.” YouTube video, 0:00-50:13. Posted by ‘PyTexas’, October 15, 2015. https://www.youtube.com/watch?v=BuMu-bdoVrU Turney, Peter D., and Patrick Pantel. "From frequency to meaning: Vector space models of semantics." Journal of artificial intelligence research 37 (2010): 141-188. Van den Bos, Maarten, and Hermione Giffard. "Mining Public Discourse for Emerging Dutch Nationalism." Digital Humanities Quarterly 10, no.3 (2016). Van der Eijk, Cees. De kern van de politiek. Het Spinhuis, 2001. Van Dijck, José. "Datafication, dataism and dataveillance: Big data between scientific paradigm and ideology." Surveillance & Society 12, no. 2 (2014): 197-208. Van Elden, Chantal. “Hoe de EU vluchtdata wil gebruiken om terrorisme te bestrijden.” VICE Motherboard, April 5, 2016. Van Wel, Lita, and Lambèr Royakkers. "Ethical issues in web data mining." Ethics and Information Technology 6, no. 2 (2004): 129-140.

49

Vries, Imar de. Tantalisingly close: An archaeology of communication desires in discourses of mobile wireless media. Amsterdam University Press, 2012. Wallach, Hanna M., Iain Murray, Ruslan Salakhutdinov, and David Mimno. "Evaluation methods for topic models." In: Proceedings of the 26th annual international conference on machine learning, 1105-1112. ACM, 2009. Wallach, Hanna M., David Mimno, and Andrew McCallum. “Rethinking LDA: Why priors matter”. In: Advances in neural information processing systems, 1973-1981. 2009. Quasthoff, Uwe, Dirk Goldhahn, and Thomas Eckart. "Building large resources for text mining: The Leipzig Corpora Collection." In: Text Mining, 3-24. Springer International Publishing, 2014. Zhao, Weizhong, James J. Chen, Roger Perkins, Zhichao Liu, Weigong Ge, Yijun Ding, and Wen Zou. "A heuristic approach to determine an appropriate number of topics in topic modeling." BMC bioinformatics 16, no. 13 (2015).

50

Appendix A: Keywords and their collocates

Actor/NE Top ten collocates the US House of Representatives house us u its from Office of Procurement and Contractsbureau prosecutor ethics box management the UN Security Council resolutions charter council commission resolution the U.S. Army Medical Department barracks lebanese arming patterson insurgent Moscow misha maslennikov schearf airstrikes expanded Congress sequester sequestration requests approved cuts Pentagon females babb carla correspondent briefing Wareham District Court supreme appeals fisa scalia ruling The Department of Commerce transferring dayton treasury commerce justice the FBI counter-terrorism rogers agent director training Dxcel - Oculus Rift rift virtual-reality headset vr thrones McDonald's - Investors.com accredited crowdfunder mosaic entrepreneurs risk LinkedIn pulse gmail teixeira yahoo groups WorldCom Inc. - Kickstarter funded campaign raised successfully lumo Lehman Brothers Holdings Inc. inc as s in

American Airlines alaska istock malaysia divestiture florida Flexera Software stack engineer hardware modern computer

Parameter settings of the of and s with on congressional declined cards website affairs collocation analysis: human security office rights korea commanding vietnamese liberation liberian smallest statistic MI daniel 1980 punk negotiate cathedral statistical cut-off value 5 imf round request budget plans span 7L-7R papers opening positions col lt minimum collocate frequency 3 filed stated justice criminal constitutional filter none homeland links external defense fourth office department house committee were

virtual developers reality founder version

fund debt interest investment sell google+ profile invite connections profiles

launch backers indiegogo target funding @ the of mh370 northwest slots southwest american open works using google company

51

Appendix B: Full keyword lists

Government keywords Commercial keywords NER NER Logic Technology the Supreme Court the Center for Law, Ethics and National Security Facebook Development MIPTV Congress U.S. Department of Environmental Quality Google Green Smoke Inc. Adobe the House Rules Committee Bureau of Consumer Protection Adobe Lenovo Transworld Data the Civil Rights Act Parliament Target Banif Plus Bank the Outlet Connection the U.S. Department of Justice the Public Trust Act Global Payments BroApp Dxcel the U.S. Helsinki Commission the American Center for Law and Justice Gallup Loveland Technologies ComiXology the California Department of Public Health the Foreign Intelligence Surveillance Act Kipling Lush Cosmetics Purina ONE the U.S. Census Bureau Elections Oversight Committee Entrevo Asia Pegasus Foods Condition One FBI the Regional Administrative Court Salesforce.com Beam Inc. Ozobot Current Population Survey OIPC Goldman Sachs Taco Bell Vex National Health Interview Survey the Federal Court of Appeals NTT Cinnabon CareWorks Berlin the Supreme Court of Michigan United Properties CWT Interactive the Delaware Department of Natural Resources the Tea Party and Environmental Control Social Media Kickstarter FIFA Gamesa Technology NYPD The Spanish Court LinkedIn Corporation Lights of America Inc. London the U.S. Departments of Education Starbucks Motorola ACE Hardware the Russian Army The National Center for Youth Law McDonald's CarePredict LG the Reserve Bank of M16 the Licensing Committee KCET India BMIT the Foreign Office Office of Attorney Regulation Counsel Tesla Laboratory Inc. Mandalah BIGresearch Lagunitas Brewing Washington the Washington State Bar Association BOC AkzoNobel Company Central Intelligence Agency the National Organization of Bar Counsel Intel Altrec Cold Spring Brewing Co. CIA The National Association of Parliamentarians AMD Badassdom.com Alynylam Pharmaceuticals Global Markets the Federal Bureau of Investigation the Ministry of Art and Culture Crytek IndieVest Pictures Intelligence Tokyo the US House of Representatives ATI Hibu Moscow the British Intelligence Service Sony Yahoo Illinois Tool Works Inc. the Ministry of Industry and Information State and Treasury Departments Technology Microsoft the Beats Music app Fiat NATO the California Department of Consumer Affairs Nintendo Ameritech Oculus Rift the Royal Australian Navy Office of Environmental Equity HBO SuperMedia Big Switch Networks the National Australia Day Council the Ministry of Information and Security Toyota AnchorBank EatingWell the Parliamentary Intelligence and Security the Department of Education Committee Ford RDA Holding IDFC Fed Brussels Cadillac Idearc Media Amazon Navy Hague Time Warner Dex One Corporation ExpressJet Air Force Federal Register Lego New Media Investment GroupAmerican Inc. Airlines NSA Senate PayPal PMGI Holdings Inc. SASC United Nations The Minnesota Court of Appeals' WFA Maxcom King Digital Entertainment the Intergovernmental Authority The Boone County Sheriff 's Department Billboard.com OnTrac Esquire.com the UN Security Council the Fort Worth Police Department Disney Flexera Software Pinterest the U.S. army the National Research Council Warner Bros Sony Pictures Instragram White House the Christian Democrats Walmart FLEX Snapchat the Iraqi Army the U.S. Department of Veterans Affairs Coca Cola Investors.com Samsung the Washington Department of Natural Qabila Media Tel Aviv Resources Productions FinTech Collective New Dehli Washington State Patrol TechTarget Inventec Philips E.U. OLE USFWS Office of Law Enforcement IMDB Aezon The Emotiv Insight China Development EU the U.S. Army Medical Department Forrester Bank Corp McAfee NASA The Department of Commerce Gmail Boston Dynamics BorrowLenses.com The Seattle Police Department Federal Election Commission the Bank of Japan General Atomics Burger King Burning Glass Supreme Court Federal Rule of Civil Procedure the Bank of England Technologies RoboForm the State Department the Vermont Drug Task Force Fox BioTrust Nutrition Nexia Lehman Brothers Paris Netflix Holdings Inc. Nike Medicare Collocation Wikimedia Aviva WorldCom Inc. Medicaid Services congressional CreditSesame.com BigCorp Engadget.com Office of Procurement and Contracts prosecutor FICO Hewlett - Packard Co Intrepid Potash Inc the Defense Department constitutional JPMorgan the Lewis Trust Group Metropolitan Police Department homeland Bank of America Corp Gordon Brothers Group Collocation The Department of Homeland Security justice Citigroup Inc Melville Corporation entrepreneurs CVS Caremark Immigration and Customs Enforcement counter-terrorism General Motors Corporation stock Department of Homeland Security/Immigration law WaMu Kellwood Company company The Defense Advanced Research Projects Agency Honda Pizza Hut entrepreneural Pentagon Other Mercedes - Benz Globe Telecom the National Security Council government Tech World Office American Coverage Corp.Other National Security Affairs federal Vida Fitness World Wide Worx business the U.S. District Court public administration CrunchBase CinemaSins commercial Precision Dynamics the National Institute of Justice legislation Farm Fresh Choice Corporation commerce the Orlando Police Department legislator eBay Property Solutions advertising the Florida Department of Law regulation Accenture Dynex Capital advertisement Task Force Mobimar Dropbox marketing the Supreme Board of Judges Cold Spark Media Airbnb product Wareham District Court BIOX Stripe the House of Representatives Verizon RAND Corp. 52

Appendix C: Topic model comparison

The topics are colour coded: green topics are the most coherent and worth two points; red topics are the least coherent and worth no points

T-value 200 300 400 150 250 eta-value symmetric 0,01 symmetric 0,01 symmetric 0,01 symmetric 0,01 symmetric 0,01 performancescore 9 score 7 score 6 score 6 score 9 score 5 score 9 score 10 score 9 score 6 efficacy 0 topics 0 topics 0 topics 1 topic 1 topic 2 topics 0 topics 1 topic 2 topics 1 topic topics [(4, [(129, [(178, [(105, [(172, [(239, [(99, [(55, [(11, [(221, [('google', 0.12435241451332284), [('facebook', 0.22195944023901065), [('hoffman', 0.10422023362483269), [('coffee', 0.22095524743671258), [('microsoft', 0.20278333183067271), [('admission', 0.15507663936024824), [('egypt', 0.074611973430694117), [('chesapeake', 0.038934492518370244), [('stadium', 0.16027897608225361), [('dodge', 0.078527949623460908), ('search', 0.090881131847963939), ('coffee', 0.08965449138577003), ('julian', 0.06684203554858624), ('cup', 0.074028594413692458), ('layoff', 0.071270363975023399), ('pearl', 0.098139556443602241), ('egyptian', 0.062506786786578492), ('dayton', 0.035600757773853117), ('fan', 0.078997471841425387), ('amnesty', 0.067005848171744875), ('alaska', 0.054352599829528087), ('freeman', 0.040955149982645558), ('gerry', 0.059626349052523488), ('fork', 0.064152077157503942), ('troll', 0.069880437749372595), ('carpet', 0.060070749773543458), ('klein', 0.033001699564826785), ('allergy', 0.028464247825152666), ('soccer', 0.072464324709685585), ('conway', 0.047295390090112362), ('afghanistan', 0.039637677492179592), ('cup', 0.036412469816599839), ('brendan', 0.046941152650738997), ('chapel', 0.032638729790408959), ('edt', 0.050479451241430988), ('vogue', 0.047600059852692708), ('sherman', 0.025905058128903643), ('condom', 0.028147618459094643), ('bruce', 0.066770157066728827), ('9th', 0.045917967116870281), ('taliban', 0.039313983255410717), ('tea', 0.032606427408734956), ('52', 0.035948103165008453), ('recycling', 0.02629971837532934), ('dawson', 0.043227530698650041), ('midway', 0.047484302078945891), ('tomb', 0.020262233492451667), ('cadet', 0.027697058023443862), ('yoga', 0.058190540326605637), ('deferred', 0.043902171090760563), ('libertarian', 0.038909820086812506), ('landfill', 0.018102649040980547), ('stan', 0.034910580991840454), ('connor', 0.025515826794107712), ('wikimedia', 0.039818689443607121), ('gigi', 0.028205285710014672), ('neutrality', 0.017871523399250584), ('kay', 0.027253228590325401), ('dh', 0.027561480268202439), ('wheelchair', 0.035153762633948524), ('afghan', 0.035791087733765256), ('social', 0.016671146648126602), ('hospitality', 0.032781778413517378), ('abbie', 0.025083480355517791), ('router', 0.036078502485667456), ('herndon', 0.028103022677160201), ('wheeler', 0.017578017787995996), ('peanut', 0.025044958142036937), ('leslie', 0.025766355272861995), ('saunders', 0.026029339818817363), ('yahoo', 0.019748156939163652), ('zuckerberg', 0.016443167029201001), ('nab', 0.027456572749450906), ('recycle', 0.023222253189553618), ('absurdity', 0.029280034688134005), ('salad', 0.028031795252756651), ('ministry', 0.015144073556544561), ('pantry', 0.023105286601624062), ('orthodox', 0.024141719762943419), ('starve', 0.022603614028481198), ('gordon', 0.017228101717125958), ('post', 0.015862531911769874), ('two-week', 0.025972146791772708), ('lakewood', 0.018955300661319726), ('civility', 0.026973665519507351), ('axe', 0.026798150402885124), ('radar', 0.014222307061290546), ('wesley', 0.021639189267795331), ('indoor', 0.022061779512040924), ('paying', 0.021757003420014659), ('oracle', 0.015791319521989314)]), ('zimmerman', 0.015589197625922211)]), ('humphrey', 0.025400681995326066)]), ('forrest', 0.016521757802188106)]), ('nokia', 0.023341544432684921)]), ('grown-up', 0.021523852148867262)]), ('scan', 0.014219955255424681)]), ('needy', 0.019226192136888053)]), ('opposing', 0.021317455924260542)]), ('covenant', 0.019667761920402569)]), (175, (147, (146, (275, (296, (65, (103, (94, (54, (150, [('russian', 0.08134578267776181), [('hiv', 0.042823420392487446), [('photograph', 0.16152728903490968), [('helen', 0.078464534239625988), [('chamber', 0.19904315400115544), [('incarcerate', 0.066508265251818008), [('norfolk', 0.02442910759494285), [('auction', 0.061792916590068699), [('test', 0.32860230770410065), [('abortion', 0.15308117097452648), ('russia', 0.079893666805368538), ('yoga', 0.033332399493095098), ('0', 0.13863654213208673), ('pablo', 0.063206609142052783), ('commerce', 0.16766929339405851), ('tan', 0.054452342740098747), ('costa', 0.023952752190147492), ('news', 0.037280594147011924), ('testing', 0.073267925989505642), ('turkey', 0.079513136997176562), ('ukraine', 0.033565487550023219), ('aids', 0.024285059028341861), ('uc', 0.09917834219337042), ('peru', 0.050169022574163315), ('entrepreneur', 0.049920285323359764), ('pdf', 0.051574204944059059), ('ashley', 0.022833230103517916), ('garcia', 0.036645917206329356), ('patent', 0.055099090836363683), ('cia', 0.044661111578035304), ('putin', 0.032397964407442879), ('volunteer', 0.022233123369455852), ('berkeley', 0.07831531621479676), ('sc', 0.046913993026760532), ('telecom', 0.0419235021152677), ('samaritan', 0.047793521756489273), ('torre', 0.022431427590669101), ('reuters', 0.034479556468224606), ('vaccine', 0.022909209681165714), ('wine', 0.042466084299857029), ('moscow', 0.020006299321600206), ('adam', 0.018402508195783433), ('teller', 0.022953411065711887), ('podcast', 0.043744944173906355), ('parlor', 0.038190466951765457), ('re-entry', 0.033438995040166529), ('ancient', 0.021129713521875069), ('thomson', 0.020915845228277594), ('gross', 0.022707655519980195), ('turkish', 0.04147975539817917), ('nato', 0.018260648698495707), ('monkey', 0.016790747560737361), ('photographic', 0.022392150240262386), ('ruth', 0.03141052338270596), ('entrepreneurship', 0.036359691473760808), ('wolfe', 0.032851871984037474), ('detainee', 0.019687447232754639), ('gates', 0.020586348195972146), ('result', 0.02267848468388697), ('clinic', 0.039339195508168737), ('soviet', 0.01547537153062393), ('singh', 0.016535863773810989), ('yosemite', 0.020405927155902781), ('alberto', 0.02624624863525446), ('101', 0.035918805091183219), ('orphan', 0.032787719871218056), ('heritage', 0.01754778690404411), ('bid', 0.019655358432553668), ('holmes', 0.019151242007297915), ('erdogan', 0.033608914954213046), ('vladimir', 0.011277543866385178), ('hepatitis', 0.016436683359040638), ('99', 0.019176743002277879), ('extradition', 0.022477421407201107), ('sap', 0.030929515548014569), ('deutsche', 0.032193421105163114), ('honduras', 0.016972544157018075), ('gerry', 0.016925351902218895), ('claim', 0.016126314014145236), ('planned', 0.031928331355800775), ('president', 0.011240318436563691), ('stephen', 0.0147992267245226), ('cracker', 0.019102847890036762), ('fujimori', 0.021871709896344266), ('keynote', 0.02902394574936867), ('1923', 0.028410576142459506), ('chapman', 0.014372458359847403), ('richards', 0.015257169919952939), ('autism', 0.015055713274341131), ('parenthood', 0.024952816349719926), ('crimea', 0.010957547719880957)]), ('c', 0.013143063669038987)]), ('nova', 0.017777084326291107)]), ('aguirre', 0.018772680216868946)]), ('bezos', 0.02605717818046575)]), ('brittany', 0.028005895209750362)]), ('berta', 0.014237688045777202)]), ('periscope', 0.01514107060175378)]), ('troll', 0.014101094587702584)]), ('istanbul', 0.016989103735029654)]), (109, (27, (132, (22, (378, (250, (138, (84, (26, (218, [('west', 0.099345825944443009), [('flag', 0.0702711768628769), [('butler', 0.11846868381574543), [('confirmation', 0.076525490915821878), [('video', 0.023364254553414711), [('hospice', 0.088864023850217369), [('saudi', 0.056452330795086887), [('eagle', 0.053791914465335003), [('parish', 0.087514922340600876), [('council', 0.19796308761907774), ('sale', 0.058949146758808017), ('milk', 0.058733507123952595), ('mo', 0.074035632549959818), ('joyce', 0.038111641587406361), ('voa', 0.019414697479408632), ('deaf', 0.073011081997751817), ('jewish', 0.044677819066872952), ('golf', 0.051623326847672839), ('fort', 0.08706121754629316), ('anderson', 0.12042444299515928), ('block', 0.057059252233681461), ('confederate', 0.043185293212194886), ('bloom', 0.058903116721505298), ('marilyn', 0.036925722463483607), ("'", 0.015833747573999368), ('aids', 0.069961569894480588), ('jew', 0.043717754951628295), ('nevada', 0.04866755829935409), ('parker', 0.065396040670650302), ('knight', 0.048697659184345633), ('libya', 0.046791474256022314), ('for-profit', 0.041435333923109947), ('designated', 0.052618742500911503), ('boyd', 0.035140507235408132), ('report', 0.012606473459989226), ('confinement', 0.0598832417379365), ('israel', 0.034740730199703217), ('watson', 0.030459004148129836), ('tape', 0.047645710611097361), ('trustee', 0.033189360487168217), ('benghazi', 0.028808698674239391), ('powder', 0.025546266201481257), ('bluff', 0.051639008531477133), ('ape', 0.033130958528048614), ('miner', 0.011049407327330009), ('solitary', 0.058749187240225854), ('carter', 0.030392686482758842), ('miles', 0.024571550928279978), ('baton', 0.040481311556615676), ('cedar', 0.022790427588329162), ('bid', 0.024632283535103938), ('exemption', 0.019082824410007859), ('shepherd', 0.050603938189997422), ('haley', 0.033069768793891639), ('2015', 0.010674285576025233), ('manning', 0.035900682539594019), ('arabia', 0.029687265092541413), ('matthews', 0.015101702690741278), ('rouge', 0.035949459101314897), ('klan', 0.018755056358179787), ('transfer', 0.024593135967194778), ('bo', 0.018360769345259841), ('deficiency', 0.049847918078172092), ('jaw', 0.030765958378314133), ('myanmar', 0.0098550619421135556), ('dying', 0.034006817066937145), ('yemen', 0.021392640689893907), ('grayson', 0.014632257309911028), ('gazette', 0.035389027177763224), ('member', 0.018100444432372839), ('avenue', 0.020141092291359346), ('howell', 0.017410368332574398), ('salvation', 0.038948068030622965), ('manual', 0.030008839237621283), ('china', 0.0096650309526690598), ('hearing', 0.026486434592180762), ('evans', 0.013112584110310896), ('pga', 0.01459616470150075), ('gawker', 0.035174388657163665), ('herb', 0.016330225795738201), ('libyan', 0.01723955540826325), ('fingerprint', 0.017178034309436208), ('gupta', 0.032745104886589507), ('doyle', 0.028096349453271077), ('daniel', 0.0091024213835241372), ('woe', 0.024262356832823847), ('anti-semitism', 0.011048711188418395), ('bundy', 0.014487666432143437), ('springs', 0.025514958252509559), ('ku', 0.016290613203428773), ('bankruptcy', 0.01395248077832623)]), ('epstein', 0.016920266619173292)]), ('listed', 0.028267780821764142)]), ('mclaughlin', 0.023414851455606116)]), ('military', 0.0083692062089974345)]), ('distinctive', 0.020795075636872808)]), ('arab', 0.0075817314354402745)]), ('cattle', 0.012951471088312655)]), ('peter', 0.025409585739420013)]), ('peggy', 0.015378364751569704)]), (163, (1, (109, (14, (24, (336, (50, (138, (176, (85, [('israel', 0.10708357667289442), [('iowa', 0.11854609131947873), [('hansen', 0.0548203034701049), [('fitness', 0.059469038426292913), [('surrender', 0.13173064406692878), [('mineral', 0.10425839653282355), [('cbs', 0.086707356954873527), [('marijuana', 0.089968334662729119), [('flag', 0.11920430498130391), [('baton', 0.048571362245703693), ('israeli', 0.049726230899859253), ('photographer', 0.081886889414035838), ('amanda', 0.053301374036132984), ('pa', 0.055247876386387069), ('drunk', 0.11287796444741718), ('slope', 0.062485369533059845), ('colorado', 0.079519571259312666), ('colorado', 0.052683082005266522), ('guard', 0.1069255257250355), ('rouge', 0.043272262932301464), ('palestinian', 0.047042581916320866), ('nicholas', 0.029923446516165325), ('moss', 0.044134445265795762), ('resistance', 0.045602474272534742), ('stomach', 0.064650530552360072), ('martian', 0.048938080357249177), ('gazette', 0.030589627297598888), ('state', 0.025409157802450413), ('memorial', 0.090340747473110786), ('lexington', 0.03342206489307345), ('jewish', 0.031884901071166649), ('xx_p', 0.029887246754154054), ('jerome', 0.044126956698809887), ('antibiotic', 0.040684269045961549), ('incarcerate', 0.052795398662828744), ('marks', 0.043092474350225153), ('bangladesh', 0.02463565282436618), ('denver', 0.020892134600953027), ('anthem', 0.033911250089497087), ('kaplan', 0.033238294851840905), ('jew', 0.024049169732648045), ('derby', 0.02644824365742366), ('two-day', 0.030637329306108796), ('median', 0.040249996324611424), ('ankle', 0.052299066944841738), ('shoreline', 0.042305921635901848), ('springs', 0.02294576086106357), ('pot', 0.017439860560384986), ('subway', 0.032453078871651268), ('sandy', 0.030641788911625202), ('gaza', 0.022506124431170355), ('holland', 0.02155723112337209), ('trained', 0.026959892733767022), ('commonwealth', 0.033235748609691813), ('23-year-old', 0.044476843107550901), ('vicinity', 0.031416198952457432), ('redstone', 0.02059773991026562), ('use', 0.017219486006865094), ('day', 0.024638597867005878), ('steam', 0.029428193527013976), ('hamas', 0.02099594362792152), ('judaism', 0.01958214985171703), ('cambodia', 0.026698800004270416), ('vanessa', 0.027241833696700125), ('meme', 0.04330866981657594), ('siberia', 0.026570502508104384), ('jennings', 0.019872381654420963), ('medical', 0.01551171160087619), ('independence', 0.020989308514648492), ('fema', 0.028001652407437457), ('jerusalem', 0.017559433171722118), ('sb', 0.016481827769680681), ('40th', 0.024769931641176086), ('beyonc', 0.025208887827487862), ('delusional', 0.038924894169598073), ('high-resolution', 0.025501197383714352), ('jean', 0.016768473463134731), ('smoke', 0.013921289785443705), ('barbecue', 0.016150347753240042), ('berg', 0.021434272605177114), ('arab', 0.0092507756849459204), ('reese', 0.014998720288531152), ('moose', 0.021282613827801385), ('ho', 0.020923205596667719), ('fond', 0.037661128944686768), ('18,000', 0.023156947850481103), ('hoover', 0.015957306296538229), ('cannabis', 0.013635531940417236), ('sebastian', 0.015740179926775708), ('marks', 0.019324178181134952), ('anti-semitism', 0.0084424030961166349)]), ('flynn', 0.012637679693394291)]), ('ryder', 0.020030732142598446)]), ('formation', 0.017466727785778235)]), ('psychologically', 0.033036785617519354)]), ('cassette', 0.021491841586246698)]), ('gerry', 0.015871090660294695)]), ('law', 0.012418735175417094)]), ('service', 0.015348181302053109)]), ('colin', 0.01823747031638134)]), (106, (68, (293, (56, (272, (366, (20, (13, (84, (183, [('mcdonald', 0.076517531116571946), [('jewish', 0.097319946916281569), [('prince', 0.16492915668591968), [('vaccine', 0.096569180158456516), [('thompson', 0.18124818403037657), [('kansas', 0.33940134305343106), [('brady', 0.042614757056152107), [('brazil', 0.065613730113403934), [('cohen', 0.090905421986439194), [('bottle', 0.063471479435104824), ('portland', 0.055627289605417025), ('jew', 0.090296543931698753), ('hayes', 0.041156247552149972), ('g', 0.052807540250436616), ('capitol', 0.12782422117564737), ('vietnamese', 0.0351804747449792), ('watson', 0.031353667463647379), ('carter', 0.043198523872346226), ('restroom', 0.083184277282107227), ('rogers', 0.056323525419827776), ('memphis', 0.030223225718941402), ('bee', 0.025951220852089087), ('suv', 0.027070975856400881), ('autism', 0.03326172665639314), ('lock', 0.10588200632524673), ('ark', 0.030533031314589766), ('manchester', 0.031033381283242633), ('brazilian', 0.032978032044629221), ('arthur', 0.041485979016652581), ('execution', 0.050366510246299324), ('burger', 0.020843410599941367), ('holocaust', 0.025313656239651414), ('saint', 0.026150588347135715), ('immunity', 0.031074622395130792), ('13', 0.061813059987132477), ('ascension', 0.02450096088938563), ('rhode', 0.028633198409537649), ('wrestler', 0.025023254681307283), ('fiduciary', 0.029797978518378881), ('pharmacy', 0.046271549770350993), ('vanessa', 0.018960932133150213), ('israel', 0.023252438278850644), ('mcqueen', 0.024879711014483719), ('stein', 0.027769420932246842), ('holmes', 0.048666582266015779), ('kan', 0.024459340952480004), ('harlem', 0.024773942987399988), ('colombia', 0.024160447795198089), ('stall', 0.023597803102496616), ('packaging', 0.031797808300547813), ('91', 0.016357736413918151), ('anti-semitism', 0.018185499498900668), ('owens', 0.022087150364092624), ('mm', 0.027596526084040802), ('concussion', 0.041950052436903564), ('yee', 0.024125004932618908), ('reynolds', 0.024407906826456775), ('hogan', 0.022823777241965989), ('commend', 0.023496221362151903), ('injection', 0.027829913984975304), ('oregonian', 0.013814340490094399), ('synagogue', 0.013665978928115543), ('impeachment', 0.018540959526932941), ('vaccination', 0.026602697328916), ('1966', 0.03882679878251212), ('mustard', 0.022128497131312944), ('yorkers', 0.018899013763990662), ('gawker', 0.01829256686005425), ('messiah', 0.023253587482732839), ('fairfax', 0.026091627151726745), ('franchisee', 0.013155337557602067), ('anti-semitic', 0.011815086573155262), ('dot', 0.017764292412003183), ('soap', 0.026478498141430988), ('helmet', 0.033395887762564372), ('frankfurt', 0.020035086872998563), ('manning', 0.017529464693250184), ('tape', 0.013603493334543866), ('extradition', 0.022335847308150556), ('boyd', 0.025199422316473714), ('perk', 0.013081223996898934), ('nazi', 0.010565403389246147), ('michel', 0.017496145856491084), ('edt', 0.02369901211711439), ('segregation', 0.030715615342931917), ('brownback', 0.019727097232479823), ('poem', 0.016777826222544642), ('silva', 0.012540240090024801), ('memorandum', 0.01933465077680897), ('ftc', 0.024219379456843475), ('bernard', 0.011893054585476698)]), ('dunn', 0.0083961936033249644)]), ('charlton', 0.016921934181062762)]), ('vaccinate', 0.016034160056281516)]), ('injury', 0.02407612053893907)]), ('overland', 0.01629573002064558)]), ('poetry', 0.015952504340878915)]), ('denton', 0.011132639442302382)]), ('davenport', 0.016853874608424173)]), ('troll', 0.021052336744415659)]), (82, (54, (221, (211, (392, (222, (121, (112, (185, (87, [('our', 0.010613895500569908), [('you', 0.026118186286740831), [('like', 0.0089594282575742382), [('', 0.021161715173580346), [('investigation', 0.024135363753598416), [('you', 0.0091711520608918698), [('music', 0.015437289153164981), [('i', 0.010768691669815839), [('i', 0.14925262059407868), [('like', 0.009904821758344411), ('these', 0.0084310834772182118), ('can', 0.01379301725858789), ('character', 0.0070612136211160425), ('movie', 0.013262088248499004), ('report', 0.016587741479531671), ('them', 0.0066951212119003047), ("'", 0.01122430849338779), ('like', 0.010026682585173474), ('think', 0.022647390905392514), ('his', 0.0081509072302923813), ('social', 0.0068403707318977743), ('if', 0.01283365954071353), ('story', 0.006751058816808167), ('its', 0.0085216656085288483), ('attorney', 0.014102077777652353), ('into', 0.006653724531836473), ('song', 0.0098529500821291777), ('you', 0.0087986288812784714), ('my', 0.016141693105458793), ('into', 0.006389033040782071), ('most', 0.0063167533394445276), ('get', 0.0097538519047057036), ("'", 0.0065029370310242954), ('story', 0.0080720753409255549), ('office', 0.012895717162983251), ('up', 0.0063530374336120809), ('play', 0.0073806055157666604), ('there', 0.0060216952722434098), ('so', 0.013953301715409795), ('out', 0.0062788383997770796), ('how', 0.006059734040139402), ('there', 0.009215997242159742), ('even', 0.0064267052659405105), ('can', 0.0066118007788184647), ('department', 0.011890249215582783), ('like', 0.0062537412331610488), ('show', 0.0072705978443847394), ('them', 0.0057514526915797195), ('like', 0.013600738832274629), ('look', 0.0062647427755059201), ('such', 0.0060348229229168163), ('up', 0.0084456694521545649), ('show', 0.0060242687801015183), ('even', 0.0062679217850096581), ('federal', 0.011143433349558791), ('can', 0.0059545361693668884), ('band', 0.0070264396627281577), ('into', 0.0056486352970119757), ('me', 0.013060030180964551), ('image', 0.0057937623655806535), ('even', 0.0059441545122846636), ('out', 0.0081321754989486725), ('play', 0.0056792356388875561), ('you', 0.0062462166007660955), ('statement', 0.0097891925741351079), ('there', 0.0059247379755973349), ('like', 0.006813041655532745), ('time', 0.0054754682932668794), ('go', 0.012400952141020553), ('i', 0.0053605368741321095), ('world', 0.0058687760334752946), ('like', 0.0080303176226747142), ('most', 0.0054246613300585472), ('into', 0.0054477984393395702), ('case', 0.0097227710560077642), ('out', 0.0056541987372772644), ('new', 0.0059026375781753493), ('so', 0.0050754567234868476), ('what', 0.011685972145641632), ('up', 0.0052503964117365104), ('many', 0.0055525974365882328), ('so', 0.0080107332302670631), ('see', 0.0050157189096265804), ('world', 0.0052836751917808341), ('general', 0.0094015390657792158), ('look', 0.0053475138981890777), ('up', 0.0054578634778968179), ('man', 0.0050707767115053025), ('there', 0.011568175165499139), ('its', 0.0052420033061592337), ('way', 0.0050445349118613892)]), ('than', 0.0076670386312119717)]), ('first', 0.0046310758865254409)]), ('time', 0.0052804273344848112)]), ('official', 0.009234325103316678)]), ('most', 0.0053217611689400281)]), ('album', 0.0054338390994167702)]), ('what', 0.0049921571459974933)]), ('really', 0.011085613378384287)]), ('story', 0.0051699605035156549)]), (129, (85, (152, (62, (178, (217, (135, (145, (19, (28, [('word', 0.0063684797682873025), [("'", 0.025954180591699301), [('law', 0.035202741595574877), [('his', 0.088891014545534858), [('like', 0.014265594624598388), [('she', 0.11357573365389861), [('his', 0.01537073617533862), [('restaurant', 0.0084159293431864748), [('i', 0.075523445101294037), [('you', 0.075929385200990671), ('might', 0.0048248236972672038), ('music', 0.014191466689069192), ('right', 0.016330569021706157), ('him', 0.029908676759154627), ('your', 0.0094299729691216319), ('her', 0.088866731500180596), ('american', 0.011317287842004772), ('like', 0.0079715364355927142), ('his', 0.032176949053736553), ('your', 0.031783262929553815), ('like', 0.0040682547475724279), ('song', 0.0092852731064440985), ('government', 0.010124458682557626), ('i', 0.028512204427837976), ('look', 0.0057191563553206986), ('i', 0.034501700764856591), ('political', 0.010843937648684793), ('go', 0.0076229204974767352), ('my', 0.030079265578729416), ('can', 0.02428762008257249), ("'", 0.0038095749138755937), ('his', 0.0091487678323135535), ('should', 0.0087419464861923698), ("'", 0.015292875546261369), ('just', 0.0055701369346220836), ('my', 0.01322928387401108), ('america', 0.0066006581237264012), ('you', 0.0076205533287351419), ('me', 0.01810637975813648), ('if', 0.016780404040800834), ('seem', 0.0035884078619599987), ('like', 0.0083429316397662899), ('legal', 0.0087158296094866396), ('my', 0.014213841661145114), ('even', 0.0054363383353259204), ('go', 0.010819072396479737), ('would', 0.0061538532430162295), ('out', 0.006707787474313962), ('him', 0.0154285113547204), ('get', 0.012277366133876318), ('any', 0.0035505246705592533), ('i', 0.0083295166662316032), ('rule', 0.0083572005509701848), ('family', 0.010540288138413723), ('experience', 0.0053921144714822594), ('me', 0.0098278487608170258), ('what', 0.0059759820007826061), ('his', 0.0066496121303506539), ('go', 0.013034723413427355), ('what', 0.011964960569087095), ('such', 0.0032942602958887608), ('play', 0.0078880713954922223), ('court', 0.0080091935201102443), ('tell', 0.010348134905247006), ('them', 0.0051231898061065671), ('get', 0.0092492624304744316), ('even', 0.0049044674292624806), ('get', 0.006276424658672332), ('get', 0.011747513264800481), ('so', 0.01091650046305891), ('even', 0.0032656959928012309), ('show', 0.0070890056794375642), ('act', 0.0075354397674701071), ('me', 0.010180874714492458), ('use', 0.0049629264698792617), ("'", 0.0087842349367149523), ('like', 0.004727994506893464), ('up', 0.0061097211984719544), ('up', 0.0091167261716964491), ('how', 0.010314679023283594), ('only', 0.003193211640979624), ('band', 0.0066505703264958119), ('any', 0.0074659516082847379), ('know', 0.0086437564920269251), ('through', 0.0047720195601795778), ('up', 0.0085586462948529351), ('politics', 0.0047220445224439439), ('what', 0.0057997892789436586), ('like', 0.0083665605194075034), ('people', 0.010045025066527024), ('most', 0.0031558825711887071)]), ('you', 0.0062809680380369196)]), ('decision', 0.0074422121664437591)]), ('go', 0.0084561694667925034)]), ('sound', 0.0046830868624621082)]), ('tell', 0.0084630360940341175)]), ('media', 0.0046286495074992521)]), ('into', 0.0055861054041208722)]), ('you', 0.00826664227384581)]), ('there', 0.0099686200558359618)]),

53

(64, (193, (274, (180, (52, (341, (91, (126, (38, (175, [('film', 0.018941207816629956), [('i', 0.052350991815489661), [('our', 0.025655292047533158), [('like', 0.016768228865840729), [('our', 0.013232168725000819), [('his', 0.013168323799984815), [('his', 0.097984129733608288), [('you', 0.06221984568786057), [('she', 0.14470872193707224), [('people', 0.01496377866390089), ("'", 0.013963601349050652), ('my', 0.023057141047832127), ('think', 0.016452184786278124), ('what', 0.011019765929688957), ('way', 0.0088019384658090689), ('story', 0.010560094324103488), ('him', 0.018083892864373653), ('your', 0.031524619974963738), ('her', 0.13721759016937943), ('what', 0.013480501854336982), ('movie', 0.013149062044670277), ('me', 0.012135197706547743), ('know', 0.013680759001198061), ('you', 0.010591334840285523), ('like', 0.0086487211580587291), ('show', 0.010075135857766575), ("'", 0.0099846067281568335), ('can', 0.02859725501196975), ('i', 0.013447751456738505), ('you', 0.011189387799665856), ('story', 0.0093586252942547722), ('life', 0.0098633475186439973), ('because', 0.013399783819824417), ('i', 0.0088347243460473231), ('even', 0.008314556402884133), ('character', 0.0098911657986805752), ('man', 0.0086334551972530385), ('if', 0.019981158017488027), ('husband', 0.0067336272522771278), ('so', 0.010384313939912789), ('character', 0.0086931472188331357), ('work', 0.0089699456332342775), ('right', 0.012458078904742825), ('so', 0.0076111144047748798), ('how', 0.0081865263906300955), ('i', 0.0072855828678233562), ('after', 0.0046827410182027195), ('how', 0.0091210767393293569), ('know', 0.0064682072792593892), ('our', 0.0092997581412896085), ('show', 0.0081499846020479921), ('his', 0.0084892807754365293), ('very', 0.0093291668751370876), ('there', 0.0072315980046489207), ('these', 0.0078061917426808149), ('like', 0.0070880363576441632), ('himself', 0.0046484736495575624), ('get', 0.0083331696321700434), ('go', 0.0057225997168515955), ('i', 0.0091388005314821906), ('play', 0.0079525789673295169), ("'", 0.0083108562772258223), ('them', 0.0091950406126664469), ('just', 0.0062835181942869), ('because', 0.0077082260128591611), ('play', 0.0065369153491525783), ('into', 0.004592184107376579), ('what', 0.008192568336443655), ('life', 0.0055949960884896484), ('if', 0.0089959740666151755), ('series', 0.0055925666853268398), ('you', 0.0082686426389404696), ('how', 0.0086893153324163544), ('some', 0.0060830349372418911), ('them', 0.0075282110037090079), ('you', 0.0062539207729378123), ('king', 0.0044989550065491039), ('need', 0.0081465695210157091), ('would', 0.0054933173162908357), ('there', 0.0083474868943624863), ('star', 0.0054554746788890796), ('love', 0.0082675118596160294), ('way', 0.0086052675873979738), ('way', 0.0058417366284971729), ('why', 0.0069042549022357187), ("'", 0.0061424582674224903), ('book', 0.0043211464292132505), ('there', 0.0076033682293253141), ('tell', 0.0053990405874351126), ('no', 0.0073333754326149209), ('first', 0.0051406483663011208)]), ('she', 0.0074621157922387851)]), ('these', 0.00858494106124944)]), ('thing', 0.0054629582115249516)]), ('fact', 0.0065843410764569576)]), ('so', 0.0059401509301115242)]), ('wife', 0.0042220716192918808)]), ('use', 0.0067625228345768841)]), ('time', 0.0050789881365561756)]), ('can', 0.007217499361440549)]), (162, (103, (233, (258, (288, (125, (145, (109, (98, (141, [('think', 0.026451181410617003), [('like', 0.0098421448484459634), [('she', 0.16438552617371455), [('she', 0.15976449514252222), [('she', 0.19568423690968292), [('you', 0.012265697246529504), [('what', 0.0069438518574287973), [('his', 0.074402609405730652), [('our', 0.010095954165821144), [('i', 0.029276042498187636), ('just', 0.021540970498005178), ('you', 0.00707250252824786), ('her', 0.16023190520623082), ('her', 0.15308958849656237), ('her', 0.18538725105395337), ('what', 0.011354761845763978), ('can', 0.0060831404058192835), ('him', 0.024253352752772118), ('no', 0.0075258863461579069), ('my', 0.019324204186986368), ('like', 0.018793740764869508), ('its', 0.0066686305551998401), ('mother', 0.011372381496646244), ('i', 0.017440048247663917), ('tell', 0.0096754814167078242), ('no', 0.010918936169625209), ('if', 0.0055412910373111414), ('i', 0.016793768193626126), ('what', 0.0075157373526645848), ('her', 0.013337220602019563), ('know', 0.017607898293566834), ('into', 0.0060536254930707916), ('family', 0.010656148431378864), ('mother', 0.0086989950049384569), ('husband', 0.0086964932376729035), ('if', 0.0092963821282659228), ('even', 0.0053129670924522473), ("'", 0.01603594714876725), ('even', 0.0068786528479449398), ('she', 0.012717870606860341), ('thing', 0.01424981445687), ('his', 0.0056150840717544611), ('tell', 0.0096117138479543934), ('husband', 0.0074617245396523275), ('daughter', 0.0076226478495785392), ('people', 0.0092623787824264968), ('no', 0.0053083197768088548), ('out', 0.01024855094906172), ('can', 0.0066256239624804597), ('me', 0.0097535292730248097), ('really', 0.014069794431621601), ('just', 0.0055938481288618294), ('husband', 0.0089631339457350301), ('tell', 0.0072893671153647513), ('woman', 0.0068531299320982921), ('can', 0.0091884131164984319), ('its', 0.0052478092491720649), ('get', 0.0092969770674529301), ('people', 0.0062111708014369526), ('life', 0.0087827563748994197), ('because', 0.013381419185436066), ('new', 0.0055698838234345353), ('daughter', 0.0087815306904208499), ('daughter', 0.0072781393803125058), ("'", 0.0068282776023212864), ('even', 0.0090366006113425414), ('book', 0.005145824589688033), ('up', 0.0092518714883929737), ('world', 0.0056632093685317917), ('like', 0.0075528203646703677), ('want', 0.013350047184328018), ('most', 0.0052954249256256785), ("'", 0.0067120438449117786), ('woman', 0.0067384272421961242), ('know', 0.0058353536855064673), ('them', 0.0080096536156312043), ('most', 0.004906249462556537), ('go', 0.0087617169295573175), ('many', 0.0055946864915447396), ('love', 0.0073193226675696684), ('see', 0.013341205451857407), ('first', 0.004761416809581258), ('know', 0.0061710127129445584), ("'", 0.0062300076005176383), ('herself', 0.0056260521114221973), ('there', 0.0078682202667648068), ('world', 0.0047238413439584601), ('after', 0.0085644438046690013), ('these', 0.0052575725669013083), ('feel', 0.0057243692862737876), ('very', 0.010565690377410524)]), ('there', 0.0046633450971293672)]), ('life', 0.0059211004747812511)]), ('family', 0.0061365126554082106)]), ('family', 0.0051802044705269576)]), ('so', 0.00757403517762072)]), ('there', 0.0042958798077719369)]), ('tell', 0.0081460308493128408)]), ('so', 0.0052011590839949563)]), ('so', 0.005621548225058573)]), (167, (196, (276, (240, (217, (150, (42, (133, (174, (177, [('like', 0.016372272439218662), [('i', 0.066670423785309074), [('like', 0.021255362169975287), [('i', 0.075881273916820702), [('my', 0.017443804437605247), [('i', 0.098860250855417317), [('i', 0.063531256444228787), [('i', 0.08229245484708364), [('you', 0.072040768835530261), [('i', 0.098371254735981295), ('just', 0.0095672906057720046), ('you', 0.036507703704685458), ('just', 0.0188173020054643), ('you', 0.043541315894494469), ('like', 0.017101203618261736), ('you', 0.030347237377473017), ('you', 0.040334676861527891), ('you', 0.034899827074569567), ('your', 0.019749782152673676), ('you', 0.031033187979773804), ('even', 0.0085493201694032475), ('what', 0.016297902264202083), ('your', 0.018004991051067981), ('go', 0.018521250469618063), ('just', 0.015871749915585241), ('my', 0.018807078956414074), ('what', 0.014462271867496403), ('my', 0.01683226889249299), ('can', 0.017786777602390426), ('go', 0.020434816798179267), ('way', 0.0069243867260689812), ('go', 0.016231218574739463), ('think', 0.013991701537860295), ('get', 0.017566756705648317), ('think', 0.014647689548909526), ('what', 0.017694281149375036), ('so', 0.014310351115092497), ('what', 0.015946712220870563), ('if', 0.017149523448561726), ('get', 0.017344511338256617), ('them', 0.0065899020819464348), ('get', 0.015512490831799203), ('thing', 0.012762137592725807), ('so', 0.017011816377299423), ('me', 0.013801429077782717), ('go', 0.017149858733257195), ('go', 0.01427455628257052), ('go', 0.014857066851212708), ('what', 0.016638304439798093), ('my', 0.016401184040899391), ('how', 0.0065499565430732548), ('so', 0.014972548022164538), ('really', 0.011047645083591131), ('what', 0.016939488861210331), ('know', 0.013488269659968601), ('there', 0.016246672189146633), ('get', 0.013820192271895335), ('so', 0.014620623503484483), ('get', 0.015826479389176678), ('what', 0.015477751020295659), ('your', 0.0065178542915773465), ('there', 0.01462472738508982), ('my', 0.01075022366868952), ('there', 0.01603130001063004), ('thing', 0.010650725548729019), ('think', 0.016009690049663544), ('my', 0.013256404425287577), ('think', 0.013474021706817121), ('there', 0.015550823919882582), ('there', 0.015075766749043902), ('thing', 0.0065143620875802132), ('think', 0.01279391026499588), ('how', 0.010427459014086623), ('think', 0.01484082211436526), ('want', 0.010351165815176089), ('get', 0.015818574581857422), ('there', 0.013031462640796807), ('get', 0.013090789966597136), ('so', 0.015516096329797391), ('think', 0.014871619683696614), ('show', 0.0063046683580792054), ('like', 0.012289878864610717), ('know', 0.0097415025635061367), ('if', 0.013225101555316815), ('see', 0.009232420093856681), ('so', 0.015468406721563931), ('like', 0.012067333648120793), ('there', 0.012704313378339162), ('go', 0.013206116360950034), ('so', 0.014242363205414037), ('feel', 0.0060993364811174171)])] ('just', 0.011871805986362885)])] ('want', 0.0097195069083230827)])] ('just', 0.013210173599477499)])] ('because', 0.0091808981458287638)])] ('me', 0.012975738093442758)])] ('think', 0.011223188647828734)])] ('like', 0.011943580189995216)])] ('people', 0.012154313451258041)])] ('just', 0.013417143557399839)])]

350 450 500 340 360 900 1000 850 symmetric 0,01 symmetric 0,01 symmetric 0,01 symmetric symmetric symmetric 0,01 0,01 0,01 score 13 score 7 score 10 score 6 score 5 score 6 score 9 score 9 score 6 score 4 score 3 score 4 1 topic 2 topics 1 topic 3 topics 2 topics 3 topics 1 topic 2 topics 5 topics 10 topics 4 topics 5 topics [(211, [(109, [(297, [(339, [(153, [(207, [(359, (57, [(651, [(665, [(14, [(630, [('crisis', 0.14781346426640851), [('genocide', 0.14411493349183283), [('hiv', 0.13158812380827881), [('hopkins', 0.13898835232791035), [('pin', 0.15202454379581948), [('singh', 0.066093513899523657), [('graduation', 0.0867859004552997), [('photography', 0.11096016035526646), [('hui', 0.33453444528515047), [('gurkha', 0.016872603479602098), [('i', 2.011937911442203e-05), [('len', 0.07571942320015343), ('ukraine', 0.12576454084132133), ('armenian', 0.062400041896519896), ('promotion', 0.11841621252470053), ('horton', 0.08648189862592405), ('hover', 0.076491495346253296), ('amanda', 0.0610745591946334), ('cedar', 0.078869281343387193), ('shed', 0.077029294179794164), ('like', 1.7716311737720657e-05), ('can', 2.3983979117526674e-05), ('start', 1.9641540694144704e-05), ('168', 0.05658669682872404), ('ukrainian', 0.050962467873992459), ('andrews', 0.05575834727901461), ('buffalo', 0.11176252772565466), ('morale', 0.083589933222288262), ('ramirez', 0.060180067993120527), ('reyes', 0.059172366132229245), ('89', 0.059527130716992852), ('mohamed', 0.072993507935070548), ('so', 1.7282757405721754e-05), ('i', 2.2450683288251341e-05), ('state', 1.9336868307731115e-05), ('ashe', 0.054091586682955918), ('eastern', 0.036940030564529526), ('feminine', 0.037545488220670033), ('aids', 0.083786675841640873), ('messenger', 0.08087660363657706), ('towers', 0.037844843596331702), ('jointly', 0.054767156852555142), ('diploma', 0.050559949348347934), ('disproportionate', 0.047665610403635573), ('study', 1.666754615946254e-05), ('you', 2.1878937827202594e-05), ('his', 1.9182608549953404e-05), ('145', 0.048870660207529809), ('separatist', 0.027821271365918627), ('contestant', 0.036261868728212875), ('prevention', 0.048955178360442303), ('sandusky', 0.069289343204353052), ('malawi', 0.035624540304011805), ('dawson', 0.051549133676707916), ('olympia', 0.050226771992808707), ('spider', 0.043424845938381598), ('can', 1.6576078326870206e-05), ('people', 2.1334853202229864e-05), ('so', 1.8940997643754366e-05), ('odessa', 0.043560074887748289), ('yanukovych', 0.019195219388553387), ('atrocity', 0.028976178444761257), ('incidence', 0.043802929069891083), ('phony', 0.043373663746090224), ('embarrassed', 0.033453092759474605), ('benton', 0.03708754104638088), ('malaysia', 0.048241658655506066), ('prime', 0.038322255394287623), ('our', 1.6348490477044115e-05), ('moscow', 2.1129504254798579e-05), ('go', 1.8789521147606392e-05), ('dislodge', 0.042393010106379407), ('kiev', 0.018904295116653962), ('taint', 0.025635379961795837), ('zimbabwe', 0.034507293169720919), ('frenzy', 0.039102366497713287), ('370', 0.031981532158538828), ('sheltered', 0.033446276589756956), ('alvarez', 0.038698127236266873), ('maliki', 0.033218818266932099), ('family', 1.6246381988832401e-05), ('what', 2.105426562050186e-05), ('would', 1.8724673177490682e-05), ('tranquil', 0.035585011933498299), ('gingrich', 0.018717310273353868), ('receiving', 0.022301428913363865), ('receiving', 0.023689830740653957), ('outward', 0.027169577187692594), ('zeal', 0.031366148297028378), ('refreshing', 0.031627593885446982), ('marking', 0.035368266586008369), ('threatening', 0.03319199843781561), ('there', 1.6184416376309902e-05), ('if', 2.0785242498054309e-05), ('its', 1.8670800383635968e-05), ('question-and-answer', 0.035264742838353839), ('newt', 0.018007174264852415), ('kimmel', 0.021955787466970869), ('hiv/aids', 0.019430843207643662), ('bluntly', 0.026294173312201838), ('tanzania', 0.030884332109321174), ('parisian', 0.030499015964446857), ('pavement', 0.031087306213370774), ('london-based', 0.031674989635308154), ('poverty', 1.5937315375401399e-05), ('his', 2.0545938022664596e-05), ('law', 1.8619811797271209e-05), ('malleable', 0.03524893851826684), ('sri', 0.017040927824013095)]), ('breast-feeding', 0.021283706337855798)]), ('manageable', 0.019354446250679343)]), ('paterno', 0.025365550913487887)]), ('countdown', 0.027762595568091325)]), ('88', 0.029735699297285721)]), ('further', 0.029768563639931227)]), ('specimen', 0.02919299770540796)]), ('than', 1.5779526331546329e-05)]), ('introvert', 2.0486530638680403e-05)]), ('out', 1.8582912872785806e-05)]), ('reconstitute', 0.028170646196514375)]), (53, (45, (83, (26, (328, (361, (7, (82, (451, (700, (868, (35, [('syria', 0.11093032629325139), [('meyer', 0.10527966262590296), [('henry', 0.35453206920205116), [('iv', 0.088026497553632524), [('garcia', 0.17425210577541167), [('parker', 0.18082836925169959), [('stanley', 0.13894163953960054), [('colombia', 0.10472127917040576), [('linda', 0.26318023145640629), [('rope', 0.23225015411557282), [('agonizing', 0.079575551651983531), [('scroll', 0.10040511861984129), ('syrian', 0.096257743413913507), ('bubble', 0.082689105269540847), ('forgive', 0.068604805969889771), ('kirby', 0.070662455219441811), ('blunt', 0.12497558087136626), ('pin', 0.075285744948024549), ('peterson', 0.10665042800783647), ('hernandez', 0.094072337564430897), ('ceres', 0.1266613040204384), ('mcclendon', 0.12629769782011888), ('oligarchy', 0.058015825446943732), ('chick', 0.075583970375679863), ('rebel', 0.066752758528975067), ('conversion', 0.08193631475632121), ('ribbon', 0.059252893583304656), ('laurie', 0.051865986099790137), ('cardboard', 0.078616889444353605), ('ratify', 0.044709928927119127), ('pirate', 0.090559461956521287), ('hitler', 0.085885826997797815), ('rahman', 0.089438583176754349), ('gannett', 0.064464449560384357), ('sherwood', 0.041514157622212805), ('flurry', 0.07244279170542231), ('regime', 0.043141159686152029), ('tanaka', 0.044077769369430801), ('82', 0.057008261934705456), ('one-way', 0.039704666752653363), ('refrigerator', 0.062055469531179082), ('nassau', 0.036361248590321353), ('piano', 0.05168776049792885), ('luis', 0.069419905638775176), ('floss', 0.076723841138669563), ('ancestral', 0.05056072545704722), ('misogynist', 0.041071169461130506), ('terra', 0.064232420637804405), ('assad', 0.039805438760721408), ('pi', 0.034975623952829125), ('proclamation', 0.045727010400869045), ('weinstein', 0.038067830545438004), ('manuel', 0.060922211070689493), ('hobbs', 0.03199918871507753), ('8th', 0.044298511227430164), ('li', 0.065123094104197476), ('stoop', 0.057946675795309925), ('donahue', 0.026460656200697298), ('tijuana', 0.040404386527908523), ('mileage', 0.059922448774542267), ('aleppo', 0.038329633522482705), ('2022', 0.032864980242650728), ('uphill', 0.037280500476669691), ('woodrow', 0.035361331000693393), ('theirs', 0.054288959281754742), ('hebron', 0.029635109072364909), ('sanitation', 0.043836845492636209), ('colombian', 0.03918953201648797), ('casually', 0.037067811028659407), ('two-month', 0.024200470956822158), ('fealty', 0.037803777593596072), ('vicki', 0.040897436527794734), ('butler', 0.031861463374701859), ('harold', 0.028890877410149559), ('unthinkable', 0.035374561464972157), ('pore', 0.034707746331159244), ('linger', 0.046470919776325594), ('azaria', 0.026789907177191822), ('beatle', 0.033308230611488764), ('tucker', 0.036231418919515278), ('eliot', 0.035819524176187306), ('bulger', 0.020693402901284356), ('bloodthirsty', 0.033911563507101547), ('searcher', 0.0366633061900506), ('cincinnati', 0.027219434735294062), ('decades-old', 0.023144829018800807), ('erosion', 0.034714647496650965), ('hayden', 0.031498206361957298), ('apparatus', 0.046367324560800872), ('heller', 0.024506566981223499), ('incubator', 0.032046915887108701), ('volt', 0.03360326179543012), ('burris', 0.033691994252177858), ('angelenos', 0.017623230042727995), ('iger', 0.032343518319805281), ('curtin', 0.02310784465574079), ('opposition', 0.025278531039666711), ('fetch', 0.02132365550424456), ('partisanship', 0.029768335904623848), ('remix', 0.029025597165801158), ('intensely', 0.034608070503941477), ('taipei', 0.01898736495003709), ('sweatshirt', 0.028831535098816991), ('stalin', 0.031055395017096518), ('live-in', 0.031450052625889159), ('whitey', 0.014732830635689411), ('bertrand', 0.023015463484409377), ('nlrb', 0.022515685831515992), ('bashar', 0.022651358176212423)]), ('inflated', 0.020748902399037691)]), ('documented', 0.026687292710088326)]), ('run-up', 0.02746155597361432)]), ('bloomfield', 0.024789729890176374)]), ('moriarty', 0.018662544047177938)]), ('barracks', 0.028463014517971955)]), ('adolf', 0.027891335379311565)]), ('cabaret', 0.030734049339053351)]), ('hearse', 0.013320719679149178)]), ('gardiner', 0.019343458825368767)]), ('resolutely', 0.021530352279730949)]), (173, (190, (100, (334, (118, (271, (304, (353, (320, (692, (668, (130, [('libya', 0.14303134140504437), [('nepal', 0.095977453451410635), [('selected', 0.061610355027701999), [('southern', 0.15190799739063249), [('bachelor', 0.17258464931100229), [('handling', 0.07425304652669551), [('earthquake', 0.10281192397756851), [('hudson', 0.094447252256511349), [('lynn', 0.2787908259488826), [('stein', 0.154290340673543), [('incarcerate', 0.20370952223988958), [('saw', 0.093332789359966611), ('benghazi', 0.067388184467558801), ('anthem', 0.064554588794922596), ('carrie', 0.059584685730920316), ('racist', 0.041789225738527808), ('orthodox', 0.11893588278302801), ('sage', 0.058108395170243916), ('horse', 0.095036036731328224), ('designated', 0.051630046235333606), ('fin', 0.121074369384013), ('jill', 0.13906929957526068), ('incarceration', 0.1996685235223673), ('29-year-old', 0.092162943485658178), ('gm', 0.066546745490304704), ('kathmandu', 0.03914693924185994), ('state-of-the-art', 0.051237705177542121), ('haggard', 0.039496074094881116), ('rethink', 0.073161753583848424), ('evasion', 0.055775405435841779), ('cattle', 0.031643524093453314), ('flanagan', 0.041384043413118979), ('kushner', 0.10737155995336553), ('sabrina', 0.072204084373525157), ('incarcerated', 0.080054848134011009), ('missoula', 0.079783860552029975), ('switch', 0.050805172247606147), ('peru', 0.033215022623293421), ('departure', 0.043581400500833395), ('klan', 0.036254779265548827), ('injustice', 0.069896932755261043), ('winton', 0.055128171577695478), ('ranch', 0.028731215097566914), ('fountain', 0.03567145907741439), ('sympathizer', 0.087177941863060576), ('encroach', 0.061719500725986365), ('usefulness', 0.069507367707057732), ('bender', 0.077572984912668258), ('libyan', 0.050625736219143631), ('peanut', 0.021578242058917667), ('deterrent', 0.043291043428415631), ('supremacist', 0.031967888536807564), ('nasty', 0.065927732274393497), ('short-lived', 0.041155038277589102), ('quake', 0.028309024521139387), ('marshall', 0.035651140933992208), ('salah', 0.085717162537927286), ('vaccination', 0.042854666756381499), ('levey', 0.034041231610319939), ('weep', 0.069059305669865151), ('ben', 0.042822729886369183), ('pearson', 0.018869485910230667), ('donkey', 0.041110421981668845), ('ku', 0.031472109022129248), ('burgeoning', 0.049446164960575731), ('eleanor', 0.036908407381318113), ('land', 0.024845941271321965), ('ch', 0.034602976442196336), ('eel', 0.069809463772443986), ('vaccinate', 0.033916501085464582), ('88-year-old', 0.013608724777341992), ('untapped', 0.044184368828207393), ('sink', 0.03822788098184398), ('alberto', 0.016515979651096722), ('robbie', 0.040060018340931862), ('white', 0.029086299268998011), ('co-owner', 0.046175192227919631), ('triumph', 0.032715323944108803), ('rancher', 0.023957421601786877), ('itt', 0.03249113490187118), ('abdeslam', 0.06517828999217111), ('thigpen', 0.021489586206210626), ('tupac', 0.0097243526597771499), ('fiancee', 0.034665229992707887), ('tripoli', 0.026460111438422657), ('fujimori', 0.014758065833817228), ('rollout', 0.039852385408340393), ('bigot', 0.024580902765323782), ('ordeal', 0.041836098567102034), ('purposely', 0.030717554388871807), ('disaster', 0.023101689822955466), ('segregation', 0.027023385878585174), ('molenbeek', 0.045497214801951805), ('prasad', 0.020648659562032785), ('730,000', 0.0084289902100520391), ('fender', 0.026246726402206141), ('ignition', 0.022989121012561422), ('lima', 0.012567743487437055), ('9%', 0.036044679147044859), ('klux', 0.024019954019838766), ('flurry', 0.040425995377681719), ('payback', 0.02783222400316222), ('hit', 0.022309771056322574), ('pulaski', 0.026435311921234089), ('life-size', 0.038981686512320049), ('mumps', 0.012548178714046962), ('malden', 0.004930941301148733), ('logger', 0.025503288280951029), ('harden', 0.022714409921493168)]), ('toby', 0.012416174861247463)]), ('eminent', 0.033493742993920696)]), ('transpire', 0.020258896312658221)]), ('unprepared', 0.040338320388189054)]), ('midday', 0.027097675660530767)]), ('strike', 0.021475611331753727)]), ('accrue', 0.024821689637517783)]), ('costello', 0.025647199567582046)]), ('whitelist', 0.01226181863261918)]), ('wexford', 0.0047741683522765921)]), ('talker', 0.022995186920554321)]), (278, (70, (91, (78, (52, (331, (278, (261, (133, (719, (432, (57, [('rodriguez', 0.068755156057737368), [('anderson', 0.21663776609906965), [('predator', 0.13681850687574659), [('tube', 0.13532573501181897), [('banner', 0.16820548783736775), [('tesla', 0.10566095817878626), [('u', 0.13590040180613014), [('jewish', 0.21151391097957747), [('cove', 0.12288574583023772), [('occurrence', 0.17060997595488916), [('artillery', 0.12881705340837643), [('smoothly', 0.13744349247007451), ('yahoo', 0.068527885350176135), ('virgin', 0.049614832979028411), ('mob', 0.091769553381664148), ('apollo', 0.087807389010894563), ('beverage', 0.076801515376136123), ('model', 0.092291844740320178), ('banana', 0.057094344502207371), ('jew', 0.17326016530987312), ('barclays', 0.10433287370609194), ('swamp', 0.13539696027501963), ('sabrina', 0.11343688406650662), ('consequential', 0.0991080348850992), ('honduras', 0.062693070215387822), ('ian', 0.036674027533019571), ('prey', 0.076533295834634432), ('1/2', 0.052889212905679585), ('explicit', 0.071465299386331463), ('2018', 0.066471763632182532), ('incoming', 0.050827912322537007), ('holiday', 0.043008378092631937), ('referee', 0.089255120723156459), ('overlay', 0.064051970623952972), ('wacky', 0.094325097838840605), ('unease', 0.06990790716049744), ('jeremy', 0.057142064397702183), ('fischer', 0.0269121213504428), ('spike', 0.063538204507347082), ('circus', 0.046412306384327716), ('curse', 0.067961760095398227), ('musk', 0.050897318534858897), ('steven', 0.046591269266936353), ('anti-semitism', 0.034369283319307505), ('clyde', 0.078446269140806296), ('outperform', 0.058964678387157944), ('biting', 0.040866571181001621), ('thoroughfare', 0.057700008276963852), ('costa', 0.052817094940463397), ('81', 0.026127021952036401), ('molly', 0.049979800855282877), ('scroll', 0.042965637761664024), ('fidelity', 0.050973464230150173), ('michel', 0.041879645855205351), ('fantasy', 0.040863792980408833), ('holocaust', 0.033208232156665012), ('revolver', 0.072662992555758199), ('otherworldly', 0.055741544853996461), ('wanting', 0.038563687516006015), ('lithuanian', 0.023392505240395194), ('intel', 0.047968592861717181), ('lu', 0.024457368306033525), ('thug', 0.04685424777849196), ('baum', 0.038729585236191848), ('soda', 0.044686056225384518), ('impeachment', 0.041579903853983274), ('transitional', 0.029388115316812707), ('orthodox', 0.029138128554258656), ('speedway', 0.066217275488043209), ('rebranded', 0.021816379183866776), ('paternal', 0.030360188765829799), ('7-0', 0.014922384558990842), ('ana', 0.039646867659278738), ('omit', 0.023640960152948386), ('cobb', 0.044268316459579078), ('two-way', 0.032995022076614096), ('edible', 0.040034640865443387), ('elon', 0.028597925149698104), ('alphabet', 0.028326011653598351), ('community', 0.025275216109049519), ('lopsided', 0.055662717845172904), ('ava', 0.017152530025569938), ('manhandle', 0.017781193715025907), ('holistically', 0.014336640744407988), ('nypd', 0.033655924073022904), ('islander', 0.01977045309021281), ('est', 0.039904851252903863), ('masse', 0.028561046518746087), ('crawl', 0.035917411872356801), ('imported', 0.023034269378840747), ('sock', 0.027674034652544716), ('anti-semitic', 0.02282669414823477), ('ubs', 0.055440218369338518), ('ey', 0.01691238337255151), ('hammel', 0.01557030318688018), ('mst', 0.012132166341304153), ('guatemala', 0.033450986909495335), ('overweight', 0.018204032590056703), ('winning', 0.039196738352103126), ('drawer', 0.027571989223708286), ('serving', 0.035636095483755306), ('3', 0.019954774271544806), ('thankful', 0.026601712955951512), ('holland', 0.021678454690713403), ('tropics', 0.044490273181998247), ('gleeson', 0.013844359889533099), ('$225,000', 0.013185860923542542), ('griff', 0.01205585270980911), ('unaccompanied', 0.031640797125932957)]), ('2.1', 0.018000551562112728)]), ('tall', 0.036849972930494496)]), ('malawi', 0.026465815865297837)]), ('lantern', 0.030441401310261337)]), ('beckman', 0.018916682910617429)]), ('mckinley', 0.026217926862240417)]), ('selective', 0.013207375559690135)]), ('defuse', 0.039246733520666877)]), ('pokemon', 0.012031522275622935)]), ('striver', 0.012128103626163853)]), ('rambo', 0.011569954136582745)]), (209, (224, (54, (242, (313, (334, (169, (304, (131, (742, (94, (768, [('marine', 0.19992362828394539), [('connected', 0.08318653748711248), [('outlaw', 0.10238425146086413), [('massage', 0.092438226550172489), [('disney', 0.17511296612053021), [('museum', 0.32570645496728695), [('moon', 0.22792896441391555), [('gene', 0.14472375012274799), [('confused', 0.16461493463995466), [('rivalry', 0.13906957319316132), [('motivational', 0.082311081061911678), [('sear', 0.11869201833148793), ('meeting', 0.15237316197027173), ('fx', 0.044339057853832908), ('navigation', 0.076883525277487097), ('withdrawal', 0.081271657340775946), ('oyster', 0.056132711450158045), ('lucas', 0.067359438633096236), ('xx_n', 0.18946361840444376), ('wyoming', 0.054863306454770021), ('wrongly', 0.15307545221573854), ('portuguese', 0.082095592479516852), ('wallpaper', 0.048278525596613255), ('experimentation', 0.10913391138578091), ('corps', 0.13867339222877301), ('marked', 0.037480048755117429), ('finland', 0.067689593618764379), ('reiterate', 0.056484730371341767), ('mouse', 0.054423693593372138), ('friend', 0.026927550978425436), ('par', 0.0400380317605252), ('dna', 0.050935537229866164), ('derrick', 0.11837380957124116), ('attire', 0.063411695181115765), ('hui', 0.03994050456333522), ('chatter', 0.090904504016213888), ('lawn', 0.048043057219898377), ('usd', 0.034504954243378787), ('decree', 0.056478487502365282), ('pt', 0.049135207270423552), ('alligator', 0.045035217502718466), ('mccormick', 0.020739979627034787), ('dinosaur', 0.039810478961738241), ('genome', 0.044994045062635442), ('doherty', 0.065637807070350523), ('dissuade', 0.052888853525999764), ('delphi', 0.026376375413301022), ('id', 0.078097520129049552), ('carpenter', 0.046152699053235949), ('mini', 0.027836878859329894), ('nordic', 0.055692285451733298), ('benign', 0.046163906458717915), ('walt', 0.04264237201316206), ('patton', 0.019936714773719902), ('samuel', 0.033102577319160345), ('3d', 0.043255876124052538), ('ecstasy', 0.057883524519519364), ('draper', 0.044216196391213519), ('whaley', 0.023972147261943244), ('imaginative', 0.056079415290058637), ('twilight', 0.035994685736735151), ('compact', 0.027574276466762995), ('entitlement', 0.05432245041362041), ('whitman', 0.040899547926876549), ('cookie', 0.037785889816939061), ('hijack', 0.01887968293748667), ('eduardo', 0.023055618368408422), ('sherman', 0.036583221931609178), ('70-year-old', 0.054975754488499375), ('compatriot', 0.03319858357464766), ('lifesaver', 0.021258978003459023), ('rediscover', 0.045229764939363488), ('presentation', 0.026311710338764116), ('upside', 0.021682040598163709), ('geographical', 0.045969356519845433), ('prolong', 0.030889756863051586), ('vista', 0.037436676964427658), ('decimate', 0.018421041260433247), ('byron', 0.022878631906864691), ('sequence', 0.032416120893859229), ('hawkeye', 0.052413463420726954), ('caitlyn', 0.031056137002337895), ('divvy', 0.017697810720879441), ('bennington', 0.023440442100829767), ('alexa', 0.023189274805316441), ('principally', 0.018253809254506558), ('state-run', 0.043240183473846477), ('gun-control', 0.026902114584017579), ('finch', 0.031513936865969205), ('chew', 0.017561248781402253), ('fletcher', 0.020827181524154068), ('hole', 0.029593702142765365), ('relatable', 0.043167096291576862), ('high-pressure', 0.025829938517476454), ('tidiness', 0.0074108479728605525), ('prosthetics', 0.020490165240424678), ('walton', 0.015422780210465383), ('dovish', 0.017050809985972769), ('consequently', 0.03675152749450239), ('lenient', 0.023806583620301743), ('resort', 0.02924435673679418), ('gum', 0.01524758040557451), ('enceladus', 0.018642510161087839), ('pga', 0.029070016426694659), ('nottingham', 0.042962603077059729), ('hollander', 0.023009928040278214), ('wtvm', 0.006579802445794365), ('1883', 0.017798015509756578), ('stack', 0.014778280422868875)]), ('inflatable', 0.016929371581218045)]), ('peacefully', 0.034598817999942542)]), ('leniency', 0.023672652612453472)]), ('dell', 0.02770878830260079)]), ('hoard', 0.01455894538248851)]), ('sergio', 0.017960213343650631)]), ('cellular', 0.027706397774298466)]), ('jump-start', 0.029745784694169067)]), ('centeno', 0.022708685144073117)]), ('payless', 0.0048448712417251015)]), ('purchased', 0.015502979342738183)]),

54

(124, (295, (209, (145, (178, (360, (111, (211, (461, (147, (27, (46, [('its', 0.0092132138609249802), [('you', 0.13982011995096955), [('she', 0.22466557974246787), [("'", 0.049953304518172578), [('law', 0.069490953839935407), [('world', 0.013645969096165157), [('show', 0.019350577672165404), [('like', 0.01873745744024093), [('2016', 0.18163736947851969), [('his', 0.1183554928880034), [('i', 0.074850618665442958), [('his', 0.17850418352840308), ('write', 0.0091593474116264291), ('your', 0.058800602362577128), ('her', 0.21599127843015398), ('i', 0.025148149509670415), ('court', 0.030703886813535551), ('american', 0.012932699650944209), ('character', 0.011995825109801055), ('you', 0.013749897851734954), ('march', 0.13977792151610194), ('him', 0.040043640924243853), ('my', 0.026770832164003486), ('him', 0.05470518980704412), ('history', 0.008578745896761171), ('if', 0.037113887683872762), ('husband', 0.0098358268316837104), ('you', 0.02402428799851895), ('state', 0.030529948081108562), ('its', 0.008633646267228819), ('story', 0.0098184427710990493), ('up', 0.010511231265618438), ('june', 0.053377171151676783), ('i', 0.027595529464076955), ('me', 0.019612879940888547), ('after', 0.014527624215982759), ("'", 0.0085417119080627452), ('can', 0.034548457398106483), ('daughter', 0.008557919319274826), ('like', 0.021816708689474652), ('rule', 0.018071509316425476), ('political', 0.0071899184377368711), ('series', 0.0091206700845782168), ('out', 0.010456555916301064), ('during', 0.033508475105772317), ("'", 0.011668717471174005), ('day', 0.017831522038736808), ('man', 0.013637325709112916), ('book', 0.0059061709420950848), ('get', 0.013312513869702123), ('after', 0.0077567453937451526), ('get', 0.017638767856328272), ('would', 0.014959522605957891), ('america', 0.0067937299327896841), ("'", 0.0082478127152494189), ('just', 0.0085477558034080074), ('22', 0.033273371243231466), ('up', 0.01084192114879667), ('get', 0.016539370392262338), ('tell', 0.012922092370928395), ('his', 0.0055473429258972081), ('what', 0.011387681326231526), ('mother', 0.0069342148600799458), ('just', 0.014408356517640945), ('legal', 0.014261884998107722), ('many', 0.0067264268947126431), ('season', 0.007467255281653826), ('get', 0.0081758900301990723), ('23', 0.023093828040421399), ('get', 0.010584079500183724), ('work', 0.014524578304354836), ("'", 0.012497485310236759), ('time', 0.0051446513522387903), ('how', 0.0093503323894904636), ('tell', 0.0066968312368308443), ('so', 0.0130019131854134), ('federal', 0.013586693561827292), ('history', 0.0064585467342940161), ('like', 0.0069599853046090248), ('look', 0.007767615827173651), ('photo', 0.021718765976435087), ('go', 0.010576611552795122), ('time', 0.013006750888463895), ('know', 0.010710374177595501), ('most', 0.0047195868422405677), ('need', 0.0091781725177996105), ('herself', 0.0063748338505569112), ('go', 0.012554001552776769), ('case', 0.012800184773390934), ('us', 0.006368116001958188), ('episode', 0.0069037729684710966), ('there', 0.0077585096039155237), ('17', 0.020316756103598163), ('out', 0.010185497169207381), ('go', 0.012708134536752789), ('out', 0.010235642027201475), ('even', 0.0047188382793315756), ('there', 0.0080542764127185112), ('family', 0.0062477615219951044), ('show', 0.011260955680890939), ('government', 0.010954094716463338), ('war', 0.0056469763878689667), ('its', 0.006266203440301826), ('so', 0.0074619550321419967), ('19', 0.019975418454473599), ('me', 0.010021394635886298), ('out', 0.011527083927225037), ('go', 0.0097834036490696934), ('like', 0.0046341456650033155)]), ('should', 0.008019153908046868)]), ('life', 0.0054767442621223373)]), ('really', 0.010045967069994189)]), ('decision', 0.01083139845924019)]), ('most', 0.0055226248066311094)]), ('new', 0.0056433077606187808)]), ('into', 0.0068307029673462284)]), ('tuesday', 0.019647093922763967)]), ('tell', 0.0099576449135207502)]), ('up', 0.011373339316395765)]), ('would', 0.0087254675291883362)]), (112, (270, (373, (332, (337, (178, (168, (75, (101, (127, (380, (105, [('you', 0.026153629051272512), [('book', 0.033487672162038684), [('like', 0.010999572045393985), [('his', 0.081936807919321772), [('you', 0.11800385210362403), [('what', 0.012636772296107929), [('his', 0.16185498176901308), [('i', 0.11998979838887278), [('you', 0.047480189448322795), [('march', 0.09115638986565508), [('like', 0.025824200530389529), [('she', 0.24579004084527209), ('like', 0.02112881771506573), ('write', 0.015349318166642411), ('i', 0.0079148507931993441), ('him', 0.023174156555946113), ('your', 0.042591838106925989), ('like', 0.012618134706017711), ('him', 0.046360928882220612), ('my', 0.045555466595437299), ('i', 0.040542142749806567), ('april', 0.079750276880436607), ('i', 0.024653184726750026), ('her', 0.22232829531335938), ('get', 0.015680558183982046), ('i', 0.0096107588726961034), ('even', 0.0071007507441605416), ('into', 0.0099002339152938004), ('if', 0.027274151544432936), ('i', 0.012235928074879216), ("'", 0.015922633765160701), ('me', 0.026168779881949236), ('so', 0.026280434313693729), ('22', 0.030263299635072093), ('you', 0.016564440196738826), ('husband', 0.0101297702350457), ('out', 0.011312008085922341), ('like', 0.0072721424368465664), ('into', 0.0060438590407315904), ('out', 0.0093566222410062759), ('can', 0.026716170250264779), ('would', 0.009501800515999078), ('i', 0.012294663873212146), ('go', 0.0098294214694387445), ('what', 0.020257171939599368), ('19', 0.022952107692025353), ('go', 0.01630975068647127), ('after', 0.0098296446043192719), ('up', 0.010668965169840266), ('read', 0.0059092096556946213), ('seem', 0.0059683706514167129), ('after', 0.0083145800922257915), ('get', 0.016765568125955425), ('even', 0.0088825372594128933), ('after', 0.010164465837356208), ('his', 0.0091472496727289382), ('like', 0.020108336328472885), ('27', 0.022629168772243878), ('just', 0.015152815809506502), ('family', 0.0087299269740028149), ('can', 0.010568016873263699), ('time', 0.0058272592380653535), ('than', 0.0057676338014763812), ('up', 0.008147352674738071), ('there', 0.012406998769350879), ('seem', 0.0086966340220110908), ('himself', 0.008070607428694478), ('get', 0.0081495639530983245), ('think', 0.018288278907659841), ('25', 0.021960919595570373), ('up', 0.014852385272064948), ('herself', 0.0072686136075343303), ('just', 0.010515827278914026), ('most', 0.0055648751618468262), ('no', 0.0056939153979232264), ('man', 0.0071149724836159674), ('so', 0.011385053043312079), ('people', 0.0082841501977285788), ('would', 0.0075920702903044326), ('day', 0.0080318835827287091), ('go', 0.01576960335250617), ('23', 0.021285354802693179), ('get', 0.014151699543696062), ('home', 0.0062376016335306912), ('there', 0.0094058492678396506), ('you', 0.0054813112213602678), ('time', 0.0053588233061980985), ('back', 0.0069664094617963909), ('what', 0.011106397562585558), ('there', 0.0082025078961582652), ('out', 0.0074538596758964119), ('time', 0.0075765889143464306), ('there', 0.015370346725560998), ('28', 0.021100587085112657), ('guy', 0.011317981501898196), ('take', 0.0059243462144558546), ('so', 0.0082667660553890319), ('story', 0.0053464887154532929), ('so', 0.0053083399397705223), ('then', 0.0054486709707051309), ('how', 0.0094830960412573932), ('so', 0.0081606234126289055), ('tell', 0.0073124222428841597), ('know', 0.0070983502206575207), ('just', 0.014657033515837243), ('26', 0.021085832470601706), ('show', 0.011187872713043449), ('tell', 0.0058976503493440172), ('what', 0.0078027249132703765)]), ('novel', 0.0051109743625463921)]), ('most', 0.0051694814791915563)]), ('take', 0.0052213772147861265)]), ('them', 0.0094322383013942523)]), ('way', 0.0077376757463750358)]), ('man', 0.0071646713152333777)]), ('out', 0.0069511525725372237)]), ('know', 0.014095389650148253)]), ('18', 0.020692324382604519)]), ('so', 0.01075770394229421)]), ('life', 0.0057533323492273691)]), (150, (342, (60, (371, (439, (235, (336, (193, (397, (273, (947, (100, [('she', 0.15759661715219564), [('she', 0.17679948070882912), [('you', 0.11336358603798181), [('she', 0.15457203529710353), [('she', 0.15072629699948586), [('i', 0.13250422311564006), [('our', 0.01989653885183466), [('she', 0.19148930808929315), [('she', 0.35542451318339602), [('what', 0.033433769166004705), [('she', 0.16655073358052966), [('if', 0.019570699258323245), ('her', 0.15245823881601361), ('her', 0.15549149924652306), ('your', 0.044313732741916689), ('her', 0.14692534424308076), ('her', 0.14273979850451879), ('my', 0.042147669750113893), ('people', 0.013884861040091448), ('her', 0.18258121864968921), ('her', 0.27341006541540569), ('you', 0.032381642857555473), ('her', 0.15040271225317797), ('you', 0.017294924521279072), ('tell', 0.0094416714444247509), ('i', 0.018528861003507098), ('if', 0.025203536591518371), ('i', 0.014174187474809896), ('i', 0.012965326699046639), ('me', 0.026359938918542089), ('what', 0.011227137993149916), ('husband', 0.0084717758557864061), ('herself', 0.010339644283738689), ('i', 0.03051521288024759), ('i', 0.013116215351426751), ('would', 0.014909461903337352), ('after', 0.0086638298489275095), ("'", 0.013185156137211614), ('can', 0.023013977527528984), ("'", 0.0077210109380341724), ("'", 0.0082722906123595986), ('you', 0.017316782994325569), ('world', 0.010972637376940503), ('daughter', 0.0077675968548343481), ('work', 0.010165982973734068), ('go', 0.0292249693254974), ('tell', 0.008294692307781952), ('what', 0.014824930536571197), ('mother', 0.0084926813994846261), ('go', 0.0084830472266939766), ('get', 0.011214803727067157), ('tell', 0.0071551510381721237), ('tell', 0.0078590109542192154), ('what', 0.014092939521083522), ('these', 0.0075576716165591962), ('after', 0.0075773930856086495), ('woman', 0.0082687732621247283), ('so', 0.027107187608484001), ('woman', 0.0076977829505065973), ('i', 0.014544490453951692), ('daughter', 0.0080766489944683242), ('get', 0.008043640559077081), ('what', 0.011178168338018678), ('go', 0.0071409981715181339), ('go', 0.0076145440102542237), ('know', 0.013525615243079542), ('can', 0.0066575818080510021), ('mother', 0.0073854813914631008), ('help', 0.0077856228923500036), ('know', 0.023886305416178175), ('out', 0.0070472506340707782), ('there', 0.013716047030115601), ('family', 0.0080516871840360018), ('tell', 0.0071612983161462584), ('people', 0.0093441972564527291), ('get', 0.0068135205054513969), ('out', 0.0067004790575410983), ('our', 0.013111961144392502), ('us', 0.0064449202261344114), ('tell', 0.0071299252954486678), ('first', 0.0067279813410037002), ('think', 0.023679953037766845), ('go', 0.0069759263186343363), ('can', 0.013177858731241273), ('i', 0.0077322433529850281), ('out', 0.007152491899741543), ('so', 0.0085472125066710179), ('out', 0.0063227084943309171), ('after', 0.0065984256566619676), ('people', 0.01301190618791234), ('human', 0.0062262452340750216), ("'", 0.0061918973518320028), ('after', 0.0056735134190073443), ('there', 0.023393697607686934), ('know', 0.0062377801002773342), ('no', 0.012531444051721439), ('husband', 0.0076170214112785241), ('up', 0.0064982004958571886), ('them', 0.0085400475525462764), ('husband', 0.0062177620102198076), ('get', 0.0063588368848616888), ('go', 0.0087269577763633699), ('life', 0.0060875874684561777), ('i', 0.0057722793711950616), ('husband', 0.0055791911851213582), ('talk', 0.016158382279714809), ('get', 0.0058666726618797479), ('so', 0.012008304852845757), ("'", 0.0072624964439734581)]), ('know', 0.0062637531617657557)]), ('how', 0.0083432635183441368)]), ('woman', 0.0059256294008275517)]), ('up', 0.0060441396430486044)]), ('think', 0.0080306513468452641)]), ('many', 0.0060548816483072289)]), ('herself', 0.0057555178046577431)]), ('meet', 0.004312850440953077)]), ('very', 0.014215135608823901)]), ('up', 0.0058538893082933354)]), ('just', 0.0093545058482486128)]), (96, (251, (378, (303, (306, (95, (262, (263, (29, (702, (357, (847, [('i', 0.11944584234151737), [("'", 0.024861254020421577), [('i', 0.15415276146762832), [('i', 0.1300222449896983), [('i', 0.15116223880022361), [('you', 0.10420826235657396), [('even', 0.009590137959808091), [('i', 0.043468578253324439), [('you', 0.18796546780198722), [('i', 0.24186961510741981), [('you', 0.090056495404486972), [('you', 0.049167617512314944), ('my', 0.038051064624169788), ('out', 0.010620552968706136), ('my', 0.042050698713884616), ('my', 0.038638046623939805), ('my', 0.052993321833572832), ('your', 0.027528186769131109), ('like', 0.0080064656673812032), ('you', 0.041311826483525195), ('your', 0.081478879216539804), ('my', 0.079774850407549253), ('can', 0.034067655520912829), ('i', 0.033737985112064953), ('me', 0.023288558863307964), ('up', 0.0090701054812114771), ('me', 0.026688330287662647), ('me', 0.023886251256473177), ('me', 0.031538641814961341), ('if', 0.017429634946395673), ('most', 0.0067426041158336223), ('what', 0.028610648850970758), ('can', 0.039379753215344468), ('me', 0.037579447560134316), ('if', 0.02487493837825679), ('get', 0.024320347122685389), ('would', 0.0086035573821551786), ('like', 0.0088538067239625207), ('you', 0.020269447243275028), ('go', 0.010314194741595258), ('you', 0.018204753295559303), ('can', 0.017046667336255893), ('than', 0.0067254550781863306), ('go', 0.025203303043176809), ('if', 0.033615995035616533), ('would', 0.014505982218282179), ('your', 0.020994127628076103), ('there', 0.022584545816409046), ('so', 0.0082477418817290277), ('his', 0.0084737764383695478), ('know', 0.012004517390228953), ('like', 0.0099933418547352177), ('his', 0.011269261381962197), ('get', 0.016743193698541794), ('seem', 0.0064591982610798402), ('there', 0.024767851917932088), ('get', 0.016021298659588037), ('you', 0.010149712562922915), ('get', 0.017061522897846595), ('go', 0.022406363725856984), ('what', 0.0078825574064878019), ('if', 0.0077433805835239273), ('what', 0.011065193859426222), ('you', 0.0094612205240898178), ('know', 0.010146800812068726), ('so', 0.015167947304974286), ('much', 0.0060730336625845673), ('people', 0.019909373655333632), ('what', 0.012613434497075447), ('what', 0.0098572748039857695), ('what', 0.014650996911903746), ('so', 0.01966966798375638), ('like', 0.0077238158616426489), ('get', 0.0076817864535395283), ('so', 0.010285477507147308), ('get', 0.0086832490541708957), ('him', 0.0089195637573015463), ('like', 0.015100781928102958), ('so', 0.0059328370113425092), ('so', 0.018311957768916524), ('people', 0.012133022119101475), ('go', 0.0096967725568879674), ('them', 0.014499070185337158), ('what', 0.018289715301195351), ('you', 0.0075219188189285186), ('you', 0.0070636414355567606), ('think', 0.010022754269664907), ('his', 0.0086686961033227289), ('what', 0.0086962570002552982), ('what', 0.013830375452239774), ('no', 0.0058120218874641886), ('think', 0.018086654097882248), ('need', 0.011292426554255046), ('know', 0.009457315767238423), ('people', 0.013906643644792809), ('like', 0.016304197638042407), ('go', 0.0075066771111805845), ('there', 0.006698506000968934), ('like', 0.0092482960364442292), ('so', 0.0086221073108988262), ('go', 0.0084517909900329764), ('there', 0.01302560103310676), ('would', 0.0057832557671237649), ('get', 0.017294637923569302), ('there', 0.010481200407337625), ('think', 0.0088161409036761357), ('there', 0.013432728179423077), ('people', 0.016161965434411899), ('time', 0.0072432109050846877)]), ('would', 0.0066695742089176689)]), ('go', 0.0091911979473393978)]), ('time', 0.0080784037075999088)]), ('tell', 0.0074359515491344121)]), ('just', 0.012035135352876937)]), ('if', 0.0055484119794018795)]), ('know', 0.016439822409510647)]), ('how', 0.0098120785661476879)]), ('like', 0.0078575219943269638)]), ('how', 0.012076942637341822)]), ('think', 0.014979116484395118)]), (5, (100, (99, (163, (161, (258, (191, (0, (732, (559, (113, (482, [('you', 0.059395987453698881), [('i', 0.097065779234058638), [('i', 0.027209345088406531), [('you', 0.047699341012118361), [('i', 0.036021302385549424), [('she', 0.1368780388015538), [('i', 0.058956137732550346), [('no', 0.012558246217381297), [('i', 0.16965436038120116), [('i', 0.053870081170707268), [('i', 0.1039028038521936), [('i', 0.14489042486807707), ('i', 0.057894897449233478), ('you', 0.026862756101668477), ('there', 0.02162091860732392), ('i', 0.030027978131430755), ('like', 0.016517526487984484), ('her', 0.13664617926964298), ('you', 0.045709072549923034), ('would', 0.0099716525933655556), ('my', 0.045689760590131169), ('you', 0.049213020194011872), ('you', 0.027808472581134471), ('my', 0.041579729170077608), ('what', 0.023462122836120532), ('my', 0.019432996135173235), ('go', 0.020673837669543876), ('what', 0.021594053945066926), ('get', 0.016383994823293526), ('mother', 0.0094934611004206543), ('what', 0.01895752358852977), ('if', 0.0098552298552762561), ('me', 0.028314494784239405), ('get', 0.018632220614266819), ('my', 0.018523710189520842), ('me', 0.026557169744689881), ('go', 0.020892674622679699), ('what', 0.017999349631717031), ('get', 0.020200212080555127), ('there', 0.018636637254954319), ('so', 0.015403540116453394), ('i', 0.009403662034471464), ('get', 0.017664644251446354), ('i', 0.0096411257719253834), ('go', 0.014029823302678691), ('what', 0.018604869879047036), ('what', 0.01751543661923281), ('you', 0.019526151075440708), ('there', 0.019811429081047455), ('think', 0.016032267346476194), ('you', 0.019769372704199685), ('so', 0.01743229908814389), ('go', 0.015354901717360423), ('family', 0.0083692243224480593), ('go', 0.017229418237501095), ('like', 0.0091616068458202152), ('get', 0.01279122215408259), ('people', 0.018331655258817844), ('think', 0.016496691661867648), ('what', 0.011451300798203334), ('so', 0.019389285319336712), ('go', 0.016020654315691735), ('so', 0.01868057953164215), ('people', 0.016624686568814117), ('just', 0.014719373950842461), ('after', 0.0077551249781569378), ('so', 0.016370032135687149), ('even', 0.0084740939138813372), ('you', 0.012783858881463959), ('there', 0.017683594591086402), ('so', 0.015248023648575647), ('think', 0.010882708302589559), ('think', 0.018580179298101126), ('so', 0.015777826746803194), ('what', 0.018122986936775366), ('if', 0.016329438265511524), ('there', 0.014657703212957705), ("'", 0.0077248730509314953), ('there', 0.015905479539374042), ('can', 0.0075547247021206416), ('know', 0.011395897614148864), ('like', 0.01752601027411587), ('go', 0.013910448834300544), ('go', 0.010208336525972196), ('people', 0.018140239661678343), ('there', 0.014670992386389444), ('like', 0.015134613746260428), ('go', 0.016168045994962042), ('what', 0.012919072109450945), ('daughter', 0.0075877272498495758), ('think', 0.015008553277046997), ('so', 0.0074156875076488646), ('think', 0.010287296174319749), ('so', 0.017241175501851523), ('there', 0.013666106058407238), ('like', 0.0099804145968635048), ('get', 0.01802398588801734), ('people', 0.014670939577080561), ('think', 0.014637537688476931), ('get', 0.014972320045845682), ('you', 0.012298128237938475), ('tell', 0.0074778478083818139), ('if', 0.013231659769831655), ('out', 0.0072309715899864192), ('what', 0.010146738911980721), ('can', 0.01505131785827944), ('me', 0.012271090738718885), ('so', 0.0099297361301763296), ('just', 0.014163968886080479)])] ('me', 0.013272494403399679)])] ('just', 0.013813936209964418)])] ('can', 0.014325930175531644)])] ('would', 0.010822116923969179)])] ('husband', 0.0067973590850921063)])] ('people', 0.012626188167574265)])] ('there', 0.0068634118024038576)])] ('so', 0.0096075643871226819)])] ('just', 0.014134631515256948)])] ('get', 0.012024694763356929)])] ('know', 0.0096932492240371076)]

55

Appendix D: Topic modelling results

Government actor corpus topics Topic 215 Topic 688 Topic 144 Topic 418 Topic 718 Topic 250 Topic 399 Topic 589 Topic 662 'google' 0.095227877143434711 fbi' 0.12297360659768283amendment' 0.32730642079997296spy' 0.065042979811322024cia' 0.20638656011419168e-mail' 0.13033705574691457patent' 0.056826166894215033charter' 0.27435464679938648facility' 0.2074413246866996 'search' 0.067490938298885486'cook' 0.11723929954561001'surveillance' 0.21283721177400028'nsa' 0.056259978191477149'celebration' 0.1898887151864839'department' 0.030971715176220209'claim' 0.049196230426213163'accountability' 0.059601572672816898'prison' 0.11449236835813888 'privacy' 0.051829624154915273'bernardino' 0.049886482889926274'appropriation' 0.044672496964598941'dom' 0.043704494873907609'celebrate' 0.16724847742161791'state' 0.028853561218947606'operator' 0.040976484331616173'public' 0.035576990375033382'inmate' 0.089168671187985479 'data' 0.045741215580974159'comey' 0.043707051006270779'amend' 0.032111478788303796'snowden' 0.037455105249508673'mayo' 0.047550986616720879'official' 0.019916064151266653'damage' 0.032703752234147884'privatization' 0.022800348612002658'correction' 0.053211815317920012 'user' 0.043601537246566749 'encryption' 0.03878905814114391'invoice' 0.023567190574567166'surveillance' 0.033575992613627245'tangible' 0.035242519322758119'server' 0.019754686245835974'hedge' 0.028469983858864203'privately' 0.021036928208928541'correctional' 0.021680641620315022 'site' 0.040433008218049696'unlock' 0.036497300226267425'viability' 0.020052974512185729'revelation' 0.026233312521838571'spying' 0.01967986222445357'her' 0.019447982281922901'springs' 0.019823276298146224'private' 0.020022734335334889'private' 0.017597529178738747 'information' 0.040260344448992132'government' 0.029497904759425566'stymie' 0.017702501716821775'bulk' 0.023758203040265678'covert' 0.017826502386148205'send' 0.018380399800176871'private' 0.016965908596116096'boon' 0.016734771519717927'cell' 0.012434459026281984 'web' 0.028453120828763216'noah' 0.023112000911125193'lister' 0.01628490888884249'router' 0.019955984052545242'inadvertently' 0.016675829018656472'clinton' 0.018359283664723943'incur' 0.016356568566822854'$1.7' 0.015172914880400646'state' 0.011424682921089661 'use' 0.023320262395775691 'james' 0.013419484022442994'massie' 0.014567762740276264'encrypt' 0.019103911768669754'anniversary' 0.01498255054960841'private' 0.016195268785076492'entity' 0.016305628710493097'scaling' 0.014565939734487414'federal' 0.010809087382097233 'collect' 0.020142316316540669'locked' 0.013173932505034479'handicap' 0.013543675064166238'bruno' 0.017773457380326717'geoff' 0.014662529073515342'she' 0.014371988972603052'scope' 0.015782868937919841'4.6' 0.013065480606771466'department' 0.009297611145518643 'website' 0.017244257273748804'privacy' 0.011277964323217756'bankston' 0.01297960408449656'bride' 0.016765913928595343'6-year-old' 0.013351086094846303'inquiry' 0.014112968831254125'fee' 0.011607765895280854'redress' 0.011828420837070612'berkshire' 0.0089780212715140136 'tracking' 0.0096751208966546643 'would' 0.010374847620119355'fourth' 0.011639083116704198'sonar' 0.016697720098288838'soa' 0.013157836055726685'secretary' 0.013885742720712812'financing' 0.011045582981901635'mony' 0.011265716618411473'25000' 0.0086488127764338327 'tool' 0.009431423297294364'encrypted' 0.01030474241856312'congresswoman' 0.011481517999797807'edward' 0.016662503035216072'rodeo' 0.011779362964216335'use' 0.012469744957755458'diligence' 0.0099506345422121934'victimization' 0.010451238398327256'treasurer' 0.0085862026968408257 'page' 0.0093105457117312988'ludicrous' 0.0095113464088774601'defined' 0.011230422989996122'targeted' 0.014715170132845855'meltdown' 0.011747561479970735'information' 0.011704716528236394'invention' 0.0096392314489481996'accountable' 0.010160976398884303'system' 0.0082699264903963229 'online' 0.0084091154697743548'help' 0.00897691232746297'backdoors' 0.0061947306042991488'intelligence' 0.013151221811720664'birthday' 0.011658742637703797'report' 0.011414770098624799'excluding' 0.009630706092953082'shaffer' 0.0098368469918415254'creep' 0.0070707141933013537 'can' 0.0081966835462738211 'co-defendant' 0.0088420786850539491'gather' 0.0055981012454663845'agency' 0.013090186591289587'sanger' 0.010168747930479763'investigation' 0.011053330483602604'firm' 0.0084445728060055913'jarvis' 0.0092862852998108116'overcrowded' 0.0069227028185777674 'internet' 0.008193154383480521'enforcement' 0.0078503384226082471'crypto' 0.0050761807510882649'government' 0.012932003201373344'slow-moving' 0.0080503040936704545'benghazi' 0.010987158707737163'under' 0.0081207630440483739'tariq' 0.0088668902079318546'contract' 0.0068194121641364834 'personal' 0.008134036710049709'deluge' 0.0078395783287096687'702' 0.0048884732796910116'es' 0.011141129314501898'cinco' 0.0060609656559006157'tell' 0.01077307891477669'million' 0.0078096477606019272'at-risk' 0.0087556590569643333'overweight' 0.0057299524441840437 'other' 0.006648367625988744'case' 0.007769407501207309'prevent' 0.0038624745567328355'maloney' 0.011049149176868302'sparkler' 0.0059059465698635422'classified' 0.0099442967224320639'infringement' 0.0077676929770696818'paltry' 0.0074983346985540537'official' 0.0050388124324315447 'location' 0.0057823306967076877 'director' 0.0076069793807890893'warrantless' 0.003748133748682768'phone' 0.010028575565835118'firefox' 0.0058439496525069653'personal' 0.0095825367985901584'deteriorating' 0.0076904488864887321'blevins' 0.007018145377793425'report' 0.0049238304248171542 'ethical' 0.005776529152019665'shooter' 0.0074533292403663985'naylor' 0.0035708031315626033'secret' 0.0094315352919097963'gentrifying' 0.0057995583626809025'fbi' 0.0092661760479188331'its' 0.0072651247719084117'webcast' 0.0069657454228232224'use' 0.0048172821690347246 'collection' 0.0057163795387919708'03' 0.0073811843131400372'propose' 0.002870839513267011'authenticate' 0.0083891086946502012'high-security' 0.0035341474204220939'release' 0.0076032199280518728'property' 0.0071920409044046573'esea' 0.0062526077676152972'$800' 0.0047955225929780408 'private' 0.0056825658734461588'simmering' 0.0073718292635041377'unfitness' 0.0028464886218846313'collection' 0.0076657122314389209'white' 0.0031566233711235376'prostitute' 0.0075080127881648742'purposely' 0.0063581470175776179'cabal' 0.0060084882083174121'than' 0.004654947689106923 'service' 0.0055673742237683441 'iphones' 0.007064838857739288'would' 0.0025039440970801921'rsa' 0.0076314458744519914'select' 0.0028798369225439165'depp' 0.0074768119861371978'company' 0.005564492034541348'85000' 0.0059899244690591238'population' 0.0044826051092602542 ' 0.0055378439291552551'pry' 0.0069005795339403229'government' 0.0024977103607278796'seaman' 0.0073004794074590955'schramm' 0.0028397190006760656'interview' 0.0071084105535649316'prior' 0.0054907882932395966'state' 0.0059129068235980586'house' 0.0043605731982781773 'link' 0.005495564481211827'elusive' 0.0068015234208355039'second' 0.0024923146032962834'communication'0.0072497977631311031'canopy' 0.0026836520342307536'time' 0.0062151328940261586'rapporteur' 0.0053171664337474485'enron' 0.0058873085207036971'end' 0.0040974495560376382 'track' 0.0054905165068756135 'san' 0.0066775299844470792'ameri' 0.0024654194905625172'american' 0.0070433387323988886'day' 0.0026152528501404843'no' 0.0057799788204537014'purposefully' 0.0052182403368298668'rubric' 0.0053462665253306086'last' 0.0040830313813873774 '8th' 0.0052237175984986032'order' 0.0059518302343145165'section' 0.0022895226279250206'call' 0.006940163389855078'forthrightly' 0.0023722086225308968'former' 0.0056287332163779567'discretionary' 0.0052107438930175873'equitably' 0.0052026338187037479'safety' 0.0039196995317183122 '13th' 0.0048331900124353282'into' 0.0056289181565164961'use' 0.002282482375199344'encryption' 0.0062898623755490296'us' 0.0023243683891046273'message' 0.0055449663165274953'behalf' 0.0051441621766346035'operate' 0.0049112747100765776'release' 0.0038282057794793239 'cookie' 0.0047855378923058476'access' 0.0053272871311136052'unsupported' 0.0022402951551603041'use' 0.006016643620428498'during' 0.0019940942853271097'claim' 0.0053785048489731375'attributable' 0.0051036007804834155'greenbelt' 0.0048783724238788927 'now' 0.0036556289630583511 Commercial actor corpus topics Topic 310 Topic 410 Topic 852 Topic 692 Topic 286 Topic 847 ms' 0.19807931145898247 activity' 0.099724991355441675information' 0.11635327462108316chain' 0.11956758198380274central' 0.12426090985413997private' 0.11599401976466901 'footage' 0.094684705871632524 'exercise' 0.081818920731572653 'use' 0.082108409917044739 'cent' 0.089657576029081545 'trillion' 0.039342122583494399 'spectrum' 0.092671037957801405 'surveillance' 0.090455388244917501 'physical' 0.041123945912335612 'our' 0.070471983114679171 'shop' 0.072606557246337133 'deficit' 0.031779389413776865 'adviser' 0.089851638479090515 'convenience' 0.047369534799311712 'fitness' 0.036486027727972797 'us' 0.057068966909828205 'edward' 0.040098534047258855 'capital' 0.024017682122628073 'nurse' 0.080499701973289498 'clerk' 0.046570672745881586 'yoga' 0.029677203730622118 'may' 0.05273079455843737 'nsa' 0.037805902045945111 'umbrella' 0.018565038948470471 'nursing' 0.045407889817570671 'tidal' 0.041157999055552306 'day' 0.02436369107244021 'your' 0.050301421928654978 'per' 0.026000103083812166 'demand' 0.016072458009259032 'alternative' 0.036998809156822771 'warrant' 0.0354439849076639 'gym' 0.021316716571532831 'transfer' 0.043090352382768239 'kraft' 0.024714381734239566 'stable' 0.015733285676747737 'five-year' 0.033420314436163903 'cathy' 0.018627741282795787 'tracking' 0.020792696274005554 'privacy' 0.036632265176589261 'royalty' 0.02220132796324541 'private' 0.013582310010377461 '1.6' 0.018104172850400158 'stash' 0.017533454390229415 'routine' 0.018997207591925899 'share' 0.026080881895368464 'snowden' 0.020217008976456605 'institution' 0.01223115812247789 'self' 0.016701424235904011 'unlock' 0.016957879435875966 'tracker' 0.01595312568166966 'policy' 0.023117609902870193 'pencil' 0.018846656609071054 'recovery' 0.011913989819103398 'reit' 0.015332709927424082 'locked' 0.015271720956533537 'cage' 0.015870047942610949 'how' 0.021304373618859777 'bookstore' 0.017620356930351692 'balanced' 0.011884475346858082 '0.7' 0.010532976702611458 'shootout' 0.013556650924476967 'step' 0.015426661479186114 'see' 0.021003555767687272 'merchandise' 0.016654406616121425 '80%' 0.010448305265945534 'rn' 0.0094768457770410628 'couch' 0.0123937388283334 'monitor' 0.012169082971837553 'detail' 0.019930714706304874 'sell' 0.015879217433977695 'panic' 0.009196304781797374 '3.7' 0.0092657830476609825 'robber' 0.01186702540284834 'wrist' 0.011334355130524146 'other' 0.019838013519996818 'noble' 0.01288585473162039 'balance' 0.0083707017962477027 'theodore' 0.0089209598864965125 'embezzlement' 0.010973390054840149 'sit' 0.009378176950823958 'service' 0.019216266516996625 'york-based' 0.01280378725630899 'quantitative' 0.0081114313626445538 'sophisticated' 0.0084289706908929531 'asa' 0.0064465787433984883 'daily' 0.0082580838023035633 'outside' 0.017469440050869933 'revelation' 0.010592540548560759 'policymaker' 0.00806946121180541 'trust' 0.0080908417430843819 'resettle' 0.0061422079798494535 'sync' 0.0080112786225782183 'if' 0.016901113170090534 'e-book' 0.010068114505061881 'crisis' 0.0079705367827340289 'lydia' 0.0073135882721914519 'tearful' 0.0061185618546347699 'minute' 0.0078817595549257099 'locate' 0.016586726660923739 'store' 0.0094438021368273128 'projected' 0.0078109617531124812 'traditional' 0.0072266291035041419 'twig' 0.0060905586331172603 'watch' 0.0078292672669585429 'business' 0.016494670031421264 'best-seller' 0.0093352739362140402 'derivative' 0.0071560691915379167 'professional' 0.0070505540594434091 'surrey' 0.0052790427512898952 'walk' 0.0069327856552585197 'tailor' 0.014453293128320105 'its' 0.0091153370460622316 'institutional' 0.006816948937569174 'take' 0.0064611110997250445 'ascertain' 0.0052199946409115586 'weight' 0.0068576278324111234 'better' 0.013206608760866151 'coloring' 0.0080858949228288981 'interest' 0.006763594634224373 'registered' 0.0064014350209042094 'glittering' 0.0048245185913179172 'logging' 0.0064031804159641048 'advertising' 0.012576416174127933 'best-selling' 0.0079860059883290375 'wiley' 0.0066927756183428177 'next' 0.0063681679027050879 'phone' 0.0045194620905330881 'healthy' 0.006104384908810572 'processed' 0.0092912563958750874 'cater' 0.0079773004490571262 'sovereign' 0.0065256464526930978 'attend' 0.0062384187527277897 'scandinavia' 0.004515686542824868 'session' 0.0059203288823765406 'then' 0.0062818233154450084 'sew' 0.0078250611270130471 'financial' 0.0064583749840352398 'flotation' 0.0059604769137696105 'lightbulb' 0.0044616273528447518 'hour' 0.005821567232561704 'e' 0.0051450645618613495 'upscale' 0.0071251822440824824 'economy' 0.00609855687581593 'late' 0.0057630245824791624 'exclusive' 0.0044346057147353415 'health' 0.0053972607647884393 'so' 0.0039882043726869962 'lingerie' 0.0049150297203258567 'devaluation' 0.0058774058023507393 'level' 0.0055824497842064339 'after' 0.0043966720334288638 'track' 0.0053547248507385655 'tell' 0.0035714469331142064 'bulk' 0.0048713286660696025 'residual' 0.0056428064498344381 'two' 0.0055367396366837037 'nite' 0.0043585728744689354 'lifestyle' 0.0052715373104035394 'news' 0.0031031187887571288 'collection' 0.0047516048862377936 'world' 0.00534043463895101 'dodd' 0.0055197848693288891 'cellphone' 0.0041788069409776675 'treadmill' 0.0051618749623549825 'photo' 0.0027713128897492532 'mannequin' 0.0042748427381826886 'long-term' 0.0052491125248650692 'need' 0.0053789894712974158 'reunification' 0.0041164465075098642 'your' 0.0049653979661683258 'first' 0.0026066016091356144 'agency' 0.0041588318949236881 'invariably' 0.0050950128833688003 'nationwide' 0.0051757866938757172

56