Can Google Trends Help Predict the Swedish General Election?
Total Page:16
File Type:pdf, Size:1020Kb
USING SEARCH QUERY DATA TO PREDICT THE GENERAL ELECTION: CAN GOOGLE TRENDS HELP PREDICT THE SWEDISH GENERAL ELECTION? Submitted by Rasmus Sjövill A thesis submitted to the Department of Statistics in partial fulfillment of the requirements for a one-year Master of Arts degree in Statistics in the Faculty of Social Sciences Supervisor Mattias Nordin Spring, 2020 ABSTRACT The 2018 Swedish general election saw the largest collective polling error so far in the twenty-first century. As in most other advanced democracies Swedish pollsters have faced ex- tensive challenges in the form of declining response rates. To deal with this problem a new method based on search query data is proposed. This thesis predicts the Swedish general elec- tion using Google Trends data by introducing three models based on the assumption, that during the pre-election period actual voters of one party are searching for that party on Google. The results indicate that a model that exploits information about searches close to the election is in general a good predictor. However, I argue that this has more to do with the underlying weight this model is based on and little to do with Google Trends data. However, more analysis needs to be done before any direct conclusion, about the use of search query data in election predic- tion, can be drawn. Keywords: Polling, Big Data, Google Trends Data, Political Prediction, Web Search Data. Contents 1 Introduction 1 2 Literature Review 2 3 Data 6 3.1 The Swedes, Google and The Voters . 12 4 Method 14 4.1 Model Development . 14 4.2 Prediction Methodology . 16 4.3 Joint Analysis . 17 4.4 Weight Analysis . 19 4.4.1 Long-term Model . 19 4.4.2 Intermediate Model . 20 4.4.3 Short-term Model . 21 4.5 Weight Selection . 22 5 Results 24 5.1 Swedish General Election 2018 . 24 5.1.1 Long-term Model . 24 5.1.2 Intermediate Model . 25 5.1.3 Short-term Model . 26 5.2 Swedish General Election 2014 . 27 5.2.1 Short-term Model . 27 5.3 Swedish General Election 2010 . 28 5.3.1 Short-term Model . 28 6 Robustness 29 6.1 Model Specifics . 29 6.1.1 Search Keywords . 29 6.1.2 Change of Pre-election Period . 30 6.2 County Level Analysis . 33 7 Concluding Discussion 35 1 Introduction We are spending an increasing amount of time on our phones and computers. Searching for information on the Internet has become a daily routine in our lives, whether it is finding a good place to eat, your next vacation or finding a new job. It places computing neatly integrated into our daily lives. This opens the possibility of a complete recording of all aspects of our life, creating a data driven society where information is stored in a huge data cloud. When it is made accessible to by scientists it provides a universe of research potential. When combining the words Internet and searches usually a particular company comes to mind, Google. Previ- ous research suggests that Google searches could be useful in predicting influenza epidemics (Polgreen et al. 2008; Ginsberg et al. 2009), goods sales (Choi and Varian 2009; Chamberlain, 2010) and unemployment rate (McLaren and Shanbhogue, 2011; Askitas and Zimmermann, 2009; Suhoy, 2009). This suggests a new data approach, since the Internet has proven to pro- vide answers to questions that are not even asked. There are over 150 billion searches on Google every month (internetlivestats.com, 2020). This raises the question, could data from Google searches help predict the results of the Swedish general election? Election prediction is the science of predicting the outcome of an election, based on the results of a predefined set of methods. Predicting the election outcome is a complex task. In recent years, the 2016 EU referendum (Brexit election) and 2016 US presidential election are some famous examples in which the majority of opinion polls wrongly predicted the outcome. Recently, a new approach has been developed in the field of electoral prediction which is based on Internet search data. It is commonly known that when people are interested and concerned about something, they are likely to search for information about it on the Internet. The Internet contains a wealth of data about the general public opinions’ on political campaigns, events and people. This includes information on the general public’s opinions’ they would not necessarily otherwise reveal. Extracting the views of the public in the given moment into a model could be of great interest when predicting election results. The paper attempts to tackle the following problem: Given the right set of search terms, would it be possible to use such aggregated web statistics to predict election results? If so, are there some underlying logic behind these predictions or are they simply a matter of luck? Answers to these questions are important since the aim is to develop a model that can be used to predict upcoming elections. More specifically, the paper focuses on predicting the vote support for all major political parties in the Swedish national election by employing three simple models. This is the first paper, to my 1 knowledge, that i.) Predicts the Swedish general election using Google Trends data ii.) Predicts vote share for all major political parties in an election iii.) Creates models based on different time horizons iv.) Thoroughly analyses the relationship between Google search proportion and voting support. The results indicate that a model, referred to as the Short-term Model, based on party sup- port measured by the average polls of polling institutes is generally a good predictor for elec- tions. However, I argue that this has more to do with the underlying voting percentage the model is based on and little to do with the Google Trends data. The article is structured as follows: In Section 2. Literature Review, an outlook on the pre- vious research on predicting general elections with Internet search data is presented along with the contribution of this thesis to previous studies. In Section 3. Data, Google Trends data is presented along with keyword selection and the construction of variable that measures Google search interest and some descriptive statistics about the Swedes use of Google. In Section 4. Method, a description of the statistical methods and analysis is presented in detail along with the models used for prediction. Later, in Section 5. Results, the main results of the study are presented, comparing accuracy measurements for the different models. In Section 6. Robust- ness, the sensitivity of the results is analysed. Finally, in Section 7. Concluding Discussion, the main findings and limitations of the thesis are discussed along with recommendations for future studies. 2 Literature Review The potential value of search data has become increasingly recognised by researchers and sci- entists. Recent work suggests that search query data might be useful in economic forecasting due to its real-time nature and the easiness of data collection. However, the topic is new and relatively little studied. To my knowledge, Ettredge et al. (2005) were the first to suggest the use of Internet search data in forecasting. Since then, many studies have researched the use of Internet search data in various contexts and found different results. For example, previous research suggests that Google searches could be useful in predicting influenza epidemics (Pol- green et al. 2008; Ginsberg et al. 2009), goods sales (Choi and Varian 2009; Chamberlain, 2010) and unemployment rate (McLaren, 2011; Askitas and Zimmermann, 2009; Suhoy, 2009) As aforementioned, search query data has attracted plenty of attention from researchers mainly 2 because of its real-time nature and the easiness of data collection. This work mainly focuses on the part of research dealing with electoral prediction. For better understanding of the re- search area, related studies using social media data and traditional data are also discussed and compared with search query data. The literature review serves three purposes. First, it provides an outlook on the previous research on predicting general elections with Internet search data and the related field of social media data. Second, it gives a brief outlook of the traditional polling technique, the field of survey weighting, and it discusses the limitations and possible advantages of search query data. Third, it explains the contribution of this thesis to previous studies. The main goal of any researcher or polling institute whether they are predicting the general opinion or forecasting election results is to obtain an accurate estimate. Known issues with traditional opinion polling techniques are related to selection problem in the way voters are polled. In order to deal with the selection issue, weights are commonly assigned to make the weighted records represent the population of interest as closely as possible. The weights are usually developed in a series of stages to compensate for bias arising from, for example, un- equal selection probabilities, nonresponse, noncoverage and sampling fluctuations from known population values (Brick and Karlton, 1996). Many studies have reviewed weighting methods and evaluated them by volume of bias (Kalton and Flores-Cervantes, 2003; Brick and Montaquila, 2009). Brick and Jones (2008) reviews bias for different method of weighing. Generally, there is no specific method which performs better than others when reducing bias. Many of the commonly used methods are in fact relatively similar and the weighting adjustments they produce are highly correlated (Deville et al. 1993). Thus, the choice of auxiliary variables and the mode in which they are employed in the adjustments may be of more significance than the choice of method. When applying a complex weighting procedure, focus should be on the assumptions of the specific statistical model and the information it can handle.