Predictive Analytics with Social Media Data

Niels Buus Lassen, Lisbeth la Cour, and Ravi Vatrapu

This chapter provides an overview of the share several aspects of our private, interper- extant literature on predictive analytics with sonal, social, and professional lives on social media data. First, we discuss the dif- Facebook, Twitter, Instagram, Tumblr, and ference between predictive vs. explanatory many other social media platforms. The result- models and the scientific purposes for and ing social data is persistent, archived, and can advantages of predictive models. Second, we be retrieved and analyzed by employing a present and discuss the foundational statisti- variety of research methods as documented cal issues in predictive modelling in general in this handbook (Quan-Haase & Sloan, with an emphasis on social media data. Chapter 1, this volume). Social data analytics Third, we present a selection of papers on is not only informing, but also transforming predictive analytics with social media data existing practices in politics, marketing, and categorize them based on the application investing, product development, entertain- domain, social media platform (Facebook, ment, and news media. This chapter focuses Twitter, etc.), independent and dependent on predictive analytics with social media variables involved, and the statistical meth- data. In other words, how social media data ods and techniques employed. Fourth and has been used to predict processes and out- last, we offer some reflections on predictive comes in the real world. analytics with social media data. Recent research in the field of Computational Social Science (Cioffi- Revilla, 2013; Conte et al., 2012; Lazer et al., Introduction 2009) has shown how data resulting from the widespread adoption and use of social media Social media has evolved into a vital constitu- channels such as Facebook and Twitter can be ent of many human activities. We increasingly used to predict outcomes such as Hollywood

BK-SAGE-SLOAN_QUAN-HAASE-160238-Chp20.indd 328 23/09/16 5:06 PM Predictive Analytics with Social Media Data 329

movie revenues (Asur & Huberman, 2010), users’ interactions and sentiments (Hussain, Apple iPhone sales (Lassen, Madsen, & Vatrapu, Hardt, & Jaffari, 2014; Robertson, Vatrapu, 2014), seasonal moods (Golder Vatrapu, & Medina, 2010a,b). On the other & Macy, 2011), and epidemic outbreaks hand, predictive models in political science (Chunara, Andrews, & Brownstein, 2012). sought to predict election outcomes from social Underlying assumptions for this research media data (Chung & Mustafaraj, 2011; Sang stream on predictive analytics with social & Bos, 2012; Skoric, Poor, Achananuparp, media data (Evangelos et al., 2013) are that Lim, & Jiang, 2012; Tsakalidis, Papadopoulos, social media actions such as tweeting, liking, Cristea, & Kompatsiaris, 2015). commenting and rating are proxies for user/ Distinguishing between explanation and consumer’s attention to a particular object/ prediction as discrete modelling goals, product and that the shared digital artefact Shmueli & Koppius (2011) argued that any that is persistent can create social influence model, which strives to embrace both expla- (Vatrapu et al., 2015). nation and prediction, will have to trade-off between explanatory and predictive power. More specifically, Shmueli & Koppius (2011) claim that predictive analytics can Predictive Models vs. advance scientific research in six scenarios: Explanatory Models (1) generating new theory for fast-changing environments which yield rich datasets about At the outset, we find that the difference difficult-to-hypothesize relationships and between predictive and explanatory models unmeasured-before concepts; (2) develop- needs to be emphasized. Predictive analytics ing alternate measures for constructs; (3) entail the application of data mining, machine comparing competing theories via tests of learning and statistical modelling to arrive at predictive accuracy; (4) augmenting contem- predictive models of future observations as well porary explanatory models through capturing as suitable methods for ascertaining the predic- complex patterns which underlie relation- tive power of these models in practice (Shmueli ships among key concepts; (5) establishing & Koppius, 2011). Consequently, predictive research relevance by evaluating the discrep- analytics differ from explanatory models in that ancy between theory and practice; and (6) the latter aims to: (1) draw statistical inferences quantifying the predictability of measureable from validating causal hypotheses about rela- phenomena. tionships among variables of interest, and; (2) This chapter discusses predictive model- assess the explanatory power of causal models ling of (big) social media data in social sci- underlying these relationships (Shmueli, 2010). ences. The focus will be entirely on what is This crucial distinction between explanatory often referred to as predictive models: mod- and predictive models is best surmised by els that use statistical and/or mathematical Shmueli & Koppius (2011) in the following modelling to predict a phenomenon of inter- statement: “whereas explanatory statistical est. Furthermore, the focus will be on pre- models are based on underlying causal relation- diction in the sense of forecasting a future ships between theoretical constructs, predictive outcome of the phenomenon of interest as models rely on associations between measura- such predictions are the ones that have so far ble variables” (p. 556). For example, in political received most attention in the literature. To science, explanatory models have investigated illustrate the concepts, models, methods and the extent to which social media platforms such evaluation of results we use examples from as Facebook can function as online public economics and finance. The general princi- spheres (Robertson & Vatrapu, 2010; Vatrapu, ples are, however, easily employed to other Robertson, & Dissanayake, 2008) in terms of social science fields as well, for example,

BK-SAGE-SLOAN_QUAN-HAASE-160238-Chp20.indd 329 23/09/16 5:06 PM 330 The SAGE Handbook of Social Media Research Methods

marketing. The concepts and principles The data that this section discusses are of a general nature and are informed by Hyndman & Once the phenomenon of interest is identi- Athanasopoulos (2014) and Chatfield (2002). fied, decisions concerning the data to be used This chapter does not discuss applicable have to be made. Data can be of different software solutions. However, it is worth men- types: time series (e.g. sales per month or tioning that there exist quite a few software sales per day), cross sectional (e.g. individu- packages with more or less automatic search als such as customers, for a given period in procedures when it comes to model specifica- time) or longitudinal/panel (a combination of tion. A few ones are, for example, SAS, SPSS the former two such as a set of customers and the Autometrics package of OxMetrics. observed through several months). Predictive models can be relevant for all these types of data and many of the basic principles for analysis are quite similar. In the remaining Predictive Modelling of Social parts of this section, for simplicity the focus Media Data will be on time series only. As social media data have been growing When performing predictive analysis on social in volume and importance during the last media data researchers often have to make a 10 years, in some cases the final number of lot of decisions along the way. Examples of observations for modelling may be rather the most important decisions or choices will limited as the dependent variable may reflect be discussed in the sections below. accounting and book-keeping and be rela- tively low-frequency like monthly or quarterly in nature. If this is the case, there may be The phenomenon of interest and a limit to how advanced models can be used. the type of forecasts In other cases, daily data may be available and more complex models may be considered. Quite often the focus will be on a single out- The frequency of the data is also impor- come (univariate modelling – one model tant for model specification itself. With more equation) where the goal is to derive a pre- high frequency data, a researcher may dis- diction or forecast of, for example, sales in a cover more informative dynamic patterns company or the stock price of the company. compared to a case with less frequent data. In some cases, more than one outcome will Consider a case where sales of a company be of interest and then a multivariate approach need to be forecasted. If the reaction time in which more than one relationship or model from increased activity on the Facebook page equation is specified, estimated, and used at of the company to changes in sales is short the same time is worth considering. From (e.g. just a couple of days) then if sales are now on let us assume that the phenomenon of available only on a monthly basis the lag interest is sales of a company and the social pattern between explanatory factors and out- media data are among the factors that are come may be difficult to identify and use. considered as explanatory for the outcome. In many cases there will be a large set of The discussion will then relate to the univari- potential explanatory factors that may be ate case. At this stage, a decision is also included in various tentative model specifica- necessary in relation to the data frequency. Is tions. Social media data may be just a part the predictive model supposed to be applied of such data and it will be important to also to forecast monthly sale, quarterly sales or include other variables. The quality as well sales of an even higher frequency like weekly as the quantity of data is very important for or daily? building a successful predictive model.

BK-SAGE-SLOAN_QUAN-HAASE-160238-Chp20.indd 330 23/09/16 5:06 PM Predictive Analytics with Social Media Data 331

Social media data Where f describes some relationship between and pre-processing what is inside the parenthesis and y. In principle, linear, non-linear, parametric, When researchers consider using social non-parametric and semi-parametric models media data for predictive purposes, at the may be considered. In general, non-linear outset the social media data will be collected models will require more data points/obser- at the level of the individual action (e.g. a vations than linear models as the structures Facebook ‘like’ or a tweet) and in order to they search for are more complex. prepare the data to enter a predictive model There is a range of possible starting points some pre-processing will be necessary. Often for the search process. At one end lies tradi- the data will need to be temporally aggre- tional econometrics where the starting point gated to match the temporal aggregation is often an economic or behavioural theory level of the outcome, for example, monthly that will guide the researcher in finding a set data. Also as some of the inputs from social of potential explanatory factors. At the other media are text variables, some filtering, inter- end of the range machine learning algorithms pretation, and classification may be neces- will help identify a relationship from a large sary. An example of the latter would be the set of social media data and other potential application of a supervised machine learning explanatory factors. The advantage of start- algorithm that classifies the posts and com- ing from a theory-based model specification ments into positive, negative or neutral senti- is that the researcher may be more confident ments (Thelwall, Chapter 32, this volume). that the model is robust in the sense that the At the current moment it is mainly the pre- identified relationship is reliable at least for processing of the social media data that is some period of time. Without a theory the considered challenging from the computa- identified structure may still work for pre- tional aspects of big data analytics (Council, dictions in the short run but may be less 2013). Once the individual actions (posts, robust and in general will not add much to likes, etc.) are temporally aggregated and an understanding of the phenomenon at hand. classified, the set of potential explanatory In between pure theoretically inspired mod- factors are usually rather limited and as the els and models based on data pattern discov- outcome variables are of fairly low frequen- eries are many models that include elements cies like monthly or quarterly (stock market of both categories. As theoretical models are data are actually sometimes used at a daily often more precise when it comes to selec- frequency) which means that the modelling tion of explanatory factors for the more fun- process deviates less from more classical damental or long-run relationships they may approaches within predictive modelling. be less precise when it comes to a description of dynamics and a combination that allows for a primary theoretically based long-run In search of a model equation – part may prove more useful. theory-based versus data-driven? To finalize the discussion of theory-based versus data-driven model selection the con- In very general terms a model equation will cept of causality is often useful. If a causal identify some relationship between the phe- relationship exists a change in an explanatory nomenon of interest (y) and a set of explana- factor is known to imply a change in the out- tory factors. The relationship will never be come. A model that suffers from a lack of a perfect either due to un-observable factors, causal relationship suffers from an endogene- measurement errors or other types of errors. ity problem (a concept used in econometrics). The general equation: y = f(explanatory factors) A model that suffers from an endogene- + error ity problem will not be useful for tests of a

BK-SAGE-SLOAN_QUAN-HAASE-160238-Chp20.indd 331 23/09/16 5:06 PM 332 The SAGE Handbook of Social Media Research Methods

theory of for policy evaluations. If the only outcome of interest is by investigating the purpose of the model is forecasting, iden- out-of-sample properties of the model. tification of a causal relationship is of less This statement calls for the need of an esti- importance as a strong association between mation (or training) sample and an evaluation the explanatory factors and the outcome may (or test) sample. As a good in-sample model be sufficient. However, without causality fit does not ensure good forecasting proper- the predictive model may be considered less ties of a predictive model, the evaluation robust (more risk of a model break-down) to process then naturally starts by an analysis general changes in structures and society and of the in-sample properties of the model and hence may be best at forecasting in the short extends to an out-of-sample analysis. run. If this is the case, some sort of monitor- ing on a continuous basis to identify a model break-down at an early stage is advisable. In-sample evaluation of the model The first thing to note is that if the model has Fitting of a predictive model a theoretical foundation the signs of the esti- In this step the researcher will adapt the mated coefficients will be compared to the mathematical specification of the predictive signs expected from the theory. model to the actual data. In the case of a A second thing to be aware of is whether linear regression model this is done by esti- the model fulfils the underlying statistical mation using the ordinary least squares assumptions (these may differ depending on (OLS) method or the maximum likelihood the type of model in focus). In classical linear (ML). For non-linear models such as neural regression modelling, problems such as auto- networks, some mathematical algorithm is correlation and heteroscedasticity will need used. In rare cases estimation of a model attention and a study of potential outliers is is not possible (e.g. in case of perfect multi- of high importance. When forecasting is the collinearity of a linear regression model). In final purpose of the model multicollinearity such a case the researcher has to re-think the is of less importance. Finally, indicators in model specification. relation to the functional form specification Estimation (the use of a formula or a proce- may provide useful information on how to dure) may in itself sound simple, but already improve the model. The overall fit of the model may be cap- at this stage the researcher has to specify the 2 2 set-up to be used for model evaluation in the tured by measures such as R , adjusted R , following step as they are highly dependent. the family for measures based on absolute Even though it may seem natural to use or squared errors (e.g. MSE, RMSE, MAE, as many data point as possible for the model MAPE), and information criteria such as fitting, there are other considerations to take AIC, and BIS. A small warning is justified into account as well. For the estimation step, here as too much emphasis on obtaining a it is stressed that in addition to the decision good fit may result in overfitting of the model of estimation or fitting method, a decision on which is not necessarily desirable when the exactly which sample or part of the sample to purpose of the model is forecasting. use for estimation is of importance too.

Out-of-sample evaluation Evaluation of a predictive model for forecasting purposes For an out-of-sample evaluation study the model is used to forecast values for a time The true test of a predictive model that is to period that was not used for the estimation be used for forecasting of future values of the of the model. In the ‘pure’ case neither

BK-SAGE-SLOAN_QUAN-HAASE-160238-Chp20.indd 332 23/09/16 5:06 PM Predictive Analytics with Social Media Data 333

future values of the explanatory factors Finally, from a practical perspective a nor future values of the outcome are known combination of forecasts from different basic and the model that is used to obtain the fore- predictive models is also a possibility and cast will need to rely on lagged values of the quite popular in certain fields. explanatory factors or to use predicted values of the explanatory factors. In the former case, the specification of the model equation in terms of lags will set a limit to how many Categorized List of Predictive periods into the future the model can predict. Models with Social Media Data In many cases an out-of-sample forecast evaluation will rely on sets of one step ahead Table 20.1 below presents a selected list of predictions, but predictions for a longer fore- research papers on predictive analytics with cast horizon (e.g. six months ahead for a social media data categorized across differ- model specified with monthly data) are also ent application domains in terms of social sometimes considered. media platform (Facebook, Twitter, etc.) and Once the out-of-sample forecasts are the independent and dependent variables obtained it is possible to calculate forecast involved. For conceptual exposition and lit- errors and to study their patterns. Focus erature review on the predictive power of areas will be of directional nature (the trend social media data (see Gayo-Avello et al. in the outcome captured), as they may be (2013)). related to predictability of turning points and summary measures for the errors will again prove useful (e.g. MSE, MAPE, etc.) Application Domains but this time for the forecasted period only. The idea of splitting the sample into different As can be seen from Table 20.1, there have parts for evaluation can be extended in vari- been many predictive models of sales based ous ways using cross-validation (Hyndman on social media data. Such predictive models & Athanasopoulos, 2014). work for the brands that can command large amounts of human attention on social media, and therefore generate big data on social Using a predictive model for media. Examples are iPhone sales, H&M revenues, Nike sales, etc., which are all prod- forecasting purposes uct categories around which there is a possi- Once a model has been chosen some con- bility to have large volumes and ranges of siderations concerning its implementation opinions on social media platforms. For are important. This topic is very much brands and products that don’t generate large related to the overall phenomenon and prob- volumes of social media data, for instance, lem; hence a general discussion is difficult insurance, banking, shipping, basic house- to provide. hold supplies, etc. the predictive models tend There is, however, one type of considera- not to work. One explanation for the success- tions that deserves mentioning: how often ful performance of the predictive models is the model needs re-estimation or specifica- that social media actions can be categorized tion updating. Given that often the general into the phases of the different domain-spe- data pattern is quite robust, the specification cific models from the application domains of updating may only take place in case of new marketing, finance, epidemiology, etc. For variables becoming available or in case a suf- example, the actual stock price for Apple is ficiently large number of data points have in rough terms mainly based on discounted become available such that more complex historical sales and expectations to future structures could be allowed for. sales. If social media can model sales, then

BK-SAGE-SLOAN_QUAN-HAASE-160238-Chp20.indd 333 23/09/16 5:06 PM The SAGE Handbook of Social Media Research Methods Model Effects Models Markov Network Markov Likelihood Principle Likelihood Bigrams Time-Series Multiple Regression Model Time-Series Time-Series Multiple Regression Model Time-Series Time-Series Multiple Regression Model Time-Series Time Series Linear Regression Model Time ARIMA/Time Series Multiple Regression ARIMA/Time Simple Seasonal AR Models and Fixed- Simple Seasonal Linear Regression SVM trained on hashtag metadata Multivariate Distribution Models Multivariate Time-Series Multiple Regression Model Time-Series Time-Series Multiple Regression Model Time-Series Multi-Variate/Linear Regression Multi-Variate/Linear Time-Series using Cross-Correlation Time-Series Time-Series Linear Regression Models Time-Series Unsupervised Bayesian Model based on Poisson Regression Model using Maximum Poisson Statistical Methods SVR Regression using Unigrams and ata D ocial Media distribution 3-month Treasury Bills I and stock prices Treasury 3-month Trend, Google s (measured on S&P 500), and consumer spending t-1 HSX, Twitter, RottenTomatoes.com Twitter, HSX, within n days after publication) Twitter activity Twitter S Twitter activity, sentiment and theatre activity, Twitter Twitter activity and sentiment Twitter Calm, Alert, Sure, Vital, Kind and Happy Vital, Sure, Alert, Calm, Google trend data car names Real personal income y, interest rates on Real personal income y, Historical sales and Google trend variable Twitter collective sentiment Twitter Twitter hashtags Twitter Measures from IMDB, Flixster, Yahoomovies, Yahoomovies, Flixster, Measures from IMDB, Twitter keywords Twitter Twitter activity and sentiment Twitter Twimpact variable (number of tweetations variable Twimpact Product/brand mentions Twitter sentiment variables Twitter Twitter texts about flu Twitter Team and player mentions Team Independent Variables unigrams and bigrams, Historical prices, nalytics with A Movie revenue iPhone sales Dow Jones Industrial Average Dow Jones Industrial Car sales Consumer spending Sales of cars, homes and travel Sales of cars, Political election outcome Political Political alignment Political Movie Academy Award Movie Academy Award winners Detecting influenza outbreaks Many types of sales Total number of citations Total Sales Brand variables Early stage influenza detection Sport results and number of goals Dependent Variables Stock-Prices esearch Publications on Predictive Publications on Predictive esearch Movies, HSX, Twitter, Twitter, HSX, Movies, RottenTomatoes.com R Twitter Twitter Twitter Google Trends Google Trends Google Trends Twitter Twitter IMDB, Flixster, Yahoo Yahoo Flixster, IMDB, Twitter Twitter Twitter Blogs Twitter Twitter Tumblr Social Data Twitter ategorization of c Flammini, & Menczer (2011) Flammini, (2010) Helden (2015) Tomkins (2005) Tomkins (2009) Bhamidipati (2014) able 20.1 T Asur & Huberman (2010) Lassen et al. (2014) Lassen et al. Bollen & Mao (2011) Voortman (2015) Voortman Vosen & Schmidt (2011) Vosen Choi & Varian (2012) Varian Choi & Chung & Mustafaraj (2011) Conover, Gonçalves, Ratkiewicz, Ratkiewicz, Gonçalves, Conover, Bothos, Apostolou, & Mentzas Apostolou, Bothos, Culotta (2010) Dijkman, Ipeirotis, Aertsen, & van & van Aertsen, Ipeirotis, Dijkman, Eysenbach (2011) Gruhl, Guha, Kumar, Novak, & Novak, Kumar, Guha, Gruhl, Jansen, Zhang, Sobel, & Chowdury Sobel, Zhang, Jansen, Li & Cardie (2013) Radosavljevic, Grbovic, Djuric, & Djuric, Grbovic, Radosavljevic, Reference & Klein (2009) Osborne, Ritterman,

BK-SAGE-SLOAN_QUAN-HAASE-160238-Chp20.indd 334 23/09/16 5:06 PM BK-SAGE-SLOAN_QUAN-HAASE-160238-Chp20.indd 335

Sang & Bos (2012) Twitter Dutch election outcome Twitter texts and sentiments Time Series Multiple Regression Model Shen, Wang, Luo, & Wang (2013) Twitter Entity belonging Twitter texts Decision tree, KAURI/LINDEN method Skoric et al. (2012) Twitter Election outcome Singapore Twitter activity Time Series Linear Regression Model Tsakalidis et al. (2015) Twitter Election outcomes EU Twitter texts Linear Regression (LR), Gaussian Process (GP) and Sequential Minimal Optimization for Regression (SMO) Tumasjan, Sprenger, Sandner, & Twitter Election outcomes Germany Twitter texts Probability Models Welpe (2010) Yu, Duan, & Cao (2013) Google blogs, Boardreader Firm equity value Variables for activity and sentiment Time Series Multiple Regression Model and Twitter compared to Google News Hughes, Rowe, Batey, & Lee (2012) Twitter and Facebook Socialising and info exchange Big5 personality traits, NFC and sociability Time-Series Multiple Regression Model Krauss, Nann, Simon, Gloor, & Forums Movie success and academy awards Intensity, positivity and trendsetter variables Time-Series Multiple Regression Model Fischbach (2008) Seiffertt & Wunsch (2008) Several Variables on financial markets Many types discussed Different Model Types Discussed Tang & Liu (2010) Flickr and YouTube Online behaviors Social Dimension variables SocioDim, several advanced models combined Karabulut (2013) Facebook Stock prices Facebook GNH (General national happiness), Time-Series Multiple Regression Model positivity, negativity Mao, Counts, & Bollen (2014) Twitter UK, US, and Canadian stock markets "Bullish" or "bearish" mentions on Twitter Time-Series Multiple Regression Model Bollen, Mao, & Zeng (2011) Twitter DJIA Twitter moods and feelings Self-Organizing Fuzzy Neural Network Bollen, Mao, & Pepe (2011) Twitter Socio-economic events Twitter moods and feelings Extended version of: Profile of mood states Eichstaedt et al. (2015) Twitter Heart attacks Anger, stress and fatigue Time-Series Multiple Regression Model De Choudhury, Gamon, Counts, & Twitter Depression Language, emotion, style, ego-network, and Support Vector Machine Horvitz (2013) user engagement De Choudhury, Counts, & Horvitz Twitter Postpartum changes in emotion and Engagement, emotion, ego-network and Support Vector Machine (2013) behaviour linguistic style De Choudhury, Counts, Horvitz, & Facebook Postpartum depression Social activity and interaction OLS Regression Model Hoff (2014) Weeks & Holbert (2013) Facebook, Twitter, YouTube Dissemination of News Content in Gender, age, web engine news search, email Decision Tree Model Social Media news activity and cell phone activity Gilbert & Karahalios (2009) Facebook Tie strength 15 communication variables Time-Series Multiple Regression Model Won et al. (2013) Weblog social media data Suicide Suicide related words and mentions Time-Series Multiple Regression Model 23/09/16 5:06 PM 336 The SAGE Handbook of Social Media Research Methods

there is a high potential for the associated Decision Trees, ARIMA, Dynamic Systems, stock price to also being modelled with Bayesian Networks, and combined models. social media data. In the case of epidemiol- In the next section, we present an illustra- ogy, all social media texts on flu can also be tive case study of predictive modelling with categorized in to the different domain-spe- big social data. cific phases of spread, incubation, immunity, resistance, susceptibility etc.

An Illustrative Case Study of Predictive Modelling Social Media Data Types For modelling stock prices, Twitter and In this section, we demonstrate how social Google Trends have proven to be the best media data from Twitter and Facebook can platforms. Twitter and Google Trends beat be used to predict the quarterly sales of Facebook for stock price modelling because iPhones and revenues of clothing retailer, of higher data volume and immediacy. On H&M, respectively. Based on a conceptual the other hand, Facebook data have been suc- model of social data (Vatrapu, Mukkamala, cessfully used for modelling sales, human & Hussain, 2014) consisting of Interactions emotions, personalities and human relations (actors, actions, activities, and artifacts) and to a brand. In general, picture and video Conversations (topics, keywords, pronouns, based social media platforms such as and sentiments), and drawing from the Instagram, YouTube and Netflix are becom- domain-specific theories in advertising and ing more prevalent and we expect them to sales from marketing (Belch, Belch, Kerr, & become more relevant for predictive models Powell, 2008), we developed and evaluated in the future. linear regression models that transform (a) iPhone tweets into a prediction of the quarterly iPhone sales with an average error close to the established prediction models from Independent and Dependent investment banks (Lassen et al., 2014) and Variables (b) Facebook likes into a prediction of the As can be seen from Table 20.1, a wide range global revenue of the fast fashion company, of dependent variables have been modelled: H&M. Our basic premise is that social media sales, stock prices, Net Promoter Score, hap- actions can serve as proxies for user’s atten- piness, feelings, personalities, interest areas, tion and as such have predictive power. The social groups, diseases, epidemics, suicide, central research question for this demonstra- crime, radicalization, civil unrest. The inde- tive case study was: To what extent can Big pendent variables used reflect the human Social Data predict real-world outcomes social relations to the dependent variables such as sales and revenues? Table 20.2 mainly consist of measures of social media below presents the dataset collected for pre- activity, feelings, personalities and dictive analytics purposes of this case study. sentiment. We adhered to the methodological sche- matic recommended by Shmueli & Koppius (2011) for building empirical predictive Statistical Methods Employed models. We built on and extended the predictive analytics method of Asur & Huberman We find that a wide range of statistical models (2010) and examined if the principles for for predictive analytics have been used includ- predicting movie revenue with Twitter data ing Regression, Neural Network, SVM, can also be used to predict iPhone sales and

BK-SAGE-SLOAN_QUAN-HAASE-160238-Chp20.indd 336 23/09/16 5:06 PM Predictive Analytics with Social Media Data 337

Table 20.2 overview of Dataset Company Data Source Time Period Size of Dataset

Apple Twitter 01–2007 to 10–2014 ∼500 million+ tweets containing "iPhone" Collected using Topsy Pro (http://topsy.thisisthebrigade.com) H&M Facebook 01–2009 to 10–2014 ∼15 million data points from the official H&M Facebook page Collected using the Social Data Analytics Tool (Hussain & Vatrapu, 2014)

H&M revenues for Facebook data. That is, if 2012–2014. In the case of the iPhone sales a tweet/like can serve as a proxy for a user’s prediction model, our average error of 5% attention towards a product and an underlying is not that far from the industry benchmark intention to purchase and/or recommend it. predictions of Morgan Stanley and IDC. That We extend Asur & Huberman (2010) in three said, there are several challenges and limi- important ways: (a) addition of Facebook tations to the predictive analytics processes social data, (b) theoretically informed time and their outcomes. First, we lack multiple lagging of the independent variable, social cases to extensively evaluate and validate the media actions, and (c) domain-specific sea- overall prediction model. A second limita- sonal weighting of the dependent variable, tion is the emerging challenge for predictive sales/revenues. Figures 20.1 and 20.2 pre- analytics from social data associated with sent the predicted vs. actual charts for Apple increasing sales in emerging markets such as iPhone sales and H&M revenues respectively. China with its own unique social media eco- With regard to our prediction models, system. By and large, the social media eco- we observed a 5–10% average error from system of China does not overlap with that our predictive models with the actual sales of Western countries to which Facebook and and revenue data over three-year period of Twitter belong. We suspect that the effect of

Figure 20.1 Predictive Model of iPhone Figure 20.2 Predictive Model of H&M Sales from Twitter Data Revenues from Facebook Data

BK-SAGE-SLOAN_QUAN-HAASE-160238-Chp20.indd 337 23/09/16 5:06 PM 338 The SAGE Handbook of Social Media Research Methods

non-overlapping social media ecosystems set of information. It is necessarily back- might be somewhat ameliorated for Veblen ward-looking as it relies on historical data goods such as iPhones given the conspicuous and irrespectively of how carefully the model consumption aspirations of a global middle specification and evaluation is done, there is class. This however remains an analytical no guarantee that the prediction of future val- challenge and restricts the predictive power ues of the variable of interest will be reliable. of our H&M prediction model. The patterns or theories that the model relies on may break down and render the model useless for predictive purposes. That being Conclusion said, careful predictive modelling is probably the best that can be done and, if applied and Predictive models offer powerful tools as used following the state of the art with most numerical forecasts and assessments of their emphasis placed on short term forecasting, uncertainty alongside quantitative statements predictive modelling is a very valuable tool. more generally may improve decisions in companies and by public authorities. The overall advice is to go for a parsimo- nious, simple model that captures the most Acknowledgements important features of the data, that fulfils the model assumptions and that provides a good fit both in sample and out of sample. We thank the members of the Centre for Furthermore, it is important that even during Business Data Analytics (http://bda.cbs.dk) the phase where the model is applied for its for their feedback. purpose, it performance is still monitored. The authors were partially supported We present a general model for predictive by the project Big Social Data Analytics: analytics of business outcomes from social Branding Algorithms, Predictive Models, and media data below. Dashboards funded by Industriens Fond (The Danish Industry Foundation). Any opinions,

ytt=×ββapAP+×tt+×ββdoDO+×tt+ ε findings, interpretations, conclusions or rec- ommendations expressed in this chapter are Where: those of its authors and do not represent the views of the Industriens Fond (The Danish yt = Outcome variable of interest Industry Foundation). At = Accumulated time-lagged social media activity associated with outcome variable at time t References At = Σ Ast

Asur, S., & Huberman, B. A. (2010). Predicting Ast = Social media activity in terms of actions by actors on artifacts associated with the future with social media. Paper presented at the IEEE/WIC/ACM International outcome variable at time t Conference on Web Intelligence and Intelli- Pt = Individual or social psychological gent Agent Technology (WI-IAT). attribute(s) at time t Belch, G. E., Belch, M. A., Kerr, G. F., & Powell, Dt = Social media dissemination factors I. (2008). Advertising and promotion: An Ot = Other explanatory factors integrated marketing communications perspective: McGraw-Hill, London. A final word of caution will end this chap- Bollen, J., & Mao, H. (2011). Twitter mood as a ter: any predictive model is based on a certain stock market predictor. Computer, 91–94.

BK-SAGE-SLOAN_QUAN-HAASE-160238-Chp20.indd 338 23/09/16 5:06 PM Predictive Analytics with Social Media Data 339

Bollen, J., Mao, H., & Pepe, A. (2011). Mode- De Choudhury, M., Counts, S., & Horvitz, E. ling public mood and emotion: Twitter senti- (2013). Predicting postpartum changes in ment and socio-economic phenomena. emotion and behavior via social media. ICWSM, 11, 450–453. Paper presented at the Proceedings of the Bollen, J., Mao, H., & Zeng, X. (2011). Twitter SIGCHI Conference on Human Factors in mood predicts the stock market. Journal of Computing Systems. Computational Science, 2(1), 1–8. De Choudhury, M., Counts, S., Horvitz, E. J., & Bothos, E., Apostolou, D., & Mentzas, G. Hoff, A. (2014). Characterizing and predict- (2010). Using Social Media to Predict Future ing postpartum depression from shared Events with Agent-Based Markets. IEEE Intel- Facebook data. Paper presented at the Pro- ligent Systems, 25(6), 50–58. ceedings of the 17th ACM conference on Chatfield, C. (2002). Confessions of a prag- Computer supported cooperative work and matic statistician. Journal of the Royal Statis- social computing. tical Society: Series D (The Statistician), 51(1), De Choudhury, M., Gamon, M., Counts, S., & 1–20. Horvitz, E. (2013). Predicting Depression via Choi, H., & Varian, H. (2012). Predicting the Social Media. Paper presented at the ICWSM. present with Google Trends. Economic Dijkman, R., Ipeirotis, P., Aertsen, F., & van Helden, Record, 88(s1), 2–9. R. (2015). Using Twitter to predict sales: a Chunara, R., Andrews, J. R., & Brownstein, J. S. case study. arXiv preprint arXiv:1503.04599. (2012). Social and News Media Enable Esti- Eichstaedt, J.C., Schwartz, H.A., Kern, M.L., mation of Epidemiological Patterns Early in Park, G., Labarthe, D.R., Merchant, R.M., the 2010 Haitian Cholera Outbreak. Ameri- Jha, S., Agrawal, M., Dziurzynski, L.A., Sap, can Journal of Tropical Medicine and M. and Weeg, C. Psychological language on Hygiene, 86(1), 39–45. doi: 10.4269/ Twitter predicts county-level heart disease ajtmh.2012.11–0597 mortality. Psychological science, 26(2), Chung, J. E., & Mustafaraj, E. (2011). Can col- 159–169. lective sentiment expressed on Twitter pre- Eysenbach, G. (2011). Can tweets predict cita- dict political elections? Paper presented at tions? Metrics of social impact based on the AAAI. Twitter and correlation with traditional met- Cioffi-Revilla, C. (2013). Introduction to Compu- rics of scientific impact. Journal of medical tational Social Science: Principles and Applica- Internet Research, 13(4), e123. tions: Springer Science & Business Media. Evangelos K, Efthimios T and Konstantinos T. Conover, M. D., Gonçalves, B., Ratkiewicz, J., (2013) Understanding the predictive power Flammini, A., & Menczer, F. (2011). Predict- of social media. Internet Research, 23(5), ing the political alignment of Twitter users. 544–559. Paper presented at the Privacy, Security, Risk Gilbert, E., & Karahalios, K. (2009). Predicting tie and Trust (PASSAT) and 2011 IEEE 3rd Inter- strength with social media. Paper presented national Conference on Social Computing at the Proceedings of the SIGCHI conference (SocialCom). on human factors in computing systems. Conte, R., Gilbert, N., Bonelli, G., Cioffi-Revilla, Golder, S. A., & Macy, M. W. (2011). Diurnal C., Deffuant, G., Kertesz, J., Loreto, V., and seasonal mood vary with work, sleep, Moat, S., Nadal, J.P., Sanchez, A., Nowak, and daylength across diverse cultures. Sci- A & Helbing, D. (2012). Manifesto of compu- ence, 333(6051), 1878–1881. tational social science. European Physical Gruhl, D., Guha, R., Kumar, R., Novak, J., & Journal, 214(1), 325–346. Tomkins, A. (2005). The predictive power of Council, N. (2013). Frontiers in massive data online chatter. Paper presented at the Pro- analysis: The National Academies Press ceedings of the 11th ACM SIGKDD interna- Washington, DC. tional conference on knowledge discovery in Culotta, A. (2010). Towards detecting influenza data mining. epidemics by analyzing Twitter messages. Hughes, D. J., Rowe, M., Batey, M., & Lee, A. Paper presented at the Proceedings of the (2012). A tale of two sites: Twitter vs. Face- first workshop on social media analytics. book and the personality predictors of social

BK-SAGE-SLOAN_QUAN-HAASE-160238-Chp20.indd 339 23/09/16 5:06 PM 340 The SAGE Handbook of Social Media Research Methods

media usage. Computers in Human Behav- Data for Forecasting and Statistics, Frankfurt, ior, 28(2), 561–569. Germany. Hussain, A., & Vatrapu, R. (2014). Social Data Radosavljevic, V., Grbovic, M., Djuric, N., & Bha- Analytics Tool (SODATO). In M. Tremblay, D. midipati, N. (2014). Large-scale World Cup VanderMeer, M. Rothenberger, A. Gupta, & 2014 outcome prediction based on Tumblr V. Yoon (Eds.), Advancing the Impact of posts. Paper presented at the KDD Workshop Design Science: Moving from Theory to Prac- on Large-Scale Sports Analytics , New York. tice (Vol. 8463, pp. 368–372): Springer Inter- Ritterman, J., Osborne, M., & Klein, E. (2009). national Publishing, Switzerland. Using prediction markets and Twitter to pre- Hussain, A., Vatrapu, R., Hardt, D., & Jaffari, Z. dict a swine flu pandemic. Paper presented (2014). Social Data Analytics Tool: A Demon- at the 1st international workshop on mining strative Case Study of Methodology and Soft- social media, Sevilla, Spain. ware. In M. Cantijoch, R. Gibson, & S. Ward Robertson, S., & Vatrapu, R. (2010). Digital (Eds.), Analyzing Social Media Data and Web Government. In B. Cronin (Ed.), Annual Networks (pp. 99–118): Palgrave Macmillan, Review of Information Science and Technol- UK.. ogy (Vol. 44, pp. 317–364). Hyndman, R. J., & Athanasopoulos, G. (2014). Robertson, S., Vatrapu, R., & Medina, R. Forecasting: principles and practice: OTexts: (2010a). Off the wall political discourse: https://www.otexts.org/fpp/ Facebook use in the 2008 US Presidential Jansen, B. J., Zhang, M., Sobel, K., & Chow- election. Information Polity, 15(1), 11–31. dury, A. (2009). Twitter power: Tweets as Robertson, S., Vatrapu, R., & Medina, R. (2010b). electronic word of mouth. Journal of the Online Video “Friends” Social Networking: American Society for Information Science Overlapping Online Public Spheres in the and Technology, 60(11), 2169–2188. 2008 U.S. Presidential Election. Journal of Karabulut, Y. (2013). Can Facebook predict Information Technology & Politics, 7(2–3), stock market activity? Paper presented at the 182–201. doi:10.1080/19331681003753420 AFA 2013 San Diego Meetings Paper. Sang, E. T. K., & Bos, J. (2012). Predicting the Krauss, J., Nann, S., Simon, D., Gloor, P. A., & 2011 Dutch senate election results with Twit- Fischbach, K. (2008). Predicting Movie Suc- ter. Paper presented at the Proceedings of cess and Academy Awards through Senti- the Workshop on Semantic Analysis in Social ment and Social Network Analysis. Paper Media, Avignon, France. presented at the ECIS. Seiffertt, J., & Wunsch, D. (2008). Intelligence Lassen, N., Madsen, R., & Vatrapu, R. (2014). in Markets: Asset Pricing, Mechanism Design, Predicting iPhone Sales from iPhone Tweets. and Natural Computation [Technology Proceedings of IEEE 18th International Enter- Review]. Computational Intelligence Maga- prise Distributed Object Computing Confer- zine, IEEE, 3(4), 27–30. ence (EDOC 2014), Ulm, Germany, 81–90, Shen, W., Wang, J., Luo, P., & Wang, M. ISBN: 1541–7719/1514, doi: 1510.1109/ (2013). Linking named entities in tweets EDOC.2014.1520. with knowledge base via user interest Lazer, D., Pentland, A.S., Adamic, L., Aral, S., modeling. Paper presented at the Proceed- Barabasi, A.L., Brewer, D., Christakis, N., ings of the 19th ACM SIGKDD international Contractor, N., Fowler, J., Gutmann, M. and conference on Knowledge discovery and Jebara, T. Computational Social Science. Sci- data mining, Chicago. ence, 323(5915), 721–723. doi: 10.1126/ Shmueli, G. (2010). To explain or to predict? science.1167742 Statistical Science, 25(3) 289–310. Li, J., & Cardie, C. (2013). Early stage influenza Shmueli, G., & Koppius, O. R. (2011). Predictive detection from Twitter. arXiv preprint analytics in information systems research. arXiv:1309.7340. MIS Quarterly, 35(3), 553–572. Mao, H., Counts, S., & Bollen, J. (2014). Quan- Skoric, M., Poor, N., Achananuparp, P., Lim, E.- tifying the effects of online bullishness on P., & Jiang, J. (2012). Tweets and votes: A international financial markets. Paper pre- study of the 2011 Singapore general elec- sented at the ECB Workshop on Using Big tion. Paper presented at the System Science

BK-SAGE-SLOAN_QUAN-HAASE-160238-Chp20.indd 340 23/09/16 5:06 PM Predictive Analytics with Social Media Data 341

(HICSS), 45th Hawaii International Confer- Social set analysis: four demonstrative case ence on System Sciences, Hawaii. studies. Proceedings of the 2015 Interna- Tang, L., & Liu, H. (2010). Toward predicting col- tional Conference on Social Media & Society. lective behavior via social dimension extrac- doi:10.1145/2789187.2789203 tion. Intelligent Systems, IEEE, 25(4), 19–25. Voortman, M. (2015). Validity and reliability Tsakalidis, A., Papadopoulos, S., Cristea, A. I., & of web search based predictions for car Kompatsiaris, Y. (2015). Predicting elections sales. for multiple countries using Twitter and polls. Vosen, S., & Schmidt, T. (2011). Forecasting Intelligent Systems, IEEE, 30(2), 10–17. private consumption: survey-based indica- Tumasjan, A., Sprenger, T. O., Sandner, P. G., & tors vs. Google trends. Journal of Forecast- Welpe, I. M. (2010). Predicting elections with ing, 30(6), 565–578. Twitter: What 140 characters reveal about Weeks, B. E., & Holbert, R. L. (2013). Predicting political sentiment. ICWSM, 10, 178–185. dissemination of news content in social Vatrapu, R., Mukkamala, R., & Hussain, A. media a focus on reception, friending, and (2014). A Set Theoretical Approach to Big partisanship. Journalism & Mass Communi- Social Data Analytics: Concepts, Methods, cation Quarterly, 90(2), 212–232. Tools, and Findings. Paper presented at the Won, H.-H., Myung, W., Song, G.-Y., Lee, W.- Computational Social Science Workshop at H., Kim, J.-W., Carroll, B. J., & Kim, D. K. the European Conference on Complex Sys- (2013). Predicting national suicide numbers tems 2014, Lucca. with social media data. PloS one, 8(4), Vatrapu, R., Robertson, S., & Dissanayake, W. e61809. (2008). Are Political Weblogs Public Spheres Yu, Y., Duan, W., & Cao, Q. (2013). The or Partisan Spheres? International Reports on impact of social and conventional media Socio-Informatics, 5(1), 7–26. on firm equity value: A sentiment analysis Vatrapu, R., Hussain, A., Lassen, N. B., Muk- approach. Decision Support Systems, kamala, R., Flesch, B., & Madsen, R. (2015). 55(4), 919–926.

BK-SAGE-SLOAN_QUAN-HAASE-160238-Chp20.indd 341 23/09/16 5:06 PM