Text As Data†

Journal of Economic Literature 2019, 57(3), 535–574 https://doi.org/10.1257/jel.20181020 Text as Data† Matthew Gentzkow, Bryan Kelly, and Matt Taddy* An ever-increasing share of human interaction, communication, and culture is recorded as digital text. We provide an introduction to the use of text as an input to economic research. We discuss the features that make text different from other forms of data, offer a practical overview of relevant statistical methods, and survey a variety of applications. (JEL C38, C55, L82, Z13) 1. Introduction from advertisements and product reviews is used to study the drivers of consumer deci- ew technologies have made available sion making. In political economy, text from Nvast quantities of digital text, recording politicians’ speeches is used to study the an ever-increasing share of human interac- dynamics of political agendas and debate. tion, communication, and culture. For social The most important way that text differs scientists, the information encoded in text is from the kinds of data often used in econom- a rich complement to the more structured ics is that text is inherently high dimensional. kinds of data traditionally used in research, Suppose that we have a sample of documents, and recent years have seen an explosion of each of which is w words long, and suppose empirical economics research using text as that each word is drawn from a vocabulary data. of p possible words. Then the unique repre- To take just a few examples: In finance, sentation of these documents has dimension text from financial news, social media, and p w. A sample of thirty-word Twitter mes- company filings is used to predict asset price sages that use only the one thousand most movements and study the causal impact of common words in the English language, for new information. In macroeconomics, text example, has roughly as many dimensions as is used to forecast variation in inflation and there are atoms in the universe. unemployment, and estimate the effects of A consequence is that the statistical meth- policy uncertainty. In media economics, text ods used to analyze text are closely related to from news and social media is used to study those used to analyze high-dimensional data the drivers and effects of political slant. In in other domains, such as machine learning industrial organization and marketing, text and computational biology. Some methods, such as lasso and other penalized regres- * Gentzkow: Stanford University. Kelly: Yale University sions, are applied to text more or less exactly and AQR Capital Management. Taddy: University of Chi- as they are in other settings. Other methods, cago Booth School of Business. † Go to https://doi.org/10.1257/jel.20181020 to visit the such as topic models and multinomial inverse article page and view author disclosure statement(s). regression, are close cousins of more general 535 536 Journal of Economic Literature, Vol. LVII (September 2019) methods adapted to the specific structure of third task is predicting the incidence of local text data. flu outbreaks from Google searches, where In all of the cases we consider, the analysis the outcome V is the true incidence of flu. can be summarized in three steps: In these examples, and in the vast major- ity of settings where text analysis has been 1. Represent raw text as a numerical applied, the ultimate goal is prediction rather array C ; than causal inference. The interpretation of the mapping from V to Vˆ is not usually an 2. Map C to predicted values Vˆ of object of interest. Why certain words appear unknown outcomes V ; and more often in spam, or why certain searches are correlated with flu is not important so 3. Use Vˆ in subsequent descriptive or long as they generate highly accurate predic- causal analysis. tions. For example, Scott and Varian (2014, 2015) use data from Google searches to pro- In the first step, the researcher must duce high-frequency estimates of macro- impose some preliminary restrictions to economic variables such as unemployment reduce the dimensionality of the data claims, retail sales, and consumer sentiment to a manageable level. Even the most that are otherwise available only at lower fre- cutting-edge high-dimensional techniques quencies from survey data. Groseclose and can make nothing of 1,000 30 -dimensional Milyo (2005) compare the text of news out- raw Twitter data. In almost all the cases we lets to speeches of congresspeople in order discuss, the elements of C are counts of to estimate the outlets’ political slant. A large tokens: words, phrases, or other predefined literature in finance following Antweiler and features of text. This step may involve filter- Frank (2004) and Tetlock (2007) uses text ing out very common or uncommon words; from the internet or the news to predict dropping numbers, punctuation, or proper stock prices. names; and restricting attention to a set of In many social science studies, however, features such as words or phrases that are the goal is to go further and, in the third likely to be especially diagnostic. The map- step, use text to infer causal relationships ping from raw text to C leverages prior infor- or the parameters of structural economic mation about the structure of language to models. Stephens-Davidowitz (2014) uses reduce the dimensionality of the data prior Google search data to estimate local areas’ to any statistical analysis. racial animus, then studies the causal The second step is where high-dimensional effect of racial animus on votes for Barack statistical methods are applied. In a classic Obama in the 2008 election. Gentzkow and example, the data is the text of emails, and Shapiro (2010) use congressional and news the unknown variable of interest V is an indi- text to estimate each news outlet’s political cator for whether the email is spam. The slant, then study the supply and demand prediction Vˆ determines whether or not to forces that determine slant in equilibrium. send the email to a spam filter. Another clas- Engelberg and Parsons (2011) measure local sic task is sentiment prediction (e.g., Pang, news coverage of earnings announcements, Lee, and Vaithyanathan 2002), where the then use the relationship between coverage unknown variable V is the true sentiment of and trading by local investors to separate a message (say positive or negative), and the the causal effect of news from other sources prediction Vˆ might be used to identify posi- of correlation between news and stock tive reviews or comments about a product. A prices. Gentzkow, Kelly, and Taddy: Text as Data 537 In this paper, we provide an overview from the text as a whole. It might seem of methods for analyzing text and a survey obvious that any attempt to distill text into of current applications in economics and meaningful data must similarly take account related social sciences. The methods discus- of complex grammatical structures and rich sion is forward looking, providing an over- interactions among words. view of methods that are currently applied The field of computational linguistics in economics as well as those that we expect has made tremendous progress in this kind to have high value in the future. Our discus- of interpretation. Most of us have mobile sion of applications is selective and necessar- phones that are capable of complex speech ily omits many worthy papers. We highlight recognition. Algorithms exist to efficiently examples that illustrate particular methods parse grammatical structure, disambiguate and use text data to make important substan- different senses of words, distinguish key tive contributions even if they do not apply points from secondary asides, and so on. methods close to the frontier. Yet virtually all analysis of text in the social A number of other excellent surveys have sciences, like much of the text analysis in been written in related areas. See Evans and machine learning more generally, ignores Aceves (2016) and Grimmer and Stewart the lion’s share of this complexity. Raw text (2013) for related surveys focused on text consists of an ordered sequence of language analysis in sociology and political science, elements: words, punctuation, and white respectively. For methodological surveys, space. To reduce this to a simpler repre- Bishop (2006), Hastie, Tibshirani, and sentation suitable for statistical analysis, we Friedman (2009), and Murphy (2012) cover typically make three kinds of simplifications: contemporary statistics and machine learn- dividing the text into individual documents i , ing in general while Jurafsky and Martin reducing the number of language elements (2009) overview methods from computa- we consider, and limiting the extent to which tional linguistics and natural language pro- we encode dependence among elements cessing. The Spring 2014 issue of the Journal within documents. The result is a mapping of Economic Perspectives contains a sympo- from raw text to a numerical array C . A row sium on “big data,” which surveys broader ci of C is a numerical vector with each ele- applications of high-dimensional statistical ment indicating the presence or count of a methods to economics. particular language token in document i . In section 2 we discuss representing 2.1 What Is a Document? text data as a manageable (though still high-dimensional) numerical array C ; in sec- The first step in constructing C is to divide tion 3 we discuss methods from data mining raw text into individual documents . {i} and machine learning for predicting V from In many applications, this is governed by the C . Section 4 then provides a selective survey level at which the attributes of interest V are of text analysis applications in social science, defined. For spam detection, the outcome of and section 5 concludes.

Load more