Journal of Economic Literature 2019, 57(3), 535–574 https://doi.org/10.1257/jel.20181020

Text as Data†

Matthew Gentzkow, Bryan Kelly, and Matt Taddy*

An ever-increasing share of human interaction, communication, and culture is recorded as digital text. We provide an introduction to the use of text as an input to economic research. We discuss the features that make text different from other forms of data, offer a practical overview of relevant statistical methods, and survey a variety of applications. (JEL C38, C55, L82, Z13)

1. Introduction from advertisements and product reviews is used to study the drivers of consumer deci- ew technologies have made available sion making. In , text from Nvast quantities of digital text, recording politicians’ speeches is used to study the an ever-increasing share of human interac- dynamics of political agendas and debate. tion, communication, and culture. For social The most important way that text differs scientists, the information encoded in text is from the kinds of data often used in econom- a rich complement to the more structured ics is that text is inherently high­ dimensional. kinds of data traditionally used in research, Suppose that we have a sample of documents, and recent years have seen an explosion of each of which is w​ ​ words long, and suppose empirical economics research using text as that each word is drawn from a vocabulary data. of p​ ​ possible words. Then the unique repre- To take just a few examples: In finance, sentation of these documents has dimension ​​ text from financial news, social media, and p​​ w​​. A sample of ­thirty-word Twitter mes- company filings is used to predict asset price sages that use only the one thousand most movements and study the causal impact of common words in the English language, for new information. In macroeconomics, text example, has roughly as many dimensions as is used to forecast variation in inflation and there are atoms in the universe. unemployment, and estimate the effects of A consequence is that the statistical meth- policy uncertainty. In media economics, text ods used to analyze text are closely related to from news and social media is used to study those used to analyze high-dimensional­ data the drivers and effects of political slant. In in other domains, such as machine learning industrial organization and marketing, text and computational biology. Some methods, such as lasso and other penalized regres- * Gentzkow: Stanford University. Kelly: Yale University sions, are applied to text more or less exactly and AQR Capital Management. Taddy: University of Chi- as they are in other settings. Other methods, cago Booth School of Business. † Go to https://doi.org/10.1257/jel.20181020 to visit the such as topic models and multinomial inverse article page and view author disclosure statement(s). regression, are close cousins of more general

535 536 Journal of Economic Literature, Vol. LVII (September 2019) methods adapted to the specific structure of third task is predicting­ the incidence of local text data. flu outbreaks from Google searches, where In all of the cases we consider, the analysis the outcome ​V​ is the true incidence of flu. can be summarized in three steps: In these examples, and in the vast major- ity of settings where text analysis has been 1. Represent raw text ​ ​ as a numerical applied, the ultimate goal is prediction rather  array ​C​; than causal inference. The interpretation of the mapping from V​ ​ to ​​Vˆ ​​ is not usually an 2. Map C to predicted values V​​ˆ of object of interest. Why certain words appear unknown outcomes ​V​; and more often in spam, or why certain searches are correlated with flu is not important so 3. Use ​​Vˆ ​​ in subsequent descriptive or long as they generate highly accurate predic- causal analysis. tions. For example, Scott and Varian (2014, 2015) use data from Google searches to pro- In the first step, the researcher must duce high-frequency­ estimates of macro- impose some preliminary restrictions to economic variables such as unemployment reduce the dimensionality of the data claims, retail sales, and consumer sentiment to a manageable level. Even the most that are otherwise available only at lower fre- ­cutting-edge ­high-dimensional techniques quencies from survey data. Groseclose and can make nothing of ​​1,000​​ 30​-dimensional Milyo (2005) compare the text of news out- raw Twitter data. In almost all the cases we lets to speeches of congresspeople in order discuss, the elements of C​ ​ are counts of to estimate the outlets’ political slant. A large tokens: words, phrases, or other predefined­ literature in finance following Antweiler and features of text. This step may involve filter- Frank (2004) and Tetlock (2007) uses text ing out very common or uncommon words; from the internet or the news to predict dropping numbers, punctuation, or proper stock prices. names; and restricting attention to a set of In many social science studies, however, features such as words or phrases that are the goal is to go further and, in the third likely to be especially diagnostic. The map- step, use text to infer causal relationships ping from raw text to C leverages prior infor- or the parameters of structural economic mation about the structure of language to ­models. ­Stephens-Davidowitz (2014) uses reduce the dimensionality of the data prior Google search data to estimate local areas’ to any statistical analysis. racial animus, then studies the causal The second step is where ­high-dimensional effect of racial animus on votes for Barack statistical methods are applied. In a classic Obama in the 2008 election. Gentzkow and example, the data is the text of emails, and Shapiro (2010) use congressional and news the unknown variable of interest V is an indi- text to estimate each news outlet’s political cator for whether the email is spam. The slant, then study the supply and demand prediction ​​Vˆ ​​ determines whether or not to forces that determine slant in equilibrium. send the email to a spam filter. Another clas- Engelberg and Parsons (2011) measure local sic task is sentiment prediction (e.g., Pang, news coverage of earnings announcements, Lee, and Vaithyanathan 2002), where the then use the relationship between coverage unknown variable ​V​ is the true sentiment of and trading by local investors to separate a message (say positive or negative), and the the causal effect of news from other sources prediction ​​Vˆ ​​ might be used to identify posi- of correlation between news and stock tive reviews or comments about a product. A prices. Gentzkow, Kelly, and Taddy: Text as Data 537

In this paper, we provide an overview from the text as a whole. It might seem of methods for analyzing text and a survey obvious that any attempt to distill text into of current applications in economics and meaningful data must similarly take account related social sciences. The methods discus- of complex grammatical structures and rich sion is forward­ looking, providing an over- interactions among words. view of methods that are currently applied The field of computational linguistics in economics as well as those that we expect has made tremendous progress in this kind to have high value in the future. Our discus- of interpretation. Most of us have mobile sion of applications is selective and necessar- phones that are capable of complex speech ily omits many worthy papers. We highlight recognition. Algorithms exist to efficiently examples that illustrate particular methods parse grammatical structure, disambiguate and use text data to make important substan- different senses of words, distinguish key tive contributions even if they do not apply points from secondary asides, and so on. methods close to the frontier. Yet virtually all analysis of text in the social A number of other excellent surveys have sciences, like much of the text analysis in been written in related areas. See Evans and machine learning more generally, ignores Aceves (2016) and Grimmer and Stewart the lion’s share of this complexity. Raw text (2013) for related surveys focused on text consists of an ordered sequence of language analysis in sociology and political science, elements: words, punctuation, and white respectively. For methodological surveys, space. To reduce this to a simpler repre- Bishop (2006), Hastie, Tibshirani, and sentation suitable for statistical analysis, we Friedman (2009), and Murphy (2012) cover typically make three kinds of simplifications: contemporary statistics and machine learn- dividing the text into individual documents ​i​, ing in general while Jurafsky and Martin reducing the number of language elements (2009) overview methods from computa- we consider, and limiting the extent to which tional linguistics and natural language pro- we encode dependence among elements cessing. The Spring 2014 issue of the Journal within documents. The result is a mapping of Economic Perspectives contains a sympo- from raw text to a numerical array C​ ​. A row  sium on “big data,” which surveys broader ​​ci​​ of ​C​ is a numerical vector with each ele- applications of ­high-dimensional statistical ment indicating the presence or count of a methods to economics. particular language token in document ​i​. In section 2 we discuss representing 2.1 What Is a Document? text data as a manageable (though still ­high-dimensional) numerical array C​ ​; in sec- The first step in constructing ​C​ is to divide tion 3 we discuss methods from data mining raw text ​ ​ into individual documents ​​ ​ ​​ ​.  {i} and machine learning for predicting V​ ​ from ​ In many applications, this is governed by the C​. Section 4 then provides a selective survey level at which the attributes of interest V​ ​ are of text analysis applications in social science, defined. For spam detection, the outcome of and section 5 concludes. interest is defined at the level of individual emails, so we want to divide text that way too. If V​ ​ is daily stock price movements that 2. Representing Text as Data we wish to predict from the prior day’s news When humans read text, they do not see a text, it might make sense to divide the news vector of dummy variables, nor a sequence text by day as well. of unrelated tokens. They interpret words In other cases, the natural way to define in light of other words, and extract ­meaning a document is not so clear. If we wish to 538 Journal of Economic Literature, Vol. LVII (September 2019)

­predict legislators’ partisanship from their that occur fewer than k​ ​ times for some arbi- floor speeches (Gentzkow, Shapiro, and trary small integer ​k​. Taddy 2016), we could aggregate speech An approach that excludes both common so a document is a ­speaker–day, a ­speaker– and rare words and has proved very useful year, or all speech by a given speaker during in practice is filtering by term“­ frequency– the time she is in Congress. When we use inverse document frequency” (tf–idf).­ methods that treat documents as indepen- For a word or other feature ​j​ in document i​​, dent (which is true most of the time), finer term frequency (​t ​fij​​) is the count ​​cij​​ of occur- partitions will typically ease computation at rences of ​j​ in ​i​. Inverse document frequency the cost of limiting the dependence we are (​id ​f​j​​​) is the log of one over the share of able to capture. Theoretical guidance for the documents containing j​​: ​log(n d​ ​​)​ where ​​d​​ / j j right level of aggregation is often limited, so ​ i​ ​​​1​ ​c​​ 0 ​ and ​n​ is the total number of = ∑ [ ij> ] this is an important dimension along which documents. The object of interest ­tf–idf to check the sensitivity of results. is the product ​t ​f​​ id ​f​​. Very rare words ij × j will have low ­­tf–idf scores because ​t ​f​​ will 2.2 Feature Selection ij be low. Very common words that appear in To reduce the number of features to some- most or all documents will have low ­­tf–idf thing manageable, a common first step is to scores because id​ ​f​j​​​ will be low. (Note that strip out elements of the raw text other than this improves on simply excluding words that words. This might include punctuation, num- occur frequently because it will keep words bers, HTML tags, proper names, and so on. that occur frequently in some documents but It is also common to remove a subset of do not appear in others; these often provide words that are either very common or very useful information.) A common practice is to rare. Very common words, often called “stop keep only the words within each document ​i​ words,” include articles (“the,” “a”), conjunc- with ­­tf–idf scores above some rank or cutoff. tions (“and,” “or”), forms of the verb “to be,” A final step that is commonly used to and so on. These words are important to the reduce the feature space is stemming: replac- grammatical structure of sentences, but they ing words with their root such that, e.g., typically convey relatively little meaning on “economic,” “economics,” “economically” their own. The frequency of “the” is proba- are all replaced by the stem “economic.” The bly not very diagnostic of whether an email Porter stemmer (Porter 1980) is a standard is spam, for example. Common practice is stemming tool for English language text. to exclude stop words based on a ­predefined All of these cleaning steps reduce the list.1 Very rare words do convey meaning, but number of unique language elements we their added computational cost in expand- must consider and thus the dimensional- ing the set of features that must be consid- ity of the data. This can provide a massive ered often exceeds their diagnostic value. computational benefit, and it is also often A common­ approach is to exclude all words key to getting more interpretable model fits (e.g., in topic modeling). However, each of these steps requires careful decisions about 1 There is no single stop word list that has become a the elements likely to carry meaning in a standard. How aggressive one wants to be in filtering stop 2 words depends on the application. The web page http:// particular application. One researcher’s www.ranks.nl/stopwords shows several common stop word lists, including the one built into the database software SQL and the list claimed to have been used in early ver- 2 Denny and Spirling (2018) discuss the sensitivity of sions of Google search. (Modern Google search does not unsupervised text analysis methods such as topic modeling appear to filter any stop words.) to preprocessing steps. Gentzkow, Kelly, and Taddy: Text as Data 539 stop words are another’s subject of interest. representation then corresponds to counts of Dropping numerals from political text means ­1-grams. missing references to “the first 100 days” or Counting ​n​-grams of order ​n 1​ yields > “September 11.” In online communication, data that describe a limited amount of the even punctuation can no longer be stripped dependence between words. Specifically, the​ without potentially significant information n​-gram counts are sufficient for estimation loss :-(. of an n​ ​-order homogeneous Markov model across words (i.e., the model that arises if we 2.3 n-grams assume that word choice is only dependent Producing a tractable representation also upon the previous n​ ​ words). This can lead requires that we limit dependence among to richer modeling. In analysis of partisan language elements. A fairly mild step in this speech, for example, single words are often direction, for example, might be to parse doc- insufficient to capture the patterns of inter- uments into distinct sentences and encode est: “death tax” and “tax break” are phrases features of these sentences while ignoring with strong partisan overtones that are not the order in which they occur. The most evident if we look at the single words “death,” common methodologies go much further. “tax,” and “break” (see, e.g., Gentzkow and The simplest and most common way to Shapiro 2010). represent a document is called ­bag-of-words. Unfortunately, the dimension of ​​c​i​ in- The order of words is ignored altogether, creases exponentially quickly with the order​ and ​​ci​​​​ is a vector whose length is equal to n​ of the phrases tracked. The majority of text the number of words in the vocabulary and analyses consider n​ ​-grams up to two or three whose elements ​​cij​​​​ are the number of times at most, and the ubiquity of these simple word ​j​ occurs in document ​i​. Suppose that representations (in both machine learning the text of document ​i​ is and social science) reflects a belief that the return to richer ​n​-gram modeling is usually Good night, good night! small relative to the cost. Best practice in Parting is such sweet sorrow. many cases is to begin analysis by focusing on single words. Given the accuracy obtained After stemming, removing stop words, and with words alone, one can then evaluate if it removing punctuation, we might be left with is worth the extra time to move on to ­2-grams “good night good night part sweet sorrow.” or ­3-grams. The bag-of-words­ representation would then 2.4 Richer Representations have ​​c​​ 2​ for ​j ​ good, night ​, ​​c​​ 1​ for ​ ij = ∈ { } ij = j ​ part, sweet, sorrow ​, and ​​c​​ 0​ for all While rarely used in the social science ∈ { } ij = other words in the vocabulary. literature to date, there is a vast array of This scheme can be extended to encode methods from computational linguistics a limited amount of dependence by count- that capture richer features of text and may ing unique phrases rather than unique have high return in certain applications. words. A phrase of length n​​ is referred to One basic step beyond the simple ​n​-gram as an ​n​-gram. For example, in our snippet counting above is to use sentence syntax to above, the count of ­2-grams (or “bigrams”) inform the text tokens used to summarize would have ​​c​​ 2​ for ​j good.night​, ​​ a document. For example, Goldberg and ij = = c​​ 1​ for ​j​ including ​night.good​, ​night.part​, Orwant (2013) describe syntactic ​n​-grams ij = ​part.sweet​, and ​sweet.sorrow​, and ​​c​​ 0​ for where words are grouped together when- ij = all other possible ­2-grams. The ­bag-of-words ever their meaning depends upon each 540 Journal of Economic Literature, Vol. LVII (September 2019) other, according to a model of language A more serious issue is that research- syntax. ers sometimes do not have direct access An alternative approach is to move beyond to the raw text and must access it through treating documents as counts of language some interface such as a search engine. For tokens, and to instead consider the ordered example, Gentzkow and Shapiro (2010) sequence of transitions between words. count the number of newspaper articles In this case, one would typically break the containing partisan phrases by entering the document into sentences, and treat each phrases into a search interface (e.g., for the as a separate unit for analysis. A single sen- database ProQuest) and counting the num- tence of length s​​ (i.e., containing s​​ words) ber of matches they return. Baker, Bloom, is then represented as a binary p​ s​ matrix ​ and Davis (2016) perform similar searches × S​, where the nonzero elements of ​S​ indi- to count the number of articles mentioning cate occurrence of the row-word­ in the terms related to policy uncertainty. Saiz and ­column-position within the sentence, and ​p​ Simonsohn (2013) count the number of web is the length of the vocabulary. Such repre- pages measuring combinations of city names sentations lead to a massive increase in the and terms related to corruption by enter- dimensions of the data to be modeled, and ing queries in a search engine. Even if one analysis of this data tends to proceed through can automate the searches in these cases, it ­word embedding: the mapping of words to is usually not feasible to produce counts for a location in ​​ ​​ K​​ for some K​ p​, such that very large feature sets (e.g., every ­two-word 핉 ≪ the sentences are then sequences of points phrase in the English language), and so the in this K​ ​ dimensional space. This is discussed initial feature selection step must be rel- in detail in section 3.3. atively aggressive. Relatedly, interacting through a search interface means that there 2.5 Other Practical Considerations is no simple way to retrieve objects like the It is worth mentioning two details that can set of all words occurring at least twenty cause practical social science applications times in the corpus of documents, or the of these methods to diverge a bit from the inputs to computing ­­tf–idf. ideal case considered in the statistics liter- ature. First, researchers sometimes receive 3. Statistical Methods data in a ­pre-aggregated form. In the analysis of Google searches, for example, one might This section considers methods for map- observe the number of searches contain- ping the ­document-token matrix ​C​ to pre- ing each possible keyword on each day, but dictions ​​Vˆ ​​ of an attribute V​ ​. In some cases, not the raw text of the individual searches. the observed data is partitioned into subma- This means documents must be similarly trices ​​C​​ train​ and ​​C​​ test​​, where the matrix ​​C​​ train​ aggregated (to days, rather than individual collects rows for which we have observations​​ searches), and it also means that the natu- V​​ train​ of ​V​ and the matrix ​​C​​ test​ collects rows ral representation where c​​ ​ij​​​ is the number of for which V​ ​ is unobserved. The dimension occurrences of word j​​ on day i​​ is not avail- of ​​C​​ train​ is ​​n​​ train​ p​, and the dimension of ​​ × able. This is probably not a significant limita- V​​ train​ is ​​n​​ train​ k​, where k​ ​ is the number of × tion, as the missing information (how many attributes we wish to predict. times per search a word occurs conditional Attributes in ​V​ can include observable on occurring at least once) is unlikely to be quantities such as the frequency of flu cases, essential, but it is useful to note when map- the positive or negative rating of movie ping practice to theory. reviews, or the unemployment rate, about Gentzkow, Kelly, and Taddy: Text as Data 541 which the documents are informative. There The second and third groups of meth- can also be latent attributes of interest, such ods are distinguished by whether they as the topics being discussed in a congressio- begin from a model of ​p(​​vi​​ | ​ci​)​ ​​ or a model of nal debate or in news articles. ​p​(​ci​​ | ​vi​)​ ​​. In the former case, which we will Methods to connect counts c​​​i​ to attri- call text regression methods, we directly butes ​​vi​​​​ can be roughly divided into four estimate the conditional outcome distribu- categories. The first, which we will call tion, usually via the conditional expectation ­dictionary-based methods, do not involve ​E[​ ​vi​​ | ​ci​]​ ​ of attributes ​​v​i​​​. This is intuitive: if we statistical inference at all: they simply spec- want to predict ​​vi​​ from ​​c​i​, we would naturally ify ​​​vˆ ​​ ​ f ​​c​​ ​​ for some known function f​ ​ ​. regress the observed values of the former i = ( i) (⋅ ) This is by far the most common method in (​​V​​ train​​) on the corresponding values of the lat- the social science literature using text to ter (​​C​​ train​​). Any generic regression technique date. In some cases, researchers define​f ​( ⋅ )​ can be applied, depending upon the nature based on a prespecified­ dictionary of of ​​vi​​​​. However, the high­ dimensionality of terms capturing particular categories of ​​ci​​, where ​p​ is often as large as or larger than ​​ train text. In Tetlock (2007), for example, ​​ci​​ is a n​​ ​​, requires use of regression techniques ­bag-of-words representation and the out- appropriate for such a setting, such as penal- come of interest ​​vi​​​​ is the latent “sentiment” ized linear or logistic regression. of Wall Street Journal columns, defined along In the latter case, we begin from a genera- a number of dimensions such as “positive,” tive model of ​p​(​ci​​ | ​vi​)​ ​​. To see why this is intu- “optimistic,” and so on. The author defines itive, note that in many cases the underlying the function ​f (​ ⋅ )​​ using a dictionary called causal relationship runs from outcomes to the General Inquirer, which provides lists of language rather than the other way around. words associated with each of these sentiment For example, Google searches about the flu 3 categories. The elements of ​f(​​ci​)​ ​ are defined do not cause flu cases to occur; rather, peo- to be the sum of the counts of words in each ple with the flu are more likely to produce category. (As we discuss below, the main anal- such searches. Congresspeople’s ideology ysis then focuses on the first principal com- is not determined by their use of partisan ponent of the resulting counts.) In Baker, language; rather, people who are more con- Bloom, and Davis (2016), c​​ i​​​​ is the count of servative or liberal to begin with are more articles in a given newspaper-month­ contain- likely to use such language. From an eco- ing a set of prespecified­ terms such as “pol- nomic point of view, the correct “structural” icy,” “uncertainty,” and “Federal Reserve,” model of language in these cases maps from and the outcome of interest v​​ i​​ is the degree ​​vi​​ to ​​ci​​​​, and as in other cases familiar to of “policy uncertainty” in the economy. The economists, modeling the underlying causal authors define​f (​ ⋅ )​​ to be the raw count of relationships can provide powerful guidance the prespecified­ terms divided by the total to inference and make the estimated model number of articles in the ­newspaper–month, more interpretable. averaged across newspapers. We do not pro- Generative models can be further divided vide additional discussion of dictionary-based­ by whether the attributes are observed or methods in this section, but we return to them latent. In the first case ofunsupervised in section 3.5 and in our discussion of applica- methods, we do not observe the true value of​​ tions in section 4. v​i​​​ for any documents. The function relating ​​ c​i​ and ​​v​i​​​ is unknown, but we are willing to impose sufficient structure on it to allow us to 3 http://www.wjh.harvard.edu/~inquirer/. infer ​​vi​​ from ​​c​i​. This class includes ­methods 542 Journal of Economic Literature, Vol. LVII (September 2019) such as topic modeling and its variants (e.g., ­typically perform close to the frontier in latent Dirichlet allocation, or LDA). In the terms of ­out-of-sample prediction. second case of supervised methods, we Linear models in the sense we mean here train observe training data V​​ ​​ ​​ and we can fit are those in which v​​​i​ depends on ​​c​i​ only our model, say ​​f​ ​ ​c​​; ​v​​ ​​ for a vector of param- through a linear index ​​ ​​ ​ ​ ​ ​ ​, where ​​ θ( i i) ηi = α + 퐱′i β eters ​θ​, to this training set. The fitted model 퐱i​​​​ is a known transformation of c​​ ​i​. In many ​​ f​ˆ​ can then be inverted to predict v​​ ​​ for doc- cases, we simply have E​ ​ ​v​​ | ​ ​​ ​ ​​​. It is θ​ ​ i [ i 퐱i] = ηi uments in the test set and can also be used to also possible that E​ ​ ​v​​ | ​ ​​ ​ f ​​ ​​ ​ for some [ i 퐱i] = (ηi) interpret the structural relationship between known link function ​f ​( ⋅ )​​, as in the case of attributes and text. Finally, in some cases, v​​ i​​ logistic regression. includes both observed and latent attributes Common transformations are the iden- for a ­semi-supervised analysis. tity ​​ ​​ ​c​​, normalization by document 퐱i = i Lastly, we discuss word embeddings, length ​​ ​​ ​c​​ ​m​​ with ​​m​​ ​ ​ ​​ ​c​​, or 퐱i = i/ i i = ∑j ij which provide a richer representation of the the positive indicator x​​ ij​​ ​1​ ​c​​ 0 ​. The best = [ ij> ] underlying text than the token counts that choice is application­ specific, and may be underlie other methods. They have seen driven by interpretability; does one wish to limited application in economics to date, but interpret ​​ ​​​​ as the added effect of an extra βj their dramatic successes in deep learning count for token j​​ (if so, use x​​ ​​ ​c​​) or as the ij = ij and other machine learning domains sug- effect of the presence of token j​​ (if so, use gest they are likely to have high value in the ​​xij​​ ​1​ ​c​​ 0 ​​​​)? The identity is a reasonable = [ ij> ] future. default in many settings. We close in section 3.5 with some broad Write ​l​ , ​​ for an unregularized objec- (α β) recommendations for practitioners. tive proportional to the negative log likeli- hood,​ log p​​v​​ | ​ ​​ ​​. For example, in Gaussian 3.1 Text Regression − ( i 퐱i) (linear) regression, l​​ , ​ ​ ​ ​​ ​​ ​v​​ ​ ​​ ​ 2​ (α β) = ∑i ( i − ηi) Predicting an attribute ​​v​i​ from counts ​​ci​​ is and in binomial (logistic) regression, ​l​( , )​ ​ ​​ α β a regression problem like any other, except ​ ​​​ ​ ​​ ​v​​ log​1 ​e​​ ηi​​ ​ for ​​v​​ ​0, 1 ​. = − ∑i[ηi i − ( + )] i ∈ { } that the ­high dimensionality of ​​ci​​ makes ordi- A penalized estimator is then the solution to nary least squares (OLS) and other standard p techniques infeasible. The methods in this section are mainly applications of standard (1) min​ l(​ , β)​ n ​ ​​​ ​ j​(​ |​ j​​|)​ ​,​ { α + λ j∑1 κ β } ­high-dimensional regression methods to text. = where ​ 0​ controls overall penalty mag- 3.1.1 Penalized Linear Models λ > nitude and ​​ ​​ ​​ are increasing “cost” func- κj(⋅ ) The most popular strategy for very tions that penalize deviations of the ​​ ​​ from βj ­high-dimensional regression in contempo- zero. rary statistics and machine learning is the A few common cost functions are shown in estimation of penalized linear models, par- figure 1. Those that have a non-differentiable­ ticularly with L​​​1​ penalization. We recom- spike at zero (lasso, elastic net, and log) lead mend this strategy for most text regression to sparse estimators, with some coefficients applications: linear models are intuitive and set to exactly zero. The curvature of the interpretable; fast, high-quality­ software penalty away from zero dictates the weight is available for big sparse input matrices of shrinkage imposed on the nonzero coef- like our C​ ​. For simple text-regression­ tasks ficients: ​​L2​​​​ costs increase with coefficient with input dimension on the same order as size; lasso’s ​​L1​​​​ penalty has zero curvature and the sample size, penalized linear models imposes constant shrinkage, and as curvature­ Gentzkow, Kelly, and Taddy: Text as Data 543

A. Ridge B. Lasso C. Elastic net D. log

400 2 60 | ) 2.5 15 β

| 40 β 2 200 1.5 | β 5 20 | + 0.1 × β 0.5 log(1 + | 0 0 β 0 | 20 0 20 20 0 20 20 0 20 20 0 20 − − − − β β β β

Figure 1

Note: From left to right, L​​ 2​​​​ costs (ridge, Hoerl and Kennard 1970), L​​ ​1​​​ (lasso, Tibshirani 1996), the “elastic net” mixture of ​​L1​​ and ​​L2​​​​ (Zou and Hastie 2005), and the log penalty (Candès, Wakin, and Boyd 2008). goes toward ​ ​ one approaches the ​​L​​ pen- sample standard deviation of that covariate. − ∞ 0 alty of subset selection. The lasso’s ​​L​1​ pen- In text analysis, where each covariate corre- alty (Tibshirani 1996) is extremely popular: sponds to some transformation of a specific it yields sparse solutions with a number of text token, this type of weighting is referred desirable properties (e.g., Bickel, Ritov, and to as “rare feature up-weighting”­ (e.g., Tsybakov 2009; Wainwright 2009; Belloni, Manning, Raghavan, and Schütze 2008) and Chernozhukov, and Hansen 2013; Bühlmann is generally thought of as good practice: rare and van de Geer 2011), and the number of words are often most useful in differentiat- nonzero estimated coefficients is an unbi- ing between documents.5 ased estimator of the regression degrees of Large ​ ​ leads to simple model estimates λ freedom (which is useful in model selection; in the sense that most coefficients will be see Zou, Hastie, and Tibshirani 2007).4 set at or close to zero, while as ​ 0​ we λ → Focusing on L​​ 1​​ regularization, ­rewrite the approach maximum likelihood estimation penalized linear model objective as (MLE). Since there is no way to define an optimal ​​ a priori, standard practice is to p λ compute estimates for a large set of possible ​ (2) min​ l(​ , β)​ n ​ ​​​ ​ j​​ |​ j​​| ​.​ α + λ j∑1 ω β ​ and then use some criterion to select the { = } λ one that yields the best fit. A common strategy sets ​​ ​​​​ so that the pen- Several criteria are available to choose an ωj alty cost for each coefficient is scaled by the optimal ​ ​. One common approach is to leave λ out part of the training sample in estimation and then choose the ​ ​ that yields the best 4 Penalties with a bias that diminishes with coefficient λ size—such as the log penalty in figure 1 (Candès, Wakin, ­out-of-sample fit according to some criterion and Boyd 2008), the smoothly clipped absolute deviation such as mean squared error. Rather than work (SCAD) of Fan and Li (2001), or the adaptive lasso of Zou with a single ­leave-out sample, researchers (2006)—have been promoted in the statistics literature as improving upon the lasso by providing consistent variable most often use K​ ​-fold ­cross-validation (CV). selection and estimation in a wider range of settings. These diminishing-bias penalties lead to increased computation costs (due to a ­non-convex loss), but there exist efficient 5 This is the same principle that motivates approximation algorithms (see, e.g., Fan, Xue, and Zou “­inverse-document frequency” weighting schemes, such 2014; Taddy 2017b). as ­­tf–idf. 544 Journal of Economic Literature, Vol. LVII (September 2019)

This splits the sample into ​K​ disjoint subsets, Penalized linear models use shrinkage and and then fits the full regularization pathK ​ ​ variable selection to manage high dimen- times excluding each subset in turn. This sionality by forcing the coefficients on most yields ​K​ realizations of the mean squared regressors to be close to (or, for lasso, exactly) error or other ­out-of-sample fit measure for zero. This can produce suboptimal­ forecasts each value of ​ ​. Common rules are to select when predictors are highly correlated. A λ the value of ​ ​ that minimizes the average transparent illustration of this problem would λ error across these realizations, or (more be a case in which all of the predictors are conservatively) to choose the largest ​ ​ with equal to the forecast target plus an i.i.d. noise λ mean error no more than one standard error term. In this situation, choosing a subset of away from the minimum. predictors via lasso penalty is inferior to tak- Analytic alternatives to ­cross-validation ing a simple average of the predictors and are Akaike’s information criterion (AIC; using this as the sole predictor in a univar- Akaike 1973) and the Bayesian informa- iate regression. This predictor averaging, as tion criterion (BIC) of Schwarz (1978). In opposed to predictor selection, is the essence particular, Flynn, Hurvich, and Simonoff of dimension reduction. (2013) describe a bias-corrected­ AIC PCR consists of a ­two-step procedure. In objective for ­high-dimensional problems the first step, principal components analysis that they call AICc. It is motivated as an (PCA) combines regressors into a small set approximate likelihood maximization sub- of ​K​ linear combinations that best preserve ject to a degrees of freedom (​d ​f​ ​) adjust- the covariance structure among the predic- λ ment: ​AICc​ ​ 2l​​ ​ ​, ​ ​ ​ ​ 2d ​f​ ​ _n ​​ . tors. This amounts to solving the problem (λ) = (αλ βλ) + λn d ​f​ ​ 1 − λ − Similarly, the BIC objective is ​BIC​ ​ (λ) l​(​ ​ ​, ​ ​ )​ ​ d ​f​ ​ log n​, and is motivated (3) min​ ​trace​ (​C ​B′ ) ​ (​ C ​B′ ) ​ ′​ ​ ​, = αλ βλ + λ ,B [ − Γ − Γ ] as an approximation to the Bayesian pos- Γ terior marginal likelihood in Kass and subject to Wasserman (1995). AICc and BIC selec- tion choose ​​ to minimize their respec- rank​​ rank​B ​ K​. λ (Γ) = ( ) = tive objectives. The BIC tends to choose simpler models than ­cross-validation or The count matrix C​ ​ consists of n​ ​ rows (one AICc. Zou, Hastie, and Tibshirani (2007) for each document) and p​ ​ columns (one for recommend BIC for lasso penalty selec- each term). PCA seeks a low-rank represen- tion whenever variable selection, rather tation ​ ​B ​​that best approximates the text Γ ′ than predictive performance, is the primary data ​C​. This formulation has the character of goal. a factor model. The n​ K​ matrix ​ ​ captures × Γ 3.1.2 Dimension Reduction the prevalence of ​K​ common components, or “factors,” in each document. The ​p K​ × Another common solution for taming high matrix ​B​ describes the strength of associa- dimensional prediction problems is to form a tion between each word and the factors. As small number of linear combinations of pre- we will see, this ­reduced-rank decomposi- dictors and to use these derived indices as tion bears a close resemblance to other text variables in an otherwise standard predictive analytic methods such as topic modeling and regression. Two classic dimension reduction word embeddings. techniques are principal components regres- In the second step, the ​K​ components are sion (PCR) and partial least squares (PLS). used in standard predictive regression. As an Gentzkow, Kelly, and Taddy: Text as Data 545 example, Foster, Liberman, and Stine (2013) is condensed into a single predictive index. use PCR to build a hedonic real estate pricing To use additional predictive indices, both​​ model that takes textual content of property vi​​ on ​​cij​​​​ are orthogonalized with respect ˆ ­listings as an input.6 With text data, where the to ​​​v ​​ i​​​, the above procedure is repeated on number of features tend to vastly exceed the the orthogonalized data set, and the result- ˆ observation count, regularized versions of ing forecast is added to the original ​​​v ​​i ​. This PCA such as predictor thresholding (e.g., Bai is iterated until the desired number of PLS and Ng 2008) and sparse PCA (Zou, Hastie, components K​ ​ is reached. Like PCR, PLS and Tibshirani 2006) help exclude the least components describe the prevalence of K​ ​ informative features to improve predictive common factors in each document. And also content of the ­dimension-reduced text. like PCR, PLS can be implemented with a A drawback of PCR is that it fails to incor- variety of regularization schemes to aid its porate the ultimate statistical objective— performance in the ultra-high-dimensional­ forecasting a particular set of attributes—in world of text. Section 4 discusses applica- the dimensionality reduction step. PCA con- tions using PLS in text regression. denses text data into indices based on the PCR and PLS share a number of com- covariation among the predictors. This hap- mon properties. In both cases, ​K​ is a pens prior to the forecasting step and with- ­user-controlled parameter which, in many out consideration of how predictors associate social science applications, is selected ex ante with the forecast target. by the researcher. But, like any hyperparam- In contrast, PLS performs dimension eter, K​ ​ can be tuned via cross-validation.­ And reduction by directly exploiting covaria- neither method is scale invariant—the fore- tion of predictors with the forecast target.7 casting model is sensitive to the distribution Suppose we are interested in forecasting of predictor variances. It is therefore com- mon to ­variance-standardize features before a scalar attribute v​​i​​. PLS regression pro- ceeds as follows. For each element j​​ of the applying PCR or PLS. feature vector c​​i​​, estimate the univariate 3.1.3 Nonlinear Text Regression covariance between ​​vi​​ on ​​cij​​. This covari- ance, denoted ​​​​, reflects the attribute’s Penalized linear models are the most φj “partial” sensitivity to each feature j​​. Next, widely applied text regression tools due to form a single predictor by averaging all their simplicity, and because they may be attributes into a single aggregate predictor viewed as a first-order­ approximation to potentially nonlinear and complex data gen- ​​​vˆ ​​ ​ ​ ​ ​ ​ ​​ ​c​​ ​ ​ ​​ ​ ​​. This forecast places i = ∑ j φj ij/∑j φj the highest weight on the strongest uni- erating processes (DGPs). In cases where a variate predictors, and the least weight on linear specification is too restrictive, there the weakest. In this way, PLS performs its are several other machine learning tools that dimension reduction with the ultimate fore- are well suited to represent nonlinear asso- casting objective in mind. The description ciations between text ​​ci​​ and outcome attri- of ​​​vˆ ​​ ​ reflects the ​K 1​ case, i.e., when text butes ​​vi​​​​. Here we briefly describe four such i = nonlinear regression methods—generalized linear models, support vector machines, regression trees, and deep learning—and 6 See Stock and Watson (2002a, b) for development of provide references for readers interested in the PCR estimator and an application to macroeconomic forecasting with a large set of numerical predictors. thorough treatments of each. 7 See Kelly and Pruitt (2013, 2015) for the asymptotic theory of PLS regression and its application to forecasting GLMs and SVMs.—One way to capture risk premia in financial markets. nonlinear associations between ​​ci​​ and ​​vi​​ is 546 Journal of Economic Literature, Vol. LVII (September 2019) with a generalized linear model (GLM). ­problems. The logic of trees differs markedly These expand the linear model to include from traditional regressions. A tree “grows” nonlinear functions of ​​ci​​ such as polynomials by sequentially sorting data observations or interactions, while otherwise treating the into bins based on values of the predictor problem with the penalized linear regression variables. This partitions the data set into methods discussed above. rectangular regions, and forms predictions A related method used in the social science as the average value of the outcome vari- literature is the support vector machine, or able within each partition (Breiman et al. SVM (Vapnik 1995). This is used for text 1984). This structure is an effective way to classification problems (whenV ​ ​ is categor- accommodate rich interactions and nonlin- ical), the prototypical example being email ear dependencies. spam filtering. A detailed discussion of SVMs Two extensions of the simple regression is beyond the scope of this review, but from tree have been highly successful thanks to a high level, the SVM finds hyperplanes in a clever regularization approaches that min- basis expansion of ​C​ that partition the obser- imize the need for tuning and avoid over- vations into sets with equal response (i.e., so fitting. Random forests (Breiman 2001) 8 that ​​vi​​​​ are all equal in each region). average predictions from many trees that GLMs and SVMs both face the limita- have been randomly perturbed in a bootstrap­ tion that, without a priori assumptions for step. Boosted trees (e.g., Friedman 2002) which basis transformations and interactions recursively combine predictions from many to include, they may overfit and require ­oversimplified trees.10 extensive tuning (Hastie, Tibshirani, and The benefits of regression trees—non- Friedman 2009; Murphy 2012). For exam- linearity and ­high-order interactions—are ple, ­multi-way interactions increase the sometimes lessened in the presence of parameterization combinatorially and can ­high-dimensional inputs. While we would quickly overwhelm the penalization rou- generally recommend tree models, and tine, and their performance suffers in the especially random forests, they are often not presence of many spurious “noise” inputs worth the effort for simple text regression. (Hastie, Tibshirani, and Friedman 2009).9 Often times, a more beneficial use of trees is in a final prediction step after some dimen- Regression Trees.—Regression trees have sion reduction derived from the generative become a popular nonlinear approach for models in section 3.2. incorporating ­multi-way predictor inter- actions into regression and classification Deep Learning.—There is a host of other machine learning techniques that have been 8 Hastie, Tibshirani, and Friedman (2009, chapter 12) applied to text regression. The most com- and Murphy (2012, chapter 14) provide detailed overviews mon techniques not mentioned thus far are of GLMs and SVMs. Joachims (1998) and Tong and Koller neural networks, which typically allow the (2001) (among others) study text applications of SVMs. 9 Another drawback of SVMs is that they cannot be inputs to act on the response through one easily connected to the estimation of a probabilistic model and the resulting fitted model can sometimes be difficult to interpret. Polson and Scott (2011) provide a 10 Hastie, Tibshirani, and Friedman (2009) provide an ­pseudo-likelihood interpretation for a variant of the SVM overview of these methods. In addition, see Wager, Hastie, objective. Our own experience has led us to lean away from and Efron (2014) and Wager and Athey (2018) for results SVMs for text analysis in favor of more easily interpretable on confidence intervals for random forests, and see Taddy models. Murphy (2012, chapter 14.6) attributes the pop- et al. (2015) and Taddy et al. (2016) for an interpretation ularity of SVMs in some application areas to an ignorance of random forests as a Bayesian posterior over potentially of alternatives. optimal trees. Gentzkow, Kelly, and Taddy: Text as Data 547 or more layers of interacting nonlinear basis Dunson, and Lee (2013) for Bayesian ana- functions (e.g., see Bishop 1995). A main logues of diminishing bias penalties like the attraction of neural networks is their status as log penalty on the right of figure 1. universal approximators, a theoretical result For those looking to do a full Bayesian describing their ability to mimic general, analysis for high-dimensional­ (e.g., text) smooth nonlinear associations. regression, an especially appealing model is In high-dimensional­ and very noisy set- the ­spike-and-slab introduced in George and tings, such as in text analysis, classical neu- McCulloch (1993). This models the distribu- ral nets tend to suffer from the same issues tion over regression coefficients as a mixture referenced above: they often overfit and between two densities centered at zero— are difficult to tune. However, the recently one with very small variance (the spike) and popular “deep” versions of neural networks another with large variance (the slab). This (with many layers, and fewer nodes per model allows one to compute posterior vari- layer) incorporate a number of innovations able inclusion probabilities as, for each coef- that allow them to work better, faster, and ficient, the posterior probability that it came with little tuning, even in difficult text analy- from the slab and not the spike component. sis problems. Such deep neural nets (DNNs) Due to a need to integrate over the posterior are now the ­state-of-the-art solution for many distribution, e.g., via Markov chain Monte machine learning tasks (LeCun, Bengio, and Carlo (MCMC), inference for ­spike-and-slab Hinton 2015).11 DNNs are now employed in models is much more computationally inten- many complex natural language processing sive than fitting the penalized regressions of tasks, such as translation (Sutskever, Vinyals, section 3.1.1. However, Yang, Wainwright, and Le 2014; Wu et al. 2016) and syntactic and Jordan (2016) argue that spike-and-slab­ parsing (Chen and Manning 2014), as well as estimates based on short MCMC samples in exercises of relevance to social scientists— can be useful in application, while Scott for example, Iyyer et al. (2014) infer political and Varian (2014) have engineered effi- ideology from text using a DNN. They are cient implementations of the spike-and-slab­ frequently used in conjunction with richer model for big data applications. These pro- text representations such as word embed- cedures give a full accounting of parameter dings, described more below. uncertainty, which we miss in a quick penal- ized regression. 3.1.4 Bayesian Regression Methods 3.2 Generative Language Models The penalized methods above can all be interpreted as posterior maximization under Text regression treats the token counts as some prior. For example, ridge regression generic high-dimensional­ input variables, maximizes the posterior under independent without any attempt to model structure that Gaussian priors on each coefficient, while is specific to language data. In many set- Park and Casella (2008) and Hans (2009) give tings it is useful to instead propose a gen- Bayesian interpretations to the lasso. See also erative model for the text tokens to learn the horseshoe of Carvalho, Polson, and Scott about how the attributes influence word (2010) and the double Pareto of Armagan, choice and account for various dependen- cies among words and among attributes. In this approach, the words in a document are 11 ­Goodfellow, Bengio, and Courville (2016) provide a viewed as the realization of a generative pro- thorough textbook overview of these “deep learning” tech- nologies, while Goldberg (2016) is an excellent primer on cess defined through a probability model for​ their use in natural language processing. p(​​ci​​ | ​vi​)​ ​. 548 Journal of Economic Literature, Vol. LVII (September 2019)

3.2.1 Unsupervised Generative Models Many readers will recognize the model in (5) as a factor model for the vector of nor- In the unsupervised setting, we have no malized counts for each token in document direct observations of the true attributes ​i​, ​​c​​ ​m​​​​. Indeed, a topic model is simply a fac- i/ i ​​vi​​​​. Our inference about these attributes must tor model for multinomial data. Each topic therefore depend entirely on strong assump- is a probability vector over possible tokens, tions that we are willing to impose on the denoted ​​ ​l​, l 1, , k​ (where ​​ ​lj​ 0​ and ​​ p θ = … θ ≥ structure of the model p​ ​​ci​​ | ​vi​​ ​. Examples in j 1​ ​​ ​ lj​​ 1​). A topic can be thought of as ( ) ∑ = θ = the broader literature include cases where a cluster of tokens that tend to appear in the ​​v​i​​​ are latent factors, clusters, or catego- documents. The latent attribute vector v​​ ​i​ is ries. In text analysis, the leading application referred to as the set of topic weights (for- has been the case in which the ​​vi​​ are topics. mally, a distribution over topics, ​​v​il​ 0​ and ​​ k ≥ A typical generative model implies that l 1​ ​​ ​vil​​ 1​). Note that v​​ il​​ describes the pro- ∑ = = each observation c​​i​​​​ is a conditionally inde- portion of language in document i​​ devoted to pendent draw from the vocabulary of the ​lth​ topic. We can allow each document possible tokens according to some document-­ to have a mix of topics, or we can require specific token probability vector, say that one ​​v​​ 1​ while the rest are zero, so il = ​​ ​​ ​q​ ​ ​q​ ​ ​ . ​​ Conditioning on doc- that each document has a single topic.13 퐪i = [ i1 ⋯ ip]′ ument length, m​​ ​​ ​ ​ ​​ ​c​​, this implies a Since its introduction into text analysis, i = ∑j ij ­multinomial distribution for the counts topic modeling has become hugely popu- lar.14 (See Blei 2012 for a ­high-level over- (4) c​​ MN​​ ​​, ​m​​ ​.​ view.) The model has been especially useful i ∼ (퐪i i) in political science (e.g., Grimmer 2010), This multinomial model underlies the vast where researchers have been successful in majority of contemporary generative models attaching political issues and beliefs to the for text. estimated latent topics. Under the basic model in (4), the function ​​ Since the ​​v​i​​​ are of course latent, estima- ​​ q​​v​​ ​​ links attributes to the distribution tion for topic models tends to make use of 퐪i = ( i) of text counts. A leading example of this link some alternating inference for V​ | Θ ​and Θ​ | V​. function is the topic model specification of One possibility is to employ a version of the Blei, Ng, and Jordan (2003),12 where ­expectation-maximization (EM) algorithm to either maximize the likelihood implied by (5) ​​ ​v​ ​ ​ ​​ ​v​ ​ ​ ​​ ​v​ ​ ​ ​​ 퐪i = i1 θ1 + i2 θ2 + ⋯ + ik θk 13 Topic modeling is alternatively labeled as “latent ​v​i​.​ = Θ Dirichlet allocation,” (LDA) which refers to the Bayesian model in Blei, Ng, and Jordan (2003) that treats each v​​ i​​ and ​​ θl​​​​ as generated from a ­Dirichlet-distributed prior. Another specification that is popular in political science (e.g., Quinn et al. 2010) keeps θ​​ ​l​ as ­Dirichlet-distributed but requires each document to have a single topic. This may be most 12 Standard ­least-squares factor models have long appropriate for short documents, such a press releases or been employed in “latent semantic analysis” (LSA; single speeches. Deerwester et al. 1990), which applies PCA (i.e., singu- 14 The same model was independently introduced in lar value decompositions) to token count transformations genetics by Pritchard, Stephens, and Donnelly (2000) for such as ​​퐱i​​ ​ci​​ ​mi​​ or ​​xij​​ ​c​ij​ log​ ​dj​​ ​ where ​​dj​​ ​ i ​ 1​​c​​ 0 ​. factorizing gene expression as a function of latent popula- = / = ( ) = ∑ [ ij> ] Topic modeling and its precursor, probabilistic LSA, are tions; it has been similarly successful in that field. Latent generally seen as improving on such approaches by replac- Dirichlet allocation is also an extension of a related mix- ing arbitrary transformations with a plausible generative ture modeling approach in the latent semantic analysis of model. Hofmann (1999). Gentzkow, Kelly, and Taddy: Text as Data 549

(4) and (5) or, after incorporating the usual on the application. As we discuss below, in Dirichlet priors on v​​ i​​ and θ​​ ​l​, to maximize the many applications of topic models to date, posterior; this is the approach taken in Taddy the goal is to provide an intuitive description (2012; see this paper also for a review of of text, rather than inference on some under- topic estimation techniques). Alternatively, lying “true” parameters; in these cases, the one can target the full posterior distribution​ ad hoc selection of the number of topics may p​(Θ, V ∣ ​ci​)​ ​​. Estimation, say for ​Θ​, then pro- be reasonable. ceeds by maximization of the estimated mar- The basic topic model has been general- ginal posterior, say ​p​(Θ ∣ ​ci​)​ ​. ized and extended in variety of ways. A prom- Due to the size of the data sets and dimen- inent example is the dynamic topic model sion of the models, posterior approximation of Blei and Lafferty (2006), which considers for topic models usually uses some form documents that are indexed by date (e.g., of variational inference (Wainwright and publication date for academic articles) and Jordan 2008) that fits a tractable paramet- allows the topics, say Θ​​ ​t​, to evolve smoothly ric family to be as close as possible (e.g., in in time. Another example is the super- ­Kullback–Leibler divergence) from the true vised topic model of Blei and McAuliffe posterior. This variational approach was (2007), which combines the standard topic used in the original Blei, Ng, and Jordan model with an extra equation relating the (2003) paper and in many applications since. weights ​​vi​​​​ to some additional attribute ​​yi​​ in Hoffman et al. (2013) present a stochastic ​p​(​yi​​ | ​vi​)​ ​​. This pushes the latent topics to be variational inference algorithm that takes relevant to y​​ i​​​​ as well as the text c​​ i​​. In these advantage of techniques for optimization on and many other extensions, the modifica- massive data; this algorithm is used in many tions are designed to incorporate available contemporary topic modeling applications. document metadata (in these examples, Another approach, which is more computa- time and ​​yi​​ respectively). tionally intensive but can yield more accu- 3.2.2 Supervised Generative Models rate posterior approximations, is the MCMC algorithm of Griffiths and Steyvers (2004). In supervised models, the attributes ​​v​i​ are Alternatively, for quick estimation without observed in a training set and thus may be uncertainty quantification, the posterior directly harnessed to inform the model of maximization algorithm of Taddy (2012) is a text generation. Perhaps the most common good option. supervised generative model is the ­so-called The choice of k​ ​, the number of topics, is naive Bayes classifier (e.g., Murphy 2012), often fairly arbitrary. ­Data-driven choices which treats counts for each token as inde- do exist: Taddy (2012) describes a model pendent with class-dependent means. For selection process for k​​ that is based upon example, the observed attribute might be Bayes factors, Airoldi et al. (2010) provide author identity for each document in the a cross-validation­ (CV) scheme, while Teh corpus with the model specifying different et al. (2006) use Bayesian nonparametric mean token counts for each author. techniques that view ​k​ as an unknown model In naive Bayes, ​​v​i​​​ is a univariate categor- parameter. In practice, however, it is very ical variable and the token count distribu- common to simply start with a number of tion is factorized as ​p​​c​​ | ​v​​ ​ ​ ​ ​​ ​p​​ ​c​​ | ​v​​ ​, ( i i) = ∏j j( ij i) topics on the order of ten, and then adjust thus “naively” specifying conditional inde- the number of topics in whatever direction pendence between tokens ​j​. This rules out seems to improve interpretability. Whether the possibility that by choosing to say one this ad hoc procedure is problematic depends token (say, “hello”) we reduce the ­probability 550 Journal of Economic Literature, Vol. LVII (September 2019)

that we say some other token (say, “hi”). vi​​​​, focusing on their use for prediction The parameters of each independent token of future document attributes through distribution are estimated, yielding p​​​ˆ ​​j ​ for ​ an inversion strategy discussed below. j 1, , p​. The model can then be inverted More recently, Taddy (2015a) provides a = … for prediction, with classification probabili- ­distributed-computing strategy that allows ties for the possible class labels obtained via the model implied by (7) to be fit (using Bayes’s rule as penalized deviance methods as detailed in section 3.1.1) for high-dimensional­ v​​​i​ on ______p​(​ci​​ | V)​ ​ v​​ (6) ​p(​V | ​c​i)​ ​ π ​ massive corpora. This facilitates language = ​ ​ ​​ p​​c​​ | a ​ ​ ​​ ∑a ( i )πa models containing a large number of sources of heterogeneity (even ­document-specific ∏​ j​ ​​ ​pj​​ ​cij​​ | V ​ ​ v​​ ______( )π ​​ , random effects), thus allowing social sci- = entists to apply familiar regression analysis ​ a​ ​​ ​∏ j​ ​​ ​pj​(​ ​cij​​ | a)​ ​ a​​ ∑ π tools in their text analyses. where ​​ ​​​​ is the prior probability on class ​a​ πa Application of the logistic regression text (usually just one over the number of pos- models implied by (7) often requires an sible classes). In text analysis, Poisson inversion step for inference about attributes naive Bayes procedures, with p​​ ​c​​ | V ​ ( ij ) conditional on text—that is, to map from Pois​​cij​​; ​ ​vj​ ​ where ​E​ ​cij​​ | V ​ ​ vj​ ​, have = ( θ ) [ ] = θ ​p​(​ci​​ | ​vi​)​ ​ to ​p​(​vi​​ | ​ci​)​ ​​. The simple Bayes’s rule been used as far back as Mosteller and technique of (6) is difficult to apply beyond Wallace (1963). Some recent social sci- a single categorical attribute. Instead, ence applications use binomial naive Bayes, Taddy (2013b) uses the inverse regression which sets p​​ ​c​​ | V ​ Bin​ ​c​​; ​ ​ ​ ​ where ( ij ) = ( ij θvj) ideas of Cook (2007) in deriving sufficient ​E​ ​c​​ ​m​​ | V ​ ​​ ​​​. The Poisson model has [ ij/ i ] = θvj projections from the fitted models. Say some statistical justification in the analysis of ​ ​ ​ ​​ ​ ​​ ​​ is the matrix of regression Φ = [φ1 ⋯ φp] text counts (Taddy 2015a), but the binomial coefficients from (7) across all tokens ​j​; then specification seems to be more common in the token count projection Φ​ ​c​i​ is a sufficient ­off-the-shelf software. statistic for ​​vi​​​​ in the sense that A more realistic sampling model for text token counts is the multinomial model of (8) ​​vi​​ ⫫ ​ci​​ ∣ Φ​ci​​,​ (4). This introduces limited dependence between token counts, encoding the fact i.e., the attributes are independent of the that using one token for a given utterance text counts conditional upon the projection. must slightly lower the expected count for Thus, the fitted logistic regression model all other tokens. Combining such a sampling yields a map from high-dimensional­ text to scheme with generalized linear models, the presumably lower dimensional attributes Taddy (2013b) advocates the use of multi- of interest, and this map can be used instead nomial logistic regression to connect text of the full text counts for future inference counts with observable attributes. The gen- tasks. For example, to predict variable v​​ ​i​ you erative model specifies probabilities in the can fit the low dimensional OLS regression multinomial distribution of (4) as of ​​v​i​ on Φ​ ​ci​​​​. Use of projections built in this ​ ​​ way is referred to as multinomial inverse _​e​​ ηij​ (7) qij​​ ​ , ij​​ ​ j​​ ​v​ i​ ​ j​​.​ p ​ ih​ ​ ′φ regression (MNIR). This idea can also be = ​ h 1​ ​​ ​e​​ η ​ η = α + ∑ = applied to only a subset of the variables in​​ Taddy (2013a, b) applies these models in vi​​​​, yielding projections that are sufficient for the setting of univariate or low-dimensional­ ​​ the text content relevant to those variables Gentzkow, Kelly, and Taddy: Text as Data 551 after conditioning on the other attributes in​​ In the vector space, words are relationally vi​​​​. Taddy (2015a) details use of such suffi- oriented and we can begin to draw meaning cient projections in a variety of applications, from term positions, something that is not including attribute prediction, treatment possible in simple bag-of-words approaches. effect estimation, and document indexing. For example, in the right figure, we can see New techniques are arising that combine that by subtracting the vector man from the MNIR techniques with the latent structure vector king, and then adding to this woman, of topic models. For example, Rabinovich we arrive spatially close to queen. Likewise, and Blei (2014) directly combine the logistic the combination ­king​ ​man​​ ​child lies in − + regression in (7) with the topic model of (5) close proximity to the vector prince. in a mixture specification. Alternatively, the Such word embeddings, also known as dis- structural topic model of Roberts et al. (2013) tributed language representations, amount allows both topic content (​​θ​l​) and topic prev- to a preprocessing­ of the text data to replace alence (latent ​​vi​​​​) to depend on observable word identities—encoded as binary indica- document attributes. Such ­semi-supervised tors in a vocabulary-length­ vector—with an techniques seem promising for their com- embedding (location) of each vocabulary bination of the strong ­text-attribute connec- word in ​​ ​​ K​, where ​K​ is the dimension of 핉 tion of MNIR with topic modeling’s ability the latent representation space. The dimen- to account for latent clustering and depen- sions of the vector space correspond to vari- dency within documents. ous aspects of meaning that give words their content. Continuing from the simplified 3.3 Word Embeddings example vocabulary, the latent (and, in real- Throughout this article, documents have ity, unlabeled) dimensions and associated been represented through token count vec- word embeddings might look like: tors, ​​ci​​​​. This is a crude language summariza- tion. It abstracts from any notion of similarity Dimension king queen prince man woman child between words (such as run, runner, jogger) Royalty 0.99 0.99 0.95 0.01 0.02 0.01 or syntactical richness. One of the frontiers Masculinity 0.94 0.06 0.02 0.99 0.02 0.49 of textual analysis is in developing new rep- Age 0.73 0.81 0.15 0.61 0.68 0.09 resentations of text data that more faithfully ​…​ capture its meaning. Instead of identifying words only as an index for location in a long vocabulary list, This type of text representation has long imagine representing words as points in been applied in natural language process- a large vector space, with similar words ing (Rumelhart, Hinton, and Williams 1986; ­colocated, and an internally consistent arith- Morin and Bengio 2005). The embeddings metic on the space for relating words to one must be estimated and are chosen to opti- another. For example, suppose our vocabu- mize, perhaps approximately, an objec- lary consists of six words: {​​ king, queen, prince, tive function defined on the original text man, woman, child}​​. The vector space repre- (such as a likelihood for word occurrences). sentation of this vocabulary based on simi- They form the basis for many deep learn- larity of their meaning might look something ing applications involving textual data (see, like the figure 2 panel A.15 e.g., Chen and Manning 2014; Goldberg 2016). They are also valuable in their own 15 This example is motivated by https://blog.acolyer. right for mapping from language to a vec- org/2016/04/21/­the-amazing-power-of-word-vectors/. tor space where we can compute distances 552 Journal of Economic Literature, Vol. LVII (September 2019)

Panel A Panel B

an an woan ld woan kn een kn an – kn woan + een

ne

kn an –

Figure 2. A Graphical Example of Word Embeddings and angles between words for fundamen- The j​th​ rows of ​ ​ and B​ ​ (denoted ​​ ​​ and ​​ ​​) Γ γj βj tal tasks such as classification, and have give a K​ ​-dimensional embedding of the j​th​ begun to be adopted by social scientists word, so ­co-occurrences of terms ​i​ and ​j​ as a useful summary representation of text are approximated as ​​​​ ​ ​ ​ ​. This geometric γi βj′ data. representation of the text has an intuitive Some popular embedding techniques interpretation. The inner product of terms’ are Word2Vec (Mikolov et al. 2013) and embeddings, which measures the close- Global Vector for Word Representation ness of the pair in the K​ ​-dimensional vec- (GloVe, Pennington, Socher, and Manning tor space, describes how likely the pair is to 2014). The key preliminary step in ­co-occur. these methods is to settle on a notion of Researchers are beginning to connect ­co-occurrence among terms. For example, these ­vector-space language models with the consider a p​ p​ matrix denoted CoOccur​ ​, sorts of document attributes that are of inter- × whose ​​(i, j)​​ entry counts the number of est in social science. For example, Le and times in your corpus that the terms i​​ and j​​ Mikolov (2014) estimate latent document appear within, say, ​b​ words of each other. scores in a vector space, while Taddy (2015b) This is known as the ­skip-gram definition of develops an inversion rule for document ­co-occurrences. classification based upon Word2Vec. In one To embed CoOccur​ ​ in a K​ ​-dimensional especially compelling application, Bolukbasi vector space, where ​K​ is much smaller than p​ ​ et al. (2016) estimate the direction of gen- (say a few hundred), we solve the same type der in an embedding space by averaging the of problem that PCA used to summarize the angles between female and male descriptors. word count matrix in equation (3). In partic- They then show that stereotypically male ular, we can find rank-​K​ matrices ​ and ​B​ and female jobs, for example, live at the Γ ​ that best approximate co-occurrences­ among corresponding ends of the implied gender terms: vector. This information is used to derive an algorithm for removing these gender biases, ​CoOccur ​B . ​ ​ so as to provide a more “fair” set of inputs ≈ Γ ′ Gentzkow, Kelly, and Taddy: Text as Data 553 for machine learning tasks. Approaches like More promising are computation algorithms this, which use embeddings as the basis for that approximate the sampling distribution, mathematical analyses of text, can play a role the most common being the familiar non- in the next generation of ­text-as-data applica- parametric bootstrap (Efron 1979). This tions in social science. repeatedly draws samples with replacement of the same size as the original data set and 3.4 Uncertainty Quantification ­reestimates parameters of interest on the The machine learning literature on text bootstrapped samples, with the resulting set analysis is focused on point estimation and of estimated parameters approximating the predictive performance. Social scientists sampling distribution.17 often seek to interpret parameters or func- Unfortunately, the nonparametric boot- tions of the fitted models, and hence desire strap fails for many of the algorithms used on strategies for quantifying the statistical text. For example, it is known to fail for meth- uncertainty around these targets – that is, for ods that involve ­non-differentiable loss func- statistical inference. tions (e.g., the lasso), and with-replacement­ Many machine learning methods for text resampling produces overfit in the bootstrap analysis are based upon a Bayesian modeling samples (repeated observations make predic- approach, where uncertainty quantification is tion seem easier than it actually is). Hence, often available as part of the estimation pro- for many applications, it is better to look to cess. In MCMC sampling, as in the Bayesian methods more suitable for ­high-dimensional regression of Scott and Varian (2014) or the estimation algorithms. The two primary can- topic modeling of Griffiths and Steyvers didates are the parametric bootstrap and (2004), the software returns samples from subsampling. the posterior and thus inference is imme- The parametric bootstrap generates new diately available. For estimators relying on unrepeated observations for each boot- variational inference—i.e., fitting a trac- strap sample given an estimated generative table distribution as closely as possible to model (or other assumed form for the data the true posterior—one can simulate from generating process).18 In doing so, it avoids the approximate distribution to conduct pathologies of the nonparametric bootstrap inference.16 that arise from using the empirical sample Frequentist uncertainty quantification is distribution. The cost is that the parametric often favored by social scientists, but analytic bootstrap is, of course, parametric: It makes sampling distributions are unavailable for strong assumptions about the underlying most of the methods discussed here. Some generative model, and one must bear in results exist for the lasso in stylized settings mind that the resulting inference is condi- (especially Knight and Fu 2000), but these tional upon these assumptions.19 assume low-dimensional asymptotic scenar- ios that may be unrealistic for text analysis. 17 See, e.g., Horowitz (2003) for an overview. 18 See Efron (2012) for an overview that also makes interesting connections to Bayesian inference. 16 Due to its popularity in the deep learning commu- 19 For example, in a linear regression model, the nity, variational inference is a common feature in newer parametric bootstrap requires simulating errors from machine learning frameworks; see, for example, Edward an assumed, say Gaussian, distribution. One must make (Tran et al. 2016, 2017) for a python library that builds assumptions on the exact form of this distribution, includ- variational inference on top of the TensorFlow platform. ing whether the errors are homoskedastic or not. This Edward and similar tools can be used to implement topic contrasts with our usual approaches to standard errors models and the other kinds of text models that we dis- for linear regression that are robust to assumptions on the cussed above. functional form of the errors. 554 Journal of Economic Literature, Vol. LVII (September 2019)

An alternative method, subsampling, pro- ­inference about heterogeneous treatment vides a nonparametric approach to inference effects in the context of a causal tree model. that remains robust to estimation features 3.5 Some Practical Advice such as ­non-differentiable losses and model selection. The book by Politis, Romano, and The methods above will be compared and Wolf (1999) gives a comprehensive overview. contrasted in our subsequent discussion of In subsampling, the data are partitioned into applications. In broad terms, however, we subsamples without replacement (the num- can make some rough recommendations for ber of subsamples being a function of the practitioners. total sample size) and the target parameters 3.5.1 Choosing the Best Approach for a are ­reestimated separately on each subsam- Specific Application ple. The advantage of subsampling is that, because each subsample is a draw from the ­Dictionary-based methods heavily weight true DGP, it works for a wide variety of esti- prior information about the function map- mation algorithms. However, since each sub- ping features ​​c​i​ to outcomes ​​v​i​. They are sample is smaller than the sample of interest, therefore most appropriate in cases where one needs to know the estimator’s rate of such prior information is strong and reliable, convergence in order to translate between and where information in the data is corre- the uncertainty in the subsamples and in the spondingly weak. An obvious example is a 20 full sample. case where the outcomes v​​ ​i​ are not observed Finally, one may consider sample splitting for any ​i​, so there is no training data avail- with methods that involve a model selection able to fit a supervised model, and where step (i.e., those setting some parameter val- the mapping of interest does not match the ues to zero, such as lasso). Model selection factor structure of unsupervised methods is performed on one “selection” sample, such as topic models. In the setting of Baker, then the standard inference is performed on Bloom, and Davis (2016), for example, there the second “estimation” sample conditional is no ground truth data on the actual level upon the selected model.21 Econometricians of policy uncertainty reflected in particu- have successfully used this approach to lar articles, and fitting a topic model would obtain accurate inference for machine learn- be unlikely to endogenously pick out policy ing procedures. For example, Chernozhukov uncertainty as a topic. A second example is et al. (2018) use it in the context of treat- where some training data do exist, but it is ment effect estimation via lasso, and Athey sufficiently small or noisy that the researcher and Imbens (2016) use sample splitting for believes a ­prior-driven specification of ​f (​ ⋅ )​ is likely to be more reliable. Text regression is generally a good choice 20 For many applications, doing this translation under for predicting a single attribute, especially _ the assumption of a standard ​​√n ​​ learning rate is a rea- when one has a large amount of labeled sonable choice. For example, Knight and Fu (2000) show _ training data available. As described in Ng conditions under which the √​​ n ​​ rate holds for the lasso. However, in many ­text-as-data applications the dimension and Jordan (2001) and Taddy (2013c), super- of the data (the size of the vocabulary) will be large and vised generative techniques such as naive growing with the number of observations. In such settings ​​ _ Bayes and MNIR can improve prediction √n ​​ learning may be optimistic, and more sophisticated methods may need to be used to infer the learning rate when ​p​ is large relative to ​n​; however, these (see, e.g., Politis, Romano, and Wolf 1999). gains diminish with the sample size due to 21 For example, when using lasso, one might apply OLS in the second stage using only the covariates selected by the asymptotic efficiency of many text regres- lasso in the first stage. sion techniques. In text ­regression, we have Gentzkow, Kelly, and Taddy: Text as Data 555 found that it is usually unwise to attempt predictive performance on data held out to learn flexible functional forms unless​n ​ from the main estimation sample. In sec- is much larger than p​ ​. When this is not the tion 3.1.1, we discussed the technique of case, we generally recommend linear regres- ­cross-validation (CV) for penalty selec- sion methods. Given the availability of fast tion, a leading example. More generally, and robust tools (gamlr and glmnet in R, and whenever one works with complex and ­scikit-learn in Python), and the typically high ­high-dimensional data, it is good practice to dimensionality of text data, many prediction reserve a test set of data to use in estimation tasks in social science with text inputs can of the true average prediction error. Looping be efficiently addressed via penalized linear across multiple test sets, as in CV, is a com- regression. mon way of reducing the variance of these When there are multiple attributes of error estimates. (See Efron 2004 for a classic interest, and one wishes to resolve or control overview.) for interdependencies between these attri- In many social science applications, the butes and their effects on language, then one goal is to go beyond prediction and use the will need to work with a generative model values ​​Vˆ ​​ in some subsequent descriptive or for text. Multinomial logistic regression and causal analysis. In these cases, it is import- its extensions can be applied to such situa- ant to also validate the accuracy with which tions, particularly via distributed multino- the fitted model is capturing the economic or mial regression. Alternatively, for corpora descriptive quantity of interest. of many unlabeled documents (or when the One approach that is often effective is labels do not tell the whole story that one manual audits: cross-checking­ some subset of wishes to investigate), topic modeling is the the fitted values against the coding a human obvious approach. Word embeddings are would produce by hand. An informal version also becoming an option for such questions. of this is for a researcher to simply inspect In the spirit of contemporary machine learn- a subset of documents alongside the fitted​​ ing, it is also perfectly fine to combine tech- Vˆ ​​ and evaluate whether the estimates align niques. For example, a common setting will with the concept of interest. A formal version have a large corpora of labeled documents would involve having one or more people as well as a smaller set of documents about manually classify each document in a subset which some metadata exist. One approach is and evaluating quantitatively the consistency to fit a topic model on the larger corpora, and between the human and machine codings. to then use these topics as well as the token The subsample of documents does not need counts for supervised text regression on the to be large in order for this exercise to be valu- smaller labeled corpora. able—often as few as twenty or thirty docu- ments is enough to provide a sense of whether 3.5.2 Model Validation and Interpretation the model is performing as desired. Ex ante criteria for selecting an empirical This kind of auditing is especially import- approach are suggestive at best. In practice, ant for dictionary methods. Validity hinges it is also crucial to validate the performance on the assumption that a particular func- of the estimation approach ex post. Real tion of text features—counts of positive or research often involves an iterative tuning negative words, an indicator for the pres- process with repeated rounds of estimation, ence of certain keywords, etc.—will be a validation, and adjustment. valid predictor of the true latent variable ​V​. When the goal is prediction, the primary In a setting where we have sufficient prior tool for validation is checking ­out-of-sample information to justify this assumption, we 556 Journal of Economic Literature, Vol. LVII (September 2019) typically also have enough prior information path of decreasing penalties. Alternatively, to evaluate whether the resulting classifica- see Gentzkow, Shapiro, and Taddy (2016) for tion looks accurate. An excellent example of ­application-specific term rankings. this is Baker, Bloom, and Davis (2016), who Inspection of fitted parameters is gener- perform a careful manual audit to validate ally more informative in the context of a gen- their ­dictionary-based method for identify- erative model. Even there, some caution is ing articles that discuss policy uncertainty. in order. For example, Taddy (2015a) finds Audits are also valuable in studies using that for MNIR models, getting an interpre- other methods. In Gentzkow and Shapiro table set of word loadings requires careful (2010), for example, the authors perform penalty tuning and the inclusion of appro- an audit of news articles that their fitted priate control variables. As in text regression, model classifies as having aright-leaning ­ or it is usually worthwhile to look at the largest ­left-leaning slant. They do not compare this coefficients for validation but not take the against ­hand coding directly, but rather count smaller values too seriously. the number of times the key phrases that are Interpretation or story building around weighted by the model are used straightfor- estimated parameters tends to be a major wardly in news text, as opposed to occurring focus for topic models and other unsuper- in quotation marks or in other types of arti- vised generative models. Interpretation of cles such as letters to the editor. the fitted topics usually proceeds by ranking A second approach to validating a fitted the tokens in each topic according to token model is inspecting the estimated coefficients probability, ​​ ​​​​, or by token lift ​​​​ ​​p ​ with θlj θlj/¯ j or other parameters of the model directly. In ​​​p ​ (1 n)​ ​ ​​ ​c​​ ​m​​​​. For example, if the five ¯ j = / ∑i ij/ i the context of text regression methods, how- highest lift tokens in topic l​​ for a model fit to a ever, this needs to be approached with cau- corpus of restaurant reviews are another.min- tion. While there is a substantial literature ute, flag.down, over.minute, wait.over, arrive. on statistical properties of estimated param- after, we might expect that reviews with high v​​ ​il ​​​ eters in penalized regression models (see correspond to negative experiences where the Bühlmann and van de Geer 2011 and Hastie, patron was forced to wait for service and food Tibshirani, and Wainwright 2015), the real- (example from Taddy 2012). Again, however, ity is that these coefficients are typically only we caution against the ­overinterpretation of interpretable in cases where the true model is these unsupervised models: the posterior dis- extremely sparse, so that the model is likely to tributions informing parameter estimates are have selected the correct set of variables with often multimodal, and multiple topic model high probability. Otherwise, multicollinear- runs can lead to multiple different interpreta- ity means the set of variables selected can be tions. As argued in Airoldi and Bischof (2016) highly unstable. and in a comment by Taddy (2017a), the best These difficulties notwithstanding, inspect- way to build interpretability for topic models ing the most important coefficients to see if may be to add some supervision (i.e., to incor- they make intuitive sense can still be useful porate external information on the topics for as a validation and sanity check. Note that some set of cases). “most important” can be defined in a number of ways; one can rank estimated coefficients by their absolute values, or by absolute value 4. Applications scaled by the standard deviation of the asso- ciated covariate, or perhaps by the order in We now turn to applications of text analy- which they first become nonzero in a lasso sis in economics and related social sciences. Gentzkow, Kelly, and Taddy: Text as Data 557

Rather than presenting a comprehensive lit- undisputed, to train a naive Bayes classifier erature survey, the goal of this section is to (a supervised generative model, as discussed present a selection of illustrative papers to in section 3.2) in which the probabilities give the reader a sense of the wide diversity ​p(​ ​cij​​ | ​vi​)​ ​​ of each phrase ​j​ are assumed to be of questions that may be addressed with tex- independent Poisson or negative binomial tual analysis and to provide a flavor of how random variables, and the inferences for the some of the methods in section 3 are applied unknown documents are made by Bayes’ rule. in practice. The results provide overwhelming evidence that all of the disputed papers were authored 4.1 Authorship by Madison. Stock and Trebbi (2003) apply similar A classic descriptive problem is inferring methods to answer an authorship question the author of a document. While this is not of more direct interest to economists: who usually a first-order­ research question for invented instrumental variables? The ear- social scientists, it provides a particularly liest known derivation of the instrumental clean example, and a good starting point to variables estimator appears in an appendix understand the applications that follow. to The Tariff on Animal and Vegetable Oils, In what is often seen as the first mod- a 1928 book by statistician Philip Wright. ern statistical analysis of text data, Mosteller While the bulk of the book is devoted to and Wallace (1963) use text analysis to infer a “painfully detailed treatise on animal the authorship of the disputed Federalist and vegetable oils, their production, uses, Papers that had alternatively been attributed markets and tariffs,” the appendix is of an to either Alexander Hamilton or James entirely different character, with “a suc- Madison. They define documentsi ​​ to be indi- cinct and insightful explanation of why data vidual Federalist Papers, the data features ​​ on price and quantity alone are in general ci​​​​ of interest to be counts of function words inadequate for estimating either supply or such as “an,” “of,” and “upon” in each doc- demand; two separate and correct deriva- ument, and the outcome ​​vi​​​​ to be an indica- tions of the instrumental variables estimators tor for the identity of the author. Note that of the supply and demand elasticities; and the function words the authors focus on are an empirical application” (Stock and Trebbi exactly the “stop words” that are frequently 2003, p. 177). The contrast between the excluded from analysis (as discussed in sec- two parts of the book has led to speculation tion 2 above). The key feature of these words that the appendix was not written by Philip is that their use by a given author tends to be Wright, but rather by his son Sewall. Sewall stable regardless of the topic, tone, or intent Wright was an economist who had origi- of the piece of writing. This means they pro- nated the method of “path coefficients” used vide little valuable information if the goal is in one of the derivations in the appendix. to infer characteristics such as political slant Several authors including Manski (1988) are or discussion of policy uncertainty that are on record attributing authorship to Sewall; independent of the idiosyncratic styles of others including Angrist and Krueger (2001) particular authors. When such styles are the attribute it to Philip. object of interest, however, function words In Stock and Trebbi’s (2003) study, the become among the most informative text outcome ​​vi​​​​ is an indicator for authorship characteristics. Mosteller and Wallace (1963) by either Philip or Sewall. The data fea- use a sample of Federalist Papers, whose tures ​​c​​ ​​ func​ gram​ ​​ are counts of the i = [​c​ i​ ​ ​c​ i​ ]​ authorship by either Madison or Hamilton is same function words used by Mosteller and 558 Journal of Economic Literature, Vol. LVII (September 2019)

Wallace (1963) plus counts of a set of gram- to predict future returns of the Dow Jones matical constructions (e.g, “noun followed Industrial Average. Hamilton’s track record by adverb”) measured using an algorithm is unimpressive. A ­market-timing strategy due to Mannion and Dixon (1997). The based on his Wall Street Journal editorials training sample C​​ ​​ train​​ consists of forty-five underperforms a passive investment in the documents known to have been written by Dow Jones Industrial Average by 3.5 per- either Philip or Sewall, and the test sample​​ centage points per year. test C​​ ​ in which ​​v​i​​​ is unobserved consists of In its modern form, the implementation of eight blocks of text from the appendix plus ­text-based prediction in finance is computa- one block of text from chapter 1 of The Tariff tionally driven, but it applies methods that are on Animal and Vegetable Oils included as a conceptually similar to Cowles’s approach, validity check. The authors apply principal seeking to predict the target ­quantity ​V​ components analysis, which we can think (the Dow Jones return, in the example of of as an unsupervised cousin of the topic Cowles) from the array of token counts modeling approach discussed in section ​C​. We discuss three examples of recent 3.2.1. They extract the first four principal papers that study equity return predic- func gram components from c​​ ​ i​ ​ and ​​c​ i​ ​ respec- tion in the spirit of Cowles: one relying on tively, and then run regressions of the binary a ­preexisting dictionary (as discussed at the authorship variable on the principal com- beginning of section 3), one using regres- ponents, resulting in predicted values ​​​vˆ ​​ func​ ​ sion techniques (as discussed in section 3.1), gram i and ​​​vˆ ​​ i​ ​. and another using generative models (as dis- The results provide overwhelming evi- cussed in section 3.2). dence that the disputed appendix was in fact Tetlock’s 2007 paper is a leading written by Philip. Figure 3 plots the values​​ ­dictionary-based example of analyzing media func gram sentiment and the stock market. He studies (​vˆ ​​ i​ ​, ​​vˆ ​​ i​ )​ ​ for all of the documents in the sample. Each point in the figure is a docu- word counts c​​​i​ in the Wall Street Journal’s ment i​​, and the labels indicate whether the widely read “Abreast of the Market” column. document is known to be written by Philip Counts from each article i​​ are converted (“P” or “1”), known to be written by Sewall into a vector of sentiment scores ​​​vˆ ​​i ​ in sev- (“S”), or of uncertain authorship (“B”). The enty-seven different sentiment dimensions measures clearly distinguish the two authors, based on the Harvard ­IV-4 psychosocial dic- 22 with documents by each forming clear, tionary. The time series of daily sentiment ­nonoverlapping clusters. The uncertain doc- scores for each category (​​​vˆ ​​i ​) are condensed uments all fall squarely within the cluster into a single principal component, which attributed to Philip, with the predicted val- Tetlock names the “pessimism factor” due to func gram ues ​​​vˆ ​​ i​ ​, ​​vˆ ​​ i​ ​ 1​. the component’s especially close association ≈ with the “pessimism” dimension of the senti- 4.2 Stock Prices ment categories. The second stage of the analysis uses this An early example analyzing news text for pessimism score to forecast stock market stock price prediction appears in Cowles (1933). He subjectively categorizes the text of editorial articles of Peter Hamilton, 22 While it has only recently been used in economics chief editor of the Wall Street Journal from and finance, the Harvard dictionary and associated General Inquirer software for textual content analysis dates to the 1902–29, as “bullish,” “bearish,” or “doubt- 1960s and has been widely used in linguistics, psychology, ful.” Cowles then uses these classifications sociology, and anthropology. Gentzkow, Kelly, and Taddy: Text as Data 559

Figure 3. Scatterplot of Authorship Predictions from PCA Method

Source: Stock and Trebbi (2003). Copyright American Economic Association; reproduced with the permission of the Journal of Economic Perspectives. activity. High pessimism significantly nega- document a significant predictive correla- tively forecasts one-day-ahead­ returns on the tion between Twitter messages and the stock Dow Jones Industrial Average. This effect is market using other ­dictionary-based tools transitory, and the short-term index dip asso- such as OpinionFinder and Google’s Profile ciated with media pessimism reverts within of Mood States. Wisniewski and Lambe a week, consistent with the interpretation (2013) show that negative media attention of that article text is informative regarding the banking sector, summarized via ad hoc media and investor sentiment, as opposed ­pre-defined word lists,Granger-causes ­ bank to ­containing fundamental news that perma- stock returns during the 2007–2009 financial nently impacts prices. crisis and not the reverse, suggesting that The number of studies using dictionary ­journalistic views have the potential to influ- methods to study asset pricing phenomena ence market outcomes, at least in extreme is growing. Loughran and McDonald (2011) states of the world. demonstrate that the widely used Harvard The use of text regression for asset pric- dictionary can be ill-suited­ for financial ing is exemplified by Jegadeesh and Wu applications. They construct an alternative, (2013). They estimate the response of ­finance-specific dictionary of positive and ­company-level stock returns, v​​​i​, to text negative terms and document its improved information in the company’s annual report predictive power over existing sentiment (token counts, ​​ci​​​​). The authors’ objective is dictionaries. Bollen, Mao, and Zeng (2011) to determine whether regression techniques 560 Journal of Economic Literature, Vol. LVII (September 2019) offer improved stock return forecasts rela- emerge from their analysis. First, the terms tive to dictionary methods. most closely associated with market vola- The authors propose the following regres- tility relate to government policy and wars. sion model to capture correlations between Second, high levels of ­news-implied volatil- occurrences of individual words and subse- ity forecast high future stock market returns. quent stock return realizations around regu- These two facts together give insight into the latory filing dates: types of risks that drive investors’ valuation decisions.23 ​c​​ _ij The closest modern analog of Cowles’s v​i​ a b​ ​ ​ wj​​ ​ ​ ​ ​ ​i​.​ = + ∑j ​ ​ ​​ ​c​​ + ε study is Antweiler and Frank (2004), who ( ∑j ij) take a generative modeling approach to ask: Documents ​i​ are defined to be annual reports how informative are the views of stock mar- filed by firms at the Securities Exchange ket prognosticators who post on internet Commission. The outcome variable v​​i​​ is a message boards? Similar to Cowles’s analysis, stock’s cumulative ­four-day return beginning these authors classify postings on stock mes- on the filing day. The independent variablec ​​ ij​​ sage boards as “buy,” “sell,” or “hold” signals. is a count of occurrences of word j​​ in annual But the vast number of postings, roughly 1.5 report i​​. The coefficientw ​​ ​j​ summarizes the million in the analyzed sample, makes sub- average association between an occurrence jective classification of messages infeasible. of word ​j​ and the stock’s subsequent return. Instead, generative techniques allow the The authors show how to estimate w​​ ​j​ from authors to automatically classify messages. a cross-sectional­ regression, along with a The authors create a training sample of subsequent rescaling of all coefficients to one thousand messages, and form ​​V​​ train​ remove the common influence parameter ​b​. by manually classifying messages into one Finally, variables to predict returns are built of the three categories. They then use the from the estimated weights, and are shown naive Bayes method described in section to have stronger ­out-of-sample forecasting 3.2.2 to estimate a probability model that performance than dictionary-based­ indices maps word counts of postings C​ ​ into clas- from Loughran and McDonald (2011). The sifications ​​Vˆ ​​ for the remaining 1.5 million results highlight the limitations of using fixed messages. Finally, the buy/sell/hold classifi- dictionaries for diverse predictive problems, cation of each message is aggregated into an and that these limitations are often sur- index that is used to forecast stock returns. mountable by estimating ­application-specific Consistent with the conclusions of Cowles, weights via regression. message board postings show little ability to Manela and Moreira (2017) take a predict stock returns. They do, however, pos- regression approach to construct an index sess significant and economically meaningful of news-implied­ market volatility based on text from the Wall Street Journal from ­1890–2009. They apply support vector machines, a nonlinear­ regression method that we discuss in section 3.1.3. This 23 While Manela and Moreira (2017) study aggre- approach applies a penalized least squares gate market volatility, Kogan et al. (2009) and Boudoukh objective to identify a small subset of words et al. (2016) use text from news and regulatory filings to whose frequencies are most useful for pre- predict ­firm-specific volatility. Chinco, Clark-Joseph, and Ye (2017) apply lasso in high frequency return prediction dicting outcomes—in this case, turbulence using ­preprocessed financial news text sentiment as an in financial markets. Two important findings explanatory variable. Gentzkow, Kelly, and Taddy: Text as Data 561

­information about stock volatility and trading of positive and negative searches and averag- volume.24 ing over all phrases in i​​. The Factiva score Bandiera et al. (2017) apply unsuper- is calculated similarly. Next, the central bank vised machine learning—topic modeling sentiment proxies ​​​vˆ ​​i ​​​ are used to predict (LDA)—to a large panel of CEO diary Treasury yields in a vector autoregression data. They uncover two distinct behavioral (VAR). They find that changes in statement types that they classify as “leaders” who content, as opposed to unexpected devia- focus on communication and coordination tions in the federal funds target rate, are the activities, and “managers” who empha- main driver of changes in interest rates. size production-related­ activities. They Born, Ehrmann, and Fratzscher (2014) show that, due to horizontal differentiation extend this idea to study the effect of central of firm and manager types, appropriately bank sentiment on stock market returns and matched firms and CEOs enjoy better firm volatility. They construct a financial stability ­performance. Mismatches are more com- sentiment index ​​​vˆ ​​ i​ from Financial Stability mon in lower income economies, and mis- Reports (FSRs) and speeches given by cen- matches can account for 13 percent of the tral bank governors. Their approach uses labor productivity gap between firms in high- a sentiment dictionary to assign optimism and middle ­low-income countries. scores to word counts c​​​​ from central bank / i communications. They find that optimis- 4.3 Central Bank Communication tic FSRs tend to increase equity prices and A related line of research analyzes the reduce market volatility during the subse- impact of communication from central banks quent month. on financial markets. As banks rely more on Hansen, McMahon, and Prat (2018) these statements to achieve policy objec- research how FOMC transparency affects tives, an understanding of their effects is debate during meetings by studying a change increasingly relevant. in disclosure policy. Prior to November 1993, Lucca and Trebbi (2009) use the con- the FOMC meeting transcripts were secret, tent of Federal Open Market Committee but following a policy shift transcripts became (FOMC) statements to predict fluctuations public with a time lag. There are potential in Treasury securities. To do this, they use costs and benefits of increased transparency, two different dictionary-based­ methods such as the potential for more efficient and (section 3)—Google and Factiva semantic informed debate due to increased account- orientation scores—to construct ​​​vˆ ​​ i​, which ability of policy makers. On the other hand, quantifies the direction and intensity of the​ transparency may make committee mem- ith ​FOMC statement. In the Google score,​​ bers more cautious, biased toward the status c​i​​​ counts how many Google search hits occur quo, or prone to ­group-think. when searching for phrases in i​​ plus one of The authors use topic modeling (section the words from a list of antonym pairs sig- 3.2.1) to study 149 FOMC meeting tran- nifying positive or negative sentiment (e.g., scripts during Alan Greenspan’s tenure. The “hawkish” versus “dovish”). These counts are unit of observation is a ­member-meeting. The mapped into ​​​vˆ ​​ i​ by differencing the ­frequency vector ​​ci​​​​ counts the words used by FOMC member ​m​ in meeting t​​, and i​​ is defined as the pair (​​m, t)​.​ The outcome of interest, v​​ i​​, 24 Other papers that use naive Bayes and similar gener- is a vector that includes the proportion of ative models to study behavioral finance questions include Buehlmaier and Whited (2018), Li (2010), and Das and ​i​’s language devoted to the ​K​ different top- Chen (2007). ics (estimated from the fitted topic model), 562 Journal of Economic Literature, Vol. LVII (September 2019) the concentration of these topic weights, and to create a product that predicts flu prevalence the frequency of data citation by ​i​. Next, a from Google searches using text regression. ­difference-in-differences regression esti- The results are reported in a ­widely cited mates the effects of the change in transpar- Nature article by Ginsberg et al. (2009). ency on ​​​vˆ ​​ ​​​. The authors find that, after the Their raw data ​ ​ consist of “hundreds of bil- i  move to a more transparent system, inex- lions of individual searches from 5 years of perienced members discuss a wider range Google web search logs.” Aggregated search of topics and make more references to data counts are arranged into a vector ​​c​i​, where a when discussing economic conditions (con- document ​i​ is defined to be a particular US sistent with increased accountability); but region in a particular week, and the outcome also speak more like Chairman Greenspan of interest ​​vi​​​​ is the true prevalence of flu in during policy discussions (consistent with the ­region–week. In the training data, this is increased conformity). Overall, the account- taken to be equal to the rate measured by ability effect appears stronger, as inexperi- the CDC. The authors first restrict atten- enced members’ topics appear to be more tion to the fifty million most common terms, influential in shaping future deliberation then select those most diagnostic of an out- after transparency. break using text regression (section 3.1), specifically a variant of partial least squares 4.4 Nowcasting regression. They first run fifty million uni- variate regressions of log​ ​​v​​ ​1 ​v​​ ​ ​ on ( i/( − i)) Important variables such as unemploy- ​log​ ​c​​ ​ 1 ​c​​ ​ ​, where ​​c​​​​ is the share of ( ij/( − ij)) ij ment, retail sales, and GDP are measured searches in i​​ containing search term j​​. They at low frequency, and estimates are released then fit a sequence of multivariate regression with a significant lag. Others, such as racial models of ​​vi​​​​ on the top n​ ​ terms ​j​ as ranked by prejudice or local government corruption, are average predictive power across regions for ​ not captured by standard measures at all. Text n ​1, 2, ​​. Next, they select the value ∈ { …} produced online such as search queries, social of ​n​ that yields the best fit on a ­hold-out media posts, listings on job websites, and so on sample. This yields a regression model with ​ can be used to construct alternative ­real-time n 45​ terms. The model produces accu- = estimates of the current values of these vari- rate flu rate estimates for all regions approxi- ables. By contrast with the standard exercise mately 1–2 weeks ahead of the CDC’s regular of forecasting future variables, this process of report publication dates.25 using diverse data sources to estimate current variables has been termed “nowcasting” in the literature (Ba´nbura et al. 2013). 25 A number of subsequent papers debate the lon- A prominent early example is the Google Flu ger-term performance of Google Flu Trends. Lazer et al. Trends project. Zeng and Wagner (2002) note (2014), for example, show that the accuracy of the Google that the volume of searches or web hits seek- Flu Trends model—which has not been recalibrated­ or updated based on more recent data—has deteriorated dra- ing information related to a disease may be a matically, and that in recent years it is outperformed by strong predictor of its prevalence. Johnson et simple extrapolation from prior CDC estimates. This may al. (2004) provide an early data point suggest- reflect changes in both search patterns and the epidemi- ology of the flu, and it suggests a general lesson that the ing that browsing ­influenza-related articles on predictive relationship mapping text to a real outcome of the website healthlink.com is correlated with interest may not be stable over time. On the other hand, traditional surveillance data from the Centers Preis and Moat (2014) argue that an adaptive version of the model that more flexibly accounts for joint dynamics for Disease Control (CDC). In the late 2000s, in flu incidence and search volume significantly improves a group of Google engineers built on this idea ­real-time influenza monitoring. Gentzkow, Kelly, and Taddy: Text as Data 563

Related work in economics attempts to ­animus in area ​i​ is the share of searches orig- nowcast macroeconomic variables using data inating in that area that contain a set of racist on the frequency of Google search terms. In words. He then uses these measures to esti- Choi and Varian (2012) and Scott and Varian mate the impact of racial animus on votes for (2014, 2015), search term counts are aggre- Barack Obama in the 2008 election, finding gated by week and by geographic location, a statistically significant and economically then converted to ­location-specific frequency large negative effect on Obama’s vote share indices. They estimate spike and slab Bayesian relative to the Democratic vote share in the forecasting models, discussed in section 3.1.4 previous election. above. Forecasts of regional retail sales, new 4.5 Policy Uncertainty housing starts, and tourism activity are all sig- nificantly improved by incorporating a few Among the most influential applications search term indices that are relevant for each of text analysis in the economics litera- category in linear models. Their results sug- ture to date is a measure of economic pol- gest a potential for large gains in forecasting icy uncertainty (EPU) developed by Baker, power using web browser search data. Bloom, and Davis (2016). Uncertainty about Saiz and Simonsohn (2013) use web search both the path of future government policies results to estimate the current extent of cor- and the impact of current government pol- ruption in US cities. Standard corruption icies has the potential to increase risk for measures based on surveys are available at economic actors and so potentially depress the country and state level, but not for smaller investment and other economic activity. The geographies. The authors use a dictionary authors use text from news outlets to provide approach in which the index v​​​ˆ ​​i ​ of corruption a ­high-frequency measure of EPU and then is defined to be the ratio of search hits for the estimate its economic effects. name of a geographic area i​​ plus the word Baker, Bloom, and Davis (2016) define “corruption” divided by hits for the name the unit of observation ​i​ to be a country–­ of the geographic area alone. These counts month. The outcome v​​​i​​​ of interest is the are extracted from search engine results. true level of economic policy uncertainty. As a validation, the authors first show that The authors apply a dictionary method ­country-level and state-level­ versions of their to produce estimates ​​​vˆ ​​i ​ based on digital measure correlate strongly with established archives of ten leading newspapers in the corruption indicies and covary in a similar United States. An element of the input data ​​ way with country- and ­state-level demograph- cij​​​​ is a count of the number of articles in ics. They then compute their measure for US newspaper ​j​ in ­country–month ​i​ containing cities and study its observable correlates. at least one keyword from each of three cat- Stephens-Davidowitz (2014) uses the fre- egories defined by hand: one related to the quency of racially charged terms in Google economy, a second related to policy, and a searches to estimate levels of racial ani- third related to uncertainty. The raw counts mus in different areas of the United States. are scaled by the total number of articles in Estimating animus via traditional surveys is the corresponding newspaper–month­ and challenging because individuals are often normalized to have standard deviation one. reluctant to state their true attitudes. The The predicted value v​​​ˆ ​​i ​​​ is then defined to paper’s results suggest Google searches be a simple average of these scaled counts provide a less filtered, and therefore more across newspapers. accurate, measure. The author uses a dictio- The simplicity of the manner in which the nary approach in which the index ​​​vˆ ​​ i​ of racial index is created allows for a high amount of 564 Journal of Economic Literature, Vol. LVII (September 2019) flexibility across a broad range of applica- the political process, with the power to poten- tions. For instance, by including a fourth, tially sway both public opinion and policy. ­policy-specific category of keywords, the Understanding how and why media outlets authors can estimate narrower indices related slant the information they present is import- to Federal Reserve policy, inflation, and so on. ant to understanding the role media play in Baker, Bloom, and Davis (2016) validate ​​​vˆ ​​ i​ practice, and to informing the large body of using a human audit of twelve thousand arti- government regulation designed to preserve cles from 1900–2012. Teams manually scored a diverse range of political perspectives. articles on the extent to which they discuss Groseclose and Milyo (2005) offer a pio- economic policy uncertainty and the spe- neering application of text analysis methods cific policies they relate to. The resulting to this problem. In their setting, ​i​ indexes a ­human-coded index has a high correlation set of large US media outlets, and documents with ​​​vˆ ​​ i​. are defined to be the complete news text or With the estimated v​​​ˆ ​​i ​​​ in hand, the authors broadcast transcripts for an outlet i​​. The out- analyze the micro- and macro-level­ effects come of interest ​​vi​​​​ is the political slant of out- of EPU. Using firm-level­ regressions, they let i​​. To give this measure content, the authors first measure how firms respond to this use speeches by politicians in the US Congress uncertainty and find that it leads to reduced to form a training sample, and define​​v i​​ within employment, investment, and greater asset this sample to be a politician’s Americans for price volatility for that firm. Then, using both Democratic Action (ADA) score, a measure US and international panel VAR models, the of left-right­ political ideology based on con- authors find that increasedv ​​​ˆ ​​ i​​​ is a strong pre- gressional voting records. The predicted val- dictor of lower investment, employment, ues ​​​vˆ ​​i ​​​ for the media outlets thus place them and production. on the same ­left-right scale as the politicians, Hassan et al. (2017) measure political risk and answer the question “what kind of poli- at the firm level by analyzing quarterly earn- tician does this news outlet’s content sound ings call transcripts. Their measure captures most similar to?” the frequency with which policy-oriented­ lan- The raw data are the full text of speeches guage and “risk” synonyms co-occur­ in a tran- by congresspeople and news reports by script. Firms with high levels of political risk media outlets over a period spanning the actively hedge these risks by lobbying more 1990s to the early 2000s.26 The authors intensively and donating more to politicians. dramatically reduce the dimensionality of When a firm’s political risk rises, it tends to the data in an initial step by deciding to retrench hiring and investment, consistent focus on a particularly informative subset with the findings of Baker, Bloom, and Davis of phrases: the names of two hundred think (2016) at the aggregate level. Their findings tanks. These think tanks are widely viewed indicate that political shocks are an important as having clear political positions (e.g., the source of idiosyncratic firm-level­ risk. Heritage Foundation on the right and the NAACP on the left). The relative frequency 4.6 Media Slant

26 For members of Congress, the authors use all entries A text analysis problem that has received in the Congressional Record from January 1, 1993 to significant attention in the social science December 31, 2002. The text includes both floor speeches literature is measuring the political slant of and documents the member chose to insert in the record but did not read on the floor. For news outlets, the time period media content. Media outlets have long been covered is different for different outlets, with start dates as seen as having a uniquely important role in early as January 1990 and end dates as late as July 2004. Gentzkow, Kelly, and Taddy: Text as Data 565 with which a politician cites conservative as which is denoted in the figure by “aver- opposed to liberal think tanks turns out to age US voter.” The last fact underlies the be strongly correlated with a politician’s ide- authors’ main conclusion, which is that there ology. The paper’s premise is that the cita- is an overall liberal bias in the media. tion frequencies of news outlets will then Gentzkow and Shapiro (2010) build on provide a good index of those outlets’ polit- the Groseclose and Milyo (2005) approach ical slants. The features of interest c​​​i​ are a ​​ to measure the slant of 433 US daily news- 1 50 ​​ vector of citation counts for each papers. The main difference in approach is ( × ) of forty-four ­highly cited think tanks plus that Gentzkow and Shapiro (2010) omit the six groups of smaller think tanks. initial step that restricts the space of fea- The text analysis is based on a supervised tures to mentions of think tanks, and instead generative model (section 3.2.2). The utility consider all phrases that appear in the 2005 that congress member or media firm ​i​ derives Congressional Record as potential predictors, from citing think tank ​j​ is ​​U​​ ​a​​ ​b​​ ​v​​ ​e​​, letting the data select those that are most ij = j + j i + ij where ​​vi​​​​ is the observable ADA score diagnostic of ideology. These could poten- of a ­congress member ​i​ or unobserved tially be think tank names, but they turn out slant of media outlet ​i​, and ​​eij​​ is an error instead to be politically charged phrases such distributed type-I­ extreme value. The coef- as “death tax,” “bring our troops home,” and ficient ​​bj​​​​ captures the extent to which think “war on terror.” tank ​j​ is cited relatively more by conserva- After standard ­preprocessing—stemming tives. The model is fit by maximum likeli- and omitting stop words—the authors pro- hood with the parameters (​​​aj​​, ​bj​)​ ​ and the duce counts of all 2-grams­ and 3-grams­ by unknown slants v​​​i​​​ estimated jointly. This is speaker. They then select the top one thou- an efficient but computationally intensive sand phrases (five hundred of each length) approach to estimation, and it constrains the by a ​​ ​​ 2​​ criterion that captures the degree to χ authors’ focus to twenty outlets. This limita- which each phrase is diagnostic of the speak- tion can be sidestepped using more recent er’s party. This is the standard ​​ ​​ 2​-test statistic χ approaches such as Taddy’s (2013b) multino- for the null hypothesis that phrase j​​ is used mial inverse regression. equally often by Democrats and Republicans, Figure 4 shows the results, which sug- and it will be high for phrases that are both gest three main findings. First, the media used frequently and used asymmetrically by outlets are all relatively centrist: they are all the parties.28 Next, a ­two-stage supervised to the left of the average Republican and to generative method is used to predict news- the right of the average Democrat with one paper slant v​​ ​i​​​ from the selected features. In exception. Second, the ordering matches the first stage, the authors run a separate conventional wisdom, with the New York regression for each phrase ​j​ of counts (c​​ij​​) Times and Washington Post on the left, and Fox News and the Washington Times on the 27 right. Third, the large majority of outlets features and estimate a much more conservative slant for fall to the left of the average in congress, the Wall Street Journal. 28 The statistic is ​f​ ​ ​f​ ​ ​f​ ​ ​f​ ​ 2 ______jr ∼jd jd ∼jr 27 ​ ​ ​ ​ ​ − ​​ The one notable exception is the Wall Street Journal, χ​ j = ​ ​f​ ​ ​f​ ​ ​ ​f​ ​ ​f​ ​ ​ ​f​ ​ ​f​ ​ ​ ​f​ ​ ​f​ ​ ​ which is generally considered to be right-of-center­ but ( jr + jd)( jr + ∼jd)( ∼jr + jd)( ∼jr + ∼jd) which is estimated by Groseclose and Milyo (2005) to be where ​​fjd​ ​ and ​​f​jr​​​ denote the number of times phrase ​j​ is the most left-wing­ outlet in their sample. This may reflect used by Democrats or Republicans, respectively, and ​​f​∼jd​ an idiosyncrasy specific to the way they cite think tanks; and ​​f​∼jr​​​ denote the number of times phrases other than j​​ Gentzkow and Shapiro (2010) use a broader sample of text are used by Democrats and Republicans, respectively. 566 Journal of Economic Literature, Vol. LVII (September 2019)

Figure 4. Distribution of Political Orientation: Media Outlets and Members of Congress

Source: Groseclose and Milyo (2005). Reprinted with permission from the Quarterly Journal of Economics. Gentzkow, Kelly, and Taddy: Text as Data 567 on speaker ​i​’s ideology, which is measured as using the full set of phrases in the data. He the 2004 Republican vote share in the speak- shows that this substantially increases the er’s district. They then use the estimated ­in-sample predictive power of the measure. ˆ coefficients ​​​ ​​j ​​​ to produce predicted slant Greenstein, Gu, and Zhu (2016) analyze 1,000 ˆβ ​​​vˆ ​​i ​ ​ j 1​ ​​​​ ​​j ​ ​cij​​​​ for the unknown newspa- the extent of bias and slant among Wikipedia ∝ ∑ = β pers ​i​.29 contributors using similar methods. They find The main focus of the study is character- that contributors tend to edit articles with izing the incentives that drive newspapers’ slants in opposition to their own slants. They choice of slant. With the estimated ​​​vˆ ​​ i​ in hand, also show that contributors’ slants become the authors estimate a model of consumer less extreme as they become more experi- demand in which a consumer’s utility from enced, and that the bias reduction is largest reading newspaper ​i​ depends on the distance for those with the most extreme initial biases. between ​i​’s slant ​​vi​​​​ and an ideal slant v​​ ​​ ⁎​ which is greater the more conservative the consum- 4.7 Market Definition and Innovation er’s ideology. Estimates of this model using Impact ­zipcode-level circulation data imply a level of slant that newspapers would choose if their Many important questions in industrial only incentive was to maximize profits. The organization hinge on the appropriate defi- authors then compare this ­profit-maximizing nition of product markets. Standard industry slant to the level actually chosen by newspa- definitions can be an imperfect proxy for the pers, and ask whether the deviations can be economically relevant concept. Hoberg and predicted by the identity of the newspaper’s Phillips (2016) provide a novel way of clas- owner or by other ­nonmarket factors such sifying industries based on product descrip- as the party of local incumbent politicians. tions in the text of company disclosures. This They find that profit maximization fits the allows for flexible industry classifications that data well, and that ownership plays no role in may vary over time as firms and economies explaining the choice of slant. In this study, ​​​vˆ ​​i ​ evolve, and allows the researchers to ana- is both an independent variable of interest lyze the effect of shocks on competition and (in the demand analysis) and an outcome of product offerings. interest (in the supply analysis). Each publicly traded firm in the United Note that both Groseclose and Milyo States must file an annual10-K ­ report describ- (2005) and Gentzkow and Shapiro (2010) ing, among other aspects of their business, the use a two-step­ procedure where they reduce products that they offer. The unit of analysis​ the dimensionality of the data in a first stage i​ is a firm–year.­ Token counts from the busi- and then estimate a predictive model in the ness description section of the ​ith​ ­10-K filing second. Taddy (2013b) shows how to com- are represented in the vector ​​c​i​. A pairwise bine a more sophisticated generative model cosine similarity score, s​​ij​​​​, based on the angle with a novel algorithm for estimation to esti- between ​​ci​​ and ​​cj​​​​, describes the closeness of mate the predictive model in a single step product offerings for each pair ​i​ and ​j​ in the same filing year. Industries are then defined by clustering firms according to their cosine 29 As Taddy (2013b) notes, this method (which Gentzkow and Shapiro 2010 derive in an ad hoc fashion) similarities. The clustering algorithm begins is essentially partial least squares. It differs from the stan- by assuming each firm is its own industry, dard implementation in that the variables v​​ i​​ and ​​c​ij​ would and gradually agglomerates firms into indus- normally be standardized. Taddy (2013b) shows that doing so increases the in-sample­ predictive power of the measure tries by grouping a firm to the cluster with from 0.37 to 0.57. its nearest neighbor according to ​​sij​​. The 568 Journal of Economic Literature, Vol. LVII (September 2019)

algorithm terminates when the number of the data ​​c​i​​​ are counts of individual words, industries (clusters) reaches three hundred, a and the outcome of interest v​​ ​i​​​ is a vector of number chosen for comparability to Standard weights indicating the share of a given article Industrial Classification and North American devoted to each of one hundred latent topics. Industrial Classification System codes.30 The authors extend the baseline LDA model After establishing an industry assignment of Blei, Ng, and Jordan (2003) to allow the for each ­firm–year, ​​​vˆ ​​i ​, the authors examine importance of one topic in a particular article the effect of military and software industry to be correlated with the presence of other shocks to competition and product offerings topics. They fit the model using all Science among firms. As an example, they find that articles from 1990–99.­ The results deliver an the events of September 11, 2001, increased automated classification of article content into entry in high-demand military markets and semantically coherent topics such as evolu- pushed products in this industry toward tion, DNA and genetics, cellular biology, and “non-battlefield­ information gathering and volcanoes. products intended for potential ground Applying similar methods in the politi- conflicts.” cal domain, Quinn et al. (2010) use a topic In a similar vein, Kelly et al. (2018) use model to identify the issues being discussed cosine similarity among patent documents to in the US Senate over the period 1997–2004. create new indicators of patent quality. They Their approach deviates from the baseline assign higher quality to patents that are novel LDA model in two ways. First, they assume in that they have low similarity with the exist- that each speech is associated with a sin- ing stock of patents and are impactful in that gle topic. Second, their model incorporates they have high similarity with subsequent ­time-series dynamics that allow the pro- patents. They then show that ­text-based nov- portion of speeches generated by a given elty and similarity scores correlate strongly topic to gradually evolve over the sample, with measures of market value. Atalay et al. similar to the dynamic topic model of Blei (2017) use text from job ads to measure task and Lafferty (2006). Their preferred spec- content and use their measure to show that ification is a model with forty-two topics, a ­within-occupation task content shifts are at number chosen to maximize the subjective least as important as employment shifts interpretability of the resulting topics. across occupations in describing the large Table 1 shows the words with the high- secular reallocation of routine tasks from est weights in each of twelve fitted top- humans to machines. ics. The labels “Judicial Nominations,” “Constitutional,” and so on are assigned by 4.8 Topics in Research, Politics, and Law hand by the authors. The results suggest A number of studies apply topic models that the automated procedure successfully (section 3.2.1) to describe how the focus of isolates coherent topics of congressional attention in a specific text corpus shifts over debates. After discussing the structure of time. topics in the fitted model, the authors then A seminal contribution in this vein is Blei track the relative importance of the topics and Lafferty’s (2007) analysis of topics in across congressional sessions and argue that Science. Documents i​​ are individual articles, spikes in discussion of particular topics track, in an intuitive way, the occurrence of import- ant debates and external events. 30 Best, Hjort, and Szakonyi (2017) use a similar approach to classify products in their study of public pro- Sim, Routledge, and Smith (2015) esti- curement and organizational bureaucracy in Russia. mate a topic model from the text of amicus Gentzkow, Kelly, and Taddy: Text as Data 569

TABLE 1 Congressional Record Topics and Key Words

Topic (Short Label) Keys 1. Judicial nominations nomine, confirm, nomin, circuit, hear, court, judg, judici, case, vacanc 2. Constitutional case, court, attornei, supreme, justic, nomin, judg, m, decis, constitut 3. Campaign finance campaign, candid, elect, monei, contribut, polit, soft, ad, parti, limit 4. Abortion procedur, abort, babi, thi, life, doctor, human, ban, decis, or 5. Crime 1 [violent] enforc, act, crime, gun, law, victim, violenc, abus, prevent, juvenil 6. Child protection gun, tobacco, smoke, kid, show, firearm, crime, kill, law, school 7. Health 1 [medical] diseas, cancer, research, health, prevent, patient, treatment, devic, food 8. Social welfare care, health, act, home, hospit, support, children, educ, student, nurs 9. Education school, teacher, educ, student, children, test, local, learn, district, class 10. Military 1 [manpower] veteran, va, forc, militari, care, reserv, serv, men, guard, member 11. Military 2 [infrastructure] appropri, defens, forc, report, request, confer, guard, depart, fund, project 12. Intelligence intellig, homeland, commiss, depart, agenc, director, secur, base, defens Source: Quinn et al. (2010). Reprinted with permission from John Wiley and Sons.

briefs to the Supreme Court of the United methods for feature selection and model States. They show that the overall topical training. As we have emphasized, dictionary composition of briefs for a given case, par- methods are appropriate in cases where ticularly along a conservative–liberal dimen- prior information is strong and the avail- sion, is highly predictive for how individual ability of appropriately labeled training data judges vote in the case. is limited. Experience in other fields, how- ever, suggests that modern methods will likely outperform ad hoc approaches in a 5. Conclusion substantial share of cases. Digital text provides a rich repository of Second, some of the workhorse meth- information about economic and social activ- ods of text analysis such as penalized linear ity. Modern statistical tools give researchers or logistic regression have still seen lim- the ability to extract this information and ited application in social science. In other encode it in a quantitative form amena- contexts, these methods provide a robust ble to descriptive or causal analysis. Both baseline that performs similarly to or bet- the ­availability of text data and the frontier ter than more complex methods. We expect of methods are expanding rapidly, and we the domains in which these methods are expect the importance of text in empirical applied to grow. economics to continue to grow. Finally, virtually all of the methods applied The review of applications above suggests to date, including those we would label as a number of areas where innovation should sophisticated or on the frontier, are based on proceed rapidly in coming years. First, a fitting predictive models to simple counts of large share of text analysis applications con- text features. Richer representations, such as tinue to rely on ad hoc dictionary methods word embeddings (3.3), and linguistic mod- rather than deploying more sophisticated els that draw on natural language processing 570 Journal of Economic Literature, Vol. LVII (September 2019) tools have seen tremendous success else- ­Economic Time Series Using Targeted Predictors.” Journal of Econometrics 146 (2): 304–17. where, and we see great potential for their Baker, Scott R., Nicholas Bloom, and Steven J. Davis. application in economics. 2016. “Measuring Economic Policy Uncertainty.” The rise of text analysis is part of a broader Quarterly Journal of Economics 131 (4): 1593–636. Ba´nbura, Marta, Domenico Giannone, Michele trend toward greater use of machine learn- Modugno, and Lucrezia Reichlin. 2013. “Now-Cast- ing and related statistical methods in eco- ing and the Real-Time Data Flow.” In Handbook of nomics. With the growing availability of Economic Forecasting, Vol. 2A, edited by Allan Tim- mermann and Graham Elliot, 195–237. Sebastopol: ­high-dimensional data in many domains— O’Reilly Media. from consumer purchase and browsing Bandiera, Oriana, Stephen Hansen, Andrea Prat, and behavior, to satellite and other spatial data, to Raffaella Sadun. 2017. “CEO Behavior and Firm Performance.” NBER Working Paper 23248. genetics and ­neuro-economics—the returns Belloni, Alexandre, Victor Chernozhukov, and Chris- are high to economists investing in learning tian B. Hansen. 2013. “Inference for High-Dimen- these methods and to increasing the flow of sional Sparse Econometric Models.” In Advances in Economics and Econometrics: Tenth World Con- ideas between economics and fields such as gress, Vol. 3, 245–95. Cambridge: Cambridge Uni- statistics and computer science, where fron- versity Press. tier innovations in these methods are taking Best, Michael Carlos, Jonas Hjort, and David Szakonyi. 2017. “Individuals and Organizations as Sources of place. State Effectiveness.” NBER Working Paper 23350. Bickel, Peter J., Ya’acov Ritov, and Alexandre B. Tsyba- References kov. 2009. “Simultaneous Analysis of Lasso and Dan- tzig Selector.” Annals of Statistics 37 (4): 1705–32. Airoldi, Edoardo M., and Jonathan M. Bischof. 2016. Bishop, Christopher M. 1995. Neural Networks for Pat- “Improving and Evaluating Topic Models and Other tern Recognition. Oxford: Oxford University Press. Models of Text.” Journal of the American Statistical Bishop, Christopher M. 2006. Pattern Recognition and Association 111 (516): 1381–403. Machine Learning. Berlin: Springer. Airoldi, Edoardo M., Elena A. Erosheva, Stephen E. Blei, David M. 2012. “Probabilistic Topic Models.” Fienberg, Cyrille Joutard, Tanzy Love, and Suyash Communications of the ACM 55 (4): 77–84. Shringarpure. 2010. “Reconceptualizing the Classifi- Blei, David M., and John D. Lafferty. 2006. “Dynamic cation of PNAS Articles.” Proceedings of the National Topic Models.” In Proceedings of the 23rd Interna- Academy of Sciences of the United States of America tional Conference on Machine Learning, edited by 107 (49): 20899–904. W. Cohen and A. Moore, 113–20. New York: Associ- Akaike, H. 1973. “Information Theory and an Exten- ation for Computing Machinery. sion of the Maximum Likelihood Principle.” In Pro- Blei, David M., and John D. Lafferty. 2007. “A Cor- ceedings of the 2nd International Symposium on related Topic Model of Science.” Annals of Applied Information Theory, edited by B. N. Petrov and F. Statistics 1 (1): 17–35. Csaki, 267–81. Budapest: Akademiai Kiado. Blei, David M., and Jon D. McAuliffe. 2007. “Super- Angrist, Joshua D., and Alan B. Krueger. 2001. “Instru- vised Topic Models.” Proceedings of the 20th mental Variables and the Search for Identification: International Conference on Neural Information From Supply and Demand to Natural Experiments.” Processing Systems, edited by J. C. Platt, D. Koller, Journal of Economic Perspectives 15 (4): 69–85. Y. Singer, and S. T. Roweis, 121–28. Red Hook: Cur- Antweiler, Werner, and Murray Z. Frank. 2004. “Is All ran Associates. That Talk Just Noise? The Information Content of Blei, David M., Andrew Y. Ng, and Michael I. Jor- Internet Stock Message Boards.” Journal of Finance dan. 2003. “Latent Dirichlet Allocation.” Journal of 59 (3): 1259–94. Machine Learning Research 3: 993–1022. Armagan, Artin, David B. Dunson, and Jaeyong Lee. Bollen, Johan, Huina Mao, and Xiaojun Zeng. 2011. 2013. “Generalized Double Pareto Shrinkage.” Sta- “Twitter Mood Predicts the Stock Market.” Journal tistica Sinica 23 (1): 119–43. of Computational Science 2 (1): 1–8. Atalay, Enghin, Phai Phongthiengtham, Sebastian Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Ven- Sotelo, and Daniel Tannenbaum. 2017. “The Evolv- katesh Saligrama, and Adam T. Kalai. 2016. “Man Is ing U.S. Occupational Structure.” Washington Cen- to Computer Programmer as Woman Is to Home- ter for Equitable Growth Working Paper 12052017. maker? Debiasing Word Embeddings.” In Proceed- Athey, Susan, and Guido Imbens. 2016. “Recursive ings of the 29th International Conference on Neural Partitioning for Heterogeneous Causal Effects.” Pro- Information Processing Systems, edited by D. D. ceedings of the National Academy of Sciences of the Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. United States of America 113 (27): 7353–60. Garnett, 4349–57. Red Hook: Curran Associates. Bai, Jushan, and Serena Ng. 2008. “Forecasting Born, Benjamin, Michael Ehrmann, and Marcel Gentzkow, Kelly, and Taddy: Text as Data 571

Fratzscher. 2014. “Central Bank Communication of the American Statistical Association 99 (467): on Financial Stability.” Economic Journal 124 (577): 619–32. 701–34. Efron, B. 2012. “Bayesian Inference and the Paramet- Boudoukh, Jacob, Ronen Feldman, Shimon Kogan, and ric Bootstrap.” Annals of Applied Statistics 6 (4): Matthew Richardson. 2016. “Information, Trading, 1971–97. and Volatility: Evidence from Firm-Specific News.” Engelberg, Joseph E., and Christopher A. Parsons. Available on SSRN at 2193667. 2011. “The Causal Impact of Media in Financial Breiman, Leo. 2001. “Random Forests.” Machine Markets.” Journal of Finance 66 (1): 67–97. Learning 45 (1): 5–32. Evans, James A., and Pedro Aceves. 2016. “Machine Breiman, Leo, Jerome H. Friedman, Richard A. Translation: Mining Text for Social Theory.” Annual Olshen, and Charles J. Stone. 1984. Classification Review of Sociology 42: 21–50. and Regression Trees. Routledge: Taylor and Francis. Fan, Jianqing, and Runze Li. 2001. “Variable Selection Buehlmaier, Matthias M., and Toni M. Whited. 2018. via Nonconcave Penalized Likelihood and Its Oracle “Are Financial Constraints Priced? Evidence from Properties.” Journal of the American Statistical Asso- Textual Analysis.” Review of Financial Studies 31 (7): ciation 96 (456): 1348–60. 2693–728. Fan, Jianqing, Lingzhou Xue, and Hui Zou. 2014. Bühlmann, Peter, and Sara van de Geer. 2011. Statistics “Strong Oracle Optimality of Folded Concave Penal- for High-Dimensional Data. Berlin: Springer. ized Estimation.” Annals of Statistics 42 (3): 819–49. Candès, Emmanuel J., Michael B. Wakin, and Stephen Flynn, Cheryl J., Clifford M. Hurvich, and Jeffrey P. Boyd. 2008. “Enhancing Sparsity by Reweighted S. Simonoff. 2013. “Efficiency for Regularization ℓ1 Minimization.” Journal of Fourier Analysis and Parameter Selection in Penalized Likelihood Estima- Applications 14 (5–6): 877–905. tion of Misspecified Models.” Journal of the Ameri- Carvalho, Carlos M., Nicholas G. Polson, and James G. can Statistical Association 108 (503): 1031–43. Scott. 2010. “The Horseshoe Estimator for Sparse Foster, Dean P., Mark Liberman, and Robert A. Stine. Signals.” Biometrika 97 (2): 465–80. 2013. “Featurizing Text: Converting Text into Predic- Chen, Danqi, and Christopher Manning. 2014. “A Fast tors for Regression Analysis.” http://www-stat.whar- and Accurate Dependency Parser Using Neural Net- ton.upenn.edu/~stine/research/regressor.pdf. works.” In Proceedings of the 2014 Conference on Friedman, Jerome H. 2002. “Stochastic Gradient Empirical Methods in Natural Language Processing, Boosting.” Computational Statistics and Data Anal- 740–50. Stroudsburg: Association for Computation ysis 38 (4): 367–78. Linguistics. Gentzkow, Matthew, and Jesse M. Shapiro. 2010. Chernozhukov, Victor, et al. 2018. “Double/Debiased “What Drives Media Slant? Evidence from U.S. Machine Learning for Treatment and Structural Daily Newspapers.” Econometrica 78 (1): 35–71. Parameters.” Econometrics Journal 21 (1): C1–C68. Gentzkow, Matthew, Jesse M. Shapiro, and Matt Taddy. Chinco, Alexander M., Adam D. Clark-Joseph, and 2016. “Measuring Group Differences in High-Di- Mao Ye. 2017. “Sparse Signals in the Cross-Section mensional Choices: Method and Application to Con- of Returns.” NBER Working Paper 23933. gressional Speech.” NBER Working Paper 22423. Choi, Hyunyoung, and Hal Varian. 2012. “Predicting George, Edward I., and Robert E. McCulloch. 1993. the Present with Google Trends.” Economic Record “Variable Selection via Gibbs Sampling.” Journal 88 (S1): 2–9. of the American Statistical Association 88 (423): Cook, R. Dennis. 2007. “Fisher Lecture: Dimension 881–89. Reduction in Regression.” Statistical Science 22 (1): Ginsberg, Jeremy, Matthew H. Mohebbi, Rajan S. 1–26. Patel, Lynnette Brammer, Mark S. Smolinski, and Cowles, Alfred. 1933. “Can Stock Market Forecasters Larry Brilliant. 2009. “Detecting Influenza Epidem- Forecast?” Econometrica 1 (3): 309–24. ics Using Search Engine Query Data.” Nature 457 Das, Sanjiv R., and Mike Y. Chen. 2007. “Yahoo! for (7232): 1012–14. Amazon: Sentiment Extraction from Small Talk on Goldberg, Yoav. 2016. “A Primer on Neural Network the Web.” Management Science 53 (9): 1375–88. Models for Natural Language Processing.” Journal of Deerwester, Scott, Susan T. Dumais, George W. Fur- Artificial Intelligence Research 57 (1): 345–420. nas, Thomas K. Landauer, and Richard Harshman. Goldberg, Yoav, and Jon Orwant. 2013. “A Dataset of 1990. “Indexing by Latent Semantic Analysis.” Jour- Syntactic-Ngrams over Time from a Very Large Cor- nal of the Association for Information Science 41 (6): pus of English Books.” In Second Joint Conference 391–407. on Lexical and Computational Semantics (*SEM), Denny, Matthew J., and Arthur Spirling. 2018. “Text Volume 1: Proceedings of the Main Conference and Preprocessing for Unsupervised Learning: Why It the Shared Task: Semantic Textual Similarity, edited Matters, When It Misleads, and What to Do about by Mona Diab, Tim Baldwin, and Marco Baroni, It.” Political Analysis 26 (2): 168–89. 241–47. Stroudsburg: Association for Computational Efron, B. 1979. “Bootstrap Methods: Another Look at Linguistics. the Jackknife.” Annals of Statistics 7 (1): 1–26. Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Efron, B. 2004. “The Estimation of Prediction Error: 2016. Deep Learning. Cambridge: MIT Press. Covariance Penalties and Cross-Validation.” Journal­ Greenstein, Shane, Yuan Gu, and Feng Zhu. 2016. 572 Journal of Economic Literature, Vol. LVII (September 2019)

“Ideological Segregation among Online Collabora- Jegadeesh, Narasimhan, and Di Wu. 2013. “Word tors: Evidence from Wikipedians.” NBER Working Power: A New Approach for Content Analysis.” Jour- Paper 22744. nal of Financial Economics 110 (3): 712–29. Griffiths, Thomas L., and Mark Steyvers. 2004. “Find- Joachims, Thorsten. 1998. “Text Categorization with ing Scientific Topics.” Proceedings of the National Support Vector Machines: Learning with Many Rel- Academy of Sciences of the United States of America evant Features.” In 10th European Conference on 101 (S1): 5228–35. Machine Learning, edited by Claire Nédellec and Grimmer, Justin. 2010. “A Bayesian Hierarchical Topic Céline Rouveirol, 137–42. Berlin: Springer. Model for Political Texts: Measuring Expressed Johnson, Heather A., et al. 2004. “Analysis of Web Agendas in Senate Press Releases.” Political Analysis Access Logs for Surveillance of Influenza.”Studies in 18 (1): 1–35. Health Technology and Informatics 107 (2): 1202–06. Grimmer, Justin, and Brandon M. Stewart. 2013. “Text Jurafsky, Daniel, and James H. Martin. 2009. Speech as Data: The Promise and Pitfalls of Automatic Con- and Language Processing: An Introduction to Natu- tent Analysis Methods for Political Texts.” Political ral Language Processing, Computational Linguistics, Analysis 21 (3): 267–97. and Speech Recognition. 2nd ed. London: Pearson. Groseclose, Tim, and Jeffrey Milyo. 2005. “A Measure Kass, Robert E., and Larry Wasserman. 1995. “A Ref- of Media Bias.” Quarterly Journal of Economics 120 erence Bayesian Test for Nested Hypotheses and Its (4): 1191–237. Relationship to the Schwarz Criterion.” Journal of the Hans, Chris. 2009. “Bayesian Lasso Regression.” Bio- American Statistical Association 90 (431): 928–34. metrika 96 (4): 835–45. Kelly, Bryan, Dimitris Papanikolaou, Amit Seru, and Hansen, Stephen, Michael McMahon, and Andrea Matt Taddy. 2018. “Measuring Technological Inno- Prat. 2018. “Transparency and Deliberation within vation over the Long Run.” NBER Working Paper the FOMC: A Computational Linguistics Approach.” 25266. Quarterly Journal of Economics 133 (2): 801–70. Kelly, Bryan, and Seth Pruitt. 2013. “Market Expecta- Hassan, Tarek A., Stephan Hollander, Laurence van tions in the Cross-Section of Present Values.” Journal Lent, and Ahmed Tahoun. 2017. “Firm-Level Politi- of Finance 68 (5): 1721–56. cal Risk: Measurement and Effects.” NBER Working Kelly, Bryan, and Seth Pruitt. 2015. “The Three-Pass Paper 24029. Regression Filter: A New Approach to Forecasting Hastie, Trevor, Robert Tibshirani, and Jerome Fried- Using Many Predictors.” Journal of Econometrics man. 2009. The Elements of Statistical Learning: 186 (2): 294–316. Data Mining, Inference, and Prediction. New York: Knight, Keith, and Wenjiang Fu. 2000. “Asymptotics Springer. for Lasso-Type Estimators.” Annals of Statistics 28 Hastie, Trevor, Robert Tibshirani, and Martin Wain- (5): 1356–78. wright. 2015. Statistical Learning with Sparsity: The Kogan, Shimon, Dimitry Levin, Bryan R. Routledge, Lasso and Generalizations. New York: Taylor and Jacob S. Sagi, and Noah A. Smith. 2009. “Predict- Francis. ing Risk from Financial Reports with Regression.” Hoberg, Gerard, and Gordon Phillips. 2016. “Text- In Proceedings of Human Language Technologies: Based Network Industries and Endogenous Product The 2009 Annual Conference of the North American Differentiation.” Journal of Political Economy 124 Chapter of the Association for Computational Lin- (5): 1423–65. guistics, 272–80. Stroudsburg: Association for Com- Hoerl, Arthur E., and Robert W. Kennard. 1970. “Ridge putational Linguistics. Regression: Biased Estimation for Nonorthogonal Lazer, David, Ryan Kennedy, Gary King, and Alessan- Problem.” Technometrics 12 (1): 55–67. dro Vespignani. 2014. “The Parable of Google Flu: Hoffman, Matthew D., David M. Blei, Chong Wang, Traps in Big Data Analysis.” Science 343 (6176): and John Paisley. 2013. “Stochastic Variational Infer- 1203–05. ence.” Journal of Machine Learning Research 14 (1): Le, Quoc, and Tomas Mikolov. 2014. “Distributed Rep- 1303–47. resentations of Sentences and Documents.” Proceed- Hofmann, Thomas. 1999. “Probabilistic Latent Seman- ings of Machine Learning Research 32: 1188–96. tic Indexing.” In Proceedings of the 22nd Annual LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. International ACM SIGIR Conference on Research 2015. “Deep Learning.” Nature 521 (7553): 436–44. and Development in Information Retrieval, 50–57. Li, Feng. 2010. “The Information Content of For- New York: ACM. ward-Looking Statements in Corporate Filings—A Horowitz, Joel L. 2003. “The Bootstrap in Economet- Naïve Bayesian Machine Learning Approach.” Jour- rics.” Statistical Science 18 (2): 211–18. nal of Accounting Research 48 (5): 1049–102. Iyyer, Mohit, Peter Enns, Jordan Boyd-Graber, and Loughran, Tim, and Bill McDonald. 2011. “When Is Philip Resnik. 2014. “Political Ideology Detection a Liability Not a Liability? Textual Analysis, Dictio- Using Recursive Neural Networks.” In Proceedings naries, and 10-Ks.” Journal of Finance 66 (1): 35–65. of the 52nd Annual Meeting of the Association for Lucca, David O., and Francesco Trebbi. 2009. “Mea- Computational Linguistics, Vol. 1, edited by Kris- suring Central Bank Communication: An Automated tina Toutanova and Hua Wu, 1113–22. Stroudsburg: Approach with Application to FOMC Statements.” Association for Computational Linguistics. NBER Working Paper 15367. Gentzkow, Kelly, and Taddy: Text as Data 573

Manela, Asaf, and Alan Moreira. 2017. “News Implied “­Adaptive Nowcasting of Influenza Outbreaks Using Volatility and Disaster Concerns.” Journal of Finan- Google Searches.” Royal Society Open Science 1 (2): cial Economics 123 (1): 137–62. Article 140095. Manning, Christopher D., Prabhakar Raghavan, and Pritchard, Jonathan K., Matthew Stephens, and Peter Hinrich Schütze. 2008. Introduction to Information Donnelly. 2000. “Inference of Population Structure Retrieval. Cambridge: Cambridge University Press. Using Multilocus Genotype Data.” Genetics 155 (2): Mannion, David, and Peter Dixon. 1997. “Authorship 945–59. Attribution: The Case of Oliver Goldsmith.” Journal Quinn, Kevin M., Burt L. Monroe, Michael Colaresi, of the Royal Statistical Society, Series D 46 (1): 1–18. Michael H. Crespin, and Dragomir R. Radev. 2010. Manski, Charles F. 1988. Analog Estimation Methods in “How to Analyze Political Attention with Minimal Econometrics. New York: Chapman and Hall. Assumptions and Costs.” American Journal of Politi- Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg cal Science 54 (1): 209–28. Corrado, and Jeffrey Dean. 2013. “Distributed Rabinovich, Maxim, and David M. Blei. 2014. “The Representations of Words and Phrases and Their Inverse Regression Topic Model.” Proceedings of Compositionality.” In Proceedings of the 26th Inter- Machine Learning Research 32: 199–207. national Conference on Neural Information Process- Roberts, Margaret E., Brandon M. Stewart, Dustin ing Systems, edited by C. J. C. Burges, L. Bottou, Tingley, and Edoardo M. Airoldi. 2013. “The Struc- M. Welling, Z. Ghahramani, and K. Q. Weinberger, tural Topic Model and Applied Social Science.” Paper 3111–19. Red Hook: Curran Associates. presented at the Advances in Neural Information Morin, Frederic, and Yoshua Bengio. 2005. “Hierarchi- Processing Systems Workshop on Topic Models: Com- cal Probabilistic Neural Network Language Model.” putation, Application, and Evaluation, Lake Tahoe. In Proceedings of the Tenth International Workshop Rumelhart, David E., Geoffrey E. Hinton, and Ron- on Artificial Intelligence and Statistics, 246–52. ald J. Williams. 1986. “Learning Representations New Jersey: Society for Artificial Intelligence and by Back-Propagating Errors.” Nature 323 (6088): Statistics. 533–36. Mosteller, Frederick, and David L. Wallace. 1963. Saiz, Albert, and Uri Simonsohn. 2013. “Proxying “Inference in an Authorship Problem.” Journal of the for Unobservable Variables with Internet Docu- American Statistical Association 58 (302): 275–309. ment-Frequency.” Journal of the European Eco- Murphy, Kevin P. 2012. Machine Learning: A Probabi- nomic Association 11 (1): 137–65. listic Perspective. Cambridge: MIT Press. Schwarz, Gideon. 1978. “Estimating the Dimension of Ng, Andrew Y., and Michael I. Jordan. 2001. “On Dis- a Model.” Annals of Statistics 6 (2): 461–64. criminative versus Generative Classifiers: A Com- Scott, Steven L., and Hal R. Varian. 2014. “Predicting parison of Logistic Regression and Naive Bayes.” In the Present with Bayesian Structural Time Series.” Proceedings of the 14th International Conference on International Journal of Mathematical Modeling and Neural Information Processing Systems: Natural and Numerical Optimisation 5 (1–2): 4–23. Synthetic, edited by T. G. Dietterich, S. Becker, and Scott, Steven L., and Hal R. Varian. 2015. “Bayesian Z. Ghahramani, 841–48. Cambridge: MIT Press. Variable Selection for Nowcasting Economic Time Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. Series.” In Economic Analysis of the Digital Econ- 2002. “Thumbs up? Sentiment Classification Using omy, edited by Avi Goldfarb, Shane M. Greenstein, Machine Learning Techniques.” In Proceedings of and Catherine E. Tucker, 119–35. Chicago: Univer- the Conference on Empirical Methods in Natural sity of Chicago Press. Language Processing (EMNLP), 79–86. Stroudsburg: Sim, Yanchuan, Bryan R. Routledge, and Noah A. Association for Computational Linguistics. Smith. 2015. “The Utility of Text: The Case of Amicus Park, Trevor, and George Casella. 2008. “The Bayesian Briefs and the Supreme Court.” In Proceedings of the Lasso.” Journal of the American Statistical Associa- Twenty-Ninth AAAI Conference on Artificial Intelli- tion 103 (482): 681–86. gence, 2311–17. Palo Alto: AAAI Press. Pennington, Jeffrey, Richard Socher, and Christopher Stephens-Davidowitz, Seth. 2014. “The Cost of Racial Manning. 2014. “GloVe: Global Vectors for Word Animus on a Black Candidate: Evidence Using Goo- Representation.” In Proceedings of the 2014 Confer- gle Search Data.” Journal of Public Economics 118 ence on Empirical Methods in Natural Language Pro- (C): 26–40. cessing (EMNLP), edited by Alessandro Moschitti, Stock, James H., and Francesco Trebbi. 2003. “Ret- Bo Pang, and Walter Daelemans, 1532–43. Strouds- rospectives Who Invented Instrumental Variable burg: Association for Computational Linguistics. Regression?” Journal of Economic Perspectives 17 Politis, Dimitris N., Joseph P. Romano, and Michael (3): 177–94. Wolf. 1999. Subsampling. Berlin: Springer. Stock, James H., and Mark W. Watson. 2002a. “Fore- Polson, Nicholas G., and Steven L. Scott. 2011. “Data casting Using Principal Components from a Large Augmentation for Support Vector Machines.” Bayes- Number of Predictors.” Journal of the American Sta- ian Analysis 6 (1): 1–24. tistical Association 97 (460): 1167–79. Porter, M. F. 1980. “An Algorithm for Suffix Stripping.” Stock, James H., and Mark W. Watson. 2002b. “Mac- Program 14 (3): 130–37. roeconomic Forecasting Using Diffusion Indexes.” Preis, Tobias, and Helen Susannah Moat. 2014. Journal of Business and Economic Statistics 20 (2): 574 Journal of Economic Literature, Vol. LVII (September 2019)

147–62. Text Classification.” Journal of Machine Learning Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. 2014. Research 2 (1): 45–66. “Sequence to Sequence Learning with Neural Net- Tran, Dustin, Matthew D. Hoffman, Rif A. Saurous, works.” In Advances in Neural Information Pro- Eugene Brevdo, Kevin Murphy, and David M. Blei. cessing Systems, Vol. 27, edited by Z. Ghahramani, 2017. “Deep Probabilistic Programming.” Paper M. Welling, C. Cortes, N. D. Lawrence, and K. Q. presented at the 5th International Conference on Weinberger, 3104–12. La Jolla: Neural Information Learning Representations, Toulon. Processing Systems Foundation. Tran, Dustin, Alp Kucukelbir, Adji B. Dieng, Maja Taddy, Matt. 2012. “On Estimation and Selection for Rudolph, Dawen Liang, and David M. Blei. 2016. Topic Models.” In Proceedings of the 15th Inter- “Edward: A Library for Probabilistic Modeling, national Conference on Artificial Intelligence and Inference, and Criticism.” Available on arXiv at Statistics, 1184–93. New York: Association for Com- 1610.09787. puting Machinery. Vapnik, Vladimir N. 1995. The Nature of Statistical Taddy, Matt. 2013a. “Measuring Political Sentiment Learning Theory. Berlin: Springer. on Twitter: Factor Optimal Design for Multinomial Wager, Stefan, and . 2018. “Estimation Inverse Regression.” Technometrics 55 (4): 415–25. and Inference of Heterogeneous Treatment Effects Taddy, Matt. 2013b. “Multinomial Inverse Regression Using Random Forest.” Journal of the American Sta- for Text Analysis.” Journal of the American Statistical tistical Association 113 (523): 1228–42. Association 108 (503): 755–70. Wager, Stefan, Trevor Hastie, and Bradley Efron. 2014. Taddy, Matt. 2013c. “Rejoinder: Efficiency and Struc- “Confidence Intervals for Random Forests: The ture in MNIR.” Journal of the American Statistical Jackknife and the Infinitesimal Jackknife.” Journal of Association 108 (503): 772–74. Machine Learning Research 15 (1): 1625–51. Taddy, Matt. 2015a. “Distributed Multinomial Regres- Wainwright, Martin J. 2009. “Sharp Thresholds for sion.” Annals of Applied Statistics 9 (3): 1394–414. High-Dimensional and Noisy Sparsity Recovery Taddy, Matt. 2015b. “Document Classification by Using ℓ1-Constrained Quadratic Programming Inversion of Distributed Language Representa- (Lasso).” IEEE Transactions on Information Theory tions.” In Proceedings of the 53rd Annual Meeting 55 (5): 2183–202. of the Association for Computational Linguistics and Wainwright, Martin J., and Michael I. Jordan. 2008. the 7th International Joint Conference on Natural “Graphical Models, Exponential Families, and Language Processing, Vol. 2, edited by Chengqing Variational Inference.” Foundations and Trends in Zong and Michael Strube, 45–49. Stroudsburg: Asso- Machine Learning 1 (1–2): 1–305. ciation for Computational Linguistics. Wisniewski, Tomasz Piotr, and Brendan Lambe. 2013. Taddy, Matt. 2017a. “Comment: A Regularization “The Role of Media in the Credit Crunch: The Case Scheme on Word Occurrence Rates That Improves of the Banking Sector.” Journal of Economic Behav- Estimation and Interpretation of Topical Content.” ior and Organization 85 (C): 163–75. https://github.com/TaddyLab/reuters/raw/master/ Wu, Yonghui, et al. 2016. “Google’s Neural Machine comment/comment-AiroldiBischof.pdf. Translation System: Bridging the Gap between Taddy, Matt. 2017b. “One-Step Estimator Paths for Human and Machine Translation.” Available on arXiv Concave Regularization.” Journal of Computational at 1609.08144. and Graphical Statistics 26 (3): 525–36. Yang, Yun, Martin J. Wainwright, and Michael I. Jor- Taddy, Matt, Chun-Sheng Chen, Jun Yu, and Mitch dan. 2016. “On the Computational Complexity of Wyle. 2015. “Bayesian and Empirical Bayesian For- High-Dimensional Bayesian Variable Selection.” ests.” Proceedings of Machine Learning Research 37: Annals of Statistics 44 (6): 2497–532. 967–76. Zeng, Xiaoming, and Michael Wagner. 2002. “Model- Taddy, Matt, Matt Gardner, Liyun Chen, and David ing the Effects of Epidemics on Routinely Collected Draper. 2016. “A Nonparametric Bayesian Analysis Data.” Journal of the American Medical Informatics of Heterogenous Treatment Effects in Digital Exper- Association 9 (6): S17–22. imentation.” Journal of Business and Economic Sta- Zou, Hui. 2006. “The Adaptive Lasso and Its Oracle tistics 34 (4): 661–72. Properties.” Journal of the American Statistical Asso- Teh, Yee Whye, Michael I. Jordan, Matthew J. Beal, ciation 101 (476): 1418–29. and David M. Blei. 2006. “Hierarchical Dirichlet Zou, Hui, and Trevor Hastie. 2005. “Regularization and Processes.” Journal of the American Statistical Asso- Variable Selection via the Elastic Net.” Journal of the ciation 101 (476): 1566–81. Royal Statistical Society, Series B (Methodological) Tetlock, Paul C. 2007. “Giving Content to Investor 67 (2): 301–20. Sentiment: The Role of Media in the Stock Market.” Zou, Hui, Trevor Hastie, and Robert Tibshirani. 2006. Journal of Finance 62 (3): 1139–68. “Sparse Principal Component Analysis.” Journal Tibshirani, Robert. 1996. “Regression Shrinkage and of Computational and Graphical Statistics 15 (2): Selection via the Lasso.” Journal of the Royal Statisti- 265–86. cal Society, Series B (Methodological) 58 (1): 267–88. Zou, Hui, Trevor Hastie, and Robert Tibshirani. 2007. Tong, Simon, and Daphne Koller. 2001. “Support Vec- “On the ‘Degrees of Freedom’ of the Lasso.” Annals tor Machine Active Learning with Applications to of Statistics 35 (5): 2173–92.