The Visible Map and the Hidden Structure of Economics1 Angela Ambrosino*, Mario Cedrini*, John B. Davis**, Stefano Fiori*, Marco
Total Page:16
File Type:pdf, Size:1020Kb
The Visible Map and the Hidden Structure of Economics1 Angela Ambrosino*, Mario Cedrini*, John B. Davis**, Stefano Fiori*, Marco Guerzoni*,*** and Massimiliano Nuccio* * Dipartimento di Economia e Statistica “Cognetti de Martiis”, Università di Torino, Italy ** Marquette University and University of Amsterdam *** ICRIOS, Bocconi University, Milan Abstract The paper is a first, preliminary attempt to illustrate the potentialities of topic modeling as information retrieval system helping to reduce problems of overload information in the sciences, and economics in particular. Noting that some motives for the use of automated tools as information retrieval systems in economics have to do with the changing structure of the discipline itself, we argue that the standard classification system in economics developed over a hundred years ago by the American Economics Association, the Journal of Economic Literature (JEL) codes, can easily assist in detecting the major faults of unsupervised techniques and possibly provide suggestions about how to correct them. With this aim in mind, we apply to the corpus of (some 1500) “exemplary” documents for each classification of the Journal of Economics Literature Codes indicated by the American Economics Association in the “JEL codes guide” (https://www.aeaweb.org/jel/guide/jel.php) the topic-modeling technique known as Latent Dirichlet Allocation (LDA), which serves to discover the hidden (latent) thematic structure in large archives of documents, by detecting probabilistic regularities, that is trends in language text and recurring themes in the form of co-occurring words. The ambition is to propose and interpret measures of (dis)similarity between JEL codes and the LDA topics resulting from the analysis. 1 This work is part of a more general research project by Despina. Big Data Lab of the University of Turin, Dipartimento di Economia e Statistica “Cognetti de Martiis” (www.despina.unito.it). We gratefully acknowledge financial support from the European Society for the History of Economic Thought (ESHET grant 2015). 1 Science and the problem of overload information One of the various possible reasons why (Google’s chief) economist Hal R Varian could be tremendously right in predicting, in October 2008 (speaking with McKinsey’s director in the San Francisco office, James Manyika, in Napa, California2), that “the sexy job in the next ten years will be statisticians” – “the ability to take data – to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it’s going to be a hugely important skill in the next decades” –, has to do with the growth of scientific knowledge. In the 1950s, the world community of economists produced around a thousand economic articles (indexed in Thomson Reuter’s Web of Science); when the Journal of Economic Literature was launched, some 5,000 major articles were published every year in about 250 economic journals. There are currently 347 impact-factor journals in economics, and economists currently publish some 20,000 works per annum (Claveau and Gingras, 2016). Scientific classification is already a very difficult task, and may become an impossible one. Still, the world of “ubiquitous data” Varian refers to is both a driver and source of, and at the same time a provider of solutions to, various forms of overload information. Web search engines, which often mine data available in databases and directories, help find information stored on World Wide Web sites, reducing not only the time needed to retrieve information but also the amount of information that one has to consult. The fundamental difference between such software systems and web directories is that these latter require human editors, whereas search engines can even be real-time, by running algorithms on web crawlers, which allow collecting and assessing items at the time of the search query. At the same time, search is greatly facilitated by the design, through web mining techniques, of taxonomies of entities, which help matching search query keywords with keywords from answers: new entities result from machine learning applied to the current web search results, by detecting commonalities between these latter and existing entities. In many ways, the problem of scientific classification, of defining research areas within disciplines, is one of overload information, given that classification systems derive their importance, primarily, from being information retrieval systems. The parallel with the search 2 “Hal Varian on how the Web challenges managers”, available at: https://www.mckinsey.com/industries/high- tech/our-insights/hal-varian-on-how-the-web-challenges-managers (accessed: 12 May 2018). 2 for information on the World Wide Web may provide intriguing lines of reasoning about the future of classification schemes. As concerns economics in particular, it seems important to add, borrowing from Cherrier’s (2017) history of the Journal of Economic Literature (JEL) classification system (developed by the American Economic Association, AEA), that scholars increasingly tend to retrieve information about the stock of knowledge and current developments of individual fields through keywords, the trend dating back to the launch of the Economic research database “EconLit”. There are evident practical reasons for this, related to the need to save more time and resources than one could hope to save by relying on however well-designed maps of science. To avoid being victim of the “burden of knowledge” – “if one is to stand on the shoulders of giants, one must first climb up their backs, and the greater the body of knowledge, the harder this climb becomes” (Jones 2009, 284) –, that is the increasing difficulty that economists at early stages of their career face in trying to reach the research frontier and achieve new breakthroughs, scholars need more and more powerful – also in terms of precision – search engines and classification schemes. To tackle the challenge of unprecedented growth of scientific knowledge, efficient ways of “simplifying” the knowledge space (Jones 2009, 310) are needed. Some recent developments in the field of information retrieval provide promising results. Text-mining techniques, and “topic modeling” in particular, have been proposed in the ambition to “find short descriptions of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships that are useful for basic tasks such as classification, novelty detection, summarization, and similarity and relevance judgments” (Blei, Ng and Jordan 2003, 993). Topic modeling aims at discovering the hidden (latent) thematic structure in large archives of documents: one of the most employed generative statistical models, Latent Dirichlet Allocation (LDA3), calculates probabilistic regularities, or trends in language texts, recurring themes in the form of co-occurring words. It groups words that compose the archive documents – on the assumption that words referring to similar subjects appear in similar contexts – into different probability distributions over the words of a fixed vocabulary. Being constellations, or sets of groups of words that are associated under one of the themes that run through the articles of the dataset, “topics” constitute the “latent”, (that is, inferred from the data) structures. All documents, it is assumed, share the same set 3 On topic modeling, see the special issues of Poetics, 41(6), 2013, and of the Journal of Digital Humanities, 2(1), 2012. On LDA in particular, see Blei 2012a, Blei, Ng and Jordan 2003. 3 of topics, but each document exhibits such topics in different proportions depending on words that are present in it. It is to be noted that LDA generates topics and associates topics with documents at the same time: the purpose served by the model is to detect topics by “reverse-engineering” the original intentions (that is, to discuss one or more specific themes) of the authors of the documents included in the corpus under examination (Mohr and Bogdanov 2013)4. Topic modeling and the “distant reading” perspective The LDA process of topic detection, if considered as possible tool for generating maps of scientific knowledge, arguably incorporates a number of remarkable features. First, the process is automated and unsupervised – human intervention is fundamental ex post but not ex ante. However debatable, that provided by the co-occurrence of words in the full text (rather than metadata only) of documents in a corpus is an objective, ready available and clearly recognizable criterion. This contrasts with the problems that traditionally affect the classification of scientific knowledge (see Suominen and Toivanen 2015): lack of agreed and internationally standardized measures, path dependency implied in the historical derivation of classification schemes, scant ability to identify new research fields and programs. Second, it induces a fundamental change of perspective. Digitization means being able to “work on 200,000 novels instead of 200”: this allows us to do “the same thing 1,000 times bigger”, since “the new scale changes our relationship to our object, and in fact it changes the object itself” (Moretti 2017, 1). Topics detected through the “distant reading” (Moretti 2005, 2013) of LDA, which explicitly reject the emphasis we usually place on the individuality or even uniqueness of texts, “make form visible within data” (ibid.: 7), and reveal the latent, hidden, otherwise invisible structures of