The Visible Map and the Hidden Structure of Economics1

Angela Ambrosino*, Mario Cedrini*, John B. Davis**, Stefano Fiori*, Marco Guerzoni*,*** and Massimiliano Nuccio*

* Dipartimento di Economia e Statistica “Cognetti de Martiis”, Università di Torino, Italy ** Marquette University and University of Amsterdam *** ICRIOS, Bocconi University, Milan

Abstract The paper is a first, preliminary attempt to illustrate the potentialities of topic modeling as information retrieval system helping to reduce problems of overload information in the sciences, and in particular. Noting that some motives for the use of automated tools as information retrieval systems in economics have to do with the changing structure of the discipline itself, we argue that the standard classification system in economics developed over a hundred years ago by the American Economics Association, the Journal of Economic Literature (JEL) codes, can easily assist in detecting the major faults of unsupervised techniques and possibly provide suggestions about how to correct them. With this aim in mind, we apply to the corpus of (some 1500) “exemplary” documents for each classification of the Journal of Economics Literature Codes indicated by the American Economics Association in the “JEL codes guide” (https://www.aeaweb.org/jel/guide/jel.php) the topic-modeling technique known as Latent Dirichlet Allocation (LDA), which serves to discover the hidden (latent) thematic structure in large archives of documents, by detecting probabilistic regularities, that is trends in language text and recurring themes in the form of co-occurring words. The ambition is to propose and interpret measures of (dis)similarity between JEL codes and the LDA topics resulting from the analysis.

1 This work is part of a more general research project by Despina. Big Data Lab of the University of Turin, Dipartimento di Economia e Statistica “Cognetti de Martiis” (www.despina.unito.it). We gratefully acknowledge financial support from the European Society for the History of Economic Thought (ESHET grant 2015). 1

Science and the problem of overload information

One of the various possible reasons why (Google’s chief) Hal R Varian could be tremendously right in predicting, in October 2008 (speaking with McKinsey’s director in the San Francisco office, James Manyika, in Napa, California2), that “the sexy job in the next ten years will be statisticians” – “the ability to take data – to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it’s going to be a hugely important skill in the next decades” –, has to do with the growth of scientific knowledge. In the 1950s, the world community of produced around a thousand economic articles (indexed in Thomson Reuter’s Web of Science); when the Journal of Economic Literature was launched, some 5,000 major articles were published every year in about 250 economic journals. There are currently 347 impact-factor journals in economics, and economists currently publish some 20,000 works per annum (Claveau and Gingras, 2016).

Scientific classification is already a very difficult task, and may become an impossible one. Still, the world of “ubiquitous data” Varian refers to is both a driver and source of, and at the same time a provider of solutions to, various forms of overload information. Web search engines, which often mine data available in databases and directories, help find information stored on World Wide Web sites, reducing not only the time needed to retrieve information but also the amount of information that one has to consult. The fundamental difference between such software systems and web directories is that these latter require human editors, whereas search engines can even be real-time, by running algorithms on web crawlers, which allow collecting and assessing items at the time of the search query. At the same time, search is greatly facilitated by the design, through web mining techniques, of taxonomies of entities, which help matching search query keywords with keywords from answers: new entities result from machine learning applied to the current web search results, by detecting commonalities between these latter and existing entities.

In many ways, the problem of scientific classification, of defining research areas within disciplines, is one of overload information, given that classification systems derive their importance, primarily, from being information retrieval systems. The parallel with the search

2 “Hal Varian on how the Web challenges managers”, available at: https://www.mckinsey.com/industries/high- tech/our-insights/hal-varian-on-how-the-web-challenges-managers (accessed: 12 May 2018). 2 for information on the World Wide Web may provide intriguing lines of reasoning about the future of classification schemes. As concerns economics in particular, it seems important to add, borrowing from Cherrier’s (2017) history of the Journal of Economic Literature (JEL) classification system (developed by the American Economic Association, AEA), that scholars increasingly tend to retrieve information about the stock of knowledge and current developments of individual fields through keywords, the trend dating back to the launch of the Economic research database “EconLit”. There are evident practical reasons for this, related to the need to save more time and resources than one could hope to save by relying on however well-designed maps of science. To avoid being victim of the “burden of knowledge” – “if one is to stand on the shoulders of giants, one must first climb up their backs, and the greater the body of knowledge, the harder this climb becomes” (Jones 2009, 284) –, that is the increasing difficulty that economists at early stages of their career face in trying to reach the research frontier and achieve new breakthroughs, scholars need more and more powerful – also in terms of precision – search engines and classification schemes.

To tackle the challenge of unprecedented growth of scientific knowledge, efficient ways of “simplifying” the knowledge space (Jones 2009, 310) are needed. Some recent developments in the field of information retrieval provide promising results. Text-mining techniques, and “topic modeling” in particular, have been proposed in the ambition to “find short descriptions of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships that are useful for basic tasks such as classification, novelty detection, summarization, and similarity and relevance judgments” (Blei, Ng and Jordan 2003, 993). Topic modeling aims at discovering the hidden (latent) thematic structure in large archives of documents: one of the most employed generative statistical models, Latent Dirichlet Allocation (LDA3), calculates probabilistic regularities, or trends in language texts, recurring themes in the form of co-occurring words. It groups words that compose the archive documents – on the assumption that words referring to similar subjects appear in similar contexts – into different probability distributions over the words of a fixed vocabulary. Being constellations, or sets of groups of words that are associated under one of the themes that run through the articles of the dataset, “topics” constitute the “latent”, (that is, inferred from the data) structures. All documents, it is assumed, share the same set

3 On topic modeling, see the special issues of Poetics, 41(6), 2013, and of the Journal of Digital Humanities, 2(1), 2012. On LDA in particular, see Blei 2012a, Blei, Ng and Jordan 2003. 3 of topics, but each document exhibits such topics in different proportions depending on words that are present in it. It is to be noted that LDA generates topics and associates topics with documents at the same time: the purpose served by the model is to detect topics by “reverse-engineering” the original intentions (that is, to discuss one or more specific themes) of the authors of the documents included in the corpus under examination (Mohr and Bogdanov 2013)4.

Topic modeling and the “distant reading” perspective

The LDA process of topic detection, if considered as possible tool for generating maps of scientific knowledge, arguably incorporates a number of remarkable features.

First, the process is automated and unsupervised – human intervention is fundamental ex post but not ex ante. However debatable, that provided by the co-occurrence of words in the full text (rather than metadata only) of documents in a corpus is an objective, ready available and clearly recognizable criterion. This contrasts with the problems that traditionally affect the classification of scientific knowledge (see Suominen and Toivanen 2015): lack of agreed and internationally standardized measures, path dependency implied in the historical derivation of classification schemes, scant ability to identify new research fields and programs.

Second, it induces a fundamental change of perspective. Digitization means being able to “work on 200,000 novels instead of 200”: this allows us to do “the same thing 1,000 times bigger”, since “the new scale changes our relationship to our object, and in fact it changes the object itself” (Moretti 2017, 1). Topics detected through the “distant reading” (Moretti 2005, 2013) of LDA, which explicitly reject the emphasis we usually place on the individuality or even uniqueness of texts, “make form visible within data” (ibid.: 7), and reveal the latent, hidden, otherwise invisible structures of scientists’ works.

Third, science maps built by topic modeling techniques rest on a logic of similarity and interconnection, as against the logic of unlikeness and boundaries implied in traditional

4 Topic modeling algorithms discover the hidden structure that might be said to have generated the collection of observed documents, the utility of LDA stemming from the fact that “the inferred hidden structure resembles the thematic structure of the collection” (Blei 2012, 79), which thereby becomes manageable. 4 classification systems. More, the ambition of uncovering an author’s deep but hidden, since unconscious, intentions makes “distant” reading throw light on the social dimension of knowledge development, that is on the “social” not within individual works (as is the case of hermeneutics) but as a trait shaping the whole field.

“Distant reading”, that is “understanding literature not by studying particular texts, but by aggregating and analyzing massive amounts of data”, has been harshly criticised, for a variety of reasons – “What are we mortal beings supposed to do with all these books? Franco Moretti has a solution: don’t read them”, as the author of Being Wrong: Adventures in the Margin of Error, Kathryn Schulz (2011), succintly puts it. Less controversial is its use for purposes of scientific classification (Glenisson et al. 2005). Text mining techniques based on co- occurrence of words have the merit of “excluding significant human judgment in selecting terms to be included in the analysis and forcing the publication into a category based on a relatively low dimensional representation of the actual content” (Suominen and Toivanen 2015, 2466). At the end of their comparison between unsupervised and human-assigned classifications of science, Suominen and Toivanen conclude that, while it seems impossible to argue in favor of the superiority of one method over the other, the main advantages of machine learning-based science classifications lies, first, in practicality (“they make it much easier to generate high-accuracy specialty-purpose maps, for example, to zoom into special areas of research or to detect emerging fields, and so forth, and they allow for a broader set of statistical analysis than historical classification schemes”, 2474). Second, human-assigned classifications “are best at monitoring the behavior of known and defined bodies of knowledge, but lend themselves poorly—if at all—to correctly identifying the emergence of truly new epistemic bodies of knowledge”, which is, on the contrary, one of the main task that unsupervised methods are called upon to accomplish.

The main problem with text-mining techniques as tools for drawing maps of science is the general bags-of-words approach employed to detect topics without any knowledge of linguistic patterns and properties, whereas word order and contextuality, to mention a few, evidently matter a lot. Once again, the evolution of web search engines is highly illustrative in this regard (see Redondo 2014). The current, provisional result of this evolution was prompted by two main trends, one towards long tail queries and the other towards precision, both suggesting to search engines the need to go beyond the use of keywords alone. These

5 latter have in fact turn into genuine learning machines, exploiting the possibilities offered by semantic search – which improves accuracy by understanding both searcher intent and the contextual meaning of terms. Search engines can in fact be assisted in the attempt to display the most relevant results by a series of “search entities”, among which are search history data, like search session, time at which the query is submitted, searcher location, but also documents responsive to the query, anchor text in a link in a document, as well as co- occurrence of words and distance between them.

A good cartography of science, helping scholars to find relevant scientific information (through an efficient information retrieval system), is fundamental for scientific dialogue. Can unsupervised text-mining techniques, granting the possibility to draw an infinite number of maps with different zooming operations, represent a significant (however preliminary) step in the evolution towards ever more efficient (and possibly learning) scientific information retrieval systems? We ourselves have recently illustrated the philosophy and concrete working of topic modeling as tool for mapping science and tracking changes in the structure of economics over time (Ambrosino et al. 2017). To assess the potentialities of topic modeling, however, it is also necessary to test the current limits of the technique.

The importance of JEL codes in a time of economics’ fragmentation

The possibility to evaluate strengths and weaknesses of text-mining tools is provided by the “official”, standard classification system in economics developed over a hundred years ago by the American Economics Association, the Journal of Economic Literature (JEL) codes. There is no need to insist on the merits of human classification schemes, which we here take for granted. We limit ourselves to note that, while a qualified comparison with JEL codes can easily assist us in detecting the major faults of unsupervised techniques and possibly provide suggestions about how to correct them, some motives for the use of automated tools as information retrieval systems in economics have to do with the changing structure of the discipline itself.

As Cherrier (2017, 546) notes:

6

The history of the JEL codes is … essentially a story of how economists have perceived their discipline. Though economists increasingly use keyword searches to locate material, the JEL codes remain important. They provide a map with which to navigate the discipline on the American Economic Association (AEA) website. They are used to publish and search job offers, skim job offers, assign grant applications and submitted papers to referees, and search for book reviewers. Bibliometric studies of the characteristics of economists’ publications, including size, age structure, coauthorship, subject matter, methodology, and citation patterns overwhelmingly rely on JEL codes to categorize papers, assuming a stability that did not exist in most categories … And it is because the classification matters that the AEA is currently contemplating yet another revision. The JEL codes classification, Cherrier (547) continues, “does not provide a pure image of the discipline. Instead, it is a compromise between looking forward and looking backward, between AEA officials’ sometimes conflicting visions of their science and the multiple and contradictory demands they face”. The importance of JEL codes, on one side, and the various compromises they embody, can explain why economists are increasingly devoting attention to the JEL classification (see for instance Kosnik 2017). The last revision of JEL codes, in 1988- 1990, aimed to provide “a map with which to navigate a growing and rapidly changing discipline” (Cherrier 2017, 547). The then Journal of Economic Literature editor, John Pencavel, had clear ideas about the necessity of the revision. In the 1991 symposium of the Economic Journal on ‘The Next Hundred Years’ of the discipline, Pencavel (1991) emphasized the continuing growth of economics in both size and diversity. “Economists”, he wrote, “will be an increasingly heterogeneous assortment of scholars. Indeed, it will become difficult to identify exactly what common elements bind us all” (p. 85). In a “bigger”, “more varied”, and “more competitive” economics, he observed, “a rough pyramidal hierarchy will persist, but there will be a much wider base with many minarets representing local confluences of authority” (p. 81). In a “fragmented world of specialization” (ibid.), practitioners would have been hardly aware of developments “in more than a few narrow fields of the subject” (ibid.). In the 70s and 80s, economists had “increasingly relied on the JEL codes as a map to navigate the profession intellectually and institutionally; to place their work within the discipline and to publish and screen job offers, scroll conference programs, apply for grants, and choose referees. Getting a code, and having it placed in the right category, became an increasingly important element in establishing intellectual and institutional space within the profession” (Cherrier 2017, 569-570).

7

At the same time, new JEL codes were required to counteract “economists’ frustration with the lack of space for new approaches” (p. 548). It is of considerable interest that, after a period of remarkable stability, that is of incremental changes only (and no creation of new categories), the AEA Executive Committee is currently discussing the possibility of a new revision. The 1988-90 revision was intended to reflect the (changed) structure of the discipline: as Cherrier observes, “the most likely reasons to undertake a new revision would be the fragmentation of the discipline due to the emergence of new methods, and the disappearance of the core, on which the stability of the past twenty years was based” (p. 577).

Now, despite the good reasons to caution against the attempt to use unsupervised techniques in tandem with human-assigned classification systems, the former can count not on one – the abovementioned “overload information” problem – but on two decisive factors in their favor. Economists – historians of economic thought and economic methodologists in primis, but also heterodox economists and so-called “mainstream dissenters” – are increasingly debating the current structure of the economics discipline, starting from the precise nature of its “unity”. Even in the recent past, contributions on the general “state” of economics in the literature were often prompted by the need to counteract (external) critiques of economics imperialism, i.e. the pugilistic attitude of economics towards social sciences, which tended to crowd out alternative approaches in territories traditionally occupied by other disciplines. Or, they were responses to the challenge led by one or another of the various heterodox schools of thought of so-called “first-wave pluralism” – each of them conceiveing itself as the paradigmatic, stand-alone alternative to the neoclassical core of the discipline (Garnett, Olson and Starr 2010). In contrast to this, on one side, the unity of economics is now externally perceived as “flexible” (Reay 2012, Fourcade 2018), that is resting more on technique and epistemology than on core beliefs. Above all, leading figures of the new research programmes (evolutionary , , cognitive economics, , neuroeconomics, economics, and so on) that, often having their origins outside economics, populate the current “mainstream pluralism”, are mainly discussing, explicitly or indirectly, the unity of today’s economics, as Davis (2006: 17) remarks, by “making use of methodology and history arguments to justify their departures from neoclassicism, and differentiate their approaches from each other”. The very fact that and history of economics are regaining importance, thereby reversing the trend of steady decline of the previous decades, seemingly provides evidence of the 8 fragmented nature of , as well as, equally importantly, of an increasing awareness of this fragmentation among mainstream economists themselves (as anticipated by Colander, Holt and Rosser 2004).

Views on the extent and precise nature of this fragmentation diverge significantly. Mainstream “pluralism”, Dow (2008: 73) argues, would be a misnomer for a “plurality” of (mainstream) theories, which however does not imply pluralism “in terms of methodological approach, and in attitude to methodological alternatives”. Scholars who, in contrast to Dow, emphasize the radical novelty of the acceptance of a plurality of approaches into mainstream economics, with explicit recognition, by mainstream economists themselves, of the changing structure of the discipline, have advanced alternative possible scenarios for mainstream pluralism. While a cycle theory of scientific development (with succession of periods of monism and periods of pluralism) can induce to believe that a new, post-neoclassical, mainstream will reestablish monism, “mainstream pluralism” could even persist over time under the impact of ever-growing specialization, as a way of reducing the gap between scholars’ competencies and the difficulty of reaching the frontier of economic research (Cedrini and Fontana 2017).

Unsupervised text-mining technique can potentially offer valuable assistance in this regard. Unlike traditional classification schemes, they can serve the purpose of mapping a fragmented science. Science maps “are generated through a scientific analysis of large-scale scholarly datasets in an effort to extract, connect, and make sense of the bits and pieces of knowledge they contain. [They can help to] identify major research areas, experts, institutions, collections, grants, papers, journals, and ideas in a domain of interest. They can show homogeneity vs. heterogeneity … and relative speed of progress. They allow us to track the emergence, evolution, and disappearance of topics and help to identify the most promising areas of research” (Börner et al. 2012). The (above illustrated) peculiar nature of topic modeling might prove decisive for the analysis of the changing structure of a discipline, economics, wherein the proliferation of potentially incompatible research programmes (with their “local” rather than “global” knowledge) arguably reduces the “core” to the neoclassical orthodoxy and enlarges the mainstream (Cedrini and Fontana 2017). A new map is needed, in other words, to reflect the new “immaturity” (in Kuhn’s terms) of economics – which has become more applied since the Seventies (Backhouse and Cherrier 2014) and moved from a

9 theory-based (key concepts) to a tool-based (admissible empirical practices) discipline (Panhans and Singleton 2017).

The dataset: The JEL Codes “Guide”

The present study is primarily intended to explore the limits, without assigning negative meaning to the word, of unsupervised text-mining techniques as tools for drawing science maps. Practically speaking, it illustrates the very preliminary results of a stress test carried out on JEL codes by applying LDA to the corpus of 1656 “exemplary articles” for each classification indicated by AEA in the “JEL codes guide” (https://www.aeaweb.org/jel/guide/jel.php). The obvious biases and intrinsic limitations that affect any possible alternative databases of economics texts, from JSTOR to ISI-WoS (as well as from the practicality of the analysis) allows, but does not justify the use of this peculiar database, which rather stems from the ambition to exploit the philosophy of topic modeling at its maximum. The guide, in fact, “provides JEL Code application guidelines, keywords, and examples of items within each classification” (ibid.). Now, “examples of items within each classification” does not mean that “exemplary” articles can be taken as “representative” of classifications themselves, and it would be a stretchy to consider them as if they were so, without qualifications. At the same time, there is actually no plausible reason to reject the hypothesis that each “example” has been selected as “typical” of its classification and, in any case, examples – all the more so when included in a “guide”, which induces to take them as references – are by definition illustrative. Although limited, to a certain extent “examples” are representative documents, that show the character of the whole.

Therefore, the guide’s “examples”, one for each classification of JEL codes, are included in the list in their individuality and because of their individuality, which induces to identify them, to a certain extent, as highly representative of a classification. But also because they express the traits of a group, or the commonalities (regularities) between articles that contain that particular JEL code. Now, the aim that justifies adoption of the “distant reading” perspective embodied in topic modeling is to identify patterns, regularities that shape (literary of scientific) fields, and that the individuality of individual, representative or exemplary texts

10 would otherwise obscure or make invisible. Even in the written production of the economics discipline, “individual texts in their individuality” are not all that matters (the expression is borrowed from Moretti 2017).

The dataset includes 1656 documents: 1588 Journal articles, 21 books, 28 working papers, 7 collective volume articles, 1 books review, 1 dissertation. Heterogeneity in the number of sub- categories for JEL codes is reflected in the number of exemplary documents for each JEL code.

Collective Book Rev. Book Volume 0% 1% Article WP Dissertation 1% 2% 0%

Journal Article 96%

Remarkably, 52% of Journal articles have been published in 2005. The Guide does not provide any explanation of the criteria followed in identifying exemplary papers or on publication dates, although both the timing of the latest revision of JEL codes and technological factors can easily motivate why the majority of exemplary articles have been published in the last fifteen years. The continuous creation of new sub-categories also matters. Other types of documents cover a shorter time-window (1994-2014, while Journal articles cover the period 1985-2015) but exhibit a similar distribution over time: the majority has been published between 2000 and 2014.

11

Journal articles per year 900

800

700

600

500

400

300

200

100

0

1991 2009 1986 1987 1988 1989 1990 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2010 2011 2012 2013 2014 2015 1985

Books per year 1996 1994 5% 1999 2014 13% 5% 2008 5% 5% 2001 2007 10% 5%

2005 2002 10% 10%

2003 2004 13% 19%

12

Collective volume articles per year

2011 6% 1992 6% 1996 13% 2005 20%

1997 6%

2004 6% 1999 6%

2004 6% 2000 13% 2003 6% 2002 6% 2001 6%

Consider the distribution of exemplary articles per JEL code (the list of JEL codes is reported here below).

A General Economics and Teaching B History of Economic Thought, Methodology, and Heterodox Approaches C Mathematical and Quantitative Methods D E and F G H I Health, Education, and Welfare J Labor and K L M Business Administration and • Marketing • Accounting • N O Economic Development, Innovation, Technological Change, and Growth P Economic Systems 13

Q Agricultural and Natural Resource Economics • Environmental and R Urban, Rural, Regional, Real Estate, and Transportation Economics Y Miscellaneous Categories Z Other Special Topics

Exemplary documents per JEL code 160

152 140 147

130 120 125 120 100 110 112 100 94 93 80 85

60 65 57 54 40 51 50 42 36 33 20 0 0 A B C D E F G H I J K L M N O P Q R Y Z

JEL Codes C (Mathematical and Quantitative Methods), D (Microeconomics), L (Industrial Organization) and N (Economic History) exhibit the highest number of exemplary documents. Each of these code has 9 sub-codes, with a high number of further subfields.

Let us now concentrate (for a variety of reasons, the most important being sample size and intrinsic homogeneity) on Journal articles (96%). The Guide include papers published in a considerable number of academic Journals (498), which gives a measure of the growing size and variety of the discipline (with proliferation of specialized journals variously contributing to the development of the discipline). The distribution of articles per journals is evidently heterogeneous.

14

Articles per Journal Journals with more than 5 papers 20%

Journal with less than 5 papers 80%

Most Journals (80%) included in the list present between one and five articles (201 with only 1 paper, 99 with 2 papers, 99 with 3 or 4 papers). Only 20% of Journals count for more than five papers: 75 Journal present between 5 and 9 papers, and 25 have more than 10 papers.

Journals per number of articles 250 201 200

150 99 99 100 75

Number Journals Number of 50 25

0 1 2 3 or 4 Between 5 and 9 10 or more Number of articles

The list of Journals with more than 10 papers in the Guide include “top” generalist Journals like the American Economic Review as well as more specialized Journals like Econometric Theory.

15

Journals with 10 or more articles in the Guide 40

35

30

25

20

15 Number papers Number of 10

5

0

Remarkably, only two (American Economic Review and International Economic Review) of the “Blue Ribbon Eight” economics Journals appear in the list of Journals with the highest number of articles in the Guide.

"Blue Ribbon Journals" articles in the Guide 35 30 25 20 15 10 5 0

16

The list of 498 Journals publishing at least one of the articles in the Guide includes both generalist and specialized Journals. The graphs below illustrate the JEL codes for which articles published in some of the Journals that present 10 or more articles in the Guide are indicated as “exemplary” in the Guide.

Consider the case of Econometric Theory. The journal, a highly specialized one, resents articles that are “exemplary” exclusively for JEL code C, “Mathematical and Quantitative Methods”. Articles published in other specialist Journals, like the Journal of Development Studies and the Journal of Economic Development, cover on the contrary various JEL Codes.

Econometric Theory, articles per JEL Code

100% C- Mathematical and Quantitative Methods

Journal of Development Studies, articles per JEL code

9% P - Economic 18% G - Financial Systems Economics

46% O - Economic Development, Innovation, Technological Change, and 27% I - Health, Growt Education, and Welfare

17

Journal of Economic Development, articles per JEL code

P- Economic D- O- Economic Systems Microeconomics Development, 6% 12% Innovation, E- Technological Macroeconomic Change, and s and Monetary Growth Economics 23% 17%

N- Economic G-Financial History Economics 6% 6% H- Public J- Labor and I- Health, Economics Demographic Education, and 6% Economics Welfare 12% 12%

Now consider the case of two Journals specialized in economic history, Economic History and the Journal of Economic History. Both journals self-describe, in their “Aims and scope”, as committed to openness and the promotion of interdisciplinary approaches. Still, Economic History appears here as associated with JEL Code N (“Economic History”) exclusively, while articles in the Guide published in the Journal of Economic History, as is the case of more specialized Business History, cover a variety of JEL Codes.

18

Journal of Economic History, Articles per JEL Code J- Labor and D- Microeconomics Demographic 6% Economics 6%

N- Economic History 88%

R- Urban, Rural, Regional, Real Business History, Estate, and Transportation Articles per JEL Code Economics I- Health, Education, and 6% Welfare 13%

L- Industrial Organization N- Economic 6% History 75%

As expected, on the contrary, generalist journals like the American Economic Review and Economic Letters cover various JEL codes with their articles in the Guide.

19

American Economic Review, articles per JEL Code A- General N- Economic History Economics and Teaching C- Mathematical and L- Industrial 3% 4% Quantitative Organization Methods 17% 17% K- Law and Economics 10%

I- Health, Education, D- Microeconomics and Welfare 17% 10% E- Macroeconomics H- Public Economics and Monetary 4% F- International Economics G-Financial Economics 7% Economics… 7%

20

Economic Letters, articles per JEL code

Q- Agricultural and Natural O- Economic Resource Economics • Development, Environmental and Ecological Innovation, Economics P- Economic Systems C- Mathematical and Technological 3% 3% Change, and Growth Quantitative 6% Methods 15% L- Industrial Organization D- Microeconomics 9% 17%

J- Labor and E- Macroeconomics Demographic and Monetary Economics Economics I- Health, Education, 17% 9% and Welfare F- International H- Public Economics G-Financial 6% Economics 6% Economics 6% 3%

Finally, it is to be noted that a vast portion of “exemplary” documents, despite their being “examples” of a specific JEL code, are in truth defined by a plurality of JEL codes. Take documents that appear in the Guide as exemplary of JEL code K: only 18 documents over 47 (in yellow in the table) are defined exclusively by JEL code K. 11 JEL codes, K included (D, F, G, H, J, K, L, M, N, O, P) concur to define the corpus. 29 exemplary documents of Law and Economics are also associated with other JEL codes, and predominantly with Labor and Demographic Economics (J), Industrial Organization (L) and Microeconomics (D).

Law and Economics: exemplary articles, associated JEL codes

Article Associated JEL codes 1 K00 L84 2 D72 K10 3 K11 K33

21

4 K11 O34 5 K12 K42 N44 P31 P37 6 D82 D86 K12 7 K13 8 D42 D82 K13 L12 O32 9 K14 K42

10 K14 K41

11 K1 12 K1 13 K19 14 K19 15 K20 K40 O17

16 K21 G34 L41 K42 17 L86 L41 M14 K21 L21

18 K22 G18 G14 K41 L63 L81 19 K42 K13 K22 K41 20 K22 G28 K41 21 L96 K23 L43

22 K2

23 K2 24 K29 25 K29 26 K30 K31 K32 27 J31 K31 J58 R23

28 J28 K31 K32 L51 29 F13 K33 K10 F14 P33 30 D74 K12 K20 K33 31 H26 H21 K34 32 H26 H21 K34 33 D14 K35

34 J12 K36

35 D13 D14 D91 J12 J16 K36 36 J11 J61 J71 K31 37 F22 J51 J61 38 J15 J18 K39 K42 Z13 39 J11 J12 K39

40 K3

41 K39 R21 R31 R38 42 G12 K40 N23 N43 43 D72 K40 44 K41 K42 45 K41 D82

46 K42

47 K49

22

JEL codes in exemplary documents for JEL code K 70

60

50

40

30

20

10

0 D F G H J K L M N O P R Z

A topic modeling analysis of the JEL codes Guide

The only human intervention involved in the exercise, LDA requires a priori definition of the number of topics to be detected. Accuracy increases with the number of topics: a more complex and detailed model produces a better fit of the data and reduces biases at work. Still, there is no standardized procedure to derive such a number (see Rhody 2012), or test supporting a precise choice of the parameters, especially when topic modeling is employed to explore the content of a dataset, and not for prediction (Mimno and Blei 2011). We thus followed a research heuristic that combines a sensitivity analysis with the educated opinion of the authors about the meaningfulness of the choices themselves – in line with our previous topic-modeling analysis of the corpus of economics articles stored in the JSTOR database (Ambrosino et al. 2017), which will be quite naturally used as possible benchmark.

Two important caveats accompanying the very preliminary results shown in the tables below are, first, that the database selected for the LDA analysis required us to download each (recent) article individually from journals’ websites and academic repositories (and older documents from the JSTOR digital library), in the form of a pdf file. In the vast majority of cases, these files contains data about the download process (source and URL, retrieval date, Internet protocol address, information about the digital library, and so on). Optical character recognition (OCR) errors (as Topic 2 makes evident) and the (however limited) presence of non-English articles (see Topic 18) add to the difficulty of obtaining meaningful results for the complete list of topics. The general ambition of stressing the limits of the technique here 23 employed requires us to abstain from using aggressive preprocessing (see Yau et al. 2013, Suominen and Toivanen 2016), and even to accept, in this first stage of the analysis at least, the appearance of non-useful words within topics’ wordlists.

The second caveat is that we had technical problems in retrieving the Guide articles for JEL code Q, “Agricultural and Natural Resource Economics • Environmental and Ecological Economics”. They are therefore, and provisionally, excluded from the analysis.

Much evidently is to be done, and will be done, as concerns stemming and stop word filtering, as we will show below. Nor did we employ, heretofore, the web-based interactive visualization of topics developed by Sievert and Shirley (2014), LDAvis, which provide simply indispensable assistance in labelling and interpreting topics (for an illustration of its analytical power, see Ambrosino et al. 2017). Still, despite such limitations, results are promising.

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 workers model financial information les land law health data labor rst bank journal des württemberg legal income table employment choice banks behavior china regional see education results wage signi capital consumer chinese wildberg states age nthe labour speci firms consumers silver century regulation children variables work rms debt experimental brazil islamic enforcement family also job cient credit crossref jit region act household one wages cant ownership game brazilian government state child level women ect companies decision chocolate ebhausen laws care number union ects banking subjects production political rights school effects worker models finance risk ford common regulatory life may unemployment time corporate media accounting economic government wkh two employees cation firm experiment remittances population property households sample bene assets experiments une agricultural crime social time hours travel market afd dans central court population average jobs utility state economic sucre nthe costs individuals effect self xed funds price plants farms agency inequality journal working nancial investment theory tea agriculture legislation mortality variable training cients stock afa taxpayers auingen policy retirement using unions erent company group bord see public welfare year

24

Topic 10 Topic 11 Topic 12 Topic 13 Topic 14 Topic 15 Topic 16 Topic 17 Topic 18 tourism political local countries org history economics new der que war urban growth use century economic nthe die las power public economic downloaded new theory one und los state city world jstor industry torino can von del states housing country content per ndownloaded also vgl para international state development terms production universita social des por conflict cities capital subject economic studi may auf tourist party land per http prices economy will den una democracy development international utc early economists economic das tourists economic regional gdp jan see january many diversity con soviet government government afa price history nof ist como german community economy seller industrial degli system sich tsa democratic areas global goods nthe smith however als core united services sector buyer trade science market für vol american population income contract cent academic even werden política government property foreign price american journal policy dass destination politics service investment party british social nand eine sobre labour per china aircraft nineteenth https well auch fiscal world county bank loan years vol example nicht www military counties policy www also work university wird

Topic 19 Topic 20 Topic 21 Topic 22 Topic 23 Topic 24 Topic 25 Topic 26 Topic 27 equilibrium firms trade model matrix model rate sports tax let firm environmental can spatial models monetary team market set business countries will auction can inflation players risk function industry domestic capital commodity distribution money league returns can innovation international price transport parameters exchange teams return theorem market export nthe regional estimation policy sport income one technology goods equilibrium regions test real salary stock nthe product pollution cost network data rates art wealth lemma management products costs bidders variables interest ort volatility proof products wto economics algorithm economic professional rate game knowledge tariff case models estimator growth football taxes given service exports economic bid approach price player finance case research imports rate input using prices debt financial two production import given transportation conditional economy clubs portfolio agents companies foreign two auctions parameter shocks performance price theory new standards investment matrices case output hacienda investors player cost agreements market model journal bank donations journal will system world demand problem nthe market motivation asset see industries policy value can structural period religious value follows performance country journal bids tests gdp baseball equity

As mentioned in the above sections, LDA assumes articles to share the same set of topics (that is, it considers and describes them as distributions of topics), which however enter each document of the corpus in different proportions (relevance); LDA estimates the probabilistic weight of each topic in each article. Consider the article (randomly selected) “Public enforcement/private monitoring: Evaluating a new approach to regulating the minimum wage”, by David Weil, published in 2005 in the ILR Review, Volume 58, issue 2, pages 238-257. The following table illustrates the complete topic distribution for the article.

25

Topics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Weights 0.14 0.00 0.00 0.01 0.00 0.00 0.24 0.00 0.41 0.00 0.00 0.00 0.00 0.00 Topics 15 16 17 18 19 20 21 22 23 24 25 26 27 Weights 0.03 0.00 0.06 0.00 0.00 0.06 0.01 0.04 0.00 0.00 0.00 0.00 0.00

The topic distribution suggests that the article mainly results from a combination of Topic 9 (41%), Topic 7 (24%) and Topic 1 (14%), whereas despite positive weights, other topics (17, 20, 22, 15, 21, 4) are of reduced importance in defining the document. While topic 9 appears as a unstructured list of common words in economics (and may disappear after future preprocessing), Topic 7 can be safely identified as one of “Regulation”, and Topic 1 as “Labour”. The article appears in the JEL codes guide as “example” of the subcategory J380, “Wages, Compensation, and Labor Costs: Public Policy”.

Let us now consider the mean value of the weight of each of the 27 topics in each of the articles that the Guide consider as “examples” of the specific JEL code J, “Labor and Demographic Economics”. The table illustrates the (average) topic distribution for JEL code J. As expected,

Topic 1 (if we exclude Topic 9 for the reasons just outlined) is by far the most “prevalent”.

Topic 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Weights 0.19 0.01 0.00 0.01 0.00 0.00 0.03 0.07 0.21 0.00 0.02 0.02 0.03 0.01 Topic 15 16 17 18 19 20 21 22 23 24 25 26 27 Weights 0.02 0.02 0.15 0.00 0.02 0.01 0.01 0.13 0.00 0.01 0.01 0.01 0.01

(Very preliminary) results and further (future) developments

We are thus able to propose a first, however rough attempt at evaluating the degree of similarity between the human-assigned JEL codes classification and the topic modeling analysis of the corpus of JEL codes “examples”. The comparison rests on assigning to each JEL code the most representative (prevalent, on average) topics for the whole of the articles that the Guide considers as “examples” of each specific JEL code. Grey shades in the graph below help visualize the degree of correspondence between JEL codes and LDA-detected topics.

26

Topics 16, 17 and 22 likely share the same features already outlined for Topic 9, and can be (at this stage) forsaken until we preprocess the database – which means that JEL codes A and B, likely the most affected by these technical problems, will be the object of future analysis.

JEL code C, “Mathematical and Quantitative Methods”, is associated with Topics 24 and 19, the former being a topic of Econometrics, the latter of Mathematics.

JEL code D, “Microeconomics”, is associated with Topics 4 and 19, that is with a topic of Game Theory and one of Mathematics.

JEL code E, “Macroeconomics and Monetary Economics”, is associated with a topic (25) that one would label exactly as the JEL code itself.

JEL code F, “International Economics”, is associated with Topics 13, 21, 25, that is with a topic of International Economics, one of International Trade, and one of Monetary Economics.

JEL code G, “Financial Economics”, is associated with Topics 3 and 27, that is with a topic of Financial Economics and one of Investment.

JEL code H, “Public Economics”, is associated with Topic 12, which appears as one of Regional and .

27

JEL code I, “Health, Education, and Welfare”, is mainly associated with Topic 8, which one would label exactly in the same way.

JEL code J, “ and Demographic”, is strongly associated with Topic 1, clearly one of labour.

JEL code K, “Law and Economics”, is strongly associated with Topic 7, a topic of regulation.

JEL code L, “Industrial Organization”, is associated with Topic 20, which one would label as such.

JEL code M, “Law and Economics”, is strongly associated with Topic 7, a topic of regulation.

Itself a composite code, JEL code M, “Business Administration and Business Economics • Marketing • Accounting • Personnel Economics”, is associated with Topic 1 (Labour) and Topic 20 (Industrial Organization).

JEL code N, “Economic History”, is strongly associated with Topic 15, which scarcely needs explanation.

JEL code O, “Economic Development, Innovation, Technological Change, and Growth”, is associated with Topic 13, international economics.

The same goes for JEL code P, “Economic Systems” (Topic 13, international economics).

JEL code Z, “Other Special Topics”, is associated with Topic 26, which is one of “Sports Economics” – the subcode Z2.

Leaving aside the (abovementioned) technicalities, that have to be dealt with and will be dealt with as soon as possible, the research project is evidently at a very preliminary stage. The Guide offers not one but a series of possible further analyses enabling the construction of a more complete framework to be used to verify the potentialities of topic modeling as ossible antidote to overload information problems in science. First, the results of the test here outlined can be easily compared with previous conclusions arrived at when applying the LDA algorithm to the whole corpus of economics articles included in the JSTOR digital library. Second, the JEL codes and sub-codes in the Guide offer a list of keywords that can be readily used for comparison with (LDA-detected) topics wordlists. Finally, the possibility to analyse documents appearing in the Guide at the same time as blend of JEL codes and blends of topics

28 provide a potentially powerful preliminary instrument to be used in exploring how the complexity of human written artefacts may impact on the attempt to impose too rigid classifications, even when human-assigned.

References

Ambrosino A., Cedrini M., Davis J., Fiori S., Guerzoni M. & Nuccio M. (2017), “What Topic Modeling Could Reveal about the Evolution of Economics”, paper presented at the 21st ESHET Conference, Antwerp, 18-20 May.

Backhouse R.E. and Cherrier B. (2014) “Becoming Applied: The Transformation of Economics after 1970”, Center for the History of Working Paper 2014–15.

Blei D. M. (2012), “Probabilistic Topic Modeling”, Communications of the ACM, 55(4): 77-84.

Blei D. M. and Lafferty J. D. (2006), “Dynamic Topic Models”, Proceedings of the 23rd international Conference on Machine Learning. ICML ‘06, ACM, 113-120.

Blei D.M., Ng A.Y and Jordan M.I. (2003), “Latent Dirichlet Allocation”, Journal of Machine Learning Research, 3 (Jan): 993-1022.

Börner K., Klavans R., Patek M., Zoss A.M., Biberstine J.R., Light R.P., et al. (2012) “Design and Update of a Classification System: The UCSD Map of Science”, PLoS ONE, 7(7): e39464.

Cedrini M. and Fontana M. (2017), “Just Another Niche in the Wall? How Specialization Is Changing the Face of Mainstream Economics”, Cambridge Journal of Economics, 42(2): 427-51.

Cherrier B. (2017). “Classifying Economics: A History of the JEL Codes”, Journal of Economic Literature, 55(2): 545-79.

Claveau F. and Gingras Y. (2016), “Macrodynamics of Economics: A Bibliometric History”, History of Political Economy, 48(4): 551-592.

Colander D., Holt R., and Rosser Jr. J.B. (2004), “The Changing Face of Mainstream Economics”, Review of Political Economy, 16(4): 485–499.

Davis J.B. (2006), “The Turn in Economics: Neoclassical Dominance to Mainstream Pluralism?”, Journal of , 2(1): 1-20.

29

Di Caro L., Guerzoni M., Nuccio M., and Siragusa G. (2017), “A Bimodal Network Approach to Model Topic Dynamics”, mimeo (available at: https://arxiv.org/abs/1709.09373).

Dow S.C. (2008), “Plurality in Orthodox and ”, The Journal of Philosophical Economics, 1(2): 73-96.

Fourcade M. (2018). “Economics: The View from Below”, Swiss Journal of Economics and Statistics, 154(1).

Garnett Jr R. F., Olsen E., & Starr M. (Eds.). (2009), Economic Pluralism. London and New York: Routledge.

Glenisson P., Glänzel W., & Persson O. (2005), “Combining Full‐text Analysis and Bibliometric Indicators. A Pilot Study”, Scientometrics, 63(1): 163-80.

Jones B. F. (2009). “The Burden of Knowledge and the ‘Death of the Renaissance Man’: Is Innovation Getting Harder?”, Review of Economic Studies, 76(1): 283–317.

Kosnik L.-R. (2018), “A Survey of JEL Codes: What Do They Mean and are They Used Consistently?”, Journal of Economic Surveys, 32(1): 249-72.

Mimno D. and Blei D. (2011). “Bayesian Checking for Topic Models”. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA: Association for Computational Linguistics, 227-237.

Mohr J.W. and Bogdanov P. (2013), “Introduction—Topic Models: What They Are and Why They Matter”, Poetics, 41(6): 545-569.

Moretti F. (2017), “Patterns and Interpretation”. Speech given at the “Distant Reading and Data-Driven Research in the History of Philosophy” Conference, Università di Torino, 16-18 January.

Moretti F. (2013), Distant Reading, London and New York: Verso.

Moretti F. (2005), Graphs, Maps, Trees: Abstract Models for a Literary History, London and New York: Verso.

Panhans M.T. and Singleton J.D. (2017), “The Empirical Economist’s Toolkit: From Models to Methods”, History of Political Economy, 49(S): 127-57.

Pencavel J. (1991), “Prospects for Economics”, Economic Journal, 101(404): 81-87.

Reay M.J. (2012), “The Flexible Unity of Economics”, American Journal of Sociology, 118(1): 45-87.

30

Redondo S. (2014), “SEO 101: What is Semantic Search and Why Should I Care?”, SearchEngineJournal, 22 November, available at: https://www.searchenginejournal.com/seo-101-semantic-search- care/119760/ (accessed: 12 May 2018).

Rhody L.M. (2012), “Topic Modeling and Figurative Language”, Journal of Digital Humanities, 2(1).

Sievert C. and Shirley K. E. (2014), “LDAvis: A Method for Visualizing and Interpreting Topics”, Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, Baltimore, Maryland: Association for Computational Linguistic, 63-70.

Schulz K. (2011), “What Is Distant Reading?”, The New York Times, 24 June, available at: https://www.nytimes.com/2011/06/26/books/review/the-mechanic-muse-what-is-distant- reading.html (accessed: 12 May 2018).

Suominen A. and Toivanen H. (2016), “Map of Science with Topic Modeling: Comparison of Unsupervised Learning and Human‐assigned Subject Classification”, Journal of the Association for Information Science and Technology, 67(10): 2464-76.

Yau C.‐K., Porter A., Newman N., & Suominen A. (2013), “Clustering Scientific Documents with Topic Modeling”, Scientometrics, 100(3): 767-86.

31