Comparing LDA As Classification Method with JEL Classification Codes

Comparing LDA as classification method with JEL classification codes Jelle Jacobus Schagen June 27, 2015 Bachelor Thesis Econometrics Supervisors: dr. Kees Jan van Garderen and dr. Marco van der Leij Roeterseilandcomplex Faculteit Economie en Bedrijfskunde Universiteit van Amsterdam Abstract In this paper the LDA model of Blei et al. (2003) is applied to a dataset of economic articles to create topics. These topics form the base of predicting which JEL categories are attached to the economic articles. The predicted topics and JEL categories show significant similarities but during the research process only a part of the dataset has been correctly predicted. This occured because of an incomplete match between the LDA topics and the JEL categories. Title: Comparing LDA as classification method with JEL classification codes Authors: Jelle Jacobus Schagen, [email protected], 10035877 Supervisors: dr. Kees Jan van Garderen and dr. Marco van der Leij Date: June 27, 2015 Contents 1. Introduction 1 2. Theory behind LDA and JEL codes 4 2.1. LDA model . 4 2.1.1. The theory . 4 2.1.2. The script . 6 2.2. JEL codes . 6 2.3. Theory discussion . 7 3. Research Methods 9 3.1. Dataset . 9 3.2. Creating topics . 9 3.3. Pearson Chi square test . 11 4. Results 12 4.1. Topics created by LDA . 12 4.1.1. First level JEL codes . 12 4.1.2. Second level JEL codes . 13 4.2. Outcomes predictions . 14 4.3. Hypothesis testing . 15 4.3.1. JEL code correction . 15 4.3.2. Pearson Chi square test applied . 15 5. Conclusion 18 References 20 A. JEL codes 21 1. Introduction Whenever a thesis, like this one, or another article is published this increases the number of articles on the web. This never ending increase over time makes it harder to find a relevant article when searching for something specific. To solve this problem keywords are attached to papers when they are published. These keywords are chosen by the author and based on the subject. By searching for the appropiate keywords the requested articles can be found. The keywords that are chosen by the author are either made up by the author or come from an existing classification system. If the keywords are made up by the author they can perfectly describe the content of his work because he is free to choose. By using an existing system the author is restricted to set categories. Keywords are picked based on the best match with the content as chosen by the author. In this paper the Latent Dirichlet Allocation (LDA) model as published by Blei, Ng and Jordan (2003) is used for the classification of economic articles. The LDA model generates topics which consist of related words based on the occurence of the words in a corpus. A good implementation of LDA as a classification method could result in an automatic classification method. LDA can be used as a classification method because each article can be represented as a distribution of different topics. Larger percentages are related with more similarities 1 and thus increase the probability that the subject of a topic is the same as the subject of the article. Another classification method to order economic articles is produced by the Journal of Economic Literature and is called a JEL classification code. The codes are assigned to the articles by their author and an article can contain multiple codes. This classification code consists of twenty main categories from the field of economics. The second and third level subcategories do specify the subject of an article. Because the third level of the JEL codes contain mainly the same subject as the upper second level, the focus of this paper is on the second level JEL code. By recognizing LDA generated topics as a subcategory of the JEL code, a link between these classification methods is made. Articles with a JEL code can be compared with the topic distribution of the LDA model. If the classification methods give similar results an automatic classification system could be possible. In this paper the goal is to create this link and find an answer to the question: can the LDA model successfully predict the JEL codes of articles? To make a comparison between LDA and JEL possible a dataset from EconLit is chosen. The dataset contains 181 thousand abstracts of economic articles. Also, the necessary information, such as year of publication and the JEL codes, are included in the dataset. The publications in the dataset come from the time period 2000 up to and including 2011. In order to do the research the theory behind the LDA model is presented in the theoretical framework. The theoretical framework also contains information about the different levels within the JEL classification codes. Second, the methods used for trimming data, creating topics and the adjustment of JEL codes to LDA formed topics are explained. Subsequently a description of the comparison methods is given. Finally, the results are 2 presented and the results are discussed in the conclusion. 3 2. Theory behind LDA and JEL codes For understanding the research done it is important to understand both the classification methods LDA and JEL. Therefore the structure of these methods is explained. 2.1. LDA model 2.1.1. The theory The LDA model considers documents as a mixture of a finite number of topics and each meaningful word in a document as allocated by one of the topics. To make the used terms clear, the same terminology as Blei et al. (2003) will be used: • A word is the smallest unit within this model and denoted by w. • A document consists of N words and can be written as d = (w1, w2, ..., wN ). • A corpus is the biggest unit and is a collection of M documents. Therefore the notation is D = (d1, d2, ..., dM ). When LDA is applied to a corpus it allocates words to a set number of topics based on their occurence in the corpus. Also a mixture of these topics over the different documents is produced. By doing so, LDA finds useful applications like creating topics from news articles as done by Xin Zhao et al. (2011) with articles from the New York Times. In their paper the topics created from the New York Times are compared with topics created from applying an adjusted LDA model to Twitter data. The Twitter-LDA model 4 proposed by Xin Zhao et al. was adjusted to take into account the restricted length of tweets. This paper is restricted to the original LDA model as published by Blei et al. (2003). An improvement made in the original LDA model compared to other text corpora modelling models such as unigrams (Nigam et al., 2000) is the multinominal distribution θ over the topics in each document. Blei shows that the restriction of one topic in a document is too restrictive to model a corpus. The pLSI model (Hofmann, 1999) takes multiple topics in one document into account but is restricted to the training set that is used. Therefore the pLSI model is only known with the topic distribution as known from the training set. Besides that, Blei et al. show in their comparison between LDA and pLSI that the pLSI model suffers from overfitting. Within each topic zn, the words wn follow a multinominal distribution. The same word can be part of multiple topics and for every word wi, with i being an element of the whole vocabulary of words, p(wi j zn) holds. Another requirement of the distribution of words is the 'Bag-of-words' assumption, which states that the sequence of words does not affect the LDA model. Blei et al. (2003) created LDA as a three-step generative process for creating documents in a corpus: 1. Choose N ∼ Poisson(ζ). 2. Choose θ ∼ Dirichlet(α). 3. For each of the N words wn: a) Choose a topic zn ∼ Multinomial(θ). b) Choose a word wn from p(wn j zn). 5 Two important steps for this paper are step 2 and step 3b. If step 2 is applied to the EconLit dataset it gives the distribution over the formed topics for every abstract. Step 3b gives the probabilities of appearances of certain words within a topic. For both steps high percentages mean a high chance of appearance and thus their relevance. 2.1.2. The script The LDA model that is applied in this paper is a script in the programming language Python, produced by Hansen et al. (2014). The script is available to the public and can be adjusted. By adjusting parameters it is able to fit the research goal and the dataset that is used. For example, one of the parameters is the number of topics created by LDA. To make an approximation for a different number of JEL categories the number of topics needs to be adjusted. The script also contains two steps to clean the dataset. The first one contains the removal of the so called stopwords. Stopwords are the words that help to build a sentence but do not contain information. Examples are 'the', 'and', 'where', 'themself', etc. This is necessary because the stopwords are not helpful in describing the content of a document. The second step is the stemming of words. Hereby the base of the words are taken in order to find words with the same meaning but a different conjugation. An example of a base word is ’financ’ which occurs within the abstracts in the form of ’finance’ or ’financing’. 2.2. JEL codes The second classification method in this paper are the JEL classification codes as de- scribed on the website of the American Economic Association.

Comparing LDA As Classification Method with JEL Classification Codes

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support