Topic Modelling

TOPIC MODELLING After Named Entity Recognition we'll be looking now at a Text Mining tech- nique called Topic Modelling (TM from now on). The reason of this choice is that lately TM has been gaining some momentum within DH research and projects, often in combination with other methods such as network visualisation and network analysis. First off, what kind of algorithm is TM? It is an unsupervised one, meaning that the relationships between documents that are eventually extracted are not learned from a training set{as it happens with supervised algorithms but extracted by fitting the input data into a model, namely the TM model. What TM does is to extract a given number of topics from an input set of documents. One of the key ideas behind it is that every topic is present to varying degrees within each document. The typical output of running a corpus through TM is a set of distinctive words for each topic and for each document a percentage value indicating the presence of each topic within that document. The algorithm underlying TM is called Latent Dirichlet Allocation (LDA) and was described in a paper published by Blei, Ng and Jordan in 2003. Matthew Jockers has recently written an introduction to LDA for a non-technical audi- ence entitled The LDA Buffet is Now Open; or, Latent Dirichlet Allocation for English Majors{I totally recommend to read it before going further with this lesson. Some Applications of Topic Modelling This is a list of applications based on TM that I'm aware of in the fields of Digital Humanities and Humanities Computing. Do check them out in order to get an idea of which purpose it is currently being used for: • Cameron Blevins, Topic Modeling the Diary of Martha Ballards • Rober K. Nelson, Mining the Dispatch • Elijah Meeks, Comprehending the Digital Humanities • David Mimno, Perseus' Thematic Index of Classics in JStor Frameworks and Packages There exist quite a few frameworks and packages out there that contain an implementation of topic models. This a list of them { not exhaustive by any means: 1 • Mallet (Java) { Mallet is actually much more than a framework for TM: it is a framework written in Java for Machine Learning • Gensim (Python) { TODO not only TM but also document similarity • Stanford Topic Modelling Toolbox For this lesson we'll be using Gensim as it fits well into the Pythonic setting of this module and is also nicely documented. But keep in mind it's not the only option out there! To start with, let's install gensim python by running from the command line: sudo easy_install gensim To test if the installation was successful just try to import it from the Python Interactive Interpreter: >>> import gensim You can also get additional information about the model with the following command: >>> help(gensim) On the Gensim website you can find a detailed tutorial about the installation. About the Data For this exercise we are going to use the OCRed text of one issue of the English Woman Journal, that you can find in the \data/" directory. This is probably not the best set of documents to be used for showing the potential of topic modelling for at least the following reasons: • it's a rather \dirty" text, with many OCR errors and it hasn't been man- ually cleaned • the files are divided into one file per page, leading to the fact that stories or articles often span several pages and one page may contain two or more of them 2 • it's a rather small corpus of around 40k words However, I thought was interesting mainly because: • it's something different from the usual example which uses the text of the English Wikipedia{you can find this very example described in detail on the Gensim website • it's an example which shows clearly the problems related to using historical texts Create the Corpus In order to process our documents with Gensim we need, as very fist thing, to transform them into a suitable format. Let's import a bunch of modules that we need: import gensim from gensim import corpora, models import glob import codecs import os import string from nltk.corpus import stopwords from nltk.tokenize.regexp import WordPunctTokenizer Let's now read the files into memory: note that at line 5 we intentionally skip the first line of each page as it usually contains the running head (remember, this is a not-so-clean OCRed text of a journal issue!) which can appear on multiple pages, thus affecting the word frequency count and the topic composition. # this is the path to the data directory data_dir = "data/NCSE_EWJ/" # reads the content of that folder and open all .txt files that are found files = [codecs.open(infile,'r','utf-8') for infile in glob.glob(os.path.join(data_dir, '*.txt'))] # reads to the memory the content of each file but skips the first line as it contains the running head texts = [" ".join(file.readlines()[1:]) for file in files] In the next step we are going to: • load a list of stop words from NLTK 3 • tokenise each text in our directory with NLTK's WordPunctTokenizer{as a result, some of the resulting tokens contain one or more punctuation characters This is the code which performs this step: # imports the list of English stopwords from NLTK en_stopwords = stopwords.words('english') # each text in our text is tokenised using NLTK's WordPunctTokenizer # ...tokenisation was covered in the "Text Processing" lesson, in case these concepts do not sound familiar to you tokenised = [[word.lower() for word in WordPunctTokenizer().tokenize(text) if word.lower() not in en_stopwords] for text in texts] Now let's remove those words that are actually punctuation{note that this filter will catch tokens like \," but not like \,(": # iterates over each word in each document and keeps it only if it doesn't contain just punctuation tokenised = [[word for word in text if word not in string.punctuation] for text in tokenised] And then we filter out those occurring only once as they are not very informative for our TM algorithm plus in some cases they are likely to OCR errors: all_tokens = sum(tokenised, []) tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1) tokenised = [[word for word in text if word not in tokens_once] for text in tokenised] Now let's remove the words with relatively high frequency: # first we create a dictionary containing a citation count for each word in each text words = {} for text in tokenised: for word in text: if words.has_key(word): words[word] += 1 else: words[word] = 1 freq_word=[(w,words[w]) for w in words.keys()] # we order the word by frequency... freq_word.sort(key=lambda tup: tup[1]) # ...then we sort them by descending order, meaning the most frequent come first freq_word.reverse() 4 # of those we keep only the most common, say the first 30 most_common = [x[0] for x in freq_word[:30]] # and finally we exclude them from our corpus tokenised = [[word for word in text if word not in most_common] for text in tokenised] Now we are ready to create the corpus and write it to a file: # we create the dictionary, where are stored the mapping between token ID and token string dictionary = corpora.Dictionary(tokenised) # then we create a corpus, which is a list of documents (lists), where each document is represented as # list of token IDs (not token strings! for the sake of performance) corpus = [dictionary.doc2bow(text) for text in tokenised] # finally with the method serialize we write the corpus to the disk encoded in Gensim's format corpora.MmCorpus.serialize('./extra/NCSE_corpus.mm', corpus) As you can see there are two main objects that we need to manipulate our corpus, they are \corpus" and \dictionary". Why? It's actually quite simple. Gensim does not store all the tokens in our texts as strings as this is not a performant approach. Each token is replaced by an integer number: this mapping is stored in the \dictionary" object. Then each document in the corpus is represented as a list of value pairs where the first value is the token id and the second is a number indicating its frequency. As an example, let's take the first document in the corpus and print its content (in Gensim's representation format): print corpus[0] Let's now look up each token id in the \dictionary" object and print it together with its frequency: for word in corpus[0]: print dictionary.id2token[word[0]],word[1] As you can see the word \english" occur is the most frequent in this document with 5 occurrences. Having saved the corpus in this format, we can load it at any time without having to re-tokenise and pre-process it from scratch: corpus = corpora.MmCorpus('./extra/NCSE_corpus.mm') 5 Extract Topics from Corpus Let's now run the LDA algorithm, which actually takes only one line. We initialise an LdaModel passing as initialisation parameters our corpus and the dictionary object (which, as we have seen already, contains the mapping between token id and token string). Additionally, we specify how many topics we want to extract from it. We do not have time to go into more details about how to determine the optimal number of topics but, as you will notice by looking at the resulting topics, setting a different number of topics leads to (very) different results. Let's start with extracting 5 topics: lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=5) lda.show_topics(topics=-1) The function \show topics()" prints the topics that were extracted.

Load more