Framework for Topic Modelling Radim Řehůřek and Petr Sojka NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic {xrehurek,sojka}@fi.muni.cz http://nlp.fi.muni.cz/projekty/gensim/

the available RAM, in accordance with While an intuitive interface is impor- Although evaluation of the quality of NLP Framework for VSM the current trends in NLP (see e.g. [3]). tant for software adoption, it is of course the obtained similarities is not the subject rather trivial and useless in itself. We have of this paper, it is of course of utmost Large corpora are ubiquitous in today’s Intuitive API. We wish to minimise the therefore implemented some of the popular practical importance. Here we note that it world and memory quickly becomes the lim- number of method names and interfaces VSM methods, , is notoriously hard to evaluate the quality, iting factor in practical applications of the that need to be memorised in order to LSA and Latent Dirichlet Allocation, LDA. as even the preferences of different types (VSM). In this paper, use the package. The terminology is The framework is heavily documented of similarity are subjective (match of main we identify a gap in existing implementa- NLP-centric. and is available from http://nlp.fi. topic, or subdomain, or specific wording/- tions of many of the popular algorithms, Easy deployment. The package should muni.cz/projekty/gensim/. This plagiarism) and depends on the motivation which is their scalability and ease of use. work out-of-the-box on all major plat- website contains sections which describe of the reader. For this reason, we have We describe a Natural Language Process- forms, even without root privileges and the framework and provide usage tutori- decided to present all the computed sim- ing software framework which is based on without any system-wide installations. als, as well as sections on download and ilarities to our library users at once, see the idea of document streaming, i.e. pro- Cover popular algorithms. We seek to installation instructions. The framework e.g. http://dml.cz/handle/10338. cessing corpora document after document, provide novel, scalable implementations is open sourced and distributed under an dmlcz/100785/SimilarArticles. At in a memory independent fashion. Within of algorithms such as TF-IDF, Latent OSI-approved LGPL license. the present time, we are gathering feed- this framework, we implement several pop- Semantic Analysis, Random Projections back from mathematicians on these results ular algorithms for topical inference, in- or Latent Dirichlet Allocation. and it is worth noting that the framework cluding Latent Semantic Analysis and La- Application proposed in this paper makes such side- tent Dirichlet Allocation, in a way that We chose Python as the programming by-side comparison of methods straightfor- makes them completely independent of the language, mainly because of its straight- “An idea that is developed and put into action is ward and feasible. training corpus size. Particular emphasis forward, compact syntax, multiplatform more important than an idea that exists only as an is placed on straightforward and intuitive nature and ease of deployment. Python idea.” Hindu Prince Gautama Siddharta, the framework design, so that modifications is also suitable for handling strings and founder of Buddhism, 563–483 B.C. Conclusion and extensions of the methods and/or their boasts a fast, high quality library for nu- Many digital libraries today start to offer application by interested practitioners are merical computing, numpy, which we use browsing features based on pairwise doc- We believe that our framework makes effortless. We demonstrate the usefulness extensively. ument content similarity. For collections an important step in the direction of cur- of our approach on a real-world scenario A corpus is represented as a sequence having hundreds of thousands documents, rent trends in Natural Language Processing of computing document similarities within of documents and at no point is there a computation of similarity scores is a chal- and fills a practical gap in existing software an existing digital library DML-CZ. need for the whole corpus to be stored lenge [1]. We have faced this task during systems. We have argued that the com- in memory. This feature is not an after- the project of The Digital Mathematics mon practice, where each novel topical thought on lazy evaluation, but rather a Library DML-CZ [5]. The emphasis was algorithm gets implemented from scratch core requirement for our application and not on developing new IR methods for this (often inventing, unfortunately, yet another as such reflected in the package philoso- task, although some modifications were ob- I/O format for its data in the process) is phy. To ensure transparent ease of use, we viously necessary—such as answering the undesirable. We have analysed the reasons define corpus to be any iterable returning question of what constitutes a “token”, for this practice and hypothesised that this documents: which differs between mathematics and partly due to the steep API learning curve of existing IR frameworks. >>> for document in corpus : the more common English ASCII texts. >>> pass With the collection’s growth and a Our framework makes a conscious effort steady feed of new papers, lack of scal- to make parsing, processing and transform- In turn, a document is a sparse vec- Introduction tor representation of its constituent fields ability appeared to be the main issue. This ing corpora into vector spaces as intuitive (such as terms or topics), again realised as drove us to develop our new document as possible. It is platform independent “Controlling complexity is the essence of computer a simple iterable: similarity framework. and requires no compilation or installa- programming.” Brian Kernighan [2] As of today, the corpus contains over tions past Python+numpy. As an added >>> for fieldId, fieldValue The Vector Space Model (VSM) is a >> in document : 61,293 fulltext documents for a total of bonus, the package provides ready imple- proven and powerful paradigm in NLP, in >>> pass about 270 million tokens. There are math- mentations of some of the popular IR algo- ematical papers from the Czech Digital rithms, such as Latent Semantic Analysis which documents are represented as vec- This is a deceptively simple interface; tors in a high-dimensional space. The while a corpus is allowed to be something Mathematics Library DML-CZ http:// and Latent Dirichlet Allocation. These are idea behind topical modelling is that texts as simple as dml.cz (22,991 papers), from the NUM- novel, pure-Python implementations that DAM repository http://numdam.org make use of modern state-of-the-art itera- in natural languages can be expressed in >>> corpus=[[(1, 0.8), (8, 0.6)]] terms of a limited number of underlying (17,636 papers) and from the math part tive algorithms. This enables them to work concepts (or topics), a process which both this streaming interface also subsumes of arXiv http://arxiv.org/archive/ over practically unlimited corpora, which improves efficiency (new representation loading/storing matrices from/to disk. math (20,666 papers). After filtering out no longer need to fit in RAM. takes up less space) and eliminates noise Note the lack of package-specific key- word types that either appear less than five We believe this package is useful to (transformation into topics can be viewed words, required method names, base class times in the corpus (mostly OCR errors) topic modelling experts in implementing as noise reduction). A topical search for re- inheritance etc. This is in accordance with or in more than one half of the documents new algorithms as well as to the general lated documents is orthogonal to the more our main selling points: ease of use and (stop words), we are left with 315,167 dis- NLP community, who is eager to try out well-known “fulltext” search, which would data scalability. tinct word types. Although this is by no these algorithms but who often finds the match particular words, possibly combined Needless to say, both corpora and docu- means an exceptionally big corpus, it al- task of translating the original implemen- through boolean operators. Research on ments are not restricted to these interfaces; ready prohibits storing the sparse term- tations (not to say the original articles!) topical models has recently picked up pace, in addition to supporting iteration, they document matrices in main memory, ruling to its needs quite daunting. especially in the field of generative topic may (and usually do) contain additional out most available VSM software systems. models such as Latent Dirichlet Alloca- methods and attributes, such as internal We have tried several VSM approaches References tion their hierarchical extensions. document ids, means of visualisation, docu- to representing documents as vectors: term ment class tags and whatever else is needed weighting by TF-IDF, Latent Semantic [1] T. Elsayed, J. Lin, and D. W. Oard. Pairwise Doc- for a particular application. Analysis, Random Projections and Latent ument Similarity in Large Collections with MapReduce. System Design The second core interface are trans- Dirichlet Allocation. In all cases, we used In HLT ’08: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human formations. Where a corpus represents the cosine measure to assess document Language Technologies, pages 265–268, Morristown, NJ, data, transformation represents the pro- USA, 2008. Association for Computational Linguistics. “Write programs that do one thing and do it well. cess of translating documents from one similarity. Write programs to work together. Write programs [2] B. W. Kernighan and P. J. Plauger. Software Tools. vector space into another (such as from a When evaluating data scalability, one of Addison-Wesley Professional, 1976. to handle text streams, because that is a universal TF-IDF space into an LSA space). Real- our two main design goals (together with interface.” Doug McIlroy [4] [3] A. Kilgarriff and G. Grefenstette. Introduction to the Spe- ization in Python is through the dictionary ease of use), we note memory usage is now cial Issue on the Web as Corpus. Computational Linguis- tics, 29(3):333–347, 2003. Our choices in designing the proposed [] mapping notation and is again quite dominated by the transformation models intuitive: [4] M. D. McIlroy, E. N. Pinson, and B. A. Tague. UNIX framework are a reflection of these per- themselves. These in turn depend on the Time-Sharing System: Forward. The Bell System Techni- ceived shortcomings. They can be explic- >>> from gensim.models vocabulary size and the number of topics cal Journal, 57(6 (part 2)), July/Aug. 1978. itly summarised into: >>> import LsiModel (but not on the training corpus size). With [5] P. Sojka. An Experience with Building Digital Open Ac- >>> lsi = LsiModel(corpus, cess Repository DML-CZ. In Proceedings of CASLIN Corpus size independence. We want numTopics = 2) 315,167 word types and 200 latent topics, 2009, Institutional Online Repositories and Open Ac- the package to be able to detect topics cess, 16th International Seminar, pages 74–78, Teplá >>> lsi[new_document] both LSA and LDA models take up about Monastery, Czech Republic, 2009. University of West Bo- based on corpora which are larger than [(0, 0.197), (1, -0.056)] 480 MB of RAM. hemia, Pilsen, CZ.

New Challenges for NLP Frameworks, LREC 2010, Malta, May 22nd, 2010 1 Support of grants MUNI/E/0084/2009 of MU Brno, 1ET200190513 of the Acad. of Sci. of the CR and MŠMT ČR LC536 is acknowledged.