Topic Modeling of Public Repositories at Scale Using Names in Source Code

Topic modeling of public repositories at scale using names in source code Vadim Markovtsev Eiso Kant [email protected] [email protected] sourcefdg, Madrid, Spain February 2017 Abstract—Programming languages themselves have a limited technological ecosystems. The next attempt to classify open number of reserved keywords and character based tokens that source projects was based on manually submitted lists of define the language specification. However, programmers have a keywords. While this approach works [1], it requires careful rich use of natural language within their code through comments, text literals and naming entities. The programmer defined names keywords engineering to appear comprehensive, and thus not that can be found in source code are a rich source of information widely adopted by the end users in practice. GitHub introduced to build a high level understanding of the project. The goal of repository tags in January 2017 which is a variant of manual this paper is to apply topic modeling to names used in over keywords submission. 13.6 million repositories and perceive the inferred topics. One The present paper describes how to conduct fully automated of the problems in such a study is the occurrence of duplicate repositories not officially marked as forks (obscure forks). We topic extraction from millions of public repositories. It scales show how to address it using the same identifiers which are linearly with the overall source code size and has substantial extracted for topic modeling. performance reserve to support the future growth. We pro- We open with a discussion on naming in source code, we then pose building a bag-of-words model on names occurring in elaborate on our approach to remove exact duplicate and fuzzy source code and applying proven Natural Language Processing duplicate repositories using Locality Sensitive Hashing on the bag-of-words model and then discuss our work on topic modeling; algorithms to it. Particularly, we describe how ”Weighted and finally present the results from our data analysis together MinHash” algorithm [2] helps to filter fuzzy duplicates and with open-access to the source code, tools and datasets. how an Additive Regularized Topic Model (ARTM) [3] can Index Terms—programming, open source, source code, soft- be efficiently trained. The result of the topic modeling is a ware repositories, git, GitHub, topic modeling, ARTM, locality nearly-complete open source projects classification. It reflects sensitive hashing, MinHash, open dataset, data.world. the drastic variety in open source projects and reflect multiple I. INTRODUCTION features. The dataset we work with consists of approx. 18 million public repositories retrieved from GitHub in October There are more than 18 million non-empty public reposi- 2016. tories on GitHub which are not marked as forks. This makes GitHub the largest version control repository hosting service. It The rest of the paper is organised as follows: Section has become difficult to explore such a large number of projects II reviews prior work on the subject. Section III elaborates and nearly impossible to classify them. One of the main on how we turn software repositories into bags-of-words. sources of information that exists about public repositories is Section IV describes the approach to efficient filtering of fuzzy their code. repository clones. Section V covers the building of the ARTM To gain a deeper understanding of software development model with 256 manually labeled topics. Section VI presents it is important to understand the trends among open-source the achieved topic modeling results. Section VII lists the opened datasets we were able to prepare. Finally, section VIII arXiv:1704.00135v2 [cs.PL] 20 May 2017 projects. Bleeding edge technologies are often used first in open source projects and later employed in proprietary so- presents a conclusion and suggests improvements to future lutions when they become stable enough1. An exploratory work. analysis of open-source projects can help to detect such trends and provide valuable insight for industry and academia. II. RELATED WORK Since GitHub appeared the open-source movement has A. Academia gained significant momentum. Historically developers would manually register their open-source projects in software di- There was an open source community study which pre- gests. As the number of projects dramatically grew, those sented statistics about manually picked topics in 2005 by J. lists became very hard to update; as a result they became Xu et.al. [4]. more fragmented and started exclusively specializing in narrow Blincoe et.al. [5] studied GitHub ecosystems using reference coupling over the GHTorrent dataset [6] which contained 2.4 1Notable examples include the Linux OS kernel, the PostgreSQL database engine, the Apache Spark cluster-computing framework and the Docker million projects. This research employs an alternative topic containers. modeling method on source code of 13.6 million projects. Instead of using the GHTorrent dataset we’ve prepared open For the purpose of our analysis we choose to use the latest datasets from almost all public repositories on GitHub to be version of the master branch of each repository. And treat each able to have a more comprehensive overview. repository as a single document. An improvement for further M. Lungi [7] conducted an in-depth study of software research would be to use the entire history of each repository ecosystems in 2009, the year when GitHub appeared. The including unique code found in each branch. examples in this work used samples of approx. 10 repositories. And the proposed discovery methods did not include Natural A. Preliminary Processing Language Processing. Our first goal is to process each repository to identify which The problem of the correct analysis of forks on GitHub has files contain source code, and which files are redundant for been discussed by Kalliamvakou et.al. [8] along with other our purpose. GitHub has an open-source machine learning valuable concerns. based library named linguist [35] that identifies the pro- Topic modeling of source code has been applied to a gramming language used within a file based on its extension variety of problems reviewed in [9]: improvement of software and contents. We modified it to also identify vendor code maintenance [10], [11], defects explanation [12], concept and automatically generated files. The first step in our pre- analysis [13], [14], software evolution analysis [15], [16], processing is to run linguist over each repository’s master finding similarities and clones [17], clustering source code and branch. From 11.94 million repositories we end up with 402.6 discovering the internal structure [18], [19], [20], summarizing million source files in which we have high confidence it is [21], [22], [23]. In the aforementioned works, the scope of the source code written by a developer in that project. Identifying research was focused on individual projects. the programming language used within each file is important The usage of topic modeling [24] focused on improv- for the next step, the names extraction, as it determines the ing software maintenance and was evaluated on 4 software programming language parser. projects. Concepts were extracted using a corpus of 24 projects in [25]. Giriprasad Sridhara et.al. [26], Yang and Tan [27], B. Extracting Names Howard et.al. [28] considered comments and/or names to Source code highlighting is a typical task for professional find semantically similar terms; Haiduc and Marcus [29] text editors and IDE’s. There have been several open source researched common domain terms appearing in source code. libraries created to tackle this task. Each works by having The presented approach in this papers reveals similar and a grammar file written per programming language which domain terms, but leverages a much significantly larger dataset contains the rules. Pygments [36] is a high quality community- of 13.6 million repositories. driven package for Python which supports more than 400 Bajracharya and Lopes [30] trained a topic model on the programming languages and markups. According to Pygments, year long usage log of Koders, one of the major commercial all source code tokens are classified across the following cate- code search engines. The topic categories suggested by Ba- gories: comments, escapes, indentations and generic symbols, jracharya and Lopes share little similarity with the categories reserved keywords, literals, operators, punctuation and names. described in this paper since the input domain is much different. Linguist and Pygments have different sets of supported languages. Linguist stores it’s list at B. Industry master/lib/linguist/languages.yml and the similar Pygments To our knowledge, there are few companies which maintain list is stored as pygments.lexers.LEXERS. Each has a complete mirror of GitHub repositories. sourcefdg [31] nearly 400 items and the intersection is approximately 200 is focused on doing machine learning on top of the col- programming languages (”programming” Linguist’s item lected source code. Software Heritage [32] strives to become type). The languages common to Linguist and Pygments web.archive.org for open source software. SourceGraph [33] which were chosen are listed in appendixA. In this research processes source code references, internal and external, and we apply Pygments to the 402.6 million source files to extract created a complete reference graph for projects written in all tokens which belong to the type Token.Name. Golang. C. Processing names libraries.io [34] is not GitHub centric but rather processes the dependencies and metadata of open source packages The next step is to process the names according to naming fetched from a number of repositories. It analyses the de- conventions. As an example class FooBarBaz adds three pendency graph (at the level of projects, while SourceGraph words to the bag: foo, bar and baz, or int wdSize analyses at the level of functions). should add two: wdsize and size. Fig.1 is the full listing of the function written in Python 3.4+ which splits identifiers.

Topic Modeling of Public Repositories at Scale Using Names in Source Code

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support