Florida State University Libraries

2015 Pyquery: A Search Engine for Python Packages and Modules Shiva Krishna Imminni

Follow this and additional works at the FSU Digital Library. For more information, please contact [email protected] FLORIDA STATE UNIVERSITY

COLLEGE OF ARTS AND SCIENCES

PYQUERY:

A SEARCH ENGINE FOR PYTHON PACKAGES AND MODULES

By

SHIVA KRISHNA IMMINNI

A Thesis submitted to the Department of Computer Science in partial fulfillment of the requirements for the degree of Master of Science

2015

Copyright 2015 Shiva Krishna Imminni. All Rights Reserved. Shiva Krishna Imminni defended this thesis on November 13, 2015. The members of the supervisory committee were:

Piyush Kumar Professor Directing Thesis

Sonia Haiduc Committee Member

Margareta Ackerman Committee Member

The Graduate School has verified and approved the above-named committee members, and certifies that the thesis has been approved in accordance with university requirements.

ii I dedicate this thesis to my family. I am grateful to my loving parents, Nageswara Rao and Subba Laxmi, who made me the person I am today. I am thankful to my affectionate sister, Ramya Krishna, who is very special to me and always stood by my side.

iii ACKNOWLEDGMENTS

I owe thanks to many people. Firstly, I would like to express my gratitude to Dr. Piyush Kumar for directing my thesis. Without his continuous support, patience, guidance and immense knowledge, PyQuery wouldn’t be so successful. He truly made a difference in my life by introducing me to Python programming language and helped me learn how to contribute to Python community. He trusted me and remained patient during the difficult times. I would also like to thank Dr. Sonia Haiduc and Dr. Margareta Ackerman for participating on the committee, monitoring my progress and providing insightful comments. They helped me learn multiple perspectives that widened my research. I would like to thank my team members Mir Anamul Hasan, Michael Duckett, Puneet Sachdeva and Sudipta Karmakar for their time, support, commitment and contributions to PyQuery.

iv TABLE OF CONTENTS

ListofTables...... vii ListofFigures ...... viii Abstract...... ix

1 Introduction 1 1.1 Objective ...... 2 1.2 Approach ...... 2

2 Related Work 4

3 Data Collection 7 3.1 Package Level Search ...... 7 3.1.1 Metadata - Packages ...... 7 3.1.2 CodeQuality...... 8 3.2 ModuleLevelSearch ...... 10 3.2.1 Mirror Python Packages ...... 10 3.2.2 Metadata - Modules ...... 11 3.2.3 CodeQuality...... 12

4 Data Indexing and Searching 14 4.1 DataIndexing...... 14 4.1.1 Package Level Search ...... 14 4.1.2 Module Level Search ...... 14 4.2 DataSearching ...... 16 4.2.1 Package Level Search ...... 16 4.2.2 Module Level Search ...... 17

5 Data Presentation 21 5.1 ServerSetup ...... 21 5.2 BrowserInterface...... 21 5.3 Package Level Search ...... 22 5.3.1 Ranking of Packages ...... 23 5.4 ModuleLevelSearch ...... 26 5.4.1 Preprocessing...... 26

6 System Level Flow Diagram 31

7 Results 36

v 8 Conclusions and Future Work 42 8.1 ThesisSummary ...... 42 8.2 Recommendation for Future Work ...... 44

Bibliography ...... 46 BiographicalSketch ...... 47

vi LIST OF TABLES

5.1 Ranking matrix for keyword music...... 28

5.2 Matching packages and their scores for keyword music...... 29

7.1 Results comparison for keyword - requests...... 37

7.2 Results comparison for keyword - flask...... 37

7.3 Results comparison for keyword - pygments...... 38

7.4 Results comparison for keyword - ...... 38

7.5 Results comparison for keyword - pylint...... 39

7.6 Results comparison for keyword - biological computing...... 39

7.7 Results comparison for keyword - 3D printing...... 40

7.8 Results comparison for keyword - framework...... 40

7.9 Results comparison for keyword - material science...... 41

7.10 Results comparison for keyword - google maps...... 41

vii LIST OF FIGURES

3.1 Metadata from PyPI for package requests...... 9

3.2 Pseudocode for collecting metadata of a module...... 12

3.3 Metadata for a module in package...... 13

4.1 Package level data index mapping...... 15

4.2 Riverdefinition...... 16

4.3 Custom Analyzer with custom pattern filter...... 17

4.4 Moduleleveldataindexingmapping...... 18

4.5 Package level search query...... 19

4.6 Modulelevelsearchquery...... 20

5.1 Package modal...... 22

5.2 Package statistics...... 23

5.3 Other packages from author...... 24

5.4 Pseudocode for ranking algorithm...... 27

5.5 Modulemodal...... 30

6.1 System Level Flow Diagram of PyQuery...... 32

6.2 PyQueryhomepage...... 33

6.3 PyQuery package level search template...... 34

6.4 PyQuerymodulelevelsearchtemplate...... 35

viii ABSTRACT

Python Package Index (PyPI) is a repository that hosts all the packages ever developed for the Python community. It hosts thousands of packages from different developers and for the Python community, it is the primary source for downloading and installing packages. It also provides a simple web interface to search for these packages. A direct search on PyPI returns hundreds of packages that are not intuitively ordered, thus making it harder to find the right package. Developers consequently resort to mature search engines like Google, Bing or Yahoo which redirect them to the appropriate package homepage at PyPI. Hence, the first task of this thesis is to improve search results for python packages. Secondly, this thesis also attempts to develop a new search engine that allows Python developers to perform a code search targeting python modules. Currently, the existing search engines classify programming languages such that a developer must select a programming language from a list. As a result every time a developer performs a search operation, he or she has to choose Python out of a plethora of programming languages. This thesis seeks to offer a more reliable and dedicated search engine that caters specifically to the Python community and ensures a more efficient way to search for Python packages and modules.

ix CHAPTER 1

INTRODUCTION

Python is a high-level programming language based on simplicity and efficiency. It emphasizes code simplicity and can perform more functions in fewer lines of code. In order to streamline code and speed up development many use application packages, which reduce the need to copy definitions into each program. These packages consist of written application in the form of different modules, which contain the actual code. Python’s main power exists within these packages and the wide range of functionality that they bring to the software development field. Providing ways and means to deliver information about these reusable components is of utmost importance. PyPI, a software repository for Python packages, offers a search feature to look for available packages meeting the user needs. It implements a trivial ranking algorithm to detect matching packages for a user’s keyword, resulting in a poorly sorted, huge list of closely and similarly scored packages. From this immense list of results, it’s hard to find a package efficiently in a reasonable amount of time that meets user needs. Due to lack of an efficient native search engine for the Python community, often developers rely on mature and multipurpose search engines like Google, Yahoo and Bing. In order to express his or her interests in python packages, a developer taking this route has to articulate his query and on top of that provide additional input. A dedicated search engine for the Python community would bypass the need to specify once interest in Python. One may argue that such a search engine wouldn’t alter the experience of developers who is searching for python packages. However, considering the number of times a developer search for packages, code and related information on an average is five search sessions with 12 total queries each workday [1], it is desirable to propose a search engine that saves a lot of time and effort. An additional method of software development that saves time is the practice of Code reuse [2]. There are many factors that influence practice of code reuse [3] [4]. One of the such factors is availability of the right tools to find reusable components. Searching for code is a serious challenge

1 for developers. Code search engines such as Krugle1, Searchcode2 and BlackDuck3 attempted to ameliorate the hardship to search for code by targeting a wide range of languages. Currently, Python developers who conduct code searches have to learn how to configure these search engines so that the search engines display python specific results. As a result all though these search engines exemplify the ideal that one product may solve all kinds of problems, such an ideal fails to overcome the problems faced by Python developers. Python developers would rather rely on a code search engine that is designed for searching python packages exclusively.

1.1 Objective

This thesis seeks to contribute to the Python community by developing a dedicated Python search engine (PyQuery)4 that enables Python developers to search for Python Packages and Mod- ules (code) and encourage them to take benefit of an important software development practice, code reuse. With PyQuery we want to facilitate the search process, improve the query results, and collect packages and modules into one user-friendly interface that provides the functionality of a myriad of code search tools. PyQuery will also be able to synchronize with the Python Package Index to provide users with code documentation and downloads, thereby providing all steps in the code search process.

1.2 Approach

PyQuery is organized into three separate components: Data Collection, Data Indexing and Data Presentation. The Data Collection component is responsible for collecting all the data required to facilitate the search operation. This data includes source code for packages, metadata about packages from PyPI, preprocessed metadata about modules, code quality analysis for packages and other relevant information that helps us deliver meaningful search results. In order to provide the most recent updates to packages at PyPI, we ensure that the data we collect and maintain is always in sync with changes made at PyPI. For this reason, we use the Bandersnatch5 mirror client of PyPI which keeps track of changes utilizing state files.

1http://www.krugle.com/ 2https://searchcode.com/ 3https://code.openhub.net/ 4https://pypi.compgeom.com/ 5https://pypi.python.org/pypi/bandersnatch

2 The Data Indexing component stores all the data we have collected and processed in the Data Collection Module in a structured schema that facilitates faster search queries for matching packages and modules. We used Elasticsearch (ES)6, a flexible and powerful, open source, distributed, real- time search and analytics engine, built on top of to index our data. We rely on FileSystem River (FSRiver)7, an ES plugin, to index documents from the local file system using SSH. In ES, we used separate indexes for files related to module level search to those of package level search. By using this approach, we can map each query to its specific type and related files. The Data Presentation component delivers matched search results to the user in a fashion that is both appealing and easy to follow. We have used Flask8 for server side scripting. When a user query for matching packages, we send a query to the index responsible for packages and retrieve required details that allow the user to see the most significant packages, their scores, statistics and other relevant information. We implemented a ranking algorithm that works on fine tuning ES results by sorting them based on various metric. Additionally, when a user query for matching modules, a request is sent to the ES index for modules that contain metadata (Ex: class name, method name, etc.) to get a list of matches along side their line number and path to module on the server. For every match, a code snippet containing matching line is rendered using Pygments9. To reduce time for processing matched results, all the modules are preprocessed with pygments and each line number is matched to their starting byte address in the file, so that the server can quickly open the pygment file, seek to the calculated byte location, and pull the required piece of HTML code snippet.

6https://www.elastic.co/products/elasticsearch 7https://github.com/dadoonet/fsriver 8http://flask.pocoo.org/ 9http://pygments.org/

3 CHAPTER 2

RELATED WORK

Search engines employ metrics for scoring and ranking, but these metrics are often limited and do not differentiate the significant packages. Additionally, these metrics do not exhibit all the qualities that may be relevant to what a user wants out of a specific module or package. The PyPI [5] website signifies the exemplar for this project. When one searches for packages, PyPI follows a very simple search algorithm which gives a score for each package based on the query. Certain fields such as name, summary, and keywords are matched against the query and a binary score for each field is computed (basically a “yes; it matched” or “no; it didn’t”). A weight is given for each field, and the composite scores from each field are added to create a total score for each package. Packages are first sorted by score and then in lexicographical order of package names. We found this information at stackoverflow1 and followed the steps given to confirm the working of the PyPI searching algorithm. The above method employed by PyPI works, but it doesn’t distinguish the packages very well. For example, searching for “flask” will yield 1750 results with a top score of 9 given to 162 packages (fortunately, the Flask , which should be at the top when searched, is listed at 4th due to sorting based on alphabetical order). This also makes it very easy to influence the outcome of popular queries if you are the package developer. An algorithm which resist the influence of a package owner would be a better fit for reliable package searches. PyPI Ranking [6] is another website created by Taichino that is similar to PyPI and PyQuery as it searches only Python Packages, and no other languages. It has a search function that takes in a user’s search query and finds relevant packages. It also syncs with PyPI so that the user can access the information contained on PyPI such as documentation and downloads. The main difference, however, is that PyPI Ranking ranks packages based only on the number of downloads, so packages with more downloads will emerge higher up on the list. This means that packages get more value based on their popularity, which is a valuable metric, but not the only valuable

1http://stackoverflow.com/questions/28685680/what-does-weight-on-search-results-in-pypi-help- in-choosing-a-package

4 metric. Furthermore, the website only allows a package level search, whereas PyQuery contains both package level search function and module level search function, providing more resources to the user. Additionally, the website makes use of Django to facilitate the web development whereas PyQuery uses the Flask. There are multiple code search engines that allow users to look for written code that relates to their search query. These code search engines include websites such as Krugle2, Open Hub Code Search - BlackDuck3, and SearchCode4. These websites allow users to enter a search query, and then they list sample lines of code based on the results from their search query. These websites are limited; however, because they can only search code at GitHub5, Sourceforge6 and other open source repositories. They are not contained within one context in which a user might want to find a specific package or module. Additionally, the websites do a search based purely on term occurrence, by identifying the user’s search term within the lines of code and returning the code samples with numerous hits. The results a user receives on their search key may not address what they want, but rather just contain the term itself. Consequently, the results are not scored due to the lack of relevant information to incorporate as metrics. PyQuery accesses data directly from PyPI, preprocess the data to extract useful information from code, indexes the data within itself, searches the data, and reorders it based on ranking function. PyQuery is also constructed within the Python community so that Python packages and modules are only ranked against other Python packages and modules. These results are more valuable due to the metrics they are based on and the nature of the searching algorithm. In the past, people have attempted to do code search in languages like Java7 based on semantics that are test driven [7] and required users to specify what they are searching for as precisely as possible. This means that they need to provide both the syntax and the semantics of their target. Furthermore pass a set of test cases that include sample input and expected output to filter potential set of matches. This is a great technique to search for code that can be reused; however, it has its limitations. This tool requires the kind of detail regarding the input that the user will not know in the first place. This tool is more helpful for testing the reusable coding entity whose path from the

2http://www.krugle.com/ 3https://code.openhub.net/ 4https://searchcode.com/ 5https://github.com/ 6http://sourceforge.net/ 7http://www.oracle.com/technetwork/java/index.html

5 package root ex: str.capitalize() is known to the user precisely. If a user is looking for code that capitalizes first character of string, he may guess the function name to be capitalize() but may not precisely know it can be found in str package with the signature str.capitalize(). If a user usually knows this information, he or she may directly look inside usage documents to see if it meets his or her requirements (though he or she may have to execute test cases on their own). Nullege is a dedicated search engine for Python source code [8]. As a keyword, it requires a UNIX path like structure (“/” replaced by “.”) used in python import statements that al- ways start at the level of the package root folder. Some of the sample queries for Nullege in- clude “flask.app.Environment”, “requests.adapters.BaseAdapter” and “scipy.add”. Results from the search operation on Nullege point to the source code where the programming entity is im- ported. This is a useful tool for users who are familiar with folder structure of the package and are generally curious in exploring its source code or to learn packages that import them. A user can’t directly pass a generic keyword(s) that infers the purpose of programming entity he or she is interested in. For users who want to learn if there exist a reusable component for a specific task at hand and are not aware of precise location to look at, Nullege is not the right tool. Because of limitations imposed on the input and type of results returned, Nullege can be classified as an exploration tool for source code tree of Python packages rather than a search engine for source code. PyQuery allows users to perform a generic keyword search without limitations in input like those of Nullege. PyQuery results are usually code snippets that point to definitions of programming entities rather than import statements. We have used Abstract Syntax Tree (AST) to collect various programming entity names and their line numbers in modules for code search. Many research topics that analyze software code often use AST. Some of the applications of AST include Semantics-Based Code Search [7], Under- standing source code evolution using abstract syntax tree matching [9] and Employing Source Code Information to Improve Question-Answering in Stack Overflow [10]. These implementations con- struct an ast for code at consideration and extract needed information by walking through the tree or directly visiting the required node. For this purpose, we have used ast module [11] in Python. Chapter 3 elaborates on how we extract metadata about modules for code search.

6 CHAPTER 3

DATA COLLECTION

For any search engine to work, it requires data to perform search operations. Data could be anything. It could be of any form and any type. For the problem we plan to solve, we have to address the question “What kind of data are we interested in?”. We are engrossed in data related to Python packages that can help us return meaningful results for a user query. We intend to provide two flavors to the search engine: Package Level Search and Module Level Search. Let us examine tools and configurations that help us collect required data to achieve this goal.

3.1 Package Level Search

A package is a collection of modules, which are meant to solve the problem(s) of some type. For example: “requests” package is developed to handle http capabilities. According to its homepage1, it has various features, including International Domains and URLs, Keep-Alive and Connection Pooling, Sessions with Cookie Persistence, Browser-style SSL Verification, etc. A user interested in these features would like to use this library to solve his or her problem. A developer may produce a library and assign a name to it that may or may not directly have any relation with the purpose of the library. A user would get to know whether the library helps solve his problem not just by looking at its name alone but the description, sample usage and other useful metadata about the package mentioned at its homepage. Sometimes when a user has to pick between multiple packages that are trying to solve the same problem, criteria like popularity of author, number of downloads, frequency of releases, code quality and efficiency starts to factor. A search engine that returns Python package as matches to a user query would require similar information.

3.1.1 Metadata - Packages

We have discussed how a developer’s description of a package on its homepage, frequency of release, code quality, popularity of author and number of downloads helps a user to decide whether

1http://docs.python-requests.org/en/latest/

7 a given library solves his or her problem. PyQuery needs this information to search for matching packages for the user’s query and if multiple packages qualify, to prioritize one over the other. One direct way to get description of a package is to crawl its homepage at PyPI2. Though this sounds pretty straight forward and easy, gathering URL information for the latest stable release for each package and maintaining this information could be tricky and searching for required information in crawled data could be time consuming. We found an elegant and much simpler way to gather metadata of a package. PyPI allows users to access metadata information about a package via a http request 3. This would return a JSON file with keys such as description, author, package url, downloads.last month, downloads.last week, downloads.last day, releases.x.x.x, etc. For example: one can query the PyPI website for metadata about “requests” package through URL 4. Refer to Figure 3.1 for a sample response from PyPI.

3.1.2 Code Quality

PEP 0008 – Style Guide for Python Code5, describes a set of semantic rules and guidelines that Python developers should incorporate into their code. These standards are highly encouraged by the Python community. Standard libraries that are shipped with installation are written using these conventions. One main reason to emphasize the standardizing style guide is to increase code readability. The Python code base is pretty huge, and it is important to maintain consistency across it. Conventions set in Python style guide makes Python language so beautiful and easy to follow as you read. Code Quality of a package can be measured in multiple ways. First, we can check if the package at consideration is following the style guide for Python code. The Python community has tools to check the package compliance with the style guide. PEP86 is a simple Python module that uses only standard libraries and validates any Python code against the PEP 8 style guide. Pylint7 is another such tool that checks for line length, variable names, unused imports, duplicate code and other coding standards against PEP 8.

2https://pypi.python.org/pypi 3http://pypi.python.org/pypi//json 4http://pypi.python.org/pypi/requests/json 5https://www.python.org/dev/peps/pep-0008/ 6https://pypi.python.org/pypi/pep8 7http://www.pylint.org/

8 ‘‘info ’ ’: { ... ‘‘package url ’ ’: ‘‘http://pypi.python.org/pypi/requests ’’, ‘‘author’ ’: ‘‘Kenneth Reitz’’, ‘‘author email ’ ’: ‘‘[email protected]’’, ‘‘description ’ ’: ‘‘Requests: HTTP for Humans... ’ ’ ...... ‘‘release url ’ ’: ‘‘http://pypi.python.org/pypi/requests /2.7.0’’, ‘‘downloads ’ ’: { ‘‘last month ’ ’: 4002673, ‘‘last week ’ ’: 1307529, ‘‘last day ’’: 198964 } , ...... ‘‘releases ’ ’: { ‘‘1.0.4’’: [ { ‘‘has sig ’’: false , ‘‘upload time ’ ’: ‘‘2012−12−23T07:45:10 ’ ’ , ‘ ‘comment text’’: ‘‘’’, ‘‘python versio ’’: ‘‘source’’, ‘‘url ’ ’: ‘‘https://pypi.python.org/packages/source/r / r e q u e s t s / requests − 1.0.4.tar.gz’’, ‘ ‘ md5 digest ’ ’: ‘‘0b7448f9e1a077a7218720575003a1b6 ’ ’ , ‘‘downloads ’ ’: 111768, ‘‘filename ’ ’: ‘‘requests − 1.0.4.tar.gz’’, ‘‘packagetype ’ ’: ‘‘sdist ’’, ‘‘size ’’: 336280 } ] , ...... } }

Figure 3.1: Metadata from PyPI for package requests.

9 PEP8 and Pylint are great tools that we can use to check for code quality, but we have decided to use Prospector8 that brings together both the functionality of Pep8 and Pylint. It also adds the functionality of the code complexity analysis tool called McCabe9. When a package is processed with Prospector, it will give a count of errors, warnings, messages and their detail description. This information gives an inference of Code Quality. There is another set of information we can use for analyzing code quality. Developers care about how well the code is commented. The ratio of the number of comment lines to the total number of lines, number of code lines to the total number of lines and number of warnings to the number of lines, offers some metrics to do code quality analysis. CLOC10 helps us acquire this information. As CLOC stands for Count Lines of Code, when we run CLOC on Python package at consideration, it returns the total number of files, number of comments, number of blank lines and number of code lines. We collect this information to check for Code Quality.

3.2 Module Level Search

A module is a Python file with extension “.py”. It is a collection of multiple programing units such as classes, methods and variables. Some developers are interested in searching for these programming entities in a module, so we wanted to build a search engine for them. There are various steps involved in achieving this goal.

3.2.1 Mirror Python Packages

In order to allow users to perform module level search, i.e., allow users to search for classes, methods and variables, we need to extract this information from modules and packages that hold them. We are interested in the source code of all Python packages available. If there is a new release for any package, we already have, we want to update the information we have on this package. All of these operations could be complex or cumbersome if we were to do it by automating the process of downloading source code from their respective homepage (assuming we somehow managed to collect source code download URL for all packages). We found a better alternative. We came across a practice followed by software development organizations. Some of these organizations

8https://github.com/landscapeio/prospector 9https://pypi.python.org/pypi/mccabe 10http://cloc.sourceforge.net/

10 wouldn’t like their developers to hit the world wide web to download software packages they need for development. Instead, they maintain a local mirror of the PyPI repository from which developers can download necessary packages without connecting to the Internet. Currently, PyPI is a home to 50,000+ packages. It would be a single point of failure if it goes down. In order to avoid such disaster, PyPI has come up with PEP 38111, a mirroring infrastructure that can clone an entire PyPI repository in a desired machine. People started making public and private repositories using this infrastructure. For our purposes, we use Bandersnatch12, a client side implementation of PEP 381 to sync Python packages. When bandersnatch is executed for the first time it will mirror the entire PyPI i.e., download all the Python packages. It will also maintain state files that help maintain the current state of the repository, which is later used to sync with PyPI to get any updates made to the packages. A recurring cron job to execute command “bandersnatch mirror” will keep the local repository always updated.

3.2.2 Metadata - Modules

We have previously discussed that developers show interest in doing a code search for program- ming entities. We mirrored the entire PyPI repository into our servers using bandersnatch. In order to enable code search, we have to find useful information from modules of each package, i.e., get a list of programming entities for each module. There are many programming entities in a Python module, but we are mainly interested in classes, functions under classes, variables under classes, global functions, recurring inner functions inside global functions, variables inside global functions and global variables. We maintain each of them in a separate key so that we can give more weight to certain entities than others. To collect required information, we iterate through all packages; with in each package we iterate through all modules; for each module, we construct an Abstract Syntax Tree using ast13 module from Python and perform a walk (visit all) operation on this tree. As walk operation visits each programming entity, it invokes various function calls inside ast.NodeVisitor such as visit Name, visit FunctionDef, visit ClassDef and so on as per the current element. We override ast.NodeVisitor class and functions inside it and perform visit all operation on top of it so that we have control over the operation performed inside them. For example, during visit all, if a class is being visited, a

11https://www.python.org/dev/peps/pep-0381/ 12https://pypi.python.org/pypi/bandersnatch 13https://docs.python.org/2/library/ast.html

11 # Sample Code for collecting metadata class PyPINodeVisitor(ast.NodeVisitor): def visit_Name(self, node): # collect variable name and line number def visit_FunctionDef(self, node): # collect function name and line number def visit_ClassDef(self, node): # collect class name and line number def visit_all(self, node): # call super class visit function

if __name__ == "__main__": for modules in packages: for module in modules: tree = ast.parse(module) JSONfile = PyPINodeVisitor().visit_all(tree)

Figure 3.2: Pseudocode for collecting metadata of a module.

function call to visit ClassDef is invoked. Since we have overridden this function, we are in control of information passed to it and decide what to do with it. We can collect information that is of interest to us such as names of the various classes and line number at which they occur. Figure 3.2 is the pseudocode for using ast to generate the required metadata of a module. This way, we can collect all the metadata for a module and save it in a JSON format for making it available for Module Level Search. Figure 3.3 is an example of one such JSON file we have generated using this process. Each identifier is concatenated with its line number and additional underscores to make a length of minimum 18. The reason behind this format is discussed in Chapter 4. As part of data collection, it is important that information being collected should be stored in an agreed format that enables better indexing and searching techniques.

3.2.3 Code Quality

Similar to the method we have applied to collect code quality for packages using Prospector and CLOC; we have also collected this information at the module level. We couldn’t use Prospector to process a single module like we did for package, so we used Pylint instead of Prospector. CLOC helped towards obtaining the number of comment lines, number of blank lines and number of code lines at the module level.

12 { ‘‘class ’’: ‘‘ SessionMixin 2 8 TaggedJSONSerializer 55 SecureCookieSession 109 NullSession 1 1 9 SessionInterface 134 SecureCookieSessionInterface 272 ’’, ‘‘class function ’’: ‘‘ get s i g n i n g serializer 2 9 0 open session 3 0 1 save session 3 1 5 ’’, ‘‘class variable ’’: ‘‘ salt 2 7 8 digest method 280 key derivation 283 serializer 2 8 7 session class 2 8 8 self 2 9 0 app 2 9 0 signer kwargs 2 9 3 self 3 0 1 app 3 0 1 request 3 0 1 val 3 0 5 max age 3 0 8 self 3 1 5 app 3 1 5 session 3 1 5 response 3 1 5 domain 3 1 6 path 3 1 7 httponly 3 2 3 secure 3 2 4 expires 3 2 5 val 3 2 6 a t a 3 1 0 ’’, ‘‘function ’’: ‘‘ total seconds 2 4 ’’, ‘‘function function ’’: ‘‘’’, ‘‘function var’’: ‘‘’’, ‘‘module’ ’: ‘‘sessions ’’, ‘ ‘module path’ ’: ‘‘Flask −0.10.1/flask/sessions.py’’, ‘‘variable ’’: ‘‘ session json serializer 106 ’’ }

Figure 3.3: Metadata for a module in Flask package.

13 CHAPTER 4

DATA INDEXING AND SEARCHING

We used Elasticsearch (ES)1, a flexible and powerful, open source, distributed, real-time search and analytics engine, built on top of Apache Lucene to index our data and query the indexed data for both package level search and module level search. FileSystem River (FSRiver)2, an ES plugin is used to index documents from a local file system or remote file system (using SSH).

4.1 Data Indexing 4.1.1 Package Level Search

Extracted data for each Python package is indexed in ES using FSRiver, where data for each Python package is considered a document. Although all fields in a document are indexed in the ES server, only the following fields: name, summary, keywords and description are analyzed (Refer to Figure 4.1 for ES mapping) before they are indexed using ES Snowball analyzer3. Snowball analyzer generates tokens using the standard tokenizer, removes English stop words and uses standard filter, lowercase filter and snowball filter. The other fields are not analyzed before indexing either because they are numbers (eg: info.downloads.last month) or are of no interest with respect to search query. Figure 4.2 depicts the river definition which actually indexes the package level data (in .json format) located in the server, looks for updates every 12 hour and reindex data if there is any update.

4.1.2 Module Level Search

Extracted data for each module in a Python package is indexed in ES, where data for each module in a Python package is considered as a document. All fields except module path in a document are analyzed using a custom analyzer (Refer to Figure 4.3 for the definition of the custom analyzer) before they are indexed. The custom analyzer generates tokens using a custom

1https://www.elastic.co/products/elasticsearch 2https://github.com/dadoonet/fsriver 3https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-snowball-analyzer.html

14 PUT packagedata/packageindex/ mapping { ‘‘packageindex ’ ’: { ‘‘properties ’ ’: { ‘‘info ’’: { ‘‘properties ’ ’: { ‘‘name’ ’: { ‘‘type’’: ‘‘string ’’, ‘‘index’ ’: ‘‘analyzed’’, ‘‘analyzer ’ ’: ‘‘snowball’’ } , ‘ ‘summary’ ’: { ‘‘type’’: ‘‘string ’’, ‘‘index’ ’: ‘‘analyzed’’, ‘‘analyzer ’ ’: ‘‘snowball’’ } , ‘‘keywords ’ ’: { ‘‘type’’: ‘‘string ’’, ‘‘index’ ’: ‘‘analyzed’’, ‘‘analyzer ’ ’: ‘‘snowball’’ } , ‘‘description ’ ’: { ‘‘type’’: ‘‘string ’’, ‘‘index’ ’: ‘‘analyzed’’, ‘‘analyzer ’ ’: ‘‘snowball’’ } } } } } }

Figure 4.1: Package level data index mapping. tokenizer (splitting by underscore, camel case and numbers, at the same preserves the original token) and uses ES snowball filter4. The module path field is not analyzed as we will not look for matches in this field for a search query. We used fast vector highlighter5 (by setting term vector to with positions offsets in the mapping) instead of the plain highlighter in module level search.

4https://www.elastic.co/guide/en/elasticsearch/reference/1.4/analysis-snowball-tokenfilter.html 5https://www.elastic.co/guide/en/elasticsearch/reference/1.3/search-request-highlighting.html

15 PUT river/packageindexriver/ meta { ‘‘type’’: ‘‘fs’’, ‘‘fs ’’: { ‘‘url ’ ’: ‘‘/server/package/data/directory/path’’, ‘‘update rate ’’: ‘‘12h’’, ‘‘json support’’ : true } , ‘‘index ’ ’: { ‘‘index ’ ’: ‘‘packagedata’’, ‘‘type’ ’: ‘‘packageindex ’’ } }

Figure 4.2: River definition.

The fast vector highlighter lets you define fragment size and number of matching fragments to be returned. Figure 4.4 depicts the mapping defined for module level data indexing.

4.2 Data Searching

We define our search query according to ES Query DSL to look for matches in the index for user queries. ES uses the Boolean model to find matching documents, and a formula called the practical scoring function to calculate relevance. This formula borrows concepts from term frequency/inverse document frequency and the vector space model but adds modern features like a coordination factor, field length normalization, and term or query clause boosting6.

4.2.1 Package Level Search

Figure 4.5 depicts the query used for package level search. ES looks for matches for the user search query in the following fields: name, author, summary, description and keywords. Based on the matches it ranks the results and returns name, author, summary, description, version, keywords, number of downloads in the last month for each top n ranked Python package, where n is the number of matching packages requested. Summary and description of a matched package are

6https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

16 PUT moduledata { ‘‘settings ’’ : { ‘‘analysis ’’ : { ‘‘filter ’’ : { ‘‘code1’’ : { ‘‘type’’ : ‘‘pattern capture ’’, ‘‘preserve original ’’ : 1, ‘‘patterns’’ : [ ‘‘( \\ p{ Ll }+|\\p{Lu}\\p{ Ll }+|\\p{Lu}+)’’, ‘‘( \\ d+) ’’ ] } } , ‘‘analyzer ’’ : { ‘‘code’’ : { ‘‘tokenizer ’’ : ‘‘pattern’’, ‘‘filter ’’ : [ ‘‘code1’’, ‘‘lowercase’’,‘‘snowball’’ ] } } } } }

Figure 4.3: Custom Analyzer with custom pattern filter. returned in the section where matches are highlighted using the tag, which we will utilize in the user interface while displaying the results.

4.2.2 Module Level Search

Figure 4.6 depicts the query used for module level search. ES detects matches to the user search query in the following fields: class, class function, class variable, function, function function, function var and variable of a module document. Different weights are assigned for matches in a different field based on their importance. For example, a match in the class field will weigh more than a match in the function field in a module. Weights are assigned using a caret (ˆ) sign followed by a number. Based on the matches, it ranks the results and returns the path to the module (module path) where match occurred. Using this information, we will retrieve the

17 PUT moduledata/moduleindex/ mapping { ‘‘pypimtype ’ ’: { ‘‘properties ’ ’: { ‘‘module’ ’: { ‘‘type’’: ‘‘string ’’, ‘‘store ’’: ‘‘yes’’, ‘‘analyzer ’ ’: ‘‘code’’ } , ‘ ‘module path ’ ’: { ‘‘type’’: ‘‘string ’’, ‘‘index’’ : ‘‘not analyzed ’ ’ } , ‘‘class ’ ’: { ‘‘type’’: ‘‘string ’’, ‘‘store ’’: ‘‘yes’’, ‘‘analyzer ’’: ‘‘code’’, ‘‘term vector ’’: ‘‘with positions offsets ’’ } , ‘‘class function ’ ’: { ‘‘type’’: ‘‘string ’’, ‘‘store ’’: ‘‘yes’’, ‘‘analyzer ’’: ‘‘code’’, ‘‘term vector ’’: ‘‘with positions offsets ’’ } , ‘‘class variable ’ ’: { ‘‘type’’: ‘‘string ’’, ‘‘store ’’: ‘‘yes’’, ‘‘analyzer ’’: ‘‘code’’, ‘‘term vector ’’: ‘‘with positions offsets ’’ } , ‘‘function ’ ’: { ‘‘type’’: ‘‘string ’’, ‘‘store ’’: ‘‘yes’’, ‘‘analyzer ’’: ‘‘code’’, ‘‘term vector ’’: ‘‘with positions offsets ’’ } , ...... } } }

Figure 4.4: Module level data indexing mapping.

18 GET packagedata/packageindex/ search { ‘‘query ’ ’: { ‘‘multi match ’ ’: { ‘‘query’’: ‘‘search query’’, ‘‘operator ’’:‘‘or’’, ‘‘fields ’’: [‘‘info.nameˆ30’’,‘‘info.author’’,‘‘info . summary’’,‘‘info.version ’’,‘‘info.keywords ’ ’] } } , ‘‘fields ’’: [ ‘‘info.name’’,‘‘info.author’’,‘‘info.summary’’,‘‘in fo .version ’’,‘‘info.keywords’’,‘‘info.downloads.last month ’ ’ ] , ‘‘highlight ’ ’: { ‘‘fields ’’: { ‘ ‘summary’ ’: {} , ‘‘description ’’: {} } } }

Figure 4.5: Package level search query. module .py file and extract the relevant code segment for each matching module. We used Fast Vector Highlighter (FVH) for module level search, which returns the top n fragments from each field (class, class function, class variable, function, function function, function var and variable) in a module document. The fragment size has been strategically defined to be 18, which is the minimum fragment size ES allows. While forming the module level search documents we appended the line number to each variable where it appears so that we can use the module path information and line number information to retrieve the relevant code segment. If a variable appended with the line number is less than 18 characters long, we appended underscores to make it 18. This trick is useful because ES FVH creates fragment based on fragment size and word boundary. This way, ES creates fragments for each variable in a field, looks for matches in each field to the search query and returns n top fragments, where n is the number of fragments set in the query.

19 GET moduledata/moduleindex/ search { ‘‘query ’ ’: { ‘‘multi match ’ ’: { ‘‘query’’ : ‘‘user query’’, ‘‘fields’’ : [ ‘‘classˆ5’’,‘‘class function ’’,‘‘class variable ’’,‘‘ functionˆ4’’,‘‘function function ’’,‘‘function var ’’,‘‘variableˆ3’’] } } , ‘‘fields ’’: [ ‘ ‘module path ’ ’ ] , ‘‘ source ’’: false , ‘‘highlight ’’ : { ‘‘order ’’:‘‘score ’’, ‘‘require field match ’ ’:true , ‘‘fields ’’ : { ‘‘class’’ : { ‘‘number of fragments ’ ’:5 , ‘‘fragment s i z e ’ ’ : 18 } , ‘‘class function ’’ : { ‘‘number of fragments ’ ’:5 , ‘‘fragment s i z e ’ ’ : 18 } , ‘‘class variable ’’ : { ‘‘number of fragments ’ ’:5 , ‘‘fragment s i z e ’ ’ : 18 } , ‘‘function ’’ : { ‘‘number of fragments ’ ’:5 , ‘‘fragment s i z e ’ ’ : 18 } , ...... , ‘‘variable ’’ : { ‘‘number of fragments ’ ’:5 , ‘‘fragment s i z e ’ ’ : 18 } } } }

Figure 4.6: Module level search query. 20 CHAPTER 5

DATA PRESENTATION

Once the data is indexed, it needs to be presented in a fast and presentable manner. We chose to create a search engine type of interface. Our goal for package level search was to provide a ranked list of relevant packages and their details to any given query. For module level search, we wanted to provide actual source code snippets related to the query. In order to display some of the source code to the user, preprocessing was necessary.

5.1 Server Setup

Our server has a simple stack setup. We use Nginx1 to handle requests, Flask2/Python3 to process and serve our data, and uWSGI4 as an interface between the and Flask. Elasticsearch (ES)5 is used to hold our data for package and module level search. To make sure we are using the latest packages from PyPI, we use databases to track the modified times of each package. Much of the rendering and manipulation of the browser interface is done using JavaScript6. Our JavaScript library of choice is jQuery7.

5.2 Browser Interface

The interface features a home page with a simple text box for queries and a choice for either package or module level search. The results page displays the query information at the top, and the results themselves below. The user can modify his or her query or change the search type from package level search to module level search and vice versa. In Chapter 6 we have added some sample screen shots of the browser interface.

1https://www.nginx.com/resources/wiki/ 2http://flask.pocoo.org/ 3https://www.python.org/ 4https://uwsgi-docs.readthedocs.org/en/latest/ 5https://www.elastic.co/products/elasticsearch 6https://developer.mozilla.org/en-US/docs/Web/JavaScript 7https://jqueryui.com/

21 Figure 5.1: Package modal.

5.3 Package Level Search

When a user sends a query to the server for package level search, the query is processed, and a ranked list of packages is sent back. Each package is depicted on the browser as a tile (Refer to Figure 6.3). Tile provides minimal information about the package that includes package name, author of the package, brief description, number of downloads and score assigned by the ranking algorithm. The user has the option to click the tile to view more information about the package and visit PyPI and other sites related to the package. When the user clicks on the tile, a modal is opened that contains detail description, version, source code homepage, PyPI homepage, score from ranking algorithm (Refer to Figure 5.1), statistics of the package as bar graph, pie chart and numbers (Refer to Figure 5.2), other packages from author (Refer to Figure at 5.3). This process relies on ranking of packages.

22 Figure 5.2: Package statistics.

5.3.1 Ranking of Packages

One of the most important aspects which distinguishes this search engine from others is the use of many types of data for ranking the packages. In Chapter 3 and Chapter 4 we have discussed how all the preprocessed information about packages and modules, including basic details from PyPI are stored in our ES server. We felt that ES relevance algorithm was not thorough enough to return meaningful results. So we are using a few other metrics, namely, Bing search results 8, number of downloads, the ratio of warnings to the total number of lines, the ratio of comments to the total number of lines and the ratio of code lines to the total number of lines (gathered by prospector and CLOC, also visible in package modal referenced in Figure 5.2). All of these metrics are passed as columns to the ranking algorithm. These columns represent nothing but a matrix. After the ranking algorithm is executed, it returns a sorted list of packages in reverse order based on newly generated scores and a dictionary mapping of package names to their scores. Note that

8http://datamarket.azure.com/dataset/bing/search

23 Figure 5.3: Other packages from author.

ES column and Bing Column are primary columns while others are secondary derived columns. Tuning these column weights could be tricky. One way to fine tune this algorithm is to try different combinations of weights and learn which one works better. We can also fine tune this algorithm by adding more primary columns or secondary derived columns. For example, at the time of writing this thesis, we were experimenting with PageRank as one of the primary columns. We sought to calculate PageRank based on import statements in each module. Just like in web pages where a page “A” links to some other page “B”, in Python modules we have import statements where a module “C” imports module “D”. This can be considered as a vote cast by “C” to module “D” and this information can be used to generate PageRank for modules and packages.

Ranking Algorithm.

1. Request ES server for top 20 results for user query.

2. Request Bing for top 20 results for user query.

3. Generate other relevant columns or metrics.

24 4. Pass list of all columns generated in step 1, 2 and 3 to rerank function along with weights for each column.

5. Inside rerank function

(a) Find max length of columns, i.e. number of rows in the matrix. (b) Construct a results dictionary for set union of all cells in the matrix with key as the name of the package and value (score) initiated to 0. (c) For each row in matrix i. For each cell in row A. Find the package score for its position using the formula, (maxlen − row.rownumber) × weightvector(cell.columnnumber). Here “maxlen” is the total number of rows in the matrix. “row.rownumber” is the position of the current row in the matrix. “weightvector(cell.columnnumber)” gives the weight assigned to the particular column the cell belongs to. B. Add this score to existing score of that package in the dictionary created in step 5(b). At the end of this step, each package in results dictionary will have cumulative score for standing in different columns. (d) From the results dictionary, generate a reverse sorted list with key as score. This will give results list with reranked order.

Note that there could be duplicates between top 20 results of ES server and top 20 results of Bing, so total number of rows in the matrix passed to ranking function is not always 40. Figure 5.4 is the sample pseudocode for the ranking algorithm. Table 5.1 is the sample matrix formed for keyword “music”. In this table by looking at the cells highlighted with yellow background, you can notice duplicate name “vls-framework” between primary columns Elasticsearch and Bing. This only means that both primary columns agree to the fact that this is a relevant result for given keyword. By looking at the cells highlighted in red, you can notice duplicates within the same primary column Bing. This scenario occurs because multiple versions of the same package emerge famous. From the table we can also notice that these duplicates are carry forwarded to secondary columns, thus influencing the ranks. Allowing duplicates to present in primary columns and allowing them to carry forward to secondary columns are decisions yet to be made. For now, this practice in ranking algorithm has looked promising, with positive effects. As we investigate with more use cases, if it turns out duplicates are influencing negatively, we can always just eliminate them by not disturbing

25 the nature of ranking function. For keyword “music”, Table 5.2 shows the matching packages and their scores calculated by the ranking function ordered in descending order.

5.4 Module Level Search

Refer to Figure 6.4 to view the template used for Module Level Search. Similar to package level search, the user sends a query to the server. However, this time around the server returns a list of code excerpts. One of our concerns was to keep the wait time of queries low. So a preprocessing step was added to retrieve the code faster.

5.4.1 Preprocessing

To reduce the wait time of searches, the source code needs to be adapted for the browser. As previously mentioned, Bandernsatch is used to create a local mirror of the PyPI repository on our server, however, only compressed packages are maintained with Bandersnatch. An uncompression step is required to examine each Python module in plain text. Our aim was to show nicely formatted, stylized lines of code. Initially, we were going to send about twenty lines from the source code to the user and render the code on the user’s side using a third party JavaScript library such as SyntaxHighlighter9. This worked well except for multi-line constructs such as doc strings. Since there is a possibility of missing lines of a doc string during client-side rendering, the renderer has no way of knowing how to stylize the doc string. Instead, we fixed this by rendering the code snippets before sending it to the client (server-side rendering). For this, we used pygments10, a Python library, for creating stylized code in Python and other languages for numerous formats such as HTML. This inevitably increases the amount of data sent from the server, but this ensures that the code is correctly stylized. Lines of Code (LOC) is another trick used in expediting the code display. It creates a “mapping” file for each module. The mapping file in this instance stores the exact byte where a new line starts for each line in a module. When it is time to grab a snippet of code, the server can open the file and immediately seek to the correct spot rather than search or linearly pass through the rest of lines. There could be thousand or even more lines of code that we are avoiding by using this technique. This cuts down on processing time and only requires another step in the preprocessing.

9http://alexgorbatchev.com/SyntaxHighlighter/ 10http://pygments.org/

26 generateTop20ESResults(): { #query ES server and get top 20 results } generateTop20BingResults(): { #query bing and get top 20 #filter out real packages from pages } generateOtherMetrics(listOfPackagesFromESandBing): { for eachpackage in listOfPackagesFromESandBing: column_downloads = eachpackage.downloads column_warnings = eachpackage.warnings/eachpackage.lines column_comments = eachpackage.comments/eachpackage.lines column_code = eachpackage.code/eachpackage.lines column_downloads.sortReverse() # more downloads, better the package column_warning.sort() # less warnings, better the package column_comments.sortReverse() # more comments, better the package column_code.sortReverse() # more code, better the package return column_downloads,column_warning,column_comments,column_code } rerank(weightvector, matrix): { maxlen = max(len(column) for column in matrix) # Create a dict as keys:0.0 from union of cells in matrix resultsDictionary = getcDictionaryFromCells(matrix) # go through the matrix one row at a time for row in matrix: for cell in row: if packagename: # avoiding empty cells resultsDictionary[cell.packaganame] += (maxlen - row.rownumber) * weightvector[cell.column_number] # higher the score, better the package resultsList = sortReverse(resultsDictionary, key=score) # resultsList gives you the order of packages # resultsDictionary gives score to each package return resultsList,resultsDictionary }

# execution starts here columns = [] columns.append(generateTop20ESResults) # primary column, more weight columns.append(generateTop20BingResults) # primary column, more weight columns.extend(generateOtherMetrics(ES+Bing)) # secondary columns wegithvector = [0.4, 0.2, 0.1, 0.1, 0.1, 0.1] # can vary rerank(weightvector,columns)

Figure 5.4: Pseudocode for ranking algorithm.

27 Table 5.1: Ranking matrix for keyword music.

Elasticsearch Bing Downloads Warnings/Lines Comments/Lines Code/Lines vk-music musicplayer pylast vk-music mps-youtube mopidy-musicbox-webclient jmbo-music music mps-youtube jmbo-music mps-youtube musicplayer panya-music mopidy-gmusic mps-youtube panya-music vis-framework musicplayer tweet-music pyspotify jmbo-music tweet-music pyacoustid mopidy-gmusic google-music-playlist-importer musicplayer hachoir-metadata google-music-playlist-importer cherrymusic hachoir-metadata music-score-creator mps-youtube pyspotify music-score-creator music21 bdmusic spilleliste music21 mp3play spilleliste music21 music21 raspberry jam mps-youtube bdmusic raspberry jam pyspotify music21 kurzfile gmusicapi pyacoustid kurzfile mp3play pyspotify vis-framework vis-framework gmusicapi vis-framework gmusicapi mp3play gmusic-rating-sync music22 vis-framework gmusic-rating-sync bdmusic pyacoustid vkmusic pygame-music-grid vis-framework vkmusic hachoir-metadata cherrymusic melody-dl bdmusic music21 melody-dl musicplayer vis-framework leftasrain mopidy-musicbox-webclient music21 leftasrain musicplayer gmusicapi youtubegen netease-musicbox raspberry jam youtubegen mopidy-gmusic mps-youtube marlib hachoir-metadata mopidy-musicbox-webclient marlib vk-music mps-youtube chordgenerator mp3play musicplayer chordgenerator jmbo-music vk-music tempi music21 musicplayer tempi panya-music jmbo-music raindrop.py pyacoustid panya-music raindrop.py tweet-music panya-music pylast cherrymusic cherrymusic pylast google-music-playlist-importer tweet-music

- - kurzfile musicplayer music-score-creator google-music-playlist-importer

- - mopidy-gmusic musicplayer spilleliste music-score-creator

- - music-score-creator music21 raspberry jam spilleliste

- - chordgenerator music21 kurzfile raspberry jam

- - tweet-music mps-youtube vis-framework kurzfile

- - vkmusic mps-youtube gmusic-rating-sync vis-framework

- - leftasrain pyacoustid vkmusic gmusic-rating-sync

- - gmusic-rating-sync vis-framework melody-dl vkmusic

- - vk-music pyspotify leftasrain melody-dl

- - tempi mopidy-musicbox-webclient youtubegen leftasrain

- - google-music-playlist-importer mp3play marlib youtubegen

- - marlib hachoir-metadata chordgenerator marlib

- - melody-dl bdmusic tempi chordgenerator

- - raindrop.py cherrymusic raindrop.py tempi

- - spilleliste mopidy-gmusic pylast raindrop.py

- - youtubegen gmusicapi mopidy-musicbox-webclient pylast

28 Table 5.2: Matching packages and their scores for keyword music.

Package Name Score Package Name Score

musicplayer 45.8 music-score-creator 13.8

mps-youtube 44.6 raspberry jam 13.6

music21 39 google-music-playlist-importer 13.5

vis-framework 33 kurzfile 12.5

pyspotify 22.8 spilleliste 12.1

mopidy-gmusic 20.8 gmusic-rating-sync 10.8

gmusicapi 19 vkmusic 10.5

bdmusic 18.6 music22 10.4

hachoir-metadata 17.8 pygame-music-grid 10

jmbo-music 17.7 leftasrain 9.4

mp3play 17.1 melody-dl 9.3

pyacoustid 16.9 pylast 9

vk-music 15.7 netease-musicbox 8.8

panya-music 15.7 chordgenerator 8.2

mopidy-musicbox-webclient 15.7 youtubegen 8

tweet-music 14.6 marlib 7.9

cherrymusic 14.5 tempi 7.1

music 14 raindrop.py 6.2

We limited the number of code snippets that are returned to 20. Sending more than 20 matches for module level search will generate a huge response and increase the response time. Before sending the top 20 results, we apply the ranking algorithm discussed for package level search with changes in input for the ranking function. As of now, there is only one primary column, i.e. results from the ES server and a list of secondary columns similar to package level, but each one of them points to module level statistics rather than package level statistics (except for downloads column). For example, consider column “Warnings/Lines”, for package level search, it is the ratio of the number of total warnings for package to the number of total lines of code for the package and for the module level search it is the ratio of the total number of warnings for module to the total number of lines of code for the module. After rendering matching code snippets on the front end, for user convenience and interest, we

29 Figure 5.5: Module modal. have given an option for user to click on a code snippet that enables a modal that displays entire code from the module. This will give users more visibility for modules. Figure 5.5 represents this modal.

30 CHAPTER 6

SYSTEM LEVEL FLOW DIAGRAM

Figure 6.1 is a System Level Flow Diagram of PyQuery. A set of Python scripts are run in batch overnight to generate all the required details mentioned in the Data Collection chapter. This preprocessed information is in JSON file format. Before executing these batch scripts, the bandersnatch1 mirror client is executed so that packages are in sync with PyPI2 and we deliver the most up to date information. All the files generated are either part of package search or module search. We maintain separate Elasticsearch (ES)3 indexes for package search and module search. These indexes are configured to update at regular intervals if there are any changes for the files they point. They form the core of the PyQuery. A web interface customized for easy flow of information to users is developed in Flask4 and deployed using NGINX5 server. Figure 6.2 is PyQuery’s homepage whose design is mainly inspired from Google’s homepage. It serves separate edge nodes for package search and module search. When a user hits the package level search page, an AJAX call is made to the edge node responsible for retrieving matching packages from ES index. Based up on matching packages retrieved from ES, a set of metrics are formulated and passed to the ranking algorithm as discussed in chapter 5. This returns a list of packages to requested front end page that is reverse sorted with key as the score calculated by ranking algorithm. The highest score is positioned on top of others. Figure 6.3 shows result of Package Level Search on PyQuery for keyword “flask”. When a user hits the module level search page, an AJAX6 call is made to edge node responsible for retrieving matching modules or lines of code based on the metadata index in ES. After collecting matching modules and line numbers at which the match happened, the Lines of Code (LOC) technique discussed in Chapter 5 is executed to quickly capture code snippets from matching

1https://pypi.python.org/pypi/bandersnatch 2https://pypi.python.org/pypi 3https://www.elastic.co/products/elasticsearch 4http://flask.pocoo.org/ 5https://www.nginx.com/resources/wiki/ 6http://api.jquery.com/jquery.ajax/

31 Figure 6.1: System Level Flow Diagram of PyQuery. modules and display them on the requested front end page. Figure 6.4 shows result of Module Level Search on PyQuery for the keyword “quicksort”.

32 Figure 6.2: PyQuery homepage.

33 Figure 6.3: PyQuery package level search template.

34 Figure 6.4: PyQuery module level search template.

35 CHAPTER 7

RESULTS

Our goal was primarily to build a better PyPI search engine. We wanted to make the search more meaningful. We wanted to a avoid huge list of closely and similarly scored packages. With PyPI being the state of the art, we will be comparing results of PyQuery with PyPI . For comparison, we will be searching 5 keywords that directly match with a package name at PyPI (from Table 7.1 to Table 7.5) and 5 generic keywords that would infer the purpose of packages (from Table 7.6 to Table 7.10).

1. Total results returned by PyPI was highest for keyword “Django” and that amounts to 11,292. 1 It is approximately 4 of total number of packages on PyPI. On the other hand, for PyQuery with two primary columns in ranking function is always set to a maximum of 40. As a fact, on Google 81% of users view only one results page [12], which is close to 10 results.

2. Maximum number of packages, that PyPI assigned the highest score for a keyword was 162 and that is for package “flask”. PyQuery always assigned highest score for only one package.

3. For the first 5 keywords, where we are expecting a package to be in first position due to the nature of keyword matching the package name, PyPI showed this behavior only 2 out of 5 times. On the other hand, PyQuery exhibited this behavior 5 out of 5 times.

4. For PyPI, distribution of scores is very less spread. Among the top 5 scores, on 4 out of 10 occasions it has assigned same score, on 3 out of 10 occasions it has assigned 4 to be of same score and on 2 out of 10 it has assigned 3 to be of the same score. PyQuery scores are more spread.

5. For the last 5 keywords where they don’t directly point to any package name but infer the need of a developer, we observe that PyQuery results are more appealing, diversely scored and unique. For example, for keyword ’Web Development Framework’, PyQuery returned all unique results with packages like Django and Pyramid (widely used web development frameworks) in top 5. On the other hand, among top 5 for PyPI it had only 3 unique results with some of them related to web testing framework.

6. Among the top 5 results, on 4 out of 10 occasions, PyPI returned duplicate package names referring to multiple versions of the same package. PyQuery always returns only one result per package that is referring to the latest version.

36 Based on above grounds of comparison, it is clear that we have met our goals to improve PyPI, offer a meaningful search and avoid closely and similarly scored packages.

Table 7.1: Results comparison for keyword - requests.

Keyword requests # of results from PyPI 5117 # of results from PyQuery 28 Top 5 results from PyPI cache requests, curl to requests, drequests, helga-pull-requests, jsonrpc-requests Top 5 results from PyQuery Requests, Requests-OAuthlib, Requests-Mock, Requests-Futures, Requests-Oauth First 5 scores - PyPI 9, 9, 9, 9, 9 First 5 scores - PyQuery 131.60, 68.50, 51.00, 45.60, 39.90 # of packages with highest score - 21 PyPI # of packages with highest score - 1 PyQuery Rank (score) of expected match - 7 (9) PyPI Rank (score) of expected match - 1 (131.60) PyQuery

Table 7.2: Results comparison for keyword - flask.

Keyword flask # of results from PyPI 1750 # of results from PyQuery 37 Top 5 results from PyPI airbrake flask, draftin a flask, fireflask, Flask, Flask-AlchemyDumps Top 5 results from PyQuery Flask, Flask-Admin, Flask-JSONRPC, Flask- Restless, Flask Debug-toolbar First 5 scores - PyPI 9, 9, 9, 9, 9 First 5 scores - PyQuery 46.70, 44.80, 41.20, 26.10, 25.90 # of packages with highest score - 162 PyPI # of packages with highest score - 1 PyQuery Rank (score) of expected match - 4 (9) PyPI Rank (score) of expected match - 1 (46.70) PyQuery

37 Table 7.3: Results comparison for keyword - pygments.

Keyword pygments # of results from PyPI 250 # of results from PyQuery 29 Top 5 results from PyPI Pygments, django mce pygments, pygments-asl, pygments-gchangelog, pygments-rspec Top 5 results from PyQuery Pygments, pygments-style-github, Pygments- Xslfo-Formatter, Bibtex-Pygments-Lexer, Mis- tune First 5 scores - PyPI 11, 9, 9, 9, 9 First 5 scores - PyQuery 88.10, 31.50, 28.50, 24.90, 19.30 # of packages with highest score - 1 PyPI # of packages with highest score - 1 PyQuery Rank (score) of expected match - 1 (11) PyPI Rank (score) of expected match - 1 (88.10) PyQuery

Table 7.4: Results comparison for keyword - Django.

Keyword Django # of results from PyPI 11292 # of results from PyQuery 38 Top 5 results from PyPI Django, django-hstore, django-modelsatts, django-notifications-hq,django-notifications-hq Top 5 results from PyQuery Django, Django-Appconf, Django-, Django-Nose, Django-Inplaceedit First 5 scores - PyPI 10, 10, 10, 10, 10 First 5 scores - PyQuery 75.30, 25.50, 25.20, 25.20, 23.50 # of packages with highest score - 6 PyPI # of packages with highest score - 1 PyQuery Rank (score) of expected match - 1 (10) PyPI Rank (score) of expected match - 1 (73.30) PyQuery

38 Table 7.5: Results comparison for keyword - pylint.

Keyword pylint # of results from PyPI 361 # of results from PyQuery 25 Top 5 results from PyPI gt-pylint-commit-hook, plpylint, pylint, pylint- patcher, pylint- Top 5 results from PyQuery Pylint, Pylint2tusar, Django-Jenkins, Py- lama pylint, Logilab-Astng First 5 scores - PyPI 9, 9, 9, 9, 9 First 5 scores - PyQuery 91.90, 21.80, 20.60, 17.30, 16.00 # of packages with highest score - 6 PyPI # of packages with highest score - 1 PyQuery Rank (score) of expected match - 3 (9) PyPI Rank (score) of expected match - 1 (91.90) PyQuery

Table 7.6: Results comparison for keyword - biological computing.

Keyword biological computing # of results from PyPI 6 # of results from PyQuery 23 Top 5 results from PyPI blacktie, appdynamics, appdynamics, appdy- namics, inspyred Top 5 results from PyQuery BiologicalProcessNetworks, Blacktie, PyD- STool, PySCeS, Csb First 5 scores - PyPI 3, 2, 2, 2, 2 First 5 scores - PyQuery 15.10, 14.10, 12.90, 12.50, 11.80 # of packages with highest score - 1 PyPI # of packages with highest score - 1 PyQuery Relevant packages among top 5 - 1 PyPI Relevant packages among top 5 - 1, 2, 3, 4, 5 PyQuery

39 Table 7.7: Results comparison for keyword - 3D printing.

Keyword 3D printing # of results from PyPI 26 # of results from PyQuery 31 Top 5 results from PyPI fabby, tangible, blockmodel, citygml2stl, de- makein Top 5 results from PyQuery Pymeshio, Demakein, C3d, Bqclient, Pyautocad First 5 scores - PyPI 7, 7, 6, 5, 4 First 5 scores - PyQuery 44.00, 21.50, 18.90, 18.90, 18.60 # of packages with highest score - 2 PyPI # of packages with highest score - 1 PyQuery Relevant packages among top 5 - 1, 2, 3, 4, 5 PyPI Relevant packages among top 5 - 1, 2, 3, 4, 5 PyQuery

Table 7.8: Results comparison for keyword - web development framework.

Keyword web development framework # of results from PyPI 801 # of results from PyQuery 32 Top 5 results from PyPI HalWeb, WebPages, robotframework- extendedselenium2library, robotframework- extendedselenium2library, robotframework- extendedselenium2library Top 5 results from PyQuery Django, Pyramid, Pylons, Moya, Circuits First 5 scores - PyPI 16, 16, 15, 15, 15 First 5 scores - PyQuery 65.80, 48.20, 37.60, 32.60, 23.80 # of packages with highest score - 2 PyPI # of packages with highest score - 1 PyQuery Relevant packages among top 5 - 1, 2 PyPI Relevant packages among top 5 - 1, 2, 3, 4, 5 PyQuery

40 Table 7.9: Results comparison for keyword - material science.

Keyword material science # of results from PyPI 52 # of results from PyQuery 29 Top 5 results from PyPI py bonemat abaqus, MatMethods, MatMiner, pymatgen, pymatgen Top 5 results from PyQuery FiPy, Pymatgen, Pymatgen-Db, Custodian, Mpmath First 5 scores - PyPI 7, 6, 6, 6, 6 First 5 scores - PyQuery 57.50, 55.20, 20.30, 19.70, 17.50 # of packages with highest score - 1 PyPI # of packages with highest score - 1 PyQuery Relevant packages among top 5 - 1, 2, 3, 4, 5 PyPI Note: 4 and 5 are duplicate results. Relevant packages among top 5 - 1, 2, 3, 4 PyQuery

Table 7.10: Results comparison for keyword - google maps.

Keyword google maps # of results from PyPI 290 # of results from PyQuery 31 Top 5 results from PyPI Product.ATGoogleMaps, trytond google maps, django-google-maps, djangocms-gmaps, Flask- GoogleMaps Top 5 results from PyQuery Googlemaps, Django-Google-Maps, Flask- GoogleMaps, Gmaps, Geolocation-Python First 5 scores - PyPI 18, 18, 14, 14, 14 First 5 scores - PyQuery 50.20, 39.40, 39.20, 39.00, 38.60 # of packages with highest score - 2 PyPI # of packages with highest score - 1 PyQuery Relevant packages among top 5 - 1, 2, 3, 4, 5 PyPI Note: Results are relevant to query but they are missing general purpose packages like Googlemaps among top 5. Relevant packages among top 5 - 1, 2, 3, 4, 5 PyQuery

41 CHAPTER 8

CONCLUSIONS AND FUTURE WORK

We believe we have succeeded in developing a dedicated search engine for Python packages and modules. We expect the Python community to widely adopt PyQuery. PyQuery would allow Python developers to explore well written, widely adopted, famous and highly apt Python packages and modules for their programming needs. It will offer itself as an encouraging tool in Python community to follow software engineering practice code reuse.

8.1 Thesis Summary

In this thesis we have proposed some concrete ideas on how to develop a dedicated search engine for Python packages and modules. We have sought to build an improved version of the state of the art Python search engine PyPI. Although PyPI is the first and only tool to address this problem, results from PyPI are found to serve very little use to user needs and requirements. We have discussed various tools and techniques which are brought together as one single tool called PyQuery, for facilitating better search, better rank and better package visibility. With PyQuery we want to bridge the gap between the high demand of means and ways to deliver reusable components in Python for code reuse and the lack of efficient tools at users disposal to achieve it. In Chapter 1 we discussed the relevance of this problem, our objective and approach towards solving the problem. In Chapter 2 we highlighted the related work in this area. For package level search, PyPI being the only search engine that does Python module search, we have elaborated on how PyPI search algorithm works, offered reasons as to why we think it needs improvement. For module level search, there isn’t any dedicated code search engine for Python so we have explored code search engines that work across multiple languages and reasoned the need for a dedicated search engine for Python. PyQuery is divided into three different components: Data Collection, Data Indexing and Data Presentation. Since we intend to provide two modes of search operations i.e. Package Level Search and Module Level Search, at each component we employ a list of tools and techniques to achieve specific goals related to these modes. In Chapter 3 we discussed Data Collection Module, use of

42 Bandersnatch1 PEP 381 mirror client to clone Python packages locally and later process these packages using code analysis tools like Prospector2 and CLOC3. We explored how to make use of Abstract Syntax Trees (ast) to filter out useful information or metadata from Python modules. We also addressed JSON file format for saving all this information with an example for each type of data. In Chapter 4 we demonstrated how to feed structured data to Elasticsearch (ES)4 and make use of FS River5 and Analyzer6 plugins to digest the feed data. ES is built on top of Apache Lucene7 and offers a wide variety of methods to configure data indexing and data retrieval. We explained the purpose behind agreeing to a specific format for the JSON file to collect the data so that we can make use of configurations ES offers. One such configuration is minimum fragment size. By minimum fragment size set to 18 and an identifier collected along with the line number as one word separated by underscores and right filled with underscores until minimum length of 18, we were able to get a matching identifier and its line number as one single match. This reduced the size of JSON file indexed in ES drastically and also helps save time to fetch line number from another key. In this chapter, we have also outlined some sample queries to index and retrieved meaningful information out of the indexed structured data. In Chapter 5, we covered the data presentation concepts like browser interface and server setup. We discussed our implementation of a server side ranking algorithm for package and module level search, columns involved in ranking metrics and an example view of these columns for a sample query. Also, we presented our preprocessing implementation of faster code search that involves generating the starting address byte of each line in a module and transforming code into pygments. In Chapter 6 we gave an overview of how all three components of PyQuery will work together with a system level flow diagram. Finally, in Chapter 7 we compared results of PyQuery with that of PyPI to prove that we have achieved our goal to improve PyPI, offer a meaningful search and avoid closely and similarly scored packages.

1https://pypi.python.org/pypi/bandersnatch] 2https://github.com/landscapeio/prospector 3http://cloc.sourceforge.net/ 4https://www.elastic.co/products/elasticsearch 5https://github.com/dadoonet/fsriver 6https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-snowball-analyzer.html 7https://lucene.apache.org/core/

43 8.2 Recommendation for Future Work

Although PyQuery accomplished the initially established goals, there is definitely scope for improvement. In this section, we would like to list ways to improve further. We want to perform a large scale comparison on PyQuery. Currently, we have tested PyQuery with a set of keywords for which we know matching packages. We observed that PyQuery is doing better than state of the art, PyPI. Python is an extensive language and people from many different fields use the Python programming language to solve problems in their respective disciplines. In this process, there is always a continuous production of packages that are useful. There are thousands of packages that are pretty famous for various reasons. Knowing in advance all the possible keywords that map to these packages is nearly impossible. A tool gains popularity and importance only when it is widely accepted by user base. By reaching out to developers of Python community from various disciplines, we can gauge how well PyQuery is mapping keywords to right packages. We want to plan for large scale user surveys by asking professional developers to search for packages that they use by direct package name and by keywords that infer package purpose. We want to collect their feedback and learn if PyQuery is meeting their requirements and if it is doing a better job than PyPI. We would like to list out use cases where PyQuery needs to do better. We can extend PyQuery to a recommendation system. We can apply collaborative filtering technique i.e., capture user actions to know their likes and dislikes of Python packages we suggest and later from this data make predictions of a list of packages a user would find interesting. This will allow further improvements on PyQuery. If a user trusts a specific author and tends to explore packages developed by him/her more often, we could make his or her search results more appealing to him by filtering packages from this author among the set of initially matched packages. If a user tends to explore packages specific to a field or category, it is most likely that he/she is working in that field and in future if there is a user management component added to PyQuery and every time a user login to the website, we can suggest famous packages from his or her field on dashboard or we can suggest latest news related to updates made on packages related to his or her field. These are the few of the many possibilities we can do with Collaborative filtering technique, to facilitate better search operations. This will allow developers to receive the latest information on packages and help them to make the best of Python packages and modules. Many successful giants in the field of entertainment like Netflix and Comcast make use of Collaborative filtering technique to

44 always keep their users engaged with their website. Since PyQuery seeks to help developers explore Python packages it could find great purpose for collaborative filtering.

45 BIBLIOGRAPHY

[1] Caitlin Sadowski, Kathryn T Stolee, and Sebastian Elbaum. How developers search for code: a case study. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, pages 191–201. ACM, 2015.

[2] Asimina Zaimi, Noni Triantafyllidou, Androklis Mavridis, Theodore Chaikalis, Ignatios Deli- giannis, Panagiotis Sfetsos, and Ioannis Stamelos. An empirical study on the reuse of third- party libraries in open-source software development. In Proceedings of the 7th Balkan Confer- ence on Informatics Conference, page 4. ACM, 2015.

[3] Andy Lynex and Paul J Layzell. Organisational considerations for software reuse. Annals of Software Engineering, 5(1):105–124, 1998.

[4] David C Rine and Robert M Sonnemann. Investments in reusable software. a study of software reuse investment success factors. Journal of systems and software, 41(1):17–32, 1998.

[5] Python Software Foundation. Pypi. https://pypi.python.org/pypi.

[6] Taichino. Pypi ranking. http://pypi-ranking.info/alltime, 2012.

[7] Steven P Reiss. Semantics-based code search. In Software Engineering, 2009. ICSE 2009. IEEE 31st International Conference on, pages 243–253. IEEE, 2009.

[8] Nullege a search engine for python source code. http://nullege.com/.

[9] Iulian Neamtiu, Jeffrey S Foster, and Michael Hicks. Understanding source code evolution using abstract syntax tree matching. ACM SIGSOFT Software Engineering Notes, 30(4):1–5, 2005.

[10] Themistoklis Diamantopoulos and Andreas L Symeonidis. Employing source code information to improve question-answering in stack overflow.

[11] Ast abstract syntax trees. https://docs.python.org/2/library/ast.html.

[12] Bernard J Jansen and Amanda Spink. How are we searching the world wide web? a comparison of nine search engine transaction logs. Information Processing & Management, 42(1):248–263, 2006.

46 BIOGRAPHICAL SKETCH

My name is Shiva Krishna Imminni and I was born in metropolitan city Hyderabad, India. My father Mr. Nageswara Rao is a government employee and my mother Mrs. Subba Laxmi is a homemaker. They are my biggest inspiration and support. I am the elder of two children of my parents. My sister Ramya Krishna Imminni is very close to my heart and is very special to me. My family is the guiding force behind the success I have in my career. I have received Bachelor’s degree from Jawaharlal Nehru Technology University in May 2011 and joined FactSet Research Systems as QA Automation Analyst. At FactSet, I wrote QA Au- tomation scripts in various languages like , Ruby and Jscript. I worked on various automation frameworks like TestComplete and Selenium. I was one among the first three employees hired for QA Automation process, so I had a lot of opportunities to try various job roles and experiment with new technologies. Out of all the job roles I had done, I liked training new hires the most. I was promoted to QAAutomation Analyst 2 in a short span of 1 year and awarded Star Performer for the year 2013. It is at FactSet I developed Testlogger, a ruby library to generate log files; custom built for QA terminology like and . I worked at FactSet for 2 years from November 2011 to December 2013 and gained a diverse experience performing various roles. I have joined Department of Computer Science at Florida State University as a Master of Sci- ence student in Spring 2013. At FSU I have continued to gain professional experience working part time as Software Developer, Graduate Research Assistant at iDigInfo, Institute of Digital Information and Scientific Communication. At iDigInfo, I worked on various projects related to re- search in specimen digitization. Some of these projects include Morphbank, a continuously growing database of images that scientists use for international collaboration, research and education and iDigInfo-OCR, an optical character recognition software for digitizing label information of specimen collections. I also worked as a Graduate Teaching Assistant for Bachelor level Software Engineering course. As a part of my course work, I have taken a Python course under Dr. Piyush Kumar that led to my interest in working on PyQuery. Experience I have gained while working on PyQuery helped me get an intern opportunity. During the Summer of 2015, I interned with Bank of America. As an intern, I worked on various technologies related to BigData including Hadoop HDFS, Hive and Impala.

47