Pyquery: a Search Engine for Python Packages and Modules Shiva Krishna Imminni

Florida State University Libraries 2015 Pyquery: A Search Engine for Python Packages and Modules Shiva Krishna Imminni Follow this and additional works at the FSU Digital Library. For more information, please contact [email protected] FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES PYQUERY: A SEARCH ENGINE FOR PYTHON PACKAGES AND MODULES By SHIVA KRISHNA IMMINNI A Thesis submitted to the Department of Computer Science in partial fulfillment of the requirements for the degree of Master of Science 2015 Copyright c 2015 Shiva Krishna Imminni. All Rights Reserved. Shiva Krishna Imminni defended this thesis on November 13, 2015. The members of the supervisory committee were: Piyush Kumar Professor Directing Thesis Sonia Haiduc Committee Member Margareta Ackerman Committee Member The Graduate School has verified and approved the above-named committee members, and certifies that the thesis has been approved in accordance with university requirements. ii I dedicate this thesis to my family. I am grateful to my loving parents, Nageswara Rao and Subba Laxmi, who made me the person I am today. I am thankful to my affectionate sister, Ramya Krishna, who is very special to me and always stood by my side. iii ACKNOWLEDGMENTS I owe thanks to many people. Firstly, I would like to express my gratitude to Dr. Piyush Kumar for directing my thesis. Without his continuous support, patience, guidance and immense knowledge, PyQuery wouldn’t be so successful. He truly made a difference in my life by introducing me to Python programming language and helped me learn how to contribute to Python community. He trusted me and remained patient during the difficult times. I would also like to thank Dr. Sonia Haiduc and Dr. Margareta Ackerman for participating on the committee, monitoring my progress and providing insightful comments. They helped me learn multiple perspectives that widened my research. I would like to thank my team members Mir Anamul Hasan, Michael Duckett, Puneet Sachdeva and Sudipta Karmakar for their time, support, commitment and contributions to PyQuery. iv TABLE OF CONTENTS ListofTables.......................................... vii ListofFigures ......................................... viii Abstract............................................. ... ix 1 Introduction 1 1.1 Objective ........................................ 2 1.2 Approach .......................................... 2 2 Related Work 4 3 Data Collection 7 3.1 Package Level Search . 7 3.1.1 Metadata - Packages . 7 3.1.2 CodeQuality.................................... 8 3.2 ModuleLevelSearch .................................. 10 3.2.1 Mirror Python Packages . 10 3.2.2 Metadata - Modules . 11 3.2.3 CodeQuality.................................... 12 4 Data Indexing and Searching 14 4.1 DataIndexing...................................... 14 4.1.1 Package Level Search . 14 4.1.2 Module Level Search . 14 4.2 DataSearching ...................................... 16 4.2.1 Package Level Search . 16 4.2.2 Module Level Search . 17 5 Data Presentation 21 5.1 ServerSetup ...................................... 21 5.2 BrowserInterface.................................. 21 5.3 Package Level Search . 22 5.3.1 Ranking of Packages . 23 5.4 ModuleLevelSearch .................................. 26 5.4.1 Preprocessing................................... 26 6 System Level Flow Diagram 31 7 Results 36 v 8 Conclusions and Future Work 42 8.1 ThesisSummary ..................................... 42 8.2 Recommendation for Future Work . 44 Bibliography .......................................... 46 BiographicalSketch ..................................... 47 vi LIST OF TABLES 5.1 Ranking matrix for keyword music. 28 5.2 Matching packages and their scores for keyword music. 29 7.1 Results comparison for keyword - requests. 37 7.2 Results comparison for keyword - flask. 37 7.3 Results comparison for keyword - pygments. 38 7.4 Results comparison for keyword - Django. 38 7.5 Results comparison for keyword - pylint. 39 7.6 Results comparison for keyword - biological computing. 39 7.7 Results comparison for keyword - 3D printing. 40 7.8 Results comparison for keyword - web development framework. 40 7.9 Results comparison for keyword - material science. 41 7.10 Results comparison for keyword - google maps. 41 vii LIST OF FIGURES 3.1 Metadata from PyPI for package requests. 9 3.2 Pseudocode for collecting metadata of a module. 12 3.3 Metadata for a module in Flask package. 13 4.1 Package level data index mapping. 15 4.2 Riverdefinition. ................................... 16 4.3 Custom Analyzer with custom pattern filter. 17 4.4 Moduleleveldataindexingmapping. 18 4.5 Package level search query. 19 4.6 Modulelevelsearchquery.. .. 20 5.1 Package modal. 22 5.2 Package statistics. 23 5.3 Other packages from author. 24 5.4 Pseudocode for ranking algorithm. 27 5.5 Modulemodal....................................... 30 6.1 System Level Flow Diagram of PyQuery. 32 6.2 PyQueryhomepage. .................................... 33 6.3 PyQuery package level search template. 34 6.4 PyQuerymodulelevelsearchtemplate. 35 viii ABSTRACT Python Package Index (PyPI) is a repository that hosts all the packages ever developed for the Python community. It hosts thousands of packages from different developers and for the Python community, it is the primary source for downloading and installing packages. It also provides a simple web interface to search for these packages. A direct search on PyPI returns hundreds of packages that are not intuitively ordered, thus making it harder to find the right package. Developers consequently resort to mature search engines like Google, Bing or Yahoo which redirect them to the appropriate package homepage at PyPI. Hence, the first task of this thesis is to improve search results for python packages. Secondly, this thesis also attempts to develop a new search engine that allows Python developers to perform a code search targeting python modules. Currently, the existing search engines classify programming languages such that a developer must select a programming language from a list. As a result every time a developer performs a search operation, he or she has to choose Python out of a plethora of programming languages. This thesis seeks to offer a more reliable and dedicated search engine that caters specifically to the Python community and ensures a more efficient way to search for Python packages and modules. ix CHAPTER 1 INTRODUCTION Python is a high-level programming language based on simplicity and efficiency. It emphasizes code simplicity and can perform more functions in fewer lines of code. In order to streamline code and speed up development many programmers use application packages, which reduce the need to copy definitions into each program. These packages consist of written application software in the form of different modules, which contain the actual code. Python’s main power exists within these packages and the wide range of functionality that they bring to the software development field. Providing ways and means to deliver information about these reusable components is of utmost importance. PyPI, a software repository for Python packages, offers a search feature to look for available packages meeting the user needs. It implements a trivial ranking algorithm to detect matching packages for a user’s keyword, resulting in a poorly sorted, huge list of closely and similarly scored packages. From this immense list of results, it’s hard to find a package efficiently in a reasonable amount of time that meets user needs. Due to lack of an efficient native search engine for the Python community, often developers rely on mature and multipurpose search engines like Google, Yahoo and Bing. In order to express his or her interests in python packages, a developer taking this route has to articulate his query and on top of that provide additional input. A dedicated search engine for the Python community would bypass the need to specify once interest in Python. One may argue that such a search engine wouldn’t alter the experience of developers who is searching for python packages. However, considering the number of times a developer search for packages, code and related information on an average is five search sessions with 12 total queries each workday [1], it is desirable to propose a search engine that saves a lot of time and effort. An additional method of software development that saves time is the practice of Code reuse [2]. There are many factors that influence practice of code reuse [3] [4]. One of the such factors is availability of the right tools to find reusable components. Searching for code is a serious challenge 1 for developers. Code search engines such as Krugle1, Searchcode2 and BlackDuck3 attempted to ameliorate the hardship to search for code by targeting a wide range of languages. Currently, Python developers who conduct code searches have to learn how to configure these search engines so that the search engines display python specific results. As a result all though these search engines exemplify the ideal that one product may solve all kinds of problems, such an ideal fails to overcome the problems faced by Python developers. Python developers would rather rely on a code search engine that is designed for searching python packages exclusively. 1.1 Objective This thesis seeks to contribute to the Python community by developing a dedicated Python search engine (PyQuery)4 that enables Python developers to search for Python Packages and Mod- ules (code) and encourage them to take benefit of an important software development

Load more