Solving the Search for Source Code

University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering: Theses, Computer Science and Engineering, Department Dissertations, and Student Research of 8-2013 Solving the Search for Source Code Kathryn T. Stolee University of Nebraska-Lincoln, [email protected] Follow this and additional works at: https://digitalcommons.unl.edu/computerscidiss Part of the Computer Engineering Commons, and the Software Engineering Commons Stolee, Kathryn T., "Solving the Search for Source Code" (2013). Computer Science and Engineering: Theses, Dissertations, and Student Research. 64. https://digitalcommons.unl.edu/computerscidiss/64 This Article is brought to you for free and open access by the Computer Science and Engineering, Department of at DigitalCommons@University of Nebraska - Lincoln. It has been accepted for inclusion in Computer Science and Engineering: Theses, Dissertations, and Student Research by an authorized administrator of DigitalCommons@University of Nebraska - Lincoln. SOLVING THE SEARCH FOR SOURCE CODE by Kathryn T. Stolee A DISSERTATION Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfilment of Requirements For the Degree of Doctor of Philosophy Major: Computer Science Under the Supervision of Professor Sebastian Elbaum Lincoln, Nebraska August, 2013 SOLVING THE SEARCH FOR SOURCE CODE Kathryn T. Stolee, Ph.D. University of Nebraska, 2013 Adviser: Sebastian Elbaum Programmers frequently search for source code to reuse using keyword searches. When effective and efficient, a code search can boost programmer productivity, however, the search effectiveness depends on the programmer's ability to specify a query that captures how the desired code may have been implemented. Further, the results often include many irrelevant matches that must be filtered manually. More semantic search approaches could address these limitations, yet existing approaches either do not scale, are not flexible enough to find approximate matches, or require complex specifications. We propose a novel approach to semantic search that addresses some of these limitations and is designed for queries that can be described using an example. In this approach, programmers write lightweight specifications as inputs and expected output examples for the behavior of desired code. Using these specifications, an SMT solver identifies source code from a repository that matches the specifications. The repository is composed of program snippets encoded as constraints that approximate the semantics of the code. This research contributes the first work toward using SMT solvers to search for existing source code. In this dissertation, we motivate the study of code search and the utility of a more semantic approach to code search. We introduce and illustrate the generality of our approach using subsets of three languages, Java, Yahoo! Pipes, and SQL. Our approach is implemented in a tool, Satsy, for Yahoo! Pipes and Java. The evaluation covers various aspects of the approach, and the results indicate that this approach is effective at finding relevant code. Even with a small repository, our search is competitive with state-of-the-practice syntactic searches when searching for Java code. Further, this approach is flexible and can be used on its own, or in conjunction with a syntactic search. Finally, we show that this approach is adaptable to finding approximate matches when exact matches do not exist, and that programmers are capable of composing input/output queries with reasonable speed and accuracy. These results are promising and lead to several open research questions that we are only beginning to explore. iv DEDICATION To my loving spouse. v ACKNOWLEDGMENTS A tremendous amount of time and effort goes into the work behind and writing of a dissertation. I have been extremely fortunate to have a team of people behind me, providing support and guidance along the way. It would be impossible to list everyone and express exactly how much their support means to me. To my family, friends, and colleagues, thank you. To my husband, Derrick, for his unfailing love and support. To my advisor, Sebastian, for his support, guidance, and friendship. To my committee, Matt, Myra, Gregg, and Kent, for their time, effort, and interest. To my family, Mom, Dad, Colin, Dan, Ann, Lillie, Vivian, and Ethan. To my best friends, Aimee, Stephanie, and Elena. To my UNL colleagues, Brady, Zhihong, Josh, Eric, Javier, Tingting, and Pingyu. To the UNL CSE faculty and staff. To the Lincoln, NE community. Thank you for the past 9 years of hospitality. vi GRANT INFORMATION This work is supported in part by NSF SHF-1218265, NSF GRFP under CFDA-47.076, a Google Faculty Research Award, and AFOSR #9550-10-1-0406 vii Contents Contents vii List of Figures xvi List of Tables xviii 1 Introduction 1 1.1 Contributions . 5 1.2 Outline of Dissertation . 6 2 Background 8 2.1 Java .................................... 9 2.2 Yahoo! Pipes . 10 2.3 SQL . 12 2.4 Constraint Solvers . 13 2.5 Symbolic Execution . 14 3 Motivation 18 3.1 User Survey . 19 3.1.1 Participants . 19 3.1.2 Search Frequency . 21 viii 3.1.3 Why do Programmers Search for Source Code? . 22 3.1.4 Tools Used in Code Search Activities . 22 3.1.5 Summary . 24 3.2 State-of-the-Practice Code Search . 25 3.2.1 Java . 25 3.2.2 Yahoo! Pipes . 26 3.2.3 SQL . 28 3.3 Input/Output Examples in the Wild . 29 3.3.1 Sampling . 29 3.3.2 Analysis . 30 3.3.3 Results . 30 3.4 Summary . 33 4 Illustrative Examples 34 4.1 Semantic Search Approach Overview . 34 4.2 Java Source Code . 36 4.2.1 Refining the Search with Multiple I/O Pairs . 37 4.2.2 Mapping Input and Output . 37 4.2.3 Partial Matches . 38 4.2.4 Using the Solver for Symbolic Variables . 41 4.2.5 Poor Matches . 42 4.3 Yahoo! Pipes Mashups . 43 4.3.1 Assigning Inputs to Fetch Modules . 44 4.3.2 Yahoo! Pipes Example: Encoding a Pipe . 46 4.3.3 Abstracting String Values . 48 4.4 SQL Select Statements . 49 ix 4.4.1 Large Specifications . 49 4.4.2 Implicit Join on Inputs . 50 4.5 Summary . 51 5 Approach 52 5.1 Specifying Behavior . 52 5.1.1 Specification Refinement . 54 5.1.2 Specification Abstraction . 56 5.1.3 Specification Encoding . 57 5.2 Encoding . 58 5.2.1 Program Transformation . 59 5.2.2 Mapping Input and Output . 64 5.2.3 Abstraction Selection . 66 5.2.3.1 Weakening Example in Java . 68 5.2.3.2 Weakening Example in Yahoo! Pipes . 71 5.3 Solving . 72 5.3.1 Matching with Multiple I/O Pairs . 73 5.3.2 Ranking . 74 5.3.2.1 Path Matching . 75 5.3.2.2 Concrete Variable Density . 76 5.3.2.3 User Context . 78 5.3.3 Ambiguity . 78 5.4 Summary . 79 6 Implementation 80 6.1 Solving with Satsy . 81 6.2 Data Types . 82 x 6.3 Java Implementation Detail . 85 6.3.1 Language Support . 85 6.3.2 Program Transformation . 88 6.3.2.1 JPF Driver Creation . 88 6.3.2.2 JPF Listener . 89 6.3.2.3 Program Encoding . 90 6.3.2.4 Abstraction . 93 6.3.2.5 Limitations and Nuances . 93 6.3.3 Constraint Storage . 94 6.4 Yahoo! Pipes Implementation Detail . 95 6.4.1 Language Support . 95 6.4.2 Program Transformation . 97 6.4.2.1 List Implementation . 97 6.4.2.2 Program Encoding . 99 6.4.2.3 Abstraction . 101 6.4.3 Constraint Storage . 102 6.5 Summary . 102 7 Java Evaluation 104 7.1 Java Study 1 . 106 7.1.1 RQ2(a): Comparison to State-of-the-Practice . 107 7.1.1.1 Artifacts . 108 7.1.1.2 Metrics . 111 7.1.1.3 Implementation . 112 7.1.1.4 Results . 112 7.1.2 RQ2(b): Augmenting the State-of-the-Practice . 116 xi 7.1.2.1 Artifacts . 117 7.1.2.2 Metrics . 117 7.1.2.3 Implementation . 118 7.1.2.4 Results . 118 7.2 Java Study 2 . 120 7.2.1 Design . 122 7.2.1.1 Treatment Structure . 123 7.2.1.2 Experimental Design . 123 7.2.2 Artifacts . 124 7.2.2.1 Repository . 124 7.2.2.2 Programming Tasks . 125 7.2.2.3 Query Generation . 129 7.2.2.4 Snippet Retrieval . 130 7.2.3 Metrics . 134 7.2.4 Implementation . 135 7.2.4.1 Tasks . 135 7.2.4.2 Deployment . 137 7.2.4.3 Participants . 137 7.2.5 Results . 138 7.2.5.1 On Types of Problem . 142 7.2.5.2 MTurk participants . 142 7.2.5.3 On the Impact of Ranking on Results . 143 7.2.5.4 On the Influence of Repository Size on Results . 150 7.3 Summary.

Solving the Search for Source Code

Jeffrey Scudder Google Inc. March 28, 2007 Google Spreadsheets Automation Using Web Services

Query Expansion Based on Crowd Knowledge for Code Search

Google Cheat Sheets [.Pdf]

Opensocialに見る Googleのオープン戦略 G オ戦略

Leveraging Social Coding Websites DISSERTATION Submitted in Partial

List of Search Engines

Portfolio: a Search Engine for Finding Functions and Their Usages

Programming the Cloud: the Internet As Platform

Entity-Based Message Bus for Saas Integration

Google Annual Stockholders Meeting 2007

"Olvidos" De Oncuba

Google Labs Code Search James Bidwell, Amgen Ltd, Uxbridge, UK