Automatic Algorithm Recognition Based on Programming Schemas and Beacons Aalto University

Department of Computer Science and Engineering Aalto- Ahmad Taherkhani Taherkhani Ahmad DD 17 Automatic Algorithm / 2013 2013 Recognition Based on Automatic Algorithm Recognition Based on Programming Schemas and Beacons and Beacons Schemas Programming on Based Recognition Algorithm Automatic Programming Schemas and Beacons A Supervised Machine Learning Classification Approach Ahmad Taherkhani 9HSTFMG*aejijc+ ISBN 978-952-60-4989-2 BUSINESS + ISBN 978-952-60-4990-8 (pdf) ECONOMY ISSN-L 1799-4934 ISSN 1799-4934 ART + ISSN 1799-4942 (pdf) DESIGN + ARCHITECTURE University Aalto Aalto University School of Science SCIENCE + Department of Computer Science and Engineering TECHNOLOGY www.aalto.fi CROSSOVER DOCTORAL DOCTORAL DISSERTATIONS DISSERTATIONS Aalto University publication series DOCTORAL DISSERTATIONS 17/2013 Automatic Algorithm Recognition Based on Programming Schemas and Beacons A Supervised Machine Learning Classification Approach Ahmad Taherkhani A doctoral dissertation completed for the degree of Doctor of Science (Technology) (Doctor of Philosophy) to be defended, with the permission of the Aalto University School of Science, at a public examination held at the lecture hall T2 of the school on the 8th of March 2013 at 12 noon. Aalto University School of Science Department of Computer Science and Engineering Learning + Technology Group Supervising professor Professor Lauri Malmi Thesis advisor D. Sc. (Tech) Ari Korhonen Preliminary examiners Professor Jorma Sajaniemi, University of Eastern Finland, Finland Dr. Colin Johnson, University of Kent, United Kingdom Opponent Professor Tapio Salakoski, University of Turku, Finland Aalto University publication series DOCTORAL DISSERTATIONS 17/2013 © Ahmad Taherkhani ISBN 978-952-60-4989-2 (printed) ISBN 978-952-60-4990-8 (pdf) ISSN-L 1799-4934 ISSN 1799-4934 (printed) ISSN 1799-4942 (pdf) http://urn.fi/URN:ISBN:978-952-60-4990-8 Unigrafia Oy Helsinki 2013 Finland Publication orders (printed book): The dissertation can be read at http://lib.tkk.fi/Diss/ Abstract Aalto University, P.O. Box 11000, FI-00076 Aalto www.aalto.fi Author Ahmad Taherkhani Name of the doctoral dissertation Automatic Algorithm Recognition Based on Programming Schemas and Beacons: A Supervised Machine Learning Classification Approach Publisher Aalto University School of Science Unit Department of Computer Science and Engineering Series Aalto University publication series DOCTORAL DISSERTATIONS 17/2013 Field of research Software Systems Manuscript submitted 11 September 2012 Date of the defence 8 March 2013 Permission to publish granted (date) 18 December 2012 Language English Monograph Article dissertation (summary + original articles) Abstract In this thesis, we present techniques to recognize basic algorithms covered in computer science education from source code. The techniques use various software metrics, language constructs and other characteristics of source code, as well as the concept of schemas and beacons from program comprehension models. Schemas are high level programming knowledge with detailed knowledge abstracted out. Beacons are statements that imply specific structures in a program. Moreover, roles of variables constitute an important part of the techniques. Roles are concepts that describe the behavior and usage of variables in a program. They have originally been introduced to help novices learn programming. We discuss two methods for algorithm recognition. The first one is a classification method based on a supervised machine learning technique. It uses the vectors of characteristics and beacons automatically computed from the algorithm implementations of a training set to learn what characteristics and beacons can best describe each algorithm. Based on these observed instance-class pairs, the system assigns a class to each new input algorithm implementation according to its characteristics and beacons. We use the C4.5 algorithm to generate a decision tree that performs the task. In the second method, the schema detection method, algorithms are defined as schemas that exist in the knowledge base of the system. To identify an algorithm, the method searches the source code to detect schemas that correspond to those predefined schemas. Moreover, we present a method that combines these two methods: it first applies the schema detection method to extract algorithmic schemas from the given program and then proceeds to the classification method applied to the schema parts only. This enhances the reliability of the classification method, as the characteristics and beacons are computed only from the algorithm implementation code, instead of the whole given program. We discuss several empirical studies conducted to evaluate the performance of the methods. Some results are as follows: evaluated by leave-one-out cross-validation, the estimated classification accuracy for sorting algorithms is 98,1%, for searching, heap, basic tree traversal and graph algorithms 97,3% and for the combined method (on sorting algorithms and their variations from real student submissions) 97,0%. For the schema detection method, the accuracy is 88,3% and 94,1%, respectively. In addition, we present a study for categorizing student-implemented sorting algorithms and their variations in order to find problematic solutions that would allow us to give feedback on them. We also explain how these variations can be automatically recognized. Keywords algorithm recognition, schema detection, beacons, roles of variables, program comprehension, automated assessment ISBN (printed) 978-952-60-4989-2 ISBN (pdf) 978-952-60-4990-8 ISSN-L 1799-4934 ISSN (printed) 1799-4934 ISSN (pdf) 1799-4942 Location of publisher Espoo Location of printing Helsinki Year 2013 Pages 254 urn http://urn.fi/URN:ISBN:978-952-60-4990-8 To the memory of my brother and my father-in-law who both were a big part of my life. i ii Preface Back in 2007, after working in industry for several years, I contacted Professor Lauri Malmi, the supervisor of this dissertation, for possibilities on working in his research group to complete my master’s thesis. He hired me to a perfect position which was funded by the Academy of Finland. The project was much larger than a master’s thesis and after completing my thesis, he offered me an opportunity to continue working on it to pursue my doctoral degree. Having always desired to do my PhD and being interested in the topic, I asked for a leave of absence from my work at Accenture and used the opportunity. It was Lauri who made it all possible: he gave me an opportunity to become a part of the computing education community, allowed me to focus almost full-time on my research and provided me an excellent guidance throughout the project. So thank you Lauri, I really appreciate all of these. Doctor Ari Korhonen provided great ideas and insightful suggestions which are reflected throughout this thesis. His friendly discussions, can-do attitude and critical thinking always helped me forward. I thank him also for showing me how to be a good instructor. I thank all the members of our research group, the Learning + Technology Group (LeTech), for providing a friendly environment in which to work and for their comments during our weekly meetings. I especially thank Doctor Jan Lönnberg for his help. I am grateful to the pre-examiners, Professor Jorma Sajaniemi and Doc- tor Colin Johnson, for taking the time to check my thesis and for their valuable observations and suggestions. Sajaniemi also read my Licenti- ate’s thesis and one of my journal publications and provided constructive feedback. It is also an honor to have Professor Tapio Salakoski as my opponent. I would also like to thank the participants of the international confer- iii ences who provided comments and feedback on my presentations. The same goes for the international working group that I had the chance to participate in. I thank my parents for giving me an opportunity for an education and an abiding respect for learning. Last, and most importantly, I would like to thank my kids, Ava and Arad, for being so special and my wife Elmira for her enormous patience and support and continuing encouragement in whatever I decide to do. Helsinki, December 28, 2012, Ahmad Taherkhani iv Contents Preface iii Contents v List of Publications ix Author’s Contribution xi 1. Introduction 1 1.1 Motivation ............................. 1 1.2 Research Questions ........................ 2 1.3 Structure of the Thesis ...................... 5 2. Algorithm Recognition and Related Work 7 2.1 Algorithm Recognition (AR) ................... 7 2.2 Related Work ............................ 8 2.2.1 Program Comprehension . ............... 8 2.2.2 Clone Detection ...................... 14 2.2.3 Program Similarity Evaluation Techniques ...... 16 2.2.4 Reverse Engineering Techniques ............ 17 2.2.5 Roles of Variables ..................... 18 3. Program Comprehension and Roles of Variables, a Theoret- ical Background 19 3.1 Schemas and Beacons ....................... 19 3.2 Roles of Variables ......................... 21 3.2.1 An Example ........................ 22 3.3 The Link Between RoV and PC . ............... 23 4. Decision Tree Classifiers and the C4.5 Algorithm 27 4.1 Decision Tree Classifiers in General .............. 27 v 4.2 The C4.5 Decision Tree Classifier ................ 30 5. Overall Process and Common Characteristics 31 5.1 Overall Process .......................... 31 5.2 Common Characteristics ..................... 35 5.2.1 Computing Characteristics ............... 36 5.3 The Tool

Automatic Algorithm Recognition Based on Programming Schemas and Beacons Aalto University

Sorting Algorithm 1 Sorting Algorithm

PART III Basic Data Types

Applied C and C++ Programming

References and Index

Engineering Burstsort: Toward Fast In-Place String Sorting

Sorting and Efficient Searching

Performance Improvement for Multi-Key Quick Sort Using Kepler Gpus

5 Sorting and Selection

Lecture Notes in Computer Science 2409 Edited by G

Algorithms and Data Structures: the Basic Toolbox

Algorithms and Data Structures Kurt Mehlhorn • Peter Sanders

Symmetry Sort