Decompilation As Search

UCAM-CL-TR-844 Technical Report ISSN 1476-2986 Number 844 Computer Laboratory Decompilation as search Wei Ming Khoo November 2013 15 JJ Thomson Avenue Cambridge CB3 0FD United Kingdom phone +44 1223 763500 http://www.cl.cam.ac.uk/ c 2013 Wei Ming Khoo This technical report is based on a dissertation submitted August 2013 by the author for the degree of Doctor of Philosophy to the University of Cambridge, Hughes Hall. Technical reports published by the University of Cambridge Computer Laboratory are freely available via the Internet: http://www.cl.cam.ac.uk/techreports/ ISSN 1476-2986 Abstract Decompilation is the process of converting programs in a low-level representation, such as machine code, into high-level programs that are human readable, compilable and seman- tically equivalent. The current de facto approach to decompilation is largely modelled on compiler theory and only focusses on one or two of these desirable goals at a time. This thesis makes the case that decompilation is more effectively accomplished through search. It is observed that software development is seldom a clean slate process and much software is available in public repositories. To back this claim, evidence is presented from three categories of software development: corporate software development, open source projects and malware creation. Evidence strongly suggests that code reuse is prevalent in all categories. Two approaches to search-based decompilation are proposed. The first approach borrows inspiration from information retrieval, and constitutes the first contribution of this thesis. It uses instruction mnemonics, control-flow sub-graphs and data constants, which can be quickly extracted from a disassembly, and relies on the popular text search engine CLucene. The time taken to analyse a function is small enough to be practical and the technique achieves an F2 measure of above 83.0% for two benchmarks. The second approach and contribution of this thesis is perturbation analysis, which is able to differentiate between algorithms implementing the same functionality, e.g. bubblesort versus quicksort, and between different implementations of the same algorithm, e.g. quicksort from Wikipedia versus quicksort from Rosetta code. Test-based indexing (TBI) uses random testing to characterise the input-output behaviour of a function; perturbation- based indexing (PBI) is TBI with additional input-output behaviour obtained through perturbation analysis. TBI/PBI achieves an F2 measure of 88.4% on five benchmarks involving different compilers and compiler options. To perform perturbation analysis, function prototyping is needed, the standard way com- prising liveness and reaching-definitions analysis. However, it is observed that in practice actual prototypes fall into one of a few possible categories, enabling the type system to be simplified considerably. The third and final contribution is an approach to prototype recovery that follows the principle of conformant execution, in the form of inlined data source tracking, to infer arrays, pointer-to-pointers and recursive data structures. Acknowledgments I gratefully acknowledge the financial support of DSO National Laboratories, which made it possible for me to pursue this PhD. I thank Tan Yang Meng and Chia Hock Teck, who first encouraged me to take this journey; Ross Anderson, for deciding to take a green and clueless individual under his wing as a PhD student in 2009; Alan Mycroft, for agreeing to co-advise this same green and clueless PhD student; Hassen Sa¨ıdi, for having me as a summer intern and for getting me started on the fascinating topic of decompilation. I thank my co-authors and collaborators Pietro Lió, Hyoungshick Kim, Michael Meeks and Ed Robbins, for fruitful discussions that have given me new perspectives and have helped me grow as a researcher. For providing me with much-needed advice, help and encouragement along the way, I thank Saad Aloteibi, Jonathan Anderson, Joseph Bonneau, Omar Choudary, Richard Clayton, Saar Drimer, Christopher Gautier, Khilan Gudka, Markus Kuhn, Steven Murdoch, Frank Stajano, Raoul Urma, Robert Watson and Rubin Xu. I am grateful to Graham Titmus and Chris Hadley for help with setting up the Rendezvous server, and Laurent Simon for help with server testing. I am thankful to Andrew Bernat and Frank Eigler for help with the Dyninst API and the Dyninst team for creating a great tool. Thank you, Dad, Mum, Pa, Mi and Li Ying, for always being there, especially during times when I needed the most support and encouragement. To Grace, Natalie and Daniel: thank you for being my inspiration. You are a gift, and my joy. To my wife, Shea Lin: thank you for your steadfast support when I was running long experiments, writing or away, for being my comfort when the going was tough, and for your uncanny ability to see past setbacks. Thank you for being you. Soli Deo gloria Contents 1 Introduction 9 1.1 Decompilation.................................. 11 1.2 Chapteroutline ................................. 16 2 Models of decompilation: a survey 17 2.1 Heuristics-drivenmodel............................. 17 2.2 Compilermodel................................. 18 2.3 Formalmethodsmodel ............................. 19 2.4 Assembly-to-Ctranslationmodel . 20 2.5 Information-flowmodel............................. 21 3 Decompilation as search 23 3.1 Howprevalentiscodereuse?.. .. .. 24 3.2 Priorworkinsoftwarereuseresearch . .. 24 3.3 GNU Public License violations . 25 3.4 Proprietary software copyright infringement . .... 26 3.5 AstudyofcodereuseonGithub. 28 3.6 Code reuse in malicious software . 32 3.6.1 Analysis of “last modified” dates . 33 3.6.2 Searchfor“http://”string . 34 3.7 Summaryoffindings .............................. 37 3.8 Proposed approach: Search-based decompilation . ..... 37 3.8.1 Case study: Statistical machine translation . 38 3.8.2 Proposedresearchagenda . 39 3.9 Relatedwork .................................. 40 4 Token-based code indexing 43 4.1 Introduction................................... 43 4.2 Designspace................................... 44 4.3 Featureextraction ............................... 45 4.4 Instructionmnemonics ............................. 45 4.5 Control-flowsubgraphs ............................. 46 4.6 Dataconstants ................................. 48 4.7 Whatmakesagoodmodel? .......................... 49 4.8 Indexingandquerying ............................. 50 4.9 Implementation ................................. 52 4.10Evaluation.................................... 52 4.10.1 Optimal df threshold ........................... 53 4.10.2 Comparison of n-grams versus n-perms................ 54 4.10.3 Mixed n-grammodels ......................... 55 4.10.4 Control-flow k-graphs versus extended k-graphs ........... 56 4.10.5 Mixed k-graphmodels ......................... 56 4.10.6 Dataconstants ............................. 57 4.10.7 Compositemodels ........................... 57 4.10.8 Timing.................................. 59 4.11Discussion.................................... 60 4.11.1 Limitations . 60 4.11.2 Threatstovalidity ........................... 60 4.11.3 Mnemonic n-gramsandbasicblockboundaries. 60 4.12Relatedwork .................................. 60 5 Perturbation analysis 63 5.1 Overview..................................... 64 5.2 Assembly-to-Ctranslation . .. .. .. 65 5.3 Source-binarymatchingforfunctions . .. 65 5.4 Test-basedindexing............................... 66 5.5 Identifying structural similarity . 67 5.6 Perturbationanalysis.............................. 67 5.7 Guardfunctions................................. 69 5.8 Perturbation-basedindexing . 70 5.9 Implementation ................................. 70 6 5.10Evaluation.................................... 71 5.10.1 Perturbationandguardfunctions . 71 5.10.2 Comparison of different implementations . 72 5.10.3 Coreutils dataset............................ 73 5.10.4 Compilers and compiler options . 74 5.11Discussion.................................... 75 5.11.1 Undefinedbehaviour .. .. .. 75 5.11.2 Indirectjumpsandexternalcode . 76 5.11.3 Functionprototyping . 76 5.12Relatedwork .................................. 77 6 Prototype recovery via inlined data source tracking 79 6.1 Survey of prototypes in coreutils and linux .................. 80 6.2 Algorithmdesign ................................ 82 6.3 Conformantexecutionfortyperecovery. .... 82 6.4 Memory validity, Fmem ............................. 83 6.5 Address validity, Faddr ............................. 84 6.6 Inlineddatasourcetracking .. .. .. 86 6.7 Probabilistic branch negation . 89 6.8 Typesystem................................... 89 6.9 Distancemetric ................................. 90 6.10Implementation ................................. 91 6.11Evaluation.................................... 91 6.11.1 Basic inlined data source tracking . 92 6.11.2 Adding probabilistic branch negation . 92 6.11.3 Timing.................................. 94 6.12Discussion.................................... 94 6.13Relatedwork .................................. 96 7 Rendezvous: a prototype search engine 99 7.1 Systemarchitecture............................... 99 7.2 Resultsandperformance ............................ 99 7.2.1 Storagerequirements . .102 7.2.2 Performance...............................102 8 Conclusions and future work 103 8.1 Futurework...................................103 7 1 Introduction “You can’t trust code that you did not totally create yourself. (Especially code from companies that employ people like me.)” – Ken Thompson, Reflections

Load more