Automatic Identification of Architecture and Endianness Using Binary File Contents
Total Page:16
File Type:pdf, Size:1020Kb
Sami Kairajärvi Automatic identification of architecture and endianness using binary file contents in Information Technology Master’s Thesis May 14, 2019 University of Jyväskylä Faculty of Information Technology Author: Sami Kairajärvi Contact information: [email protected] Supervisor: Andrei Costin Title: Automatic identification of architecture and endianness using binary file contents Työn nimi: Arkkitehtuurin ja tavujärjestyksen automaattinen tunnistaminen binääritiedoston sisällön avulla Project: Master’s Thesis Study line: Cyber security Page count: 71+5 Abstract: This thesis explores how architecture and endianness of executable code can be identified using binary file contents, as falsely identifying the architecture caused about 10% of failures of firmware analysis in a recent study (Costin et al. 2014). A literature review was performed to identify the current state-of-the-art methods and how they could be improved in terms of algorithms, performance, data sets, and support tools. The thorough review identified methods presented by Clemens (2015) and De Nicolao et al. (2018) as the state-of-the-art and found that they had good results. However, these methods were found lacking essential tools to acquire or build the data sets as well as requiring more comprehensive comparison of classifier performance on full binaries. An experimental evaluation was performed to test classifier performance on different situations. For example, when training and testing classifiers with only code sections from executable files, all the classifiers performed equally well achieving over 98% accuracy. On samples with very small code sections 3-nearest neighbors and SVM had the best performance achieving 90% accuracy at 128 bytes. At the same time, random forest classifier performed the best classifying full binaries when trained with code sections at 90% accuracy and 99.2% when trained using full binaries. Keywords: Firmware Analysis, Supervised Machine Learning, Classification, Binary Code Suomenkielinen tiivistelmä: Tässä tutkielmassa tutkitaan kuinka suoritettavan ohjelman arkkitehtuurin ja tavujärjestyksen voi tunnistaa binääritiedoston sisällön avulla. Tätä tietoa tarvittiin esimerkiksi siksi, että arkkitehtuurin väärä tunnistus aiheutti 10% laiteohjelmistoanalyysin epäonnistumisista viimeaikaisessa tutkimuksessa (Costin et al. 2014). Tutkielmassa tehtiin kirjallisuuskatsaus, jossa etsittiin tämänhetkiset parhaat metodit sekä i kuinka niitä voisi parantaa. Clemens (2015) ja De Nicolao et al. (2018) tunnistettiin johtavina tutkimuksina, ja niissä esitetyt menetelmät antoivat hyviä luokittelutuloksia. Puutteita löytyi aineistonkeräyksestä sekä luokittelijoiden laajemmasta vertailusta kokonaisten binäärien luokittelussa. Tutkielman kokeellisessa osiossa luokittelijoiden suorituskykyä mitattiin erilaisissa tilanteissa. Luokittelijoita esimerkiksi opetettiin ja testattiin vain suoritettavien ohjelmien koodiosioilla, jolloin kaikki luokittelijat saavuttivat vähintään 98% tarkkuuden. 3-lähimmän naapurin menetelmä sekä SVM luokittelija suoriutuivat parhaiten pienten näytteiden luokittelussa saavuttaen 90% tarkkuuden jo 128 tavun kokoisten tiedostojen kohdalla. Satunnaismetsä luokittelijalla oli paras suorituskyky kokonaisten binääritiedostojen luokittelussa 90% tarkkuudella, kun luokittelija opetettiin binääritiedoston koodiosilla, ja 99.2%, kun luokittelija opetettiin kokonaisilla binääritiedostoilla. Avainsanat: Laiteohjelmiston analysointi, valvottu koneoppiminen, luokittelu, binäärikoodi ii List of Figures Figure 1. Example of an ELF file structure .........................................8 Figure 2. Overview of steps in supervised learning (Kotsiantis, I. Zaharakis, and P. Pintelas 2007) ...........................................................18 Figure 3. Workflow of the tool to download data sets .............................19 Figure 4. Bayesian network representation (Dua and Du 2016) .................26 Figure 5. Example of a kNN where k=5 (Dua and Du 2016) .....................27 Figure 6. Example of a decision tree (Dua and Du 2016) .........................29 Figure 7. Example of a logistic regression .........................................30 Figure 8. Example of a multilayer perceptron. It is composed of input neurons X1::XN, N number of neurons H in m hidden layers, and output neurons Y1::YN (Kononenko and Kukar 2007) .................................31 Figure 9. Example of SVM classifier (Dua and Du 2016)..........................32 Figure 10. Steps to choose a research methodology (University of Jyväskylä 2010).............................................................................34 Figure 11. Classifier and dataflow setup used in Azure Machine Learning .....38 Figure 12. Coefficients in logistic regression when using the additional powerpcspe signatures. Class 13 is powerpc while class 14 is powerpcspe. .....................................................................42 Figure 13. Impact of the sample size on classifier performance .................44 Figure 14. Confusion matrix of random forest classification results on full binaries. Columns show the predicted class frequencies while rows show the true class frequencies. ..............................................47 Figure 15. Output JSON from FIRMADYNE scraper ..............................65 Figure 16. Output JSON from Jigdo downloader ...................................66 Figure 17. Output JSON from Debian extractor ...................................66 Figure 18. (Partial) Output JSON from Debian unpacker .........................67 Figure 19. Output JSON from binary extractor ....................................68 List of Tables Table 1. A confusion matrix for binary classification (Sokolova, Japkowicz, and Szpakowicz 2006) ..........................................................21 Table 2. Cumulative data set statistics..............................................36 Table 3. Data set statistics ...........................................................36 Table 4. Classifiers and settings used by Clemens (2015) ........................37 Table 5. F1 score with random forest trained and tested with full binaries in Weka with and without powerpcspe signatures ..............................40 Table 6. Confusion matrix of powerpc and powerpcspe architectures without the additional signatures .......................................................41 Table 7. Confusion matrix of powerpc and powerpcspe architectures with the additional signatures .......................................................41 iii Table 8. Classifier performance with different feature sets, also compared with some metrics presented by Clemens (2015) ............................43 Table 9. Logistic regression 10-fold cross validation performance on different platforms ..............................................................45 Table 10. Classifier performance on full binaries ..................................46 Table 11. Accuracy, precision and recall for random forest and decision jungle classifiers trained in Azure Machine Learning platform. Performance results from random forest implemented in Weka added for comparison. .................................................................48 Table 12. Classifier performance on full binaries trained with full binaries. Random forest and random jungle missing AUC and F1 measures as Azure platform does not provide them. Accuracies achieved in classifying full binaries with classifiers trained with code sections added for comparison. ..........................................................48 iv Contents 1 INTRODUCTION ................................................................1 2 OVERVIEW OF SECURITY-ORIENTED FIRMWARE ANALYSIS ............4 2.1 Unpacking .................................................................4 2.2 Identification and classification..........................................5 2.3 Static analysis .............................................................6 2.4 Architecture identification ...............................................7 2.5 Emulation ..................................................................11 2.6 Dynamic analysis..........................................................12 2.7 Domain-specific analysis .................................................13 3 MACHINE LEARNING .........................................................15 3.1 Overview of artificial intelligence and its subfields ....................15 3.2 Unsupervised learning ...................................................16 3.3 Supervised learning ......................................................16 3.3.1Identification of required data .....................................18 3.3.2Data pre-processing and feature selection .......................19 3.3.3Algorithm selection .................................................20 3.3.4Evaluation ...........................................................21 3.3.5Evaluation metrics ..................................................21 3.3.6Cross-validation .....................................................23 3.4 Reinforcement learning ..................................................24 3.5 Deep learning .............................................................24 3.6 Classifiers..................................................................25 3.6.1Bayesian classifiers .................................................25 3.6.2Lazy classifiers ......................................................27 3.6.3Trees .................................................................28