SECURWARE 2011 : The Fifth International Conference on Emerging Security Information, Systems and Technologies Proposal of n-gram Based Algorithm for Malware Classification Abdurrahman Pektaş Mehmet Eriş Tankut Acarman The Scientific and Technological The Scientific and Technological Computer Engineering Dept. Research Council of Turkey Research Council of Turkey Galatasaray University National Research Institute of National Research Institute of Istanbul, Turkey Electronics and Cryptology Electronics and Cryptology e-mail:
[email protected] Gebze, Turkey Gebze, Turkey e-mail:
[email protected] e-mail:
[email protected] Abstract— Obfuscation techniques degrade the n-gram malware used by the previous work. Similar methodologies features of binary form of the malware. In this study, have been used in source authorship, information retrieval methodology to classify malware instances by using n-gram and natural language processing [5], [6]. features of its disassembled code is presented. The presented The first known use of machine learning in malware statistical method uses the n-gram features of the malware to detection is presented by the work of Tesauro et al. in [7]. classify its instance with respect to their families. n-gram is a fixed size sliding window of byte array, where n is the size of This detection algorithm was successfully implemented in the window. The contribution of the presented method is IBM’s antivirus scanner. They used 3-grams as a feature set capability of using only one vector to represent malware and neural networks as a classification model. When the 3- subfamily which is called subfamily centroid. Using only one grams parameter is selected, the number of all n-gram vector for classification simply reduces the dimension of the n- features becomes 2563, which leads to some spacing gram space.