Identifying Polymorphic Malware Variants Using Biosequence Analysis Techniques

Total Page:16

File Type:pdf, Size:1020Kb

Identifying Polymorphic Malware Variants Using Biosequence Analysis Techniques IDENTIFYING POLYMORPHIC MALWARE VARIANTS USING BIOSEQUENCE ANALYSIS TECHNIQUES By Vijay Naidu SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY AT AUCKLAND UNIVERSITY OF TECHNOLOGY AUCKLAND, NEW ZEALAND June 2018 © Copyright by Vijay Naidu, 2018 Abstract Modern antivirus systems (AVSs) are not able to detect new polymorphic malware variants until they emerge, even when signatures of one or more variants belonging to a specific polymorphic malware family are known. Polymorphic malware can transform into functionally identical variants of themselves. Polymorphism changes the order of the viral code but not typically the code itself to avoid signature-based detection. Current AVSs detect malware by adopting signatures based on the most essential parts of a known virus, such as execution traces, instruction sequences, etc. Virus writers exploit the weaknesses of malware signature databases by creating new variants using the same engine employed by an already existing polymorphic malware family. In this thesis, virus detection and signature extraction techniques are presented. These techniques were developed by exploring string matching techniques traditionally employed in biosequence analysis. The main contribution of these matching techniques is to extract syntactic patterns (i.e. conserved regions/sequences) from semantically rich polymorphic hex code. These extracted syntactic patterns act as signatures and are used in the identification of polymorphic malware variants belonging to the same family. Moreover, these extracted syntactic patterns can help in identifying new variants that make simple alterations to their newly generated variants. The string matching approaches presented in this thesis may revolutionise our knowledge of polymorphic variant generation and give rise to a new era of string-based syntactic AVSs. i Table of Contents Abstract ............................................................................................................................. i Table of Contents ............................................................................................................ ii List of Figures ............................................................................................................... viii List of Tables ................................................................................................................. xii Attestation of Authorship ............................................................................................. xv Acknowledgements ....................................................................................................... xvi Chapter 1 Introduction ................................................................................................... 1 1.1 Motivation .......................................................................................................... 1 1.2 Background and Related Work .......................................................................... 6 1.3 Syntactic and Semantic Approaches .................................................................. 7 1.4 Problem Statements, Research Objectives and Questions ............................... 10 1.4.1 Problem Statements ...................................................................................... 10 1.4.2 Research Objectives ...................................................................................... 11 1.4.3 Research Questions ....................................................................................... 11 1.5 Hypothesis and Proposed Approach ................................................................. 12 1.5.1 Drawbacks of Previous Approaches ............................................................. 12 1.5.2 Hypothesis .................................................................................................... 13 1.5.3 Smith-Waterman Algorithm (SWA) ............................................................. 13 1.5.4 NNge ............................................................................................................. 14 1.5.5 Limitations of Proposed Approach and Possible Solutions .......................... 17 1.6 Thesis Description ............................................................................................ 19 1.6.1 Thesis Contribution....................................................................................... 21 1.6.2 Thesis Structure ............................................................................................ 22 1.6.3 Publications ................................................................................................... 25 Chapter 2 Malware, Polymorphic Malware, and their Detection Approaches....... 27 2.1 Classification of Malware and Recent Research into Malware Detection ....... 27 2.1.1 Virus.............................................................................................................. 28 2.1.2 Previous Research into Malware Detection .................................................. 28 2.1.3 Classification of Viruses by Masking Strategies .......................................... 33 2.1.4 Polymorphism ............................................................................................... 33 2.1.5 Classification of Polymorphism.................................................................... 34 2.1.6 Levels of Polymorphism ............................................................................... 35 2.1.7 Mutation Engine ........................................................................................... 37 2.1.8 Polymorphic Decryptor (The decryption routine) ........................................ 38 ii 2.1.9 Metamorphism .............................................................................................. 39 2.2 Malware Detection Techniques ........................................................................ 40 2.2.1 Machine Learning/Data Mining Approach ................................................... 42 2.2.2 Normalisation Approach ............................................................................... 43 2.2.3 Scan Engine (Signature based Approach) .................................................... 43 2.2.4 Cryptanalysis ................................................................................................ 44 2.2.5 Heuristic Approach ....................................................................................... 45 2.3 History of Malware – Timeline ........................................................................ 47 2.4 Tool Validation ................................................................................................. 48 2.4.1 Predictive Validation .................................................................................... 48 2.4.2 Triangulation Approach ................................................................................ 51 2.5 Summary .......................................................................................................... 53 Chapter 3 Research Design .......................................................................................... 54 3.1 Research Design ............................................................................................... 55 3.2 Identifying and analysing the problem ............................................................. 57 3.3 Defining research objectives and questions ..................................................... 57 3.4 Designing the proposed approach and conducting experiments ...................... 57 3.5 Discussion of Results and Evidence ................................................................. 59 3.6 Analysis and Evaluation ................................................................................... 59 3.7 Overview of thesis ............................................................................................ 60 3.8 Summary .......................................................................................................... 62 Chapter 4 A String-Based Method for Syntactically Identifying Polymorphic Virus Variants .......................................................................................................................... 64 4.1. Introduction ...................................................................................................... 65 4.2. String-Based Syntactic Detection of Polymorphic Malware Variants Method: An Overview ............................................................................................................... 65 4.3. String-Based Syntactic Detection of Polymorphic Malware Variants Method: Systems and Methods .................................................................................................. 66 4.3.1 Hex Dump Extraction ................................................................................... 67 4.3.2 Hex to DNA Code Conversion ..................................................................... 67 4.3.3 Process of Pairwise Local Sequence Alignment........................................... 69 4.3.4 Meta-Signature Virus Testing ....................................................................... 70 4.4. Experimental Results ........................................................................................ 70 4.5. Summary .......................................................................................................... 74 Chapter 5 Exploring Advanced Sequence Alignment Techniques in
Recommended publications
  • Maldet: How to Detect the Malware?
    International Journal of Computer Applications (0975 – 8887) Volume 123 – No.6, August 2015 MalDet: How to Detect the Malware? Samridhi Sharma Shabnam Parveen Department of CSE, Assistant Professor, Department of CSE, Seth Jai Parkash Mukand Lal Institute of Seth Jai Parkash Mukand Lal Institute of Engineering and Technology, Engineering and Technology, Harayana, India. Harayana, India. ABSTRACT to violate the privacy and security of a system. According to Malware is malicious software. This software used to the Symantec Internet Threat Report 499,811 new malware interrupt computer functionality. Protecting the internet is samples were received in the second half of 2007 detection. probably a enormous task that the contemporary epoch of So it becomes necessary to detect the malware. computers have seen. Day by day the threat levels large thus This paper is organized as follows: Section two has covered making the network susceptible to attacks. Many novel the recent state of the malware security. Section three strategies are brought into the field of cyber security to guard discusses about the malware classification, section four websites from attacks. But still malware has remained a grave presents the malware dectector. Section five studies reason of anxiety to web developers and server administrators. mechanism of malware detection, and finally section six With this war takes place amid the security community and explains malware normalization process. malicious software developers, the security specialists use all possible techniques, methods and strategies to discontinue and 2. RELATED WORK eliminate the threats while the malware developers utilize new Technology has turn out to be an building block in recent types of malwares that avoid implemented security features.
    [Show full text]
  • Digital Skills in Sub-Saharan Africa Spotlight on Ghana
    Digital Skills in Sub-Saharan Africa Spotlight on Ghana IN COOPERATION WITH: ABOUT IFC Research and writing underpinning the report was conducted by the L.E.K. Global Education practice. The L.E.K. IFC—a sister organization of the World Bank and member of team was led by Ashwin Assomull, Maryanna Abdo, and the World Bank Group—is the largest global development Ridhi Gupta, including writing by Maryanna Abdo, Priyanka institution focused on the private sector in emerging Thapar, and Jaisal Kapoor and research contributions by Neil markets. We work with more than 2,000 businesses Aneja, Shrrinesh Balasubramanian, Patrick Desmond, Ridhi worldwide, using our capital, expertise, and influence to Gupta, Jaisal Kapoor, Rohan Sur, and Priyanka Thapar. create markets and opportunities in the toughest areas of Sudeep Laad provided valuable insights on the Ghana the world. For more information, visit www.ifc.org. market landscape and opportunity sizing. ABOUT REPORT L.E.K. is a global management consulting firm that uses deep industry expertise and rigorous analysis to help business This publication, Digital Skills in Sub-Saharan Africa: Spotlight leaders achieve practical results with real impact. The Global on Ghana, was produced by the Manufacturing Agribusiness Education practice is a specialist international team based in and Services department of the International Finance Singapore serving a global client base from China to Chile. Corporation, in cooperation with the Global Education practice at L.E.K. Consulting. It was developed under the ACKNOWLEDGMENTS overall guidance of Tomasz Telma (Senior Director, MAS), The report would not have been possible without the Mary-Jean Moyo (Director, MAS, Middle-East and Africa), participation of leadership and alumni from eight case study Elena Sterlin (Senior Manager, Global Health and Education, organizations, including: MAS) and Olaf Schmidt (Manager, Services, MAS, Sub- Andela: Lara Kok, Executive Coordinator; Anudip: Dipak Saharan Africa).
    [Show full text]
  • Redundancy-Free Analysis of Multi-Revision Software Artifacts
    Empirical Software Engineering (preprint) Redundancy-free Analysis of Multi-revision Software Artifacts Carol V. Alexandru · Sebastiano Panichella · Sebastian Proksch · Harald C. Gall 30. April 2018 Abstract Researchers often analyze several revisions of a software project to obtain historical data about its evolution. For example, they statically ana- lyze the source code and monitor the evolution of certain metrics over multi- ple revisions. The time and resource requirements for running these analyses often make it necessary to limit the number of analyzed revisions, e.g., by only selecting major revisions or by using a coarse-grained sampling strategy, which could remove significant details of the evolution. Most existing analy- sis techniques are not designed for the analysis of multi-revision artifacts and they treat each revision individually. However, the actual difference between two subsequent revisions is typically very small. Thus, tools tailored for the analysis of multiple revisions should only analyze these differences, thereby preventing re-computation and storage of redundant data, improving scalabil- ity and enabling the study of a larger number of revisions. In this work, we propose the Lean Language-Independent Software Analyzer (LISA), a generic framework for representing and analyzing multi-revisioned software artifacts. It employs a redundancy-free, multi-revision representation for artifacts and avoids re-computation by only analyzing changed artifact fragments across thousands of revisions. The evaluation of our approach consists of measur- ing the effect of each individual technique incorporated, an in-depth study of LISA's resource requirements and a large-scale analysis over 7 million program revisions of 4,000 software projects written in four languages.
    [Show full text]
  • 13 Templates-Generics.Pdf
    CS 242 2012 Generic programming in OO Languages Reading Text: Sections 9.4.1 and 9.4.3 J Koskinen, Metaprogramming in C++, Sections 2 – 5 Gilad Bracha, Generics in the Java Programming Language Questions • If subtyping and inheritance are so great, why do we need type parameterization in object- oriented languages? • The great polymorphism debate – Subtype polymorphism • Apply f(Object x) to any y : C <: Object – Parametric polymorphism • Apply generic <T> f(T x) to any y : C Do these serve similar or different purposes? Outline • C++ Templates – Polymorphism vs Overloading – C++ Template specialization – Example: Standard Template Library (STL) – C++ Template metaprogramming • Java Generics – Subtyping versus generics – Static type checking for generics – Implementation of Java generics Polymorphism vs Overloading • Parametric polymorphism – Single algorithm may be given many types – Type variable may be replaced by any type – f :: tt => f :: IntInt, f :: BoolBool, ... • Overloading – A single symbol may refer to more than one algorithm – Each algorithm may have different type – Choice of algorithm determined by type context – Types of symbol may be arbitrarily different – + has types int*intint, real*realreal, ... Polymorphism: Haskell vs C++ • Haskell polymorphic function – Declarations (generally) require no type information – Type inference uses type variables – Type inference substitutes for variables as needed to instantiate polymorphic code • C++ function template – Programmer declares argument, result types of fctns – Programmers
    [Show full text]
  • Typedevil: Dynamic Type Inconsistency Analysis for Javascript
    TypeDevil: Dynamic Type Inconsistency Analysis for JavaScript Michael Pradel∗x, Parker Schuh∗, and Koushik Sen∗ ∗EECS Department, University of California, Berkeley xDepartment of Computer Science, TU Darmstadt Abstract—Dynamic languages, such as JavaScript, give pro- of type number with undefined, which is benign when grammers the freedom to ignore types, and enable them to write running the program in its default configuration but causes concise code in short time. Despite this freedom, many programs a crash with a slightly different configuration. Finding these follow implicit type rules, for example, that a function has a particular signature or that a property has a particular type. problems is difficult because they do not always lead to obvi- Violations of such implicit type rules often correlate with prob- ous signs of misbehavior when executing the programs. How lems in the program. This paper presents TypeDevil, a mostly can developers detect such problems despite the permissive dynamic analysis that warns developers about inconsistent types. nature of JavaScript? The key idea is to assign a set of observed types to each variable, All three examples in Figure 1 share the property that property, and function, to merge types based in their structure, and to warn developers about variables, properties, and functions a variable, property or function has multiple, inconsistent that have inconsistent types. To deal with the pervasiveness of types. In Figure 1a, variable dnaOutputStr holds both the polymorphic behavior in real-world JavaScript programs, we undefined value and string values. In Figure 1b, function present a set of techniques to remove spurious warnings and leftPad sometimes returns an object and sometimes returns to merge related warnings.
    [Show full text]
  • Usuba, Optimizing Bitslicing Compiler Darius Mercadier
    Usuba, Optimizing Bitslicing Compiler Darius Mercadier To cite this version: Darius Mercadier. Usuba, Optimizing Bitslicing Compiler. Programming Languages [cs.PL]. Sorbonne Université (France), 2020. English. tel-03133456 HAL Id: tel-03133456 https://tel.archives-ouvertes.fr/tel-03133456 Submitted on 6 Feb 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THESE` DE DOCTORAT DE SORBONNE UNIVERSITE´ Specialit´ e´ Informatique Ecole´ doctorale Informatique, Tel´ ecommunications´ et Electronique´ (Paris) Present´ ee´ par Darius MERCADIER Pour obtenir le grade de DOCTEUR de SORBONNE UNIVERSITE´ Sujet de la these` : Usuba, Optimizing Bitslicing Compiler soutenue le 20 novembre 2020 devant le jury compose´ de : M. Gilles MULLER Directeur de these` M. Pierre-Evariste´ DAGAND Encadrant de these` M. Karthik BHARGAVAN Rapporteur Mme. Sandrine BLAZY Rapporteur Mme. Caroline COLLANGE Examinateur M. Xavier LEROY Examinateur M. Thomas PORNIN Examinateur M. Damien VERGNAUD Examinateur Abstract Bitslicing is a technique commonly used in cryptography to implement high-throughput parallel and constant-time symmetric primitives. However, writing, optimizing and pro- tecting bitsliced implementations by hand are tedious tasks, requiring knowledge in cryptography, CPU microarchitectures and side-channel attacks. The resulting programs tend to be hard to maintain due to their high complexity.
    [Show full text]
  • Polymorphic Blending Attacks
    Polymorphic Blending Attacks Prahlad Fogla Monirul Sharif Roberto Perdisci Oleg Kolesnikov Wenke Lee College of Computing, Georgia Institute of Technology 801 Atlantic Drive, Atlanta, Georgia 30332 {prahlad, msharif, rperdisc, ok, wenke}@cc.gatech.edu Abstract because the invariant parts of the attack may not be A very effective means to evade signature-based intru- sufficient to construct a signature that produces very sion detection systems (IDS) is to employ polymor- few false positives. On the other hand, each instance phic techniques to generate attack instances that do of a polymorphic attack needs to contain exploit code not share a fixed signature. Anomaly-based intrusion that is typically not used in normal activities. Thus, detection systems provide good defense because existing each instance looks different from normal. Existing polymorphic techniques can make the attack instances polymorphic techniques [28] focus on making the attack look different from each other, but cannot make them instances look different from each other, and not much on look like normal. In this paper we introduce a new making them look like normal. This means that network class of polymorphic attacks, called polymorphic blend- payload anomaly detection systems can provide a good ing attacks, that can effectively evade byte frequency- defense against the current generation of polymorphic based network anomaly IDS by carefully matching the attacks. However, if a polymorphic attack can blend statistics of the mutated attack instances to the normal in with (or look like) normal traffic, it can successfully profiles. The proposed polymorphic blending attacks evade an anomaly-based IDS that relies solely on pay- can be viewed as a subclass of the mimicry attacks.
    [Show full text]
  • SMB Security Series: How to Protect Your Business from Malware, Phishing, and Cybercrime Dan Sullivan
    How to Protect Your Business from Malware, Phishing, and Cybercrime The SMB Security Series Malware, Phishing, and Cybercrime – Dangerous Threats Facing the SMB State of Cybercrime sponsored by Dan Sullivan SMB Security Series: How to Protect Your Business from Malware, Phishing, and Cybercrime Dan Sullivan Introduction to Realtime Publishers by Don Jones, Series Editor For several years now, Realtime has produced dozens and dozens of high‐quality books that just happen to be delivered in electronic format—at no cost to you, the reader. We’ve made this unique publishing model work through the generous support and cooperation of our sponsors, who agree to bear each book’s production expenses for the benefit of our readers. Although we’ve always offered our publications to you for free, don’t think for a moment that quality is anything less than our top priority. My job is to make sure that our books are as good as—and in most cases better than—any printed book that would cost you $40 or more. Our electronic publishing model offers several advantages over printed books: You receive chapters literally as fast as our authors produce them (hence the “realtime” aspect of our model), and we can update chapters to reflect the latest changes in technology. I want to point out that our books are by no means paid advertisements or white papers. We’re an independent publishing company, and an important aspect of my job is to make sure that our authors are free to voice their expertise and opinions without reservation or restriction.
    [Show full text]
  • On the Infeasibility of Modeling Polymorphic Shellcode*
    On the Infeasibility of Modeling Polymorphic Shellcode∗ Yingbo Song Michael E. Locasto Angelos Stavrou Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science Columbia University Columbia University Columbia University [email protected] [email protected] [email protected] Angelos D. Keromytis Salvatore J. Stolfo Dept. of Computer Science Dept. of Computer Science Columbia University Columbia University [email protected] [email protected] ABSTRACT General Terms Polymorphic malcode remains a troubling threat. The ability for Experimentation, Measurement, Security malcode to automatically transform into semantically equivalent variants frustrates attempts to rapidly construct a single, simple, easily verifiable representation. We present a quantitative analy- Keywords sis of the strengths and limitations of shellcode polymorphism and polymorphism, shellcode, signature generation, statistical models consider its impact on current intrusion detection practice. We focus on the nature of shellcode decoding routines. The em- pirical evidence we gather helps show that modeling the class of self–modifying code is likely intractable by known methods, in- 1. INTRODUCTION cluding both statistical constructs and string signatures. In addi- Code injection attacks have traditionally received a great deal of tion, we develop and present measures that provide insight into the attention from both security researchers and the blackhat commu- capabilities, strengths, and weaknesses of polymorphic engines. In nity [1, 14], and researchers have proposed a variety of defenses, order to explore countermeasures to future polymorphic threats, we from artificial diversity of the address space [5] or instruction set show how to improve polymorphic techniques and create a proof- [20, 4] to compiler-added integrity checking of the stack [10, 15] of-concept engine expressing these improvements.
    [Show full text]
  • Polymorphic and Metamorphic Code Applications in Portable Executable Files Protection
    Volume 51, Number 1, 2010 ACTA TECHNICA NAPOCENSIS Electronics and Telecommunications ________________________________________________________________________________ POLYMORPHIC AND METAMORPHIC CODE APPLICATIONS IN PORTABLE EXECUTABLE FILES PROTECTION Liviu PETREAN “Emil Racovi ţă ” High School Baia Mare, 56 V. Alecsandri, tel. 0262 224 266 Abstract: Assembly code obfuscation is one of the most popular ways used by software developers to protect their intellectual property. This paper is reviewing the methods of software security employing metamorphic and polymorphic code transformations used mostly by computer viruses. Keywords: code, polymorphic, portable I. INTRODUCTION to execute. Nowadays self-modifying code is used by The illegal copying of computer programs causes huge programs that do not want to reveal their presence such as revenue losses of software companies and most of the computer viruses and executable compressors and time these losses exceed the earnings. As a consequence protectors. the software companies should use strong protection for Self modifying code is quite straightforward to write their intellectual property, but surprisingly, we often when using assembly language but it can also be encounter the absence of such protection or just a futile implemented in high level language interpreters as C and security routine. Many software producers argued these C++. The usage of self modifying code has many frailties affirming that sooner or later their product will be purposes. Those which present an interest for us in this reversed with or without protection [1], [3], [6]. They are paper are mentioned below: right but only partially, because even if everything that 1. Hiding the code to prevent reverse engineering, can be run can be reversed, the problem is how long is the through the use of a disassembler or debugger.
    [Show full text]
  • Freenas® 11.2-U3 User Guide
    FreeNAS® 11.2-U3 User Guide March 2019 Edition FreeNAS® is © 2011-2019 iXsystems FreeNAS® and the FreeNAS® logo are registered trademarks of iXsystems FreeBSD® is a registered trademark of the FreeBSD Foundation Written by users of the FreeNAS® network-attached storage operating system. Version 11.2 Copyright © 2011-2019 iXsystems (https://www.ixsystems.com/) CONTENTS Welcome .............................................................. 8 Typographic Conventions ..................................................... 10 1 Introduction 11 1.1 New Features in 11.2 .................................................... 11 1.1.1 RELEASE-U1 ..................................................... 14 1.1.2 U2 .......................................................... 14 1.1.3 U3 .......................................................... 15 1.2 Path and Name Lengths .................................................. 16 1.3 Hardware Recommendations ............................................... 17 1.3.1 RAM ......................................................... 17 1.3.2 The Operating System Device ........................................... 18 1.3.3 Storage Disks and Controllers ........................................... 18 1.3.4 Network Interfaces ................................................. 19 1.4 Getting Started with ZFS .................................................. 20 2 Installing and Upgrading 21 2.1 Getting FreeNAS® ...................................................... 21 2.2 Preparing the Media ...................................................
    [Show full text]
  • A Review of Polymorphic Malware Detection Techniques
    A review of polymorphic malware detection techniques Joma Alrzini1, Diane Pennington2 1University of Strathclyde, UK, [email protected] 2University of Strathclyde, UK, [email protected] ABSTRACT Despite the continuous updating of anti- detection systems for the signature algorithm, using sandbox analysis, machine malicious programs (malware), malware has moved to an learning algorithms, deep learning framework based on a hybrid abnormal threat level; it is being generated and spread faster than malware analysis method, and feature engineering approach. before. One of the most serious challenges faced by anti-detection malware programs is an automatic mutation in the code; this is 2. AN OVERVIEW OF POLYMORPHIC MALWARE called polymorphic malware via the polymorphic engine. In this The first emergence of polymorphic malware occurred in case, it is difficult to block the impact of signature-based 1990. It comes in several structures to create malicious code by detection. Hence new techniques have to be used in order to using polymorphic engines. This type of malware is less likely analyse modern malware. One of these techniques is machine to be detected by an antivirus application. The most commonly learning algorithms in a virtual machine (VM) that can run the used techniques for writing polymorphic malware are packed malicious file and analyse it dynamically through encryption/decryption and data appending. These techniques automated testing of the code. Moreover, recent research used use code obfuscations to evade antivirus scanners. Thus, an image processing techniques with deep learning framework as a effective method has to be used to detect unknown malware with hybrid method with two analysis types and extracting a feature obfuscation; the machine learning method is now the most engineering approach in the analysis process in order to detect effective approach, particularly for abnormal malware [29].
    [Show full text]