Spam Campaign Detection, Analysis, and Formalization

Spam Campaign Detection, Analysis, and Formalization Thèse Mina Sheikhalishahi Doctorat en informatique Philosophiæ doctor (Ph.D.) Québec, Canada © Mina Sheikhalishahi, 2016 Spam Campaign Detection, Analysis, and Formalization Thèse Mina Sheikhalishahi Sous la direction de: Directeur de recherche: Mohamed Mejri Codirectrice de recherche: Nadia Tawbi Résumé Les courriels Spams (courriels indésirables ou pourriels) imposent des coûts annuels extrême- ment lourds en termes de temps, d’espace de stockage et d’argent aux utilisateurs privés et aux entreprises. Afin de lutter efficacement contre le problème des spams, il ne suffit pas d’arrêter les messages de spam qui sont livrés à la boîte de réception de l’utilisateur. Il est obligatoire, soit d’essayer de trouver et de persécuter les spammeurs qui, généralement, se cachent derrière des réseaux complexes de dispositifs infectés, ou d’analyser le comportement des spammeurs afin de trouver des stratégies de défense appropriées. Cependant, une telle tâche est difficile en raison des techniques de camouflage, ce qui nécessite une analyse manuelle des spams corrélés pour trouver les spammeurs. Pour faciliter une telle analyse, qui doit être effectuée sur de grandes quantités des courriels non classés, nous proposons une méthodologie de regroupement catégorique, nommé CCTree, permettant de diviser un grand volume de spams en des campagnes, et ce, en se basant sur leur similarité structurale. Nous montrons l’efficacité et l’efficience de notre algorithme de clustering proposé par plusieurs expériences. Ensuite, une approche d’auto-apprentissage est proposée pour étiqueter les campagnes de spam en se basant sur le but des spammeur, par exemple, phishing. Les campagnes de spam marquées sont utilisées afin de former un clas- sificateur, qui peut être appliqué dans la classification des nouveaux courriels de spam. En outre, les campagnes marquées, avec un ensemble de quatre autres critères de classement, sont ordonnées selon les priorités des enquêteurs. Finalement, une structure basée sur le semiring est proposée pour la représentation abstraite de CCTree. Le schéma abstrait de CCTree, nommé CCTree terme, est appliqué pour formali- ser la parallélisation du CCTree. Grâce à un certain nombre d’analyses mathématiques et de résultats expérimentaux, nous montrons l’efficience et l’efficacité du cadre proposé. iii Abstract Spam emails yearly impose extremely heavy costs in terms of time, storage space, and money to both private users and companies. To effectively fight the problem of spam emails, it is not enough to stop spam messages to be delivered to end user inbox or be collected in spam box. It is mandatory either to try to find and persecute the spammers, generally hiding be- hind complex networks of infected devices, which send spam emails against their user will, i.e. botnets; or analyze the spammer behavior to find appropriate strategies against it. However, such a task is difficult due to the camouflage techniques, which makes necessary a manual analysis of correlated spam emails to find the spammers. To facilitate such an analysis, which should be performed on large amounts of unclassified raw emails, we propose a categorical clustering methodology, named CCTree, to divide large amount of spam emails into spam campaigns by structural similarity. We show the effectiveness and efficiency of our proposed clustering algorithm through several experiments. Afterwards, a self-learning approach is proposed to label spam campaigns based on the goal of spammer, e.g. phishing. The labeled spam campaigns are used to train a classifier, which can be applied in classifying new spam emails. Furthermore, the labeled campaigns, with the set of four more ranking features, are ordered according to investigators priorities. A semiring-based structure is proposed to abstract CCTree representation. Through several theorems we show under some conditions the proposed approach fully abstracts the tree representation. The abstract schema of CCTree, named CCTree term, is applied to formalize CCTree parallelism. Through a number of mathematical analysis and experimental results, we show the efficiency and effectiveness of our proposed framework as an automatic tool for spam campaign detection, labeling, ranking, and formalization. iv Table des matières Résumé iii Abstract iv Table des matièresv Liste des tableaux vii Liste des figures ix Remerciements xii 1 Introduction1 1.1 Motivation .................................... 3 1.2 Main Contributions................................ 6 1.3 Thesis Outline .................................. 7 2 State of the Art9 2.1 Spam Emails Issues................................ 9 2.2 Clustering Spam emails into Campaigns.................... 12 2.3 Labeling and Ranking Spam Campaigns.................... 17 2.4 On the Formalization of Clustering and its Applications........... 18 3 Spam Campaign Detection 22 3.1 Introduction.................................... 22 3.2 Preliminary Notions ............................... 25 3.3 Related Works .................................. 28 3.4 Categorical Clustering Tree (CCTree) ..................... 30 3.5 Time Complexity................................. 32 3.6 Conclusion .................................... 33 4 Effectiveness and Efficiency of CCTree in Spam Campaign Detection 34 4.1 Introduction.................................... 34 4.2 Framework .................................... 36 4.3 Evaluation and Results.............................. 38 4.4 Discussion and Comparisons........................... 56 4.5 Related Work................................... 57 4.6 Conclusion .................................... 58 v 5 Labeling and Ranking Spam Campaigns 60 5.1 Introduction.................................... 60 5.2 Related Work................................... 62 5.3 Digital Waste Sorting .............................. 63 5.4 Results....................................... 75 5.5 Ranking Spam Campaigns............................ 82 5.6 Conclusion .................................... 86 6 Algebraic Formalization of CCTree 87 6.1 Introduction.................................... 87 6.2 Related work ................................... 89 6.3 Feature-Cluster Algebra............................. 90 6.4 Feature-Cluster (Family) Term Abstraction.................. 99 6.5 Relations on Feature-Cluster Algebra ..................... 109 6.6 CCTrees Parallelism ............................... 117 6.7 Conclusion .................................... 122 7 Conclusions and Future Work 124 7.1 Thesis Summary ................................. 124 7.2 Future work.................................... 126 A Appendix 130 A.1 Source Codes of Proposed Approach...................... 130 A.2 Tables of Attributes ............................... 138 Bibliography 144 Bibliographie 144 vi Liste des tableaux 4.1 Features extracted from each email. ........................ 37 4.2 CCTree Internal evaluation with fixed number of elements............. 41 4.3 Internal evaluation results of CCTree, COBWEB and CLOPE.......... 45 4.4 Silhouette values and number of clusters in function of µ for four email datasets. 50 4.5 Silhouette result, hamming distance, = 0:001, and µ changes.......... 52 4.6 Number of Clusters , = 0:001, and µ changes .................. 52 4.7 External evaluation results of CCTree, COBWEB and CLOPE.......... 55 4.8 Campaigns on the February 2015 dataset from five clustering methodologies. 57 5.1 Features extracted from each email. ........................ 71 5.2 Feature vectors of a spam email for each class. .................. 72 5.3 Classification results evaluated with K-fold validation on training set. 77 5.4 Classification results evaluated on test set...................... 77 5.5 Training set generated from small knowledge.................... 81 5.6 DWS classification results for the labeled spam campaigns. ........... 81 5.7 Set of ranking features................................ 82 5.8 Normalized score of spam campaigns label..................... 84 5.9 Three first ranked campaigns............................ 85 6.1 CCTree Rewriting System.............................. 114 6.2 Composition Rewriting System........................... 119 7.1 Table of Notations.................................. 129 A.1 Language of spam message and subject....................... 138 A.2 Type of Attachment................................. 138 A.3 Attachment Size ................................... 139 A.4 Number of attachment................................ 139 A.5 Average size of attachments............................. 139 A.6 Type of Message ................................... 139 A.7 Length of Message.................................. 140 A.8 IP-based links verification.............................. 140 A.9 Mismatch links.................................... 140 A.10 Number of links.................................... 141 A.11 Number of Domains ................................. 141 A.12 Average number of dots in links........................... 141 A.13 Hex character in links ................................ 141 A.14 Words in Subject................................... 142 vii A.15 Characters in subject................................. 142 A.16 Non ASCII characters in subject .......................... 142 A.17 Recipients of spam email............................... 142 A.18 Images in spam messages .............................. 143 viii Liste des figures 1.1 Steady volume of spam................................ 2 1.2 Mcafee Report 2015.................................. 3 1.3 The framework of thesis. .............................

Load more