Les courriels Spams (courriels indésirables ou pourriels) imposent des coûts annuels extrême- ment lourds en termes de temps, d’espace de stockage et d’argent aux utilisateurs privés et aux entreprises. Afin de lutter efficacement contre le problème des spams, il ne suffit pas d’arrêter les messages de spam qui sont livrés à la boîte de réception de l’utilisateur. Il est obligatoire, soit d’essayer de trouver et de persécuter les spammeurs qui, généralement, se cachent derrière des réseaux complexes de dispositifs infectés, ou d’analyser le comportement des spammeurs afin de trouver des stratégies de défense appropriées. Cependant, une telle tâche est difficile en raison des techniques de camouflage, ce qui nécessite une analyse manuelle des spams corrélés pour trouver les spammeurs. Pour faciliter une telle analyse, qui doit être effectuée sur de grandes quantités des courriels non classés, nous proposons une méthodologie de regroupement catégorique, nommé CCTree, permettant de diviser un grand volume de spams en des campagnes, et ce, en se basant sur leur similarité structurale. Nous montrons l’efficacité et l’efficience de notre algorithme de clustering proposé par plusieurs expériences. Ensuite, une approche d’auto-apprentissage est proposée pour étiqueter les campagnes de spam en se basant sur le but des spammeur, par exemple, . Les campagnes de spam marquées sont utilisées afin de former un clas- sificateur, qui peut être appliqué dans la classification des nouveaux courriels de spam. En outre, les campagnes marquées, avec un ensemble de quatre autres critères de classement, sont ordonnées selon les priorités des enquêteurs. Finalement, une structure basée sur le semiring est proposée pour la représentation abstraite de CCTree. Le schéma abstrait de CCTree, nommé CCTree terme, est appliqué pour formali- ser la parallélisation du CCTree. Grâce à un certain nombre d’analyses mathématiques et de résultats expérimentaux, nous montrons l’efficience et l’efficacité du cadre proposé.

iii Abstract

Spam emails yearly impose extremely heavy costs in terms of time, storage space, and money to both private users and companies. To effectively fight the problem of spam emails, it is not enough to stop spam messages to be delivered to end user inbox or be collected in spam box. It is mandatory either to try to find and persecute the spammers, generally hiding be- hind complex networks of infected devices, which send spam emails against their user will, i.e. botnets; or analyze the spammer behavior to find appropriate strategies against it. However, such a task is difficult due to the camouflage techniques, which makes necessary a manual analysis of correlated spam emails to find the spammers. To facilitate such an analysis, which should be performed on large amounts of unclassified raw emails, we propose a categorical clustering methodology, named CCTree, to divide large amount of spam emails into spam campaigns by structural similarity. We show the effective- ness and efficiency of our proposed clustering algorithm through several experiments. Afterwards, a self-learning approach is proposed to label spam campaigns based on the goal of spammer, e.g. phishing. The labeled spam campaigns are used to train a classifier, which can be applied in classifying new spam emails. Furthermore, the labeled campaigns, with the set of four more ranking features, are ordered according to investigators priorities. A semiring-based structure is proposed to abstract CCTree representation. Through several theorems we show under some conditions the proposed approach fully abstracts the tree rep- resentation. The abstract schema of CCTree, named CCTree term, is applied to formalize CCTree parallelism. Through a number of mathematical analysis and experimental results, we show the efficiency and effectiveness of our proposed framework as an automatic tool for spam campaign detection, labeling, ranking, and formalization.

iv Table des matières

Résumé iii

Abstract iv

Table des matièresv

Liste des tableaux vii

Liste des figures ix

Remerciements xii

1 Introduction1 1.1 Motivation ...... 3 1.2 Main Contributions...... 6 1.3 Thesis Outline ...... 7

2 State of the Art9 2.1 Spam Emails Issues...... 9 2.2 Clustering Spam emails into Campaigns...... 12 2.3 Labeling and Ranking Spam Campaigns...... 17 2.4 On the Formalization of Clustering and its Applications...... 18

3 Spam Campaign Detection 22 3.1 Introduction...... 22 3.2 Preliminary Notions ...... 25 3.3 Related Works ...... 28 3.4 Categorical Clustering Tree (CCTree) ...... 30 3.5 Time Complexity...... 32 3.6 Conclusion ...... 33

4 Effectiveness and Efficiency of CCTree in Spam Campaign Detection 34 4.1 Introduction...... 34 4.2 Framework ...... 36 4.3 Evaluation and Results...... 38 4.4 Discussion and Comparisons...... 56 4.5 Related Work...... 57 4.6 Conclusion ...... 58

v 5 Labeling and Ranking Spam Campaigns 60 5.1 Introduction...... 60 5.2 Related Work...... 62 5.3 Digital Waste Sorting ...... 63 5.4 Results...... 75 5.5 Ranking Spam Campaigns...... 82 5.6 Conclusion ...... 86

6 Algebraic Formalization of CCTree 87 6.1 Introduction...... 87 6.2 Related work ...... 89 6.3 Feature-Cluster Algebra...... 90 6.4 Feature-Cluster (Family) Term Abstraction...... 99 6.5 Relations on Feature-Cluster Algebra ...... 109 6.6 CCTrees Parallelism ...... 117 6.7 Conclusion ...... 122

7 Conclusions and Future Work 124 7.1 Thesis Summary ...... 124 7.2 Future work...... 126

A Appendix 130 A.1 Source Codes of Proposed Approach...... 130 A.2 Tables of Attributes ...... 138

Bibliography 144

Bibliographie 144

vi Liste des tableaux

4.1 Features extracted from each email...... 37 4.2 CCTree Internal evaluation with fixed number of elements...... 41 4.3 Internal evaluation results of CCTree, COBWEB and CLOPE...... 45 4.4 Silhouette values and number of clusters in function of µ for four email datasets. 50 4.5 Silhouette result, hamming distance,  = 0.001, and µ changes...... 52 4.6 Number of Clusters ,  = 0.001, and µ changes ...... 52 4.7 External evaluation results of CCTree, COBWEB and CLOPE...... 55 4.8 Campaigns on the February 2015 dataset from five clustering methodologies. . 57

5.1 Features extracted from each email...... 71 5.2 Feature vectors of a spam email for each class...... 72 5.3 Classification results evaluated with K-fold validation on training set...... 77 5.4 Classification results evaluated on test set...... 77 5.5 Training set generated from small knowledge...... 81 5.6 DWS classification results for the labeled spam campaigns...... 81 5.7 Set of ranking features...... 82 5.8 Normalized score of spam campaigns label...... 84 5.9 Three first ranked campaigns...... 85

6.1 CCTree Rewriting System...... 114 6.2 Composition Rewriting System...... 119

7.1 Table of Notations...... 129

A.1 Language of spam message and subject...... 138 A.2 Type of Attachment...... 138 A.3 Attachment Size ...... 139 A.4 Number of attachment...... 139 A.5 Average size of attachments...... 139 A.6 Type of Message ...... 139 A.7 Length of Message...... 140 A.8 IP-based links verification...... 140 A.9 Mismatch links...... 140 A.10 Number of links...... 141 A.11 Number of Domains ...... 141 A.12 Average number of dots in links...... 141 A.13 Hex character in links ...... 141 A.14 Words in Subject...... 142

vii A.15 Characters in subject...... 142 A.16 Non ASCII characters in subject ...... 142 A.17 Recipients of spam email...... 142 A.18 Images in spam messages ...... 143

viii Liste des figures

1.1 Steady volume of spam...... 2 1.2 Mcafee Report 2015...... 3 1.3 The framework of thesis...... 8

3.1 dataset 1 ...... 26 3.2 dataset 2 ...... 26 3.3 Spam 1 ...... 27 3.4 Spam 2 ...... 27 3.5 A Small CCTree ...... 31

4.1 CCTree(0.001,1) ...... 42 4.2 CCTree (0.01,1) ...... 42 4.3 CCTree(0.1,1) ...... 43 4.4 CCTree(0.5,1) ...... 43 4.5 Internal evaluation at the variation of the  parameter...... 44 4.6 COBWEB...... 46 4.7 CCTree(0.001,1) ...... 47 4.8 CCTree(0.001,10) ...... 47 4.9 CCTree(0.001,100) ...... 48 4.10 CCTree(0.001,1000) ...... 48 4.11 CLOPE...... 49 4.12 Silhouette in function of the number of clusters for different values of µ. . . . . 50 4.13 Sihouette (Hamming)...... 50 4.14 Generated Clusters...... 51 4.15 Sihouette (Hamming)...... 53 4.16 Generated Clusters...... 54

5.1 Advertisement ...... 64 5.2 Portal...... 66 5.3 Fraud ...... 66 5.4 ...... 68 5.5 Crypto Ransomeware volume...... 69 5.6 Phishing...... 70 5.7 DWS Workflow...... 73 5.8 Insert new instance X in a CCTree...... 74 5.9 ROC curve / Advertisement...... 78 5.10 ROC curve / Portal Redirection ...... 78 5.11 ROC curve / Fraud...... 79

ix 5.12 ROC curve / Malware...... 79 5.13 ROC curve/ Phishing ...... 80

6.1 A Small CCTree ...... 106 6.2 Parallel Clustering Workflow...... 117

x To my love, my family and To any one who looks for worldwide peace and happiness

xi Remerciements

Though only my name appears on the cover of this dissertation, a great many people have contributed to its production. I owe my gratitude to all those people who have made this dissertation possible.

First and foremost, I want to thank my supervisor, professor Mohamed Mejri, for accepting me in his research group, which improved my view of life. I appreciate all his contributions of time, ideas, patience, and funding to make my Ph.D. experience productive and stimulating. Thanks for allowing me to grow as a research scientist, for all his patience and support. I also would like to express my deeply thanks to my co-advisor, professor Nadia Tawbi, who has been always there to listen and give advices. Thanks to her for all her kind mental, financial supports and helpful discussions in different stages of my Ph.D. studies. I gratefully acknowledge her support for my cooperation with IIT-CNR research group that changed my life. I really appreciate the insightful comments and constructive criticisms of my advisor and co- advisor at different stages of my research. For encouraging the use of correct grammar and consistent notations in my writings.

Besides my advisors, I would like to thank the rest of my thesis committee : Prof. Fabio Martinelli, Prof. Raphael Khoury, and Dr. Ilaria Matteucci, for their insightful comments and encouragement. Special thanks to professor Fabio Martinelli for accepting me to join to his research group in IIT-CNR, Italy, which enriched my research experience.

My time in Quebec was made enjoyable in large part due to the many friends that became part of my life. I am grateful to my dearest Shadi, who supported me continuously during three years of my staying in Quebec. With her presence in Quebec, I always felt I have a family member who takes care of me. To my kind friend Bahareh, who several times I bothered from Italy to do something in Quebec instead of me. Thanks to my other kind friends in Quebec : Elaheh, Afrooz, Sheyda, Soamyeh.

I am especially grateful to my best friend Sara, who always, in very difficult moments of my Ph.D., was available from Iran to send me messages, to support, encourage, and motivate me. She was always there to hear me, although with different timezones of Iran and Canada. I will

xii always appreciate all her kind continuous supports. Many thanks to my other friends from Iran : Mahboobeh, for continuous memorizing and praying me, Mahmoud, for always following my weblog and motivating me.

I would like to deeply thank my family for all their love and encouragement. To my father who always motivated us to read, to know, to follow our dreams, who always love us as we are. To my mother, who finally accepted my travel to Canada although never convinced, for all the worries she passed during my Ph.D., for all her patience when I was following my dreams, even against her dreams. Thanks to my dearest sister, Mojgan, who was my joint to Iran. She was always following what ever I needed to be done in Iran, who always motivated me with her typical sweet words. Many thanks to my brother, Mohammad, who always supported me in all my pursuits, who we are always proud of him. I am also grateful to Hamed, my brother-in-law, who called me many times from Iran to tell me we all love you and miss you. To my kindest aunt, Azra, who always teaches us that you can still smile when the life is passing its most difficult challenging stage.

Most of all, I would like to give my deep gratitude to my colleague, my friend, my love, Andrea, who cleared lots of the obstacles that I faced along my Ph.D. path. Who generously from the first moments of my arrival in Italy, taught me his experiences of research. Many thanks, for all his faithful support, patience, and encouragement during the difficult stages of my Ph.D. thesis. Thanks for his presence in my life, for all happiness he brought with himself, for making the feeling that I am able to make all my dreams come true.

Mina Sheikhalishahi Laval University Quebec, Canada

xiii Chapitre 1


The term spam became well-known from one comedy program of “Monty Python’s Flying Circus”, where the servant was proposing dishes containing an unknown ingredient called spam, which corresponds to a brand of canned meat produced by “American Hormel Foods Corp”. In the sketches of this program, all the foods in the restaurant are served with lots of spam, and the waitress repeats the word spam several times in describing how much spam is in the plates. After doing this, a group of “Vikings” in the corner start a song : “Spam, spam, spam, spam, spam, spam, spam, spam, lovely spam ! Wonderful spam !” Hence, the meaning of the term was referring to something that keeps repeating and repeating to great annoyance 1. Due to the success of this program, probably since the canned meat constituted the only nutritious food available in England during the Second World War, the term “SPAM” indicated something inevitably omnipresent. The name imported to unwanted electronic messages, believed that the first spam email has been sent on 1 May 1978 by Digital Equipment Corporation to advertise a new product, and sent to all the users of ARPAnet of the West Coast of the United States, containing a few hundred people 2. Only many years later, after the birth (dating back to January 1994 3), the first unwanted commercial message in large scale distributed across USENET, titled “Global Alert for All : Jesus is Coming Soon”. It was posted to every newsgroup, indicating unwanted messages, which were sent massively to unwilling recipients.

More precise definition of spam email got introduced later in the literature. [8] define Spam email, also known as junk email or unsolicited bulk email, as an electronic message, sent in bulk, against the will of the receiver. [83] define spam email as an unwanted email, sent indiscriminately by a sender who has no current relationship with the receiver.

Nowadays, spam emails are not just undesired advertisement. The problem of unsolicited 1. http :// 2. 3.

1 emails causes incredible huge costs to companies and private users [113], [83], [84]. Current proposed approaches [30], [46], [123], though being quiet effective in stopping spam emails to be delivered to end users inbox [21], [89], they do not propose a methodology to organize huge amount of messages in order to be able to fight against the root of the problem, i.e. the spammer.

Any effort in this regard requires a first analysis of large amount of spam emails, mostly col- lected in honey-pots. This first analysis demands grouping huge amount of data into smaller groups, named spam campaigns, which are supposed to be originated from the same source (spammer). Then, it is required to train a classifier to label and group new spam emails. Furthermore, the large set of detected spam campaigns should be ordered based on the inves- tigators’ priorities, automatically.

Figure 1.1 – Steady volume of spam.

To this end, in present thesis, we first propose a fast and effective categorical clustering al- gorithm, named CCTree, to detect spam campaigns on the base of structural similarity of messages. Afterwards, we propose a self-learning methodology to automatically label detected spam campaigns based on the goal of spammer. The labeled detected campaigns are ranked automatically considering a set of ranking priorities. A semiring-based formal method is pro- posed to abstract CCTree representation. The abstract form is used to formalize the process of clustering spam emails in parallel computers, which may help to speed up the process of spam campaign detection.

2 1.1 Motivation

Being incredibly cheap to send, spam messages are vastly used by adversaries to steal money, distribute malware, advertise the goods and/or services, etc. Cisco Report, in 2015 [36], shows that although adversaries develop more sophisticated tech- niques to breach network defense, spam emails still continue to play a major role in these attacks, and the worldwide volume of spam has remained relatively consistent (Figure 1.1). Furthermore, it has been shown [36] that 4.5 billion emails get blocked every day. Internet Threats Trend Report [114] estimates that 54 billion spam emails were sent per day in 2014. According to McAfee 2015 Report [100], unsolicited emails constitute up more than 70 percent of total amount of email messages in 2014 (Figure 1.2).

Figure 1.2 – Mcafee Report 2015.

Microsoft and Google [113] estimate spam emails cost to American firms and consumers up to 20 billion dollars per year. Ferris Research estimated the worldwide cost of spam in 2005 at $50 billion, and raised its estimate to $100 billion in 2007 and $130 billion in 2009 4,[112]. [83] report that 382 million mailing attempts resulted in 28 sales. Yahoo ! data on similar “high ticket” items, which were sold through the marginal profit more than $50, shows that they had conversation rates of about 1 in 25,000 [112]. 4.

3 The problem of undesired electronic messages became a serious issue, due to a lot of troubles caused by spam to Internet Community. [5] categorize spam losses in three different groups, named direct losses, indirect losses, and defense costs, and call the sum of these losses as the society losses of spam. In what follows, the sets of society losses proposed in [5] are listed :

Direct losses by spam :

• “Money withdrawn from victim accounts • Time and effort to reset account credentials (for banks and consumers) • Distress suffered by victims • Secondary costs of overdrawn accounts : deferred purchases, inconvenience of not having access to money when needed • Lost attention and bandwidth caused by spam messages, even if they are not reacted to.”

Indirect losses by spam :

• “Loss of trust in online banking, leading to reduced revenues from electronic transaction fees, and higher costs for maintaining branch staff and cheque clearing facilities • Missed business opportunity for banks to communicate with their customers by email • Reduced uptake by citizens of electronic services as a result of lessened trust in online transactions • Efforts to clean-up PCs infected with malware for a spam sending botnet”

Defense costs of spam :

• “Security products such as spam filters, antivirus, and browser extensions to protect users • Security services provided to individuals, such as training and awareness measures • Security services provided to industry, such as website take-down services • Fraud detection, tracking, and recuperation efforts • Law enforcement • The inconvenience of missing messages falsely classified as spam”

Considering that the large amount of spam traffic among servers cause the delay for delivering legitimate emails ; Sorting out the unsolicited messages takes time ; Whilst in the process of classifying messages into spam and legitimate, there is the risk of deleting an important email by mistake, the problems resulting of spam emails makes unbearable situations for every one who uses the Internet.

To get a better insight on the direct and indirect losses of spam, here we briefly present some reports. Microsoft and Google [113] estimate that spam emails cost to American firms and consumers up to 20 billion dollars per year, whilst [83], [84] show that a successful spam campaign can earn revenues between $400k to $1000k. [133] estimated Cutwail botnet for providing spam

4 services earns around $1.7 million to $4.2 million in one year. It has been calculated that a company with 1000 employees, looses just $500,000 per year as productivity cost resulting from spam messages 5.

The most popular solution to the problem of spam is Filtering [21]. The spam filtering can be defined as a methodology to divide messages into spam and legitimate [21]. Currently, the most used approach for fighting spam emails consists in identifying and blocking them on the recipient machine through filters [30], [46], [123], which generally are based on machine learning techniques or content features [22], [138], [139].

Nevertheless that the existing filtering algorithms often show the accuracy of more than 90% in experimental evaluations [21], [89], it does not stop spammers from imposing considerable cost to users and companies [113]. We believe the reason could be that the spammer, the root of the problem, feels the minimum risk to be caught or followed.

To effectively fight the problem of spam emails, it is mandatory to find and persecute the spammers, generally hiding behind complex networks of infected devices, which send spam emails against their user will, i.e. botnets. Due to botnets, identifying the spammer is a difficult task, however possible [142], [149], [45].

To simplify this analysis, first of all, huge amount of spam emails are required to be divided into spam campaigns. A spam campaign is the set of messages spread by a spammer with a specific purpose [27], like advertising a specific product, spreading ideas, or for criminal intents e.g. phishing. Grouping spam messages into spam campaigns reveals behaviors that may be difficult to be inferred when we look at a large collection of spam emails as a whole [132]. It is noteworthy to be mentioned that the problem of grouping a large amount of spam emails into smaller groups is an unsupervised learning task. The reason is that there is no labeled data for training a classifier in the beginning. The proposed approach for clustering spam messages should be based on this premise that the general appearance of messages belonging to the same spam campaign mainly remain unchanged, although spammers usually insert random text or links [27]. The rationale behind this approach is that two messages in the same format, i.e. similar language, size, same number of attachments, same amount of links, etc., are more likely to be originated from the same source (spammer), belonging thus to the same campaign. Hence, the discriminative structural features of messages required be to be selected correctly. Furthermore, the clustering algorithm should be quite fast and effective in grouping junk emails into spam campaigns.

Afterward to each campaign should be assigned a label describing the purpose of spammer. This goal-based labeling facilitates for investigators the analysis of spam campaigns, eventually directed toward a specific cybercrime. Moreover, the spam campaign labeling based on the goal of spammer can help to rank spam campaigns. 5. http ://

5 Ranking spam campaigns based on the investigator’s priorities, provides ordered set of spam campaigns that on the base of it the investigator decides which spam campaigns must be first analyzed, which is a difficult task when we look at large number of detected spam campaigns as a whole.

It is not uncommon that data mining process requires several days or weeks to be completed. Parallel computing systems bring significant benefit, say high performance, in implementation of massive database [33]. Parallel clustering is a methodology proposed to alleviate the problem of time and memory usage in clustering large amount of data [94], [18]. Because of the huge amount of received spam emails, which vastly increases every hour (8 billions per hour) [110], [101] and for the high variance that related emails may show, due to the use of obfuscation techniques [108], it would be helpful if we are able to parallelize the process of clustering in several parallel computers. Parallel clustering will speed up the process of grouping unwanted messages into spam campaigns.

In the present thesis, we address all aforementioned issues related to spam campaign detection, analysis, labeling, and speeding up the process through parallelism with the use of formal methods. In what follows, the contribution of the thesis is explained in detail.

1.2 Main Contributions

The main contribution of this thesis can be summarized as following :

— We propose a categorical clustering algorithm, named CCTree, which is designed to divide spam emails into smaller groups, named spam campaigns, based on the structural similarity. The main hypothesis is that some parts of spam emails, belonging to the same spam campaign, remains unchanged. The CCTree has a tree-like structure, where the leaves of the tree represent the desired spam campaigns ([126]). — A set of 21 categorical features are presented which characterize the structure of spam emails. An extensible and portable framework is provided to automatically extract the set of proposed features from raw emails. These features well represent the structure of an email. Some of these features hardly change when a spammer creates his own spam campaign ([129]). — We propose and validate through analysis of 200k spam emails, a methodology to choose the optimal CCTree configuration parameters. The proposed technique shows that once the input parameters of CCTree are chosen for a dataset, they can be used for similar datasets with comparable size ([129]). — We show the effectiveness and efficiency of CCTree in clustering emails into campaigns through two well-known evaluation indexes, named internal evaluation, i.e. the ability of CCTree in obtaining homogeneous clusters and external evaluation, i.e. the ability to effectively classify similar elements (emails), when classes are known beforehand ([129]).

6 — We propose a framework, named Digital Waste Sorter (DWS), which exploits a self lear- ning goal of the spammer-based approach for spam email classification. The proposed approach aims at automatically classifying large amount of raw unclassified spam emails dividing them into campaigns and labeling each campaign with its spammer goals. To this end, we proposed five class labels to group spammer goals into five macro-groups, namely Advertisement, Portal Redirection, Advanced Fee Fraud, Malware Distribution and Phishing ([128]). — A ranking methodology is proposed to order sets of spam campaigns on the base of investigator priorities. The proposed approach extract five ranking features from each discovered spam campaign, according to investigator priorities. Including the spammer- goal label of spam campaign, these features are used to automatically attribute a grade to each spam campaign. The set of spam campaigns are ordered based on their grades. — A semiring-based formal method, named Feature-Cluster Algebra, is proposed to abs- tract the representation of CCTree. The resulted term equivalent to a CCTree is called CCTree term. Trough several theorems we prove that the proposed algebraic struc- ture, under some conditions, fully abstracts tree representation. A rewriting system is proposed to automatically verify whether a term is a CCTree term or not ( [127] ). — The abstract schema of CCTree is applied to formalize CCTree parallelism. The pa- rallelism approach can be applied to speed up the process of clustering in parallel computers. To formalize CCTree parallelism, a set of rewriting rules are provided to get a final CCTree from the resulted CCTrees of parallel computers. Through the set of examples and theorems, we show how the proposed approach works.

1.3 Thesis Outline

The present thesis is structured as follows. First, we provide related work synthesis in the effort of spam campaign detection, labeling, and formalization in Chapter2. In Chapter3, we propose a categorical clustering algorithm, named CCTree, to cluster spam emails based on structural similarity (step 1 in Figure 1.3), the result of this step is a set of spam campaigns, which are the leaves of the CCTree (step 2 of Figure 1.3). The effectiveness and efficiency of CCTree in spam email campaign detection is presented in Chapter4. We propose a self-learning approach to label spam campaigns on the base of the goal of the spammers (step 3,4 of Figure 1.3), and rank the labeled spam campaigns (step 5 of Figure 1.3) in Chapter5. The aforementioned steps are complete to divide a large amount of spam emails into spam campaigns. On the other side, to speed up the process of clustering algorithms, one well-known applied technique is parallel clustering. In the rest of the thesis, we formalize the CCTree parallelism. Hence, it is possible that the whole set of data to be divided in parallel computers

7 Figure 1.3 – The framework of thesis.

(step 6,7 of Figure 1.3). In Chapter6, we abstract the CCTree representation with the use of a well-known algebraic structure, named semiring. We prove that the proposed algebraic based technique abstracts tree representation. The formal representation of CCTree is named CCTree term. We propose a rewriting system to verify whether a term is a CCTree term or not. The CCTree term is used to formalize CCTree parallelism with the use of a rewriting system (step 8 of Figure 1.3) . The result of final CCTree is the set of spam campaigns (step 10 of Figure 1.3), which can be delivered to previous explained parts of the framework to be labeled and ranked. We conclude with future directions of the present thesis in Chapter7.

8 Chapitre 2

State of the Art

In line with the growing concerns regarding spam messages, there has been an increasing number of works dedicated to the problem, which studies the issue from different aspects. In this chapter, we present a comprehensive literature review to the problem of spam emails, directly or indirectly related to our work. At the end of the chapter, we present the studies related to formal methods applied in feature models’ presentations. We refer how these formal approaches are similar (and different) to our proposed semiring-based formalization technique for abstracting feature-based categorical clustering algorithm, and finally to speed up the process of clustering through parallelism.

2.1 Spam Emails Issues

In this section we explain different problems of spam emails discussed in the literature.

Botnet is one of the main topic related to spam emails, which vastly came under consideration in recent years. [76] report that more than 85% of worldwide spam is sent by botnets 1. The term botnet refers to a group of campaign host computers that are controlled by a small number of commander hosts referred to as command and control (CC) servers. Compromised machines on the Internet are generally referred to as bots, and the set of bots controlled by a single entity is called a botnet [153]. In other words, botnet is a network of “zombie” computers infected by a malicious software (or “malware”) designed to enslave them to a master computer. The malware is installed in a variety of ways, such as downloading an attachment received by a spam email [25],[78], [35].

[146] perform a large scale analysis of botnet characteristics and identify trends that can benefit future botnet detection and defense mechanisms. The proposed framework is based on the premise that botnet spam emails are mostly sent in an aggregate fashion, resul- ting in content prevalence similar to the worm propagation. The focus of research is on URLd 1.

9 embedded in email content. With the use of three-month collected spam emails from Hotmail, the proposed framework, named AutoRE, [146] found several interesting results regarding the degree of email obfuscation, properties of botnet IP addresses, sending patterns, and their correlation with network scanning traffic. [79] present a platform, named Botlab, which continually monitors and analyzes the behavior of spam botnets. The result of this study shows that six botnets are responsible for 79% of spam messages arriving at the University of Washington campus. [96] first discuss about the fundamental concepts of botnets, including formation and exploi- tation, lifecycle, and two major kinds of topologies. Several related attacks, detection, tracing, and countermeasures, are introduced later. [47] propose a spam zombie detection system, named SPOT (Sequential Probability Ratio Test), which monitors outgoing messages of a network. Through a two-month e-mail trace collected in a large US campus network, they show that SPOT is an effective and efficient technique in automatically detecting compromised machines in a network. [52] apply PageRank approach, with an additional clustering algorithm, to efficiently detect stealthy botnets through peer-to-peer communication. [133] provide interesting statistic about botnet : at two hours about 29.6% of bots are blacklis- ted, and 46.4% are blacklisted after three hours. By six hours, roughly 75.3% are blacklisted. The rate reaches 90% after a period of about 18 hours. [142], [149], [45] propose several approaches to find the botmaster through step stones. [13], [122], [116], [107] provide a brief look at the existing botnet research, the evolution and future of botnets, as well as the goals and visibility of today networks intersection in order to inform the field of botnet technology and defense.

The other topic related to the problem of spam emails is about the cost of spam messages, and the revenue of spammers. [119] believe that any marketing based on spam emails brings the advantage of costing the sender small. Hence, the sender send large number of messages to maximize the return. There are several researches focusing on what spammer get back from spam campaigns. The conversion rate of spam marketing is discussed in [83], while in [133],[112] , and [134] the underground economy of spam is analyzed. [133] show that spam-as-a-service can be purchased for approximately $100–$500 per million emails sent. Botnets can also be rented to groups interested in sending out larger amount of designed spam emails, which are capable in sending 100 million emails per day for $10,000 per month. Considering in their own study that a cutwail operators may have paid between $1,500 and $15,000 on a recurring basis to grow and maintain their botnet, and estimating the value of the largest email address list (containing over than 1,596,093,833 unique addresses) from advertised prices, it is worth approximately $10,000– $20,000. Finally, the Cutwail gangs profit for providing spam services is estimated to around $1.7 million to $4.2 million since June 2009. They also observed that several individuals offer 10,000 malware installations for approximately $300– $800, and rates for one million email

10 addresses ranging from $25 to $50, with discounted prices for bulk purchases. [84] show that a successful spam campaign can earn revenues between $400k to $1000k. The other side of cost effect of spammer has been evaluated as productivity cost 2. To measure the cost of spam emails in terms of productivity, suppose that the average money an employee makes per year equals to $ 80k, while he is working 220 days per year. Let’s say that he receives 100 messages per day, which 40 of them are spam, and the average time to read a message and delete it takes 5 seconds. Then, he gets $45 per hour, and needs 3 minutes just for deleting spam emails, he lost $2.25 per day just for checking the spam messages. It means, if a company has 1000 employees, it looses just $500,000 per year as productivity cost resulted from spam messages.

The other main focus of research related to the problem of spam emails refers to spam filtering methods. Spam filtering is based on analysis of the message contents and additional information, trying to identify spam messages from legitimate ones [143], [21]. Generally, a spam filter is an application which implements a function as following :

( C(spam) if the message m is spam f(m, θ) = C(leg) if the message m is legitimate where m is a message to be classified, and θ is a vector of parameters, and C(spam) and C(leg) are labels assigned to the message.

Mostly spam filtering is performed with the use of machine learning algorithms, e.g. applying Naive bayesian approaches [9],[8], and other classifiers [75], [151], [90], [22], [138], [139]. The approaches proposed in the literature for filtering spam emails constitute a variety of topics. [29] presents an overview of approaches aimed at spam filtering. Text analysis and characteri- zing spam emails with the use special words, was another applicable approach in the field of spam filtering. To this end, [48] apply lazy learning algorithms to tackle concept drift in spam filtering, while [80] use n-grams in an anti-spam approach based on words. Spammers start to obfuscate text in spam messages, or embed the text in images, to avoid being identified trough text filtering techniques. Image spam filtering methodologies [10], [20], came under consideration to block these kinds of spam messages.

Nevertheless, despite the growing research on spam filtering, often showing accuracy of above than 90% [21], the evolution of spam messages is still considerable. Actually, a filter prevents end-users from wasting their time on junk messages, but it does not stop resources misuse, since however the messages are delivered [21]. We believe the reason could be that the spammer, the root of the problem, feels that there is the minimum risk to be caught. 2. http ://

11 To effectively fight the problem of spam emails, it is mandatory to find and persecute the spammers, generally hiding behind complex networks of infected devices which send spam emails against their user will, i.e. botnets. Due to botnets, identifying the spammer is a difficult task, however possible [142], [149], [45]. To this end, first of all it is required that efficiently and effectively divide huge amount of spam emails in the direction of being helpful to caught the spammer.

2.2 Clustering Spam emails into Campaigns

Detecting a spammer, analyzing his behavior, deciding which spammers among all have the priority to be followed, constitutes an extremely challenging task, due to the huge amount of spam emails, which vastly increases every hour (8 billions per hour) [110] [101] and for the high variance that related emails may show, due to the use of obfuscation techniques [108]. To this end : ;

• First of all, a fast and effective clustering algorithm is required to divide huge amount of spam messages into smaller groups, each representing a spam campaign, originated from the source (spammer).

In the research field of spam emails, several works exist which cluster spam emails into spam campaigns.

The basic idea in [87] for identifying spam campaigns is based on the keywords or string standing for specific types of campaigns. For example, all templates containing the string linksh are defined as a type of self-propagation campaigns. Several campaign types, related to the same spammer purpose, constitute a campaign class. The purpose of a spam campaign is identified on the base of keywords in the text or subject. The set of messages containing no text, and just the feature, belong to the image campaign. Finally, 10 spam campaign classes are presented, named 1) Image spam, 2) Job ads, 3) Other ads, 4) Personal ad, containing fake dating/matchmaking advance money scams, 5) Pharma containing pointers to web sites selling Viagra, Cialis, etc, 6) Phishing, which forces victims to enter sensitive information 7 ) Political campaigning 8) Self-prop, i.e. the spam messages which tricks victims into executing Storm binaries 9) Stock scam that ricks victims into buying a particular penny stock 10) (Other) Manual selection of keywords needs too much efforts iteratively, while the spammers soon by soon change the keywords that they use. Moreover, spammer continuously fight keywords- based approaches by means of obfuscation techniques.

It has been inferred by [87] that 65 percent of instances last less than 2 hours and the longest existing ones are pharmaceutical which were available for months, and crucial self propagation working for 12 days. Three large campaigns, named Pharma, self-propagation and stock storm have the large num-

12 ber of unique headers in template, but Pharma and self propagation have actually few different bodies. The authors suggest that may be it is better to focus on clustering on headers to iden- tify these three campaigns and then try to identify other campaigns using other techniques.

In [54], although the authors focus on analysis of spam URLs in , the study of URLs and clustering spam messages is similar to our goal concerning spam emails. First, all wall posts that share the same URLs are clustered together, then the description of wall posts are analyzed and if two wall posts have the same description their clusters are merged. In this study factors like bursty activity and distributed communication have also come under consideration. The distributed property in sending spam emails refers to the number of users who send spam messages in the cluster and in this case is usually computed from IP addresses of the senders, while in facebook spam messages it refers to users‘ unique ID. The bursty property comes from the rational that most spam campaigns are involved in an action within a short period of time. The threshold values for distributed and bursty properties in this study has been identified as 5 and 1.5 hours, respectively. This means that if a spammer sends spam messages to less than 5 different accounts or the interval of sending messages is greater than 1.5 hours, he is considered as a person who have no important effect in the system.

Furthermore, the authors found that for attracting people attention, the spammers techniques can mostly (88.2) be classified into three types : 1) They promise free gifts, 2) They use some phrases to trigger the curiosity, like some one likes them, 3) They describe a product for sale.

It has been discovered that approximately 80 percent of malicious accounts are active less than one hour and about 10 percent are active for longer than one day. According to each time zone most malicious wall posts were sent around 3 am to avoid detection, and among 187 million wall posts of 3.5 million facebook users, 200,000 malicious wall posts were attributed to 57,000 malicious accounts.

[92] believe that spam emails with identical URLs are highly clusterable and mostly sent in burst. In their method, if the same URL exists in spam emails from source A and source B, and each has a unique IP address, they will be connected with an edge to each other and the connected components are the desired clusters. It is also observed that if a spammer is associated with multiple groups, it has a higher probability of sending more spam mails in the near future. Furthermore, the authors found a very small fraction of the active spammers actually accounted for a large portion of the total spam mails. Furthermore, they inferred that the spam emails from the same group of spammers are sent in burst.

Spamscatter [4] is a method that automatically clusters the destination web sites extracted from URLs in spam emails with the use of image shingling. In image shingling, images are divided into blocks and the blocks are hashed, then two images are considered similar if 70

13 percent of the hashed blocks are the same. The life time of each detected spam campaign is computed through finding the first (in terms of time) and last (in terms of time) spam message in the spam campaign. The result shows that over 40% of the malicious scams persist for less than 120 hours, whereas the lifetime for the same percentage of shopping scams is 180 hours and the median for all scams is 155 hours.

[150] cluster spam messages based on the images of spam to trace the origins of spam emails. To this end, spam images are divided into two parts : foreground and background. The foreground comprises the text and/or illustrations while background is the colors and/or textures. The spam emails are visually similar if their illustrations, text, layouts, and/or background textures are similar. In this study, spam images are separated to foreground and background, where the foreground contains the text and illustration, and the background means various colors and textures. The two-stage clustering, first with the use of Optimal Character Recognition recognizes texts whose bounding boxes represent the text layout. Afterwards, the illustrations are separated from the background by detecting the background. The authors mention that the proposed approach requires to be mixed with other methods to get better result.

[130] focus on clustering spam emails based on IP addresses resolved from URLs inside the body of these emails. The rational behind it is that the authors believe in many cases it is not easy to change the IP addresses easily, since it requires to compromise a lot of computers. In this study, two emails belong to the same cluster, if their IP addresses resolved from URLs are exactly the same. Afterwards, the relationship between spam sending system and malicious Web servers connected to URLs , and also some information like the number of unique URLs, unique domain names, etc are provided.

By examining three weeks spam messages gathered on used SMTP server, the authors conclude that the proposed methodology outperforms comparing to clustering techniques based on domain names and URLs, while the claim is justified due to the fact that domain names associated with the scam changes frequently, also the period that a URL is active is too short for performing the investigation, and most of the time the URLs used in spam emails are unique.

In all aforementioned works for clustering spam emails into campaigns, the pairwise comparison of each two email is required, where the time complexity is quadratic. Furthermore, the spam campaign detection is limited to one or two features in spam emails, where if the spam messages does not contain the related feature, the methodology fails in its clustering. For example, for emails without URL or without images, the approaches of [130],[150] fail, respectively.

Other limitations of the former approaches have been identified in [132], which shows how only considering IP addresses resolved from URLs is insufficient for dividing emails in spam campaigns. More precisely, since web servers contain lots of domains with the same IP address, every spam campaign identified by such a mean (such as [130]) are instead made of a large

14 amount of spam emails sent by different controlling entities.

Thus, [130] propose a new technique for spam campaign detection, named O-means cluste- ring, which is based on K-means clustering algorithm. The distance of two spam messages is calculated based on 12 features extracted from emails, which are expressed by numbers and the distance is computed with the use of euclidean measurement. The set of 12 features are 1) size of email, 2) number of lines, 3) number of unique URLs, 4) average length of unique URLs, 5) average length of domain names, 6) average length of query, 7) average number of key values pair, 8) average length of path, 9)average length of keys, 10) average length of values, 11) average number of dots in domains, 12) number of global top 100 URLs.

The limitation of O-means is that it requires the number of clusters to be known from begin- ning, which is generally not a working hypothesis. On the other hand, the applied features are considered numerical, not representing well the reality, specially for considering the distance of two emails based on the the number of links numerically, i.e. the two email with one link be considered closer to the email with 10 links rather than the one with 11 links.

After clustering spam emails according to O-means method, [131] found that the 10 largest clusters had sent about 90 percent of all spam emails. Hence, the authors investigate these 10 clusters to implement heuristic analysis for selecting significant features among 12 features used in previous work. As a result they select four most important features which could effectively separate these 10 clusters from each other. These features are : “Size of emails”, “Number of lines”, “Length of URLs” and “Number of dots”. However, the authors mentioned that it is not the best method for selecting the most significant features, since it was based on analysis of the top 10 clusters. By the way, it results almost with the same accuracy of clustering of the previous method which used 12 features. The accuracy ranges from 86.63 percent to 86.33 percent, which the difference is negligible but the execution time from 28,772 sec decreases to 6,124 sec.

[144] first extract eleven features of each spam email. This set of features includes : “Message Id”, “Sender IP address”, “Sender Email”, “Subject”, “Body Length”, “Word Count”, “Attach- ment File Name”, “Attachment MD5”, “Attachment Size”, “Body URL”, “Body URL Domain”, while some attributes are broken down into two sub-attributes, for example, “body URL” into “Machine Name” and “Path”. Afterwards, two clustering algorithms are applied to divide spam messages. At first an agglo- merative hierarchical algorithm [66] is used to cluster the whole data set based on messages’ subject comparison. This means that at the beginning, each email is a cluster by itself and then clusters sharing common subject are merged. The distance D(i, j) between two clusters i and j is equal to 0 if they share common feature of an attribute and equal to 1 if not. Thus, when the distance between two clusters is 0, the two clusters are merged. Finding that with first merge based on the subject, 67% of messages are attributed to one cluster. To solve the

15 problem of false positive rate for big clusters, the connected component with weighted edges algorithm is applied. A connected component [12] is an undirected graph in a set A of vertexes such that for each vertex v ∈ A, the set of vertexes for which there exists a path from v to them is exactly the set A. The weight on edges represents the strength of the connection between two vertexes. Applying this approach, edges connect two spam emails based on the eleven attributes. The desired clusters are the connected components of this graph with the weight above a specified threshold.

The main drawback of this methodology is that it cannot be applied on large datasets, since the pairwise comparison are done for pair of emails in the dataset several times.

The basic hypothesis in [27] for clustering spam emails is that some parts of spam messages are static in the view of recognizing a spam campaign. In this work, as an improvement of [92], just URLs are not considered for clustering. In this work, for identifying spam campaigns some features extracted from spam emails, named “language of email”, “message layout”, “type of message”, “URLs” and “subject”. Afterwards, the frequencies of proposed features in a large dataset are computed in order to cluster spam messages with the use of FP-Tree. Frequent Pattern Tree (FP-Tree), proposed by [67], is a signature based method in which each node after the root depicts a feature extracted from the spam message that is shared by the sub- trees beneath. Thus, each path in this tree shows sets of features that co-occur in messages, with the property of non-increasing order of frequency of occurrences.

Applying FP-Tree for spam campaign detection, in [27] and [44], has several limitations. First of all, in the side of URL similarity, since each token of a URL is considered as a feature, it fails to distinguish dynamic URLs in emails belonging to the same campaign [27]. On the other hand, considering token of URLs as feature causes that a spam email containing several URLs be directed to several campaigns.

Moreover, in the side of layout detection, FP-Tree is too much sensitive to very small changes in the layout. More precisely, FP-Tree reads each message line by line, and then the layout is provided as the string of letters, e.g. UTBUUB, where the i’th letter in the string represents the i’th line of spam message, e.g. if U occurs in the first letter of layout string, it means that in the first line of message we have URL. Considering that spammers use several techniques for random text and URL obfuscations, it is possible that two very similar emails, belonging to the same spam campaign, be considered as having two different layouts in FP-Tree, just because the random text reaches to the next line in one email whilst not in the other one.

In summary, the previous works for clustering spam emails mainly could be divided into two main categories : the first group focus on pairwise comparison of each pair of emails, for example URL comparison, and the second group in which a clustering algorithm is used, for example O-means clustering. In general, the aforementioned previous works suffer from one of the following problems : 1) They consider one or two features for grouping spam

16 messages, which decreases the accuracy, 2) The pairwise comparison is used, with quadratic time complexity, 3) The number of clusters is required as a former knowledge, 4) The features which create a pure cluster are not focused. In our proposed methodology for clustering spam emails into campaigns, we try to address the aforementioned problems.

2.3 Labeling and Ranking Spam Campaigns

• In the next step, to address the spam message problem, an approach is required to label detected spam campaigns in order to train a classifier with the use of labeled set of messages, and then to investigate an order among detected spam campaigns according to investigator priorities.

In the literature, the spam campaigns are usually labeled based on characteristic strings (key- words) representing individual campaign types as in [44], [88] and [55]. As explained, in these works, the occurrence of some specific string in a spam message means that the spam is labe- led as a pre-identified type spam campaign. For example, all templates containing the string linksh are defined as a type of self-propagation campaigns. First of all, manual string selection requires a lot of time, while the spammers soon by soon change the set of words in the body of messages applying obfuscation techniques. Moreover, it is worth noticing that many spammer apply the same words, like “viagra”, to deceive the victims. Hence, training a classifier based on the words label is not helpful in spam campaign detection, while the spam campaign is defined according to our need, i.e. originated from the same source.

[106] label spam campaigns on the base of contact information in the body of messages. To this end, URLs, phone number, Skype ID, and Mail ID used as contact information are considered for clustering spam emails into similar groups, whilst the contact information is considered as the label of detected spam campaign. This methodology is effective only against emails reporting contacts, which are only a subset of all the spam emails found in the wild.

There are several approaches in the literature in which the spammer goal is considered. Howe- ver, these approaches are mainly focused on detecting phishing emails, not considering other spammer purposes. Phishing email [3] as a special type of spam message, has become an enormous threat for all Internet based commercial operations, which causes non negligible financial losses to organizations and individual users. Phisher attempts to redirect users to fake websites, which is designed to obtain financial data such as usernames, passwords, and credit card detail, etc of a person illegally in an electronic communication [3]. In this regard, mostly the set of features which represent a phishing email structure are pro- posed, and then a machine learning algorithm is used to classify set of emails into phishing or legitimate.

[50] applied 10 email features to discern phishing emails from ham (good) emails. These 10

17 features include : 1) IP-based URLs, 2)age of linked-to domain names, 3) nonmatching URLs, 4)“Here” links to non-model domain, 5) HTML emails, 6) number of links, 7) number of domains, 8) number of dots, 9) containing javascript, 10) spam-filter output.

[17] propose a similar methodology with additional features to train a classifier in order to filter phishing emails. Advanced email features are generated by adaptively trained Dynamic Markov Chains and latent Class-Topic Models. The set of features are divided into three main groups, named basic features, dynamic markove chain features, latent topic model features. Basic features by itself contain several features, e.g. structural features, link features.

[34] propose a methodology to detect phishing emails based on both machine learning and heu- ristics. The proposed novel heuristic anti-phishing system employs Gestalt and decision theory concepts in modeling the similarity. [3] provide a survey on different techniques in filtering phishing emails, while Gansterer et al. [53] compare different machine learning algorithms in phishing detection. Furthermore, the authors propose a technique which refines the previous phishing filtering approaches. In this work, three types of messages, named ham, spam and phishing are distinguished automatically. Nevertheless, the category of emails containing spam, is not precisely characterized.

There are number of works discussing on different aspects of spam email attacks, spanning from the network of malware distribution [104] , PageRank spam analysis [1] to total revenues for a range of spam advertised campaigns [84], [83]. However, in these works also some specific aspects of one type of spam attack is analyzed, where the detection of different types of spam attacks is not discussed. In the side of ranking spam campaigns, [44] consider Canadian law enforcement elements, e.g. Canadian IP addresses, “.ca” top-level domain names, and IP ranges of Canadian IP addresses.

To the best of our knowledge, the present work is the first effort in labeling spam campaigns based on the different goals of spammer based on the structural features of messages, whereas the goal-based label of each campaign is applied to order the set of detected labeled spam campaigns.

2.4 On the Formalization of Clustering and its Applications

• As the next step, we formalize CCTree, as the effective and efficient categorical clustering algorithm. The formal schema is used to formalize CCTree parallelism with the use of rewriting system.

It is hard to find studies in the literature on the formalization of different concepts related to clustering algorithms.

[58] formalize hierarchical clustering as an Integer Linear Programming (ILP) problem with a

18 natural objective function. The dendrogram properties of hierarchical clustering are enforced as linear constraints. The proposed formalization technique has the benefit of that relaxing the constraints may provide novel program variation, like overlapping clusterings.

[103] formally define the problem of clustering in Multi-Criteria Decision Aid (MCDA) system. As in most MCDAs, the preferences of a decision maker are modeled based on a set of decision alternatives. To find the optimal solution, the authors propose a heuristic approach, which is validated trough tests on a large set of artificially generated benchmarks.

[2] propose an approach to formalize the problem of data streams in clustering algorithms, based on the set theory. Data stream refers to infinite sequences of data. The formalization scheme made it possible to identify and propose basic properties for the design and comparison of data stream clustering algorithms. To this end, they extended Kleinberg’s properties [86] to represent clustering partitions evolving according to the data stream behavior. They found that it is difficult to find an algorithm to comply with expressiveness property in a data stream context.

[41] apply predicate logic language in terms of sets of if-then rules to formalize heuristic rules in clustering algorithms. In this approach, it is possible to describe traditional clustering algorithms, like k-means. However, in none of the few number of works on formalizing clustering algorithms, algebraic methodology is used in abstracting a clustering algorithm representation. In what follows we present several techniques and methodologies used to formalize feature models.

Feature models are information models in a way that a set of products, e.g. software products or DVD player products, are represented as hierarchically arrangement of features, with dif- ferent relationships among features [15]. Feature models are used in many applications as the result of being able to model complex systems, being interpretable, and the ability to handle both ordered and unordered features [105]. Benavids et al. [15] believe designing a family of software systems in terms of features, makes it easy to be understood by all stakeholders, rather than the time they are expressed in terms of objects or classes. Representing feature models as a tree of features, were first introduced by Kang et. al in [82], to be used in soft- ware product line. Some studies [31], [32], show that tree models combined with ensemble techniques, lead to an accurate performance on variety of domains. In feature model tree, dif- ferently from CCTree, the root is the desired product, the nodes are the features, and different representation of edges demonstrates the mandatory or optional presence of features. Hofner et al. [73], [74], were the first who applied idempotent semiring as the basis for the formalization of tree models of products, and they called it feature algebra. The concept of semiring is used to answer the needs of product family abstract form of expression, refine- ments, multi-view reconciliation, and product development and classification. The elements of semiring in the proposed methodology, are sets of products, or product families.

19 To get better insight on how feature algebra works, we present a brief history of product family from definition to formalization. Furthermore we explain that despite our inspiration from the concept of feature algebra in formalizing tree model system, our proposed approach is different in several aspects.

FODA used feature models as the means to give the mandatory, optional and alternative concepts within a domain [81], [115]. For example, in a car, the transmission system is a mandatory feature, and an air conditioning is an optional feature, whilst the transmission system can either be manual or automatic. The part of the FODA feature model most related to formalizations works is the proposed feature diagram. It builds a tree of features and captures the mandatory, optional, and alternative relationships among features.

[82] perform an analysis of commonalities among applications in a particular domain in terms of services, operating environments, domain technologies and implementation techniques. After- wards, they construct a model named feature model to capture commonalities as an AND/OR graph. The AND nodes in this graph demonstrate mandatory features and OR nodes show alternative features chosen from different applications.

[39] proposed a feature model represented by a hierarchically arranged diagram where a parent feature is composed of a combination of some or all of its children. A vertex parent feature and its children in this diagram can have one of the following relationships : – And relationship, which indicates that all children must be considered in the composition of the parent feature – Alternative relationship, which indicates that only one child forms the parent feature – Or relationship, which shows that one or more children features can be involved in the composition of parent feature – Mandatory relationship, which indicates that children features are required – Optional relationship, which shows that children features are optional.

Lopez-Herrejon, Batory, and Lengauer model features as functions and feature composition as function composition [97][95]

To get better insight how feature algebra works, we refer to an example of product line, provided in [24]. Suppose that an electronic company have a family of three product lines : mp3 Players, DVD Players and Hard Disk Recorders. All members share the set of features given in the Commonalities. A member can contain some mandatory features and might contain some optional features that another member of the same product line do not have. For instance, a product could be a DVD Player that is able to play music CDs, whilst the other one does not have this feature. However, all the DVD players of the DVD Player product line must contain the Play DVD feature. Furthermore, it is possible to have a DVD player that is able to play several DVDs at the same time.

20 Different researchers have proposed different views of what a feature is or should be. A defi- nition that is common to most (if not all) of them in Feature-Oriented Software Development (FOSD) is that “a feature is a structure that extends and modifies the structure of a given program in order to satisfy a stakeholder’s requirement, to implement a design decision, and to offer a configuration option” [72].

Mostly, a set of features are composed to create a final program, which is itself considered as a feature. Under this assumption, a feature is either a complete program which can be executed or a program increment that requires further features to lead to a complete program. The structure of a basic feature is modeled as a tree, called feature structure tree (FST), which builds the feature’s structural elements, e.g., classes, fields, or methods, hierarchically. A specified name and type information is assigned to each node of an FST, which helps to prevent the composition of incompatible nodes during feature composition [72].

The concept of product families entered from hardware industry to the software development process [72]. The reason was that the software developers also prefer not to build just a single product but a family of similar products, sharing some functionalities, whilst having some well-identified variabilities. These elements, known as features, in software family can be characterized as requirements, architectural properties, components, middleware, or code. Due to the fact that the systems are characterized by their features, in [72] the authors call their proposed methodology feature algebra. Idempotent semirings is the basis of feature algebra, which allows a formal treatment of the aforementioned elements as well as the calculations with them. Sets of products are particular models of proposed feature algebra, which in its extension form covers product lines, refinement, product development and product classification.

The tree-like structure which is formalized in product family problems has different structure from CCTree. In product family structure, against CCTree, the edges of the tree have no labels, only the nodes have ones. Furthermore, different representations of edges convey different concepts, whilst in CCTree we do not have different possible edge representations.

To the best of our knowledge, we are the first to apply an algebraic structure to abstract a categorical clustering algorithm representation and formalize the interesting concepts related to it, i.e. clustering parallelism. To this end, we attribute an algebraic representation of a tree structure and then trough several theorems and examples we show the proposed abstraction algebraic term fully abstract tree representation. Calling the term resulted from CCTree, as CCTree term , a rewriting system is proposed to automatically verify whether a term represents CCTree structure or not. Furthermore, a set of rewriting rules are provided to parallelize the result of parallel clustering.

21 Chapitre 3

Spam Campaign Detection

Spam emails constitute a fast growing and costly problems associated with the Internet today. To fight effectively against spammers, it is not enough to block spam messages. Instead, it is necessary to analyze the behavior of spammer and catch them in the case. This analysis is extremely difficult if the huge amount of spam messages is considered as a whole. Clustering spam emails into smaller groups according to their inherent similarity, facilitates discovering spam campaigns sent by a spammer, in order to analyze the spammer behavior. In this chapter, we propose a methodology to group large amount of spam emails into spam campaigns, on the base of categorical attributes of spam messages. A new informative clustering algorithm, named Categorical Clustering Tree (CCTree), is introduced to cluster and characterize spam campaigns. The complexity of the algorithm is also analyzed and its efficiency is proved ([126]).

3.1 Introduction

Nowadays, the problem of receiving spam messages leaves no one untouched. According to McAfee [100] report, out of the daily 191.4 billions of emails sent worldwide in average [110], more than 70% are spam emails. Microsoft and Google [113] estimate spam emails cost to American firms and consumers up to 20 billion dollars per year. Moreover, Cisco Report [136] shows that spam volume increased 250 percent from January 2014 to November 2014. Spam emails cause problems, from direct financial losses to misuses of traffic, storage space and computational power.

Given the relevance of the problem, several approaches have already been proposed to tackle this issue. Currently, the most used approach for fighting spam emails consists in identifying and blocking them [30], [46], [123], on the recipient machine through filters, which generally are based on machine learning techniques or content features [22], [138], [139]. Alternative approaches are based on the analysis of spam botnets [79],[91],[146], [152].

Though some mechanisms to block spam emails already exist, spammers still impose non

22 negligible cost to users and companies [113]. Thus, the analysis of spammers behavior and the identification of spam sending infrastructures is of capital importance in the effort of defining a definitive solution to the problem of spam emails.

Such an analysis, which is based on structural dissection of raw emails, constitutes an extremely challenging task, due to the following factors : — The amount of data to be analyzed is huge and growing too fast every single hour. — Always new attack strategies are designed and the immediate understanding of such strategies is paramount in fighting criminal attacks brought through spam emails (e.g. phishing). To simplify this analysis, huge amount of spam emails should be divided into spam campaigns. A spam campaign is the set of messages spread by a spammer with a specific purpose [27], like advertising a specific product, spreading ideas, or for criminal intents e.g. phishing. Grouping spam messages into spam-campaigns reveals behaviors that may be difficult to be inferred when we look at a large collection of spam emails as a whole [132]. According to [27], in order to characterize the strategies and traffic generated by different spammers, it is necessary to identify groups of messages that are generated following the same procedure and that are part of the same campaign. It is noteworthy to be mentioned that the problem of grouping a large amount of spam emails into smaller groups is an unsupervised learning task. The reason is that there is no labeled data for training a classifier in the beginning. More specifically, supervised learning requires classes to be defined in advance and the availability of a training set with elements for each class. In several classification problems, this knowledge is not available and unsupervised learning is used instead. The problem of unsupervised learning refers to trying to find hidden structure in unlabeled data [57]. The most known unsupervised learning methodology is clustering. Clustering is an unsupervised learning methodology that divides data into groups (clusters) of objects, such that object in the same group are more similar to each other than to those in other groups [77].

However, dividing spam messages into spam campaigns is not a trivial task due to the following reasons : — Spam campaign classes are not known beforehand, which means we need an unsuper- vised machine learning technique. — Feature extraction is difficult. Finding the elements that best characterize an email is an open problem addressed differently in various research works [50], [17], [150], [132]. For these reasons the most used approaches to classify spam emails is clustering them on the base of their similarities [4], [111], [132].

However, the accuracy of current solutions is still somehow limited and further improvements are needed. While some categorical attributes, for example the language of spam message, are primary, discriminative and outstanding characteristics to specify a spam campaign, neverthe-

23 less in previous works [87], [92], [4], [130], [131],[144], [28], these categorical features are not considered, or the homogeneity of resulted campaigns are not on the base of these features.

In this chapter, after a thorough literature review on the clustering and classification of spam emails, we propose a preliminary work on the design of a categorical clustering algorithm for grouping spam emails, which is based on structural features of emails like language, number of links, email size etc. The rationale behind this approach is that two messages in the same format, i.e. similar language, size, same number of attachments, same amount of links, etc., are more probable to be originated from the same source, belonging thus to the same campaign. To this aim, we expect to extract categorical features (attributes) from spam emails, which are representative of their structure and that should clearly shape the differences between emails belonging to different campaigns.

The proposed clustering algorithm, named Categorical Clustering Tree (CCTree), builds a tree starting from a whole set of spam messages. At the beginning, the root node of the tree contains all data points, which constitutes a skewed dataset where non related data are mixed together. Then, the proposed clustering algorithm divides data points, step-by-step, clustering together data that are similar and obtaining homogeneous subsets of data points. The measure of similarity of clustered data points at each step of the algorithm is given by an index called node purity. If the level of purity is not sufficient, it means that the data points belonging to this node are not sufficiently homogeneous and they should be divided into different subsets (nodes) based on the characteristic (attribute) that yields the highest value of entropy. The rationale under this choice is that dividing data on the base of the attribute which yields the greatest entropy helps in creating more homogeneous subset where the overall value of entropy is consistently reduced. This approach, aims at reducing the time needed to obtain homogeneous subsets. This division process of non homogeneous sets of data points is repeated iteratively till all sets are sufficiently pure or the number of elements belonging to a node is less than a specific threshold set in advance. These pure sets are the leaves of the tree and will represent the different spam campaigns. The usage of categorical attribute is crucial for the proposed approach, which exploits the Shannon Entropy [125], which yields good results on nominal attributes. After detailing the CCTree algorithm and briefly presenting categorical features for categori- zing spam emails, we will discuss the algorithm efficiency proving its linear complexity.

The rest of this chapter is structured as follows. Section 3.2 provides some preliminary notions of the topic. Section 3.3 reports a literature review concerning the previous techniques used for clustering spam emails into campaigns. In Section 3.4, we describe the proposed categorical clustering algorithm for clustering spam messages. In Section 3.5 the analysis of the proposed methodology is discussed. Finally, Section 3.6 is a brief conclusion and a sketch of some future directions.

24 3.2 Preliminary Notions

In this section we briefly present some preliminary notions required to be known in our pro- posed process for clustering spam emails into campaigns.

Clustering Let X be a dataset which consists data points (or objects, instances, cases, pat-

terns, tuples, transactions, elements) xi = (xi1, xi2, . . . , xid) in attribute space A, i.e. each xij ∈ A, 1 ≤ i ≤ n, 1 ≤ j ≤ d where n is the number of points belonging to X and d is the number of attributes. Furthermore, each xij is numerical or categorical attribute (or feature, value, component). Such a point-by-attribute data representation conceptually corresponds to a matrix. The ultimate goal of clustering [18] is to assign

points to a finite set of k subsets C1,C2,...,Ck, named clusters. Usually subsets do not intersect (where this assumption is sometimes violated), and their union is equal to a full dataset with possible exception of outliers :

X = C1 ∪ C2 ∪ ... ∪ Ck ∪ Coutlier ,Ci ∩ Cj = ∅, ∀1 ≤ i, j ≤ k

Clustering groups data points into subsets in such a manner that similar instances are grouped together, while different points belong to different groups [117]. Due to the fact that clustering is grouping similar instances, it means that some sort of measure that can determine whether two objects are similar or dissimilar is required. Many clustering techniques use distance measures to determine the similarity or dissimi- larity between any pair of objects. The distance between two points xi and xj is usually shown as d(xi, xj). A valid distance measure should be symmetric and get the minimum value (usually zero) in the case of identical vectors. The distance measure is called a metric distance measure if it also satisfies the following properties :

d(xi, xk) ≤ d(xi, xj) + d(xj, xk) ∀xi, xj, xk ∈ X

d(xi, xj) = 0 ⇔ xi = xj ∀xi, xj ∈ X

Shannon Entropy In information theory, entropy is a measure of the uncertainty of a ran- dom variable. More specifically the Shannon entropy [125], as a measure of uncertainty, for a random variable X with N outcomes {x1, x2, . . . , xN } is defined as follows :

k X H(X) = − p(xi) log(p(xi)) i=1

Ni where p(xi) = N , Ni is the number of outcomes of xi, and N is the total number of elements of X. The amount of Shannon entropy is maximal when all outcomes are equally likely, i.e. the number of elements for each value is almost the same, and it gets its minimum, i.e. zero, when all data belonging to a set are identical. Thus, the more closer to zero the more pure is a dataset.

25 To get better insights how shannon entropy works in returning the purity of a data set, Figures 3.1 and 3.2 are provided. From the first glance, it is clear that the dataset 2 is more pure or homogeneous than dataset 1. In the following two equations, we can see that shannon entropy returns the minimum possible amount, i.e. zero, for the complete pure dataset 2.

Figure 3.1 – dataset 1

F igure3.1: H(dataset1) = −(0.4 log(0.4) + 0.3 log(0.3) + 0.3 log(0.3)) = 0.4729 10 10 F igure3.2: H(dataset2) = −( log( )) = 0 10 10 [38] and [93] show that entropy works well as a measure distance in clustering algorithms.

Spam Campaign A spam campaign is the set of messages spread by a spammer with a specific purpose [27], like advertising a specific product, spreading ideas, or for criminal intents e.g. phishing. The premise in our spam campaign detection, as [27], is based on the fact that the spammers generally keep some parts of the message static, whilst some other parts are changed systematically with automated text, image, or dynamic link generation. To get better insight of how two spam emails belong to the same campaign, we refer the reader to the Figures 3.3 and 3.4. Although in these two emails, the text, images, and dynamic links are different, it is obvious they both are generated from the same source, or designed by the same spammer. The rational behind our spam campaign detection is focusing on features that almost remain unchanged, when a spam campaign is created, e.g. the language of message, the number of images, etc.

Figure 3.2 – dataset 2

26 Figure 3.3 – Spam 1

Figure 3.4 – Spam 2

27 3.3 Related Works

To the best of our knowledge just a few works exist related to the problem of clustering spam emails into campaigns.

In [87], the basic idea for identifying campaigns is the keywords standing for specific types of campaigns. In this study, at first campaigns are found manually based on keywords and then some interesting results are extracted from groups of campaigns. As the result of needing manual scanning of spam, it is not suitable to be used for large amount of data set. In [54], although the authors focus on analysis of spam URLs in Facebook, the study of URLs and clustering spam messages is similar to our goal concerning spam emails. First, all wall posts that share the same URLs are clustered together, then the description of wall posts are ana- lyzed and if two wall posts have the same description their clusters are merged. In [92], the authors believe that spam emails with identical URLs are highly clusterable and mostly sent in burst. In their method, if the same URL exists in spam emails from source A and source B, and each has a unique IP address, they will be connected with an edge to each other and the connected components are the desired clusters. Spamscatter [4] is a method that automati- cally clusters destination web sites extracted from URLs in spam emails with the use of image shingling. In image shingling, images are divided into blocks and the blocks are hashed, then two images are considered similar if 70 percent of the hashed blocks are the same. In [150], the spam emails are clustered based on their images to trace the origins of spam emails. They are visually resembled if their illustrations, text, layouts, and/or background textures are similar. J. Song et al. [130] focus on clustering spam emails based on IP addresses resolved from URLs inside the body of these emails. Two emails belong to the same cluster, if their IP address sets resolved from URLs are exactly the same. In previous works, pairwise comparison of each two emails is required for finding the clusters. This kind of comparison has two problems : the time complexity is quadratic, which is not suitable for big data clustering, and furthermore finding clusters is based on just one or two features of messages, which causes the decreasing of precision. In what follows, spam emails are grouped with the use of clustering algorithms.

In [132], the same authors of [130] mentioned that only considering IP addresses resolved from URLs is insufficient for clustering. Since web servers contain lots of Web sites with the same IP address, so each IP cluster in [130] consists of a large amount of spam emails sent by different controlling entities. Thus, the authors clustered spam emails by IP addresses resolved from URLs in their new method, called O-means clustering, which is based on K-means clustering method. The distance is based on 12 features in the body of an email which are expressed by numbers and the euclidean distance is used to measure the distance between two emails. In [131], after clustering spam emails according to O-means method, the authors found that 10 largest clusters had sent about 90 percent of all spam emails in their data set. Hence, the authors investigate these 10 clusters to implement heuristic analysis for selecting significant features among 12 features used in previous work. As a result they select four most important

28 features which could effectively separate these 10 clusters from each other. Since the idea for clustering is based on k-means clustering, computationally NP-hard algorithms. Also it requires the number of clusters to be known from beginning.

In [144] the authors focus on a set of eleven attributes extracted from messages to cluster spam emails. Two clustering methods have been used : the agglomerative hierarchical algorithm clusters the whole data set. Next, for some clusters containing too many emails, the connected component with weighted edges algorithm is used to solve the problem of false positive rate. With the use of agglomerative clustering [66] a global clustering is done based on common features of email attributes. In the beginning, each email is a cluster by itself and then clusters sharing common features are merged. In this model, edges connect two nodes (spam emails) based on the eleven attributes. The desired clusters are the connected components of this graph with the weight above a specified threshold. This method suffers from not being useful for large data set. The pairwise comparison requires quadratic time complexity. The basic hypothesis in FP-Tree method [27] for clustering spam emails is that some parts of spam messages are static in the view of recognizing a spam campaign. In this work as an improvement of [92], just URLs are not considered for clustering.

For identifying spam campaigns, Frequent Pattern Tree (FP-Tree) as a signature based me- thod, is constructed from some features extracted from spam emails. These features are : language of email, message layout, type of message, URL and subject. In this tree, each node after the root depicts a feature extracted from the spam message that is shared by the sub- trees beneath. Thus, each path in this tree shows sets of features that co-occur in messages, with the property of non-increasing order of frequency of occurrences. The problem of FP- Tree is that it is based on frequency of features rather than creating pure clusters in terms of homogeneity. The redundant features also are removed for specifying a campaign according to the frequency property, while in our method redundant features are characterized based on purity or homogeneity of campaigns. However, the greatest problem results from sensitivity of FP-Tree to dynamic URL and text generation in layout detection. The reason is that the layout is extracted line by line, which means two very similar emails with one line difference, will be attributed to two different layout.

In summary, the previous works for clustering spam emails could be mainly divided in two categories : The first group focus on pairwise comparison of each pair of emails, for example URL comparison, and the second group consists of those in which a clustering algorithm is used, for example O-means clustering. In general, the aforementioned previous works suffer from one of the following problems : 1) They consider one or two features for grouping spam messages, which decreases the accuracy, 2) The pairwise comparison is used, with quadratic time complexity, 3) The number of clusters is required as a former knowledge, 4) The features which create a pure cluster are not focused. In our proposed algorithm, we try to solve these problems.

29 3.4 Categorical Clustering Tree (CCTree)

The general idea for construction comes from a supervised learning algorithm called Induction Decision Tree (ID3) [109]. To create the CCTree, a set of objects is given in which each data point is described in terms of a set of categorical attributes, e.g. the language of a message. Each attribute represents the value of an important feature of data and is limited to assume a set of discrete, mutually exclusive values, e.g. the Language as an attribute can take its values or features as English or F rench. Then, a tree is constructed in which the leaves of the tree are the desired clusters, while other nodes contain non pure data needing an attribute-based test to separate them. The separation is shown with a branch for each possible outcome of the specific attribute values. Each branch or edge extracted from that parent node is labeled with the selected value which directs data to the child node. The attribute for which the Shannon entropy is maximum is selected to divide the data based on it. A purity function on a node, based on Shannon entropy, is defined. Purity function represents how much the data belonging to a node are homogeneous. A required threshold of node purity is specified. When a node purity is equal or greater than this threshold, or the number of elements in a node is less than a threshold, the node is labeled as a leaf or terminal node.

The precise process of CCTree construction can be formalized as follows : — Input : Let D be a set of data points, containing N tuples on a set A of d attributes and a set of stop conditions S.

Attributes An ordered set of d attributes A = {A1,A2,...,Ad} is given, where each attribute is an ordered set of mutually exclusive values. Thus, the j’th attribute

could be written as Aj = {v1j, v2j, . . . , v(rj)j}, where rj is the number of features of attribute Aj. For example Ai could be the Language of spam email, and the set of possible values is {English, French, Spanish}. Data Points A set D of N data points is given, where each data point is a vector D = (v1 , v2 , . . . , vd ) whose elements are the features of attributes, e.g. i i1 i2 id , where vk ∈ A i k spam 1 = ik k is the k’th feature of the ’th attribute. For example we have : (English, excel attachment, image based). Stop Conditions A set of stop conditions S = ({µ, ε}) is given. µ is the “minimum number of elements in a node”, i.e. when the number of elements in a node is less than µ, then the node is not divided even if it is not pure enough. ε represents the “minimum desired purity” for each cluster, i.e. when the purity of a node is better or equal to ε, it will be considered as a leaf. To calculate the node purity, a function based on Shannon entropy is defined as follows :

Let Nkji represents the number of elements having the k’th value of the j’th at- tribute in node i, and Ni be the number of elements in node i. Thus, considering

30 N p(v ) = kji i ρ(i) kji Ni , the purity of node , denoted by , is defined as following :

d rj X X ρ(i) = − p(vkji )log(p(vkji )) j=1 k=1

where d is the number of attributes, and rj is the number of features of j’th attribute. — Output : A set of clusters which are the leaves of the categorical clustering tree. S

red blue

Sr Sb

small large

Sb.s Sb.l

Figure 3.5 – A Small CCTree

We report in the following the process of creating the CCTree :

At the beginning all data points, as the set of N tuples, are assigned to the root of the tree. Root is the first new node. The clustering process is applied iteratively for each new created node. For each new node of the tree, the algorithm checks if the stop conditions are verified and if the number of data points is less than a threshold M, or the purity, is less than or equal to ε. In this case, the node is labeled as a leaf, otherwise the node should be split.

In order to find the best attribute to be used to divide the cluster, the Shannon entropy based on the distribution of each attribute values is calculated. The attribute for which the Shannon entropy is maximal is selected. The reason is that the attribute which has the most equiprobable distribution of values, generates the highest amount of chaos (non homogeneity) in a node. For each possible value of the selected attribute, a branch is extracted from the node, with the label of that value, directing the data respecting that value to the corresponding child node. Then the process is iterated until each node is either a parent node or is labeled as a leaf. At the last step all final nodes or leaves of the tree are the set of desired clusters, named {C1,C2,...,Ck}. Figure 3.5 depicts an example of a small CCTree, whilst a formal description of algorithm is given in Algorithm1.

The source codes are provided in A.1.

31 Algorithme 1 : Categorical Clustering Tree (CCTree) algorithm Input : Input : Data points Dk , Attributes Al, Attribute Values Vm, node_purity_threshold, max_num_elem) Output : Clusters Ck 1 Root node N0 takes all data points Dk 2 for each node Ni !=leaf node do 3 if node_purityi < node_purity_threshold|| 4 num_elemi < max_num_elem then 5 Label Ni as leaf; 6 else 7 for each attribute Aj do 8 if Aj yields max Shannon entropy then 9 use Aj to divide data of Ni;

10 generate new nodes Ni1 ,...,Nit with t = size of V for attribute Aj ; 11 end 12 end 13 end 14 end

3.5 Time Complexity

The proposed structure-based methodology for clustering spam emails into campaigns respec- ting the aforementioned requirements of our problem, is linear in terms of complexity. This property becomes more impressive when it is compared with the complexity of previous works for grouping spam emails into campaigns, which are mostly based on pairwise comparison of spam messages, suffering from quadratic time complexity, resulted from this kind of compari- son. Here, we briefly discuss the precise time complexity of the proposed methodology. Let us consi- der n as the number of the whole data set, ni the number of elements in node i, m the total

number of features, vl the number of features of attribute Al, r the number of attributes, and

vmax = argmax{vl} (l = 1, 2, . . . , d). For constructing a CCTree, it is needed to create an ni ×m matrix based on the data belonging to each non leaf node i, which takes O(m × ni) time. For finding the appropriate attribute

for dividing data based on, constant time is required. To divide the ni points, based on the vl

features of selected attribute (Al), O(ni × vl) time is needed. This process is repeated in each non leaf node. Thus, if K = m + 1 be the maximum number of non leaf nodes, which arises in a complete tree, then the maximum time required for constructing a CCTree with n elements equals to O(K × (n × m + n × vmax)). Recalling that the number of features m and consequently K = m + 1 are constant number, we conclude that the result is linear on the number of data points.

32 3.6 Conclusion

Spam emails impose a cost which is non negligible, damaging users and companies for several millions of dollars each year. To fight spammers effectively, catch them or analyze their beha- vior, it is not sufficient to stop spam messages from being delivered to the final recipient. Characterizing a spam campaign sent by a specific spammer, instead, is necessary to analyze the spammer behavior. Such an analysis can be used to tailor a more specific prevention stra- tegy which could be more effective in tackling the issue of spam emails. Considering a large set of spam emails as a whole, makes the definition of spam campaigns an extremely challenging task. Thus, we argue that a clustering algorithm is required to group this huge amount of data, based on message similarities.

In this chapter we proposed a new categorical clustering algorithm, named CCTree, that we argue to be useful in the problem of clustering spam emails. This algorithm, in fact, allows an easy analysis of data based on an informative structure. The CCTree algorithm introduces an easy-to-understand representation, where it is possible to infer at a first glance the criteria used to group spam emails in clusters. This information can be used, for example, by officers to track and persecute a specific subset of spam emails, which may be related to an important crime. Here, we have mainly presented the theoretical results of our approach, the implementation of the CCTree algorithm and its usage in clustering spam emails is presented in the following chapter.

33 Chapitre 4

Effectiveness and Efficiency of CCTree in Spam Campaign Detection

Spam emails yearly impose extremely heavy costs in terms of time, storage space and money to both private users and companies. Finding and persecuting spammers and eventual spam emails stakeholders should allow to directly tackle the root of the problem. To facilitate such a difficult analysis, which should be performed on large amounts of unclassified raw emails, in this chapter we propose a framework to fast and effectively divide large amount of spam emails into homogeneous campaigns through structural similarity. The framework exploits a set of 21 features representative of the email structure and a novel categorical clustering algorithm named Categorical Clustering Tree (CCTree). The methodology is evaluated and validated through standard tests performed on three dataset accounting to more than 200k real recent spam emails ([129]).

4.1 Introduction

Spam emails constitute a notorious and consistent problem still far from being solved. In the last year, out of the daily 191.4 billions of emails sent worldwide in average, more than 70% are spam emails [110]. Spam emails cause several problems, spanning from direct financial losses, to misuses of Internet traffic, storage space and computational power [113]. Moreover, spam emails are becoming a tool to perpetrate different cybercrimes, such as phishing , malware distribution, or social engineering-based frauds.

Given the relevance of the problem, several approaches have already been proposed to tackle the spam email issue. Currently, the most used approach for fighting spam emails consists in identifying and blocking them on the recipient machine through filters, which generally are based on machine learning techniques or content features, such as keywords, or non ascii characters [30][46][123][22]. Unfortunately, these countermeasures just slightly mitigate the

34 problem which still impose non negligible cost to users and companies [113].

To effectively fight the problem of spam emails, it is mandatory to find and persecute the spammers, generally hiding behind complex networks of infected devices which send spam emails against their user will, i.e. botnets. Thus, information useful in finding the spammer should be inferred analyzing text, attachments and other elements of the emails, such as links. Therefore, the early analysis of correlated spam emails is vital [44][4]. However, such an analysis, constitutes an extremely challenging task, due to the huge amount of spam emails, which vastly increases hourly (8 billions per hour) [110] and for the high variance that related emails may show, due to the use of obfuscation techniques [108]. To simplify this analysis, huge amount of spam emails, generally collected through honey-pots, should be divided into spam campaigns [132]. A spam campaign is the set of messages spread by a spammer with a specific purpose [27], like advertising a product, spreading ideas, or for criminal intents.

In this chapter, we propose to use our algorithm presented in Chapter 3 on set of 21 attributes to fast and effectively group large amount of spam emails by structural similarity. A set of 21 discriminative structural features are considered to obtain homogeneous email groups, which identify different spam campaigns. Grouping spam emails on the base of their similarities is a known approach. However, previous works mainly focus on the analysis of few specific parame- ters [4][111][132][139], showing results whose accuracy is still somehow limited. The approach is based on applying CCTree, a tree-like structure whose leaves represent the various spam campaigns. The algorithm clusters (groups) emails through structural similarity, verifying at each step the homogeneity of the obtained clusters and dividing the groups not enough homo- geneous (pure) on the base of the attribute which yields the greatest variance (entropy). The effectiveness of the proposed approach has been tested against 10k spam emails extracted from a real recent dataset 1 , and compared with other well-known categorical clustering algorithm, reporting the best results in terms of clustering quality (i.e. purity and accuracy) and time performance.

The contribution of present chapter can be summarized as follows : — We introduce a set of 21 categorical features representative of email structure, brie- fly discussing the discretization procedure for numerical features, which are used for applying CCTree. — The performance of CCTree has been thoroughly evaluated through internal evaluation, to estimate the ability in obtaining homogeneous clusters and external evaluation, for the ability to effectively classify similar elements (emails), when classes are known beforehand. Internal and external evaluation have been performed respectively on a dataset of 10k unclassified spam emails and 276 emails manually divided in classes. — We propose and validate through analysis on 200k spam emails, a methodology to choose the optimal CCTree configuration parameters based on detection of max curva- 1. http ://

35 ture point (knee) on an homogeneity-number of clusters graph. — We compare the proposed methodology with two general categorical clustering algo- rithms, and other methodologies specific for clustering spam emails. The rest of this chapter is structured as follows. Section 4.2 describes the proposed framework, detailing the extracted features and reporting implementation details. Section 4.3 reports the experiments to evaluate the ability of CCTree in clustering spam emails, comparing the results with the ones of two well known categorical clustering algorithms. Also the methodology to set the CCTree parameters is reported and validated. Section 4.4 discuss limitations and advantages of the proposed approach reporting result comparison with some related work. Other related work on clustering spam emails is presented in Section 4.5. Finally Section 4.6 briefly concludes proposing future directions.

4.2 Framework

The presented framework acts in two steps. At first raw emails are analyzed by a parser to extract vectors of structural features. Afterward the collected vectors (elements) are cluste- red through the introduced CCTree algorithm. This section reports details on the proposed framework for analysis and clustering spam emails and extracted features.

4.2.1 Feature Extraction and Definition

To describe spam emails, we have selected a set of 21 categorical attributes, which are repre- sentatives of the structural properties of emails. The reason is that the general appearance of messages belonging to the same spam campaign mainly remain unchanged, although spammers usually insert random text or links [27]. The selected attributes extends the set of structural features proposed in [99] to label emails as spam or ham.

The attributes and a brief description are presented in Table 4.1.

Since the clustering algorithm is categorical, all selected features are categorical as well. It is worth noting that some features are meant to represent numerical values, e.g. AttachmentSize, instead that categorical ones. However, it is always possible to turn these features from nume- rical into categorical, defining intervals and assigning a feature value to each interval defined in such a way. We chose these intervals on the base of the ChiMerge discretization method [85], which returns outstanding results for discretization in decision tree-like problems [56].

The detail of discretization results are provided in Tables A.1, A.2, A.3, A.4,A.5, A.6, A.7, A.8, A.9, A.10, A.11, A.12, A.13, A.14, A.15, A.16, A.17, A.18.

Features of particular interest are the ones that report the amount of pictures in the email (ImagesNumber), or the presence of HTML tags (IsHTML), or again, the amount of links (NumberOfLinks). Through these features, in fact, it is possible to determine if the email

36 Attribute Description RecipientNumber Number of recipients addresses. NumberOfLinks Total links in email text. NumberOfIPBasedLinks Links shown as an IP address. NumberOfMismatchingLinks Links with a text different from the real link. NumberOfDomainsInLinks Number of domains in links. AvgDotsPerLink Average number of dots in link in text. NumberOfLinksWithAt Number of links containing “@”. NumberOfLinksWithHex Number of links containing hex chars. SubjectLanguage Language of the subject. NumberOfNonAsciiLinks Number of links with non-ASCII chars. IsHtml True if the mail contains html tags. EmailSize The email size, including attachments. Language Email language. AttachmentNumber Number of attachments. AttachmentSize Total size of email attachments. AttachmentType File type of the biggest attachment. WordsInSubject Number of words in subject. CharsInSubject Number of chars in subject. ReOrFwdInSubject True if subject contains “Re” or “Fwd”. NonAsciiCharsInSubject Number of non ASCII chars in subject. ImagesNumber Number of images in the email text.

Table 4.1 – Features extracted from each email. is raw text, contains several images, or is presented in the form of a web page, which mostly remain unchanged when a spammer designs a spam campaign to be sent in burst.

4.2.2 Implementation Details

On the implementation side, an email parser has been developed in Java to automatically analyze raw mails text and extract the features in form of vectors. The software exploits the JSoup [69] for HTML parsing, and of the LID 2 Python tools for language recognition. The LID software exploits the technique of n-grams to recognize the language of a text. For each language that LID has to recognize, a database of words must be provided to the software, in order to extract n-grams. The language on which LID has been trained are the following : English, Italian, French, German, Spanish, Portuguese, Chinese, Japanese, Persian, Arabic, Croatian. We have implemented the CCTree algorithm using the MATLAB 3 software, which takes as input the matrix of emails features extracted by the parser.

It is worth noting that the complete framework, i.e. feature extraction and clustering module, are totally portable on different operation system. In fact, both the feature extraction module 2. http :// 3.

37 and the clustering module (i.e. MATLAB) are Java-based and executable on the vast majority of general purpose operative system (Java, UNIX, iOS, etc.). Also the Python module for language analysis it is portable. Moreover, LID has been made as a disposable component, i.e. if the Python interpreter is missing, the analysis is not stopped. For the emails where the language is not inferable, the UKNOWN_LANGUAGE value for the attribute is used instead.

4.3 Evaluation and Results

This section reports on the experimental results to evaluate the quality of the CCTree algo- rithm on the problem of clustering spam emails. A first set of experiments has been performed on a dataset of 10k recent spam emails (February 2015), to estimate the capability of the CCTree algorithm in obtaining homogeneous clusters. This evaluation is known as Internal Evaluation and estimates the quality of the clustering algorithm, measuring how much each element of the resulting cluster is similar to the elements of the same cluster and dissimilar from the elements of other clusters. A second set of experiments aims at assessing the ca- pability of CCTree to correctly classify data using a small dataset with benchmark classes known beforehand. This evaluation is named External Evaluation and measures the similarity between the resulting clusters of a specific algorithm and the desired clusters (classes) of the pre-classified dataset. For external evaluation, CCTree has been tested against a dataset of 276 emails, manually labeled in 29 classes 4. The emails have been manually divided, looking both at the structure and the semantic of the message. Thus, emails belonging to one class can be considered as part of a single spam campaign.

The results of CCTree are compared with those of two categorical clustering algorithms, na- mely COBWEB and CLOPE, well-known of being accurate and fast clustering algorithms, respectively. The comparison has been done both for internal and external evaluation on the same aforementioned datasets. A time performance analysis is also reported. It is worth noting that the three algorithms are all implemented on Java-based tools, hence the validity of time comparison.

In what follows, we briefly introduce these two algorithms : COBWEB COBWEB proposed by [51], is a categorical clustering algorithm, which builds a dendrogram where each node is associated with a conditional probability which summa- rizes the attribute-value distributions of objects belonging to a specific node. Differently from the CCTree algorithm, also includes a merging operation to join two separate nodes in a single one. COBWEB is computationally demanding and time consuming, since it re-analyzes at each step every single data point. Actually, COBWEB employs four ope- rations as following : • merging two nodes : in merging of two nodes, the two nodes are replaced by a node 4. Available at : http ://

38 whose children is the original nodes of children and it summarizes the attribute-value distribution of the elements classified under them. • Split a node : a node is split by replacing it with its own children • inserting a new node : a new node is created for a new data inserting to the tree • passing a datum in the tree : the datum is located in the node it respects. However, the COBWEB algorithm is used in several fields for its good accuracy, in a way that its similarity distance measure, named Category Utility, is used to evaluate categorical clustering accuracy [7], and is formally defined as what follows. Definition 4.1. Category Utility (CU) : The Category utility [60] is defined as the difference between the expected number of attribute values which can be guessed correctly with the given clustering algorithm, and the expected number of correct guess when we

do not have this knowledge. Let {C1,...,Ck} are the set of clusters, and vij’s (for all

possible j) are the values of attribute Ai, then CU is defined as following :

X |Ci| X X CU = [P (A = v |C )2 − P (A = v )2] k i ij i i ij Ci i j The WEKA [65] implementation of COBWEB has been used for our experiments.

CLOPE : CLOPE [148] is a fast categorical clustering algorithm which maximizes the num- ber of elements with the same value for a subset of attributes, attempting to increase the homogeneity of each obtained cluster. In this algorithm, a global criterion function is proposed to increase the intra-cluster overlapping by increasing the height-to-weight ratio of the cluster histogram. The clustering with maximum amount of height-to-width ratio on all cluster histograms is the optimum result. Formally, the CLOPE clustering is defined as what follows : Let X = {x1, x2, . . . xn} be the set of n tuples, while all the features of data point

xi 1 ≤ i ≤ n are categorical. Suppose C = {C1,C2,...Ck} represents the devision of X to k clusters, and D(Ci) shows the statistic histogram of Ci respect to the categorical attributes. Two measure functions are introduced in this method as follows : X S(Ci) = |xj|

xj ∈Ci

where |xj| is the dimensionality of xj.

W (Ci) = |Hi|

where |Hi| is the number of bins in histogram Hi. Then, the criterion function of CLOPE is defined as : k 1 X S(Ci) max{P rofit(C) = |C |} n W (C )2 i i=1 i

where |Ci| is the number of elements in cluster Ci.

Also for CLOPE we have used the WEKA [65] implementation for the performed experiments.

39 4.3.1 Internal Evaluation

When the result of clustering algorithm is evaluated based on the data that was clustered itself, it is called internal evaluation. Internal evaluation measures the ability of a clustering algorithm in obtaining homogeneous clusters. A high score on internal evaluation is given to clustering algorithms which maximize the intra-cluster similarity, i.e. elements within the same cluster are similar, and minimize the inter-cluster similarity, i.e. elements from different clusters are dissimilar. The cluster dissimilarity is measured by computing the distances between elements (data points) in various clusters. The used distance function changes for the specific problem. In particular, for elements described by categorical attributes, the common geometric distances, e.g. Euclidean distance, cannot be used. Hence, in this work the Hamming and Jaccard distance measures [66] are applied. The Hamming distance considers two elements closer when they have the same value for a higher number of attributes. On the other hand, the Jaccard distance is defined as the size of intersection of attributes of two elements divided by their union. Internal evaluation can be performed directly on the dataset on which the clustering algorithm operates, i.e. the knowledge of the classes (desired clusters) is not a prerequisite. The indexes used for internal evaluation are the Dunn Index [19] and the Silhouette [118], which are defined as follows :

Dunn index : Let ∆i be the diameter of cluster Ci, that can be defined as the maximum distance of elements of Ci :

∆i = max {d(x, y)} x,y∈Ci , x6=y

where d(x, y) measures the distance of pair x and y, which can be considered any distance as specified by user, e.g. Hamming distance, and |C| shows the number of elements belonging to cluster C. Also, let δ(Ci,Cj) be the inter-cluster distance between clusters Ci and Cj, which is calculated as the pairwise distance between elements of two clusters. Then, on a set of k clusters, the Dunn index [64], is defined as :

δ(Ci,Cj) DIk = min { min { }} 1≤i≤k 1≤j≤k max1≤t≤k ∆t

A higher Dunn index value means a better cluster quality. It is worth noting that the value of Dunn index is negatively affected by the greatest diameter between the elements

of all generated clusters (max1≤t≤k ∆t). Hence, even a single resulting cluster with poor quality (non homogeneous), will cause a low value of the Dunn index. On the other hand, higher values of this index means that the overall homogeneity of resulting clusters is noticeable.

Silhouette : The dissimilarity of point xi from a cluster C is the average distance from xi to points of C. Mostly, dissimilarity refers to distance measure, where for categorical attributes, distance measure can be considered as hamming distance. Let d(xi) be the

40 average dissimilarity of data point xi with other data points within the same cluster. 0 Also, let d (xi) be the lowest average dissimilarity of xi to any other cluster, except the cluster that xi belongs to. Then, the silhouette [118], s(i) for xi is defined as :

 d(i) 0  1 − 0 d(i) < d (i) d0(i) − d(i)  d (i) s(i) = = 0 d(i) = d0(i) max{d(i), d0(i)}  d0(i) 0  d(i) − 1 d(i) > d (i)

where the definition result in s(i) ∈ [−1, 1]. As much as s(i) is closer to 1, the more the data point xi is appropriately clustered. The average value of s(i) over all data of a cluster, shows how tightly related are data within a cluster. Hence, the more the average value of s(i) is close to 1, the better is the clustering result. For easy interpretation, the silhouette of all clustered points is also represented through a silhouette plot.

Performance Comparison

As discussed in Chapter3, CCTree algorithm requires two stop conditions as input, i.e. the minimum number of elements in a node to be split (µ), and minimum purity in a cluster (). Henceforth, the notation CCTree(, µ) will be used to refer the specific implementation of the CCTree algorithm. To choose the stop conditions, we first fix the minimum number of elements µ = 1, and then we change node purity to see how internal indexes are affected. Worth noticing that when µ is fixed to 1, the only stop conditions affecting the result is . Table 4.2 shows the result of internal evaluations when µ is fixed to 1 and  gets five different values as 0.0001, 0.001, 0.01, 0.1 and 0.5.

Table 4.2 – CCTree Internal evaluation with fixed number of elements.

CCTree µ = 1 Algorithm -  = 0.0001  = 0.001  = 0.01  = 0.1  = 0.5 Silhouette(Hamming) 0.9772 0.9772 0.9642 0.7124 0.1040 Silhouette(Jaccard) 0.9777 0.9777 0.9650 0.7110 0.0820 Dunn(Hamming) 0.5 0.5 0.2 0.1111 0.0714 Dunn(Jaccard) 0.25 0.25 0.1571 0.1032 0.0704

For a further insight, we report in Figures 4.1, 4.2, 4.3, 4.4 the Hamming distance silhouette plots for the CCTrees with the same parameters of Table 4.2. The graphs are horizontal histograms in which every bar represents the silhouette result, s(i) ∈ [−1, 1], for each data point xi, as from the aforementioned definition. It can be seen that both CCTree(0.001,1) (Fig. 4.1) and CCTree(0.01,1) (Fig. 4.2) do not show negative values for any data point, hence, the high value of silhouette close to 1. Actually, the first row of Table 4.2 shows the average of silhouette result for all points CCTree is constructed on, with identified input

41 Figure 4.1 – CCTree(0.001,1)

Figure 4.2 – CCTree (0.01,1)

42 Figure 4.3 – CCTree(0.1,1)

Figure 4.4 – CCTree(0.5,1)

43 parameters. The white spaces in plots show the points for which the silhouette result equals to one.

Figure 4.5 graphs the internal evaluation measurements of CCTree with five different values of , when the minimum number of elements µ has been set to 1. It is worth noting that if µ = 1, the only stop condition affecting the result is the node purity. This is the reason that we first fix µ = 1 to find the best amount of required node purity for our dataset.

Figure 4.5 – Internal evaluation at the variation of the  parameter.

As shown in Figure 4.5, the purity value reach the maximum and stabilize when  = 0.001. More strict purity requirements (i.e.,  < 0.001) do not further increase the precision. This value of  will be fixed for the following evaluations. More precisely, we first fix one of the input parameters in a way that it does not affect the result, i.e. µ = 1, and attribute different values to other parameter. We see that in one point, more strict parameter is not affecting the general homogeneity result.

Fixing the node purity  = 0.001, we then look for the better value for the µ parameter to be able to compare CCTree performance with accurate COBWEB and fast CLOPE. To this end, we provide four different values of minimum number of elements in a cluster. Table 4.3 presents the Silhouette and Dunn index results for proposed values of µ, namely 1, 10, 100, and 1000. In addition, the last two rows of Table 4.3 reports the resulting number of clusters and the time required to generate the clusters.

Table 4.3 also reports the comparison with the two categorical clustering algorithms COBWEB

44 Table 4.3 – Internal evaluation results of CCTree, COBWEB and CLOPE.

CCTree  = 0.001 Algorithm COBWEB - CLOPE µ = 1 µ = 10 µ = 100 µ = 1000 Silhouette(Hamming) 0.9922 0.9772 0.9264 0.7934 0.5931 0.2801 Silhouette(Jaccard) 0.9922 0.9777 0.9290 0.8021 0.6074 0.2791 Dunn(Hamming) 0.1429 0.5 0.1 0.0769 0.0769 0 Dunn(Jaccard) 0.1327 0.25 0.1 0.0879 0.0857 0 Clusters 1118 619 392 154 59 55 Time (s) 17.81 0.6027 0.3887 0.1760 0.08610 3.02

and CLOPE, previously described. The first two columns from left, show comparable results in term of clustering precision for the silhouette index. In fact, COBWEB and CCTree have both a good precision, when the CCTree purity is set to  = 0.001 and the minimum number of elements is set to µ = 1 (CCTree(0.001,1)). COBWEB performs slightly better on the silhouette index, for both distance measures. However, the difference (less than 2 percent) is negligible by considering that COBWEB creates almost twice more the number of clusters rather than CCTree(0.001,1).

It can be inferred that a higher number of small clusters improves the internal homogeneity (e.g., a cluster with one element is totally homogeneous). However, as it will be detailed in the following, a number of clusters, strongly greater than the expected number of groups, is not desirable. In fact, it can be inferred from the Silhouette definition that, in case every element xi is unique, the maximum theoretical value is achieved if each cluster contains only one element.

Moreover, CCTree(0.001,1) returns better result for the Dunn index, with respect to COB- WEB. We recall that the value of Dunn index is strongly affected by the cluster homogeneity of the worst resulting cluster. The value returned for CCTree(0.001,1) shows that all the returned clusters globally have a good homogeneity, compared to COBWEB, i.e. the worst cluster for CCTree(0.001,1) is much more homogeneous than the worst cluster for COBWEB.

The rightest column of Table 4.3 reports the results for the CLOPE clustering algorithm. CLOPE is a categorical clustering algorithm, known to be fast in creating as much as possible pure clusters. The accuracy of CLOPE is quite limited for Silhouette, and zero for the Dunn index, whilst CCTree(0.001, 1000) with almost the same number of clusters is 35 times faster than CLOPE.

A graphical description of the accuracy difference between the clustering of Table 4.3 can be inferred from the Hamming Silhouette plots of Figures 4.6, 4.7, 4.8, 4.9, 4.10, and 4.11. The plots are horizontal histograms in which every bar represents the silhouette result, s(i) ∈ [−1, 1], for each data point xi, as from the aforementioned definition.

45 Figure 4.6 – COBWEB

Both COBWEB and CCTree(0.001,1) show no negative values, with the majority of data points scoring s(i) = 1. In fact, for CCTree(0.001, 1000) the worst data points do not score less than −0.5, whilst for CLOPE some data points have a silhouette of −0.8, which will cause a strong non-homogeneity in their clusters. Also, the number of data point with positive values are much more for CCTree(0.001,1000), than for CLOPE. This also justifies the better value of Dunn index for the CCTree(0.001,1000), which we recall to be affected by the non- homogeneity of the worst cluster. Also, the number of data point with positive values are much more for CCTree(0.001,1000), than for CLOPE, even if CLOPE returns some points whose Silhouette value is 1. However, CCTree(0.001,1000) returns a better Silhouette than CLOPE.

The outstanding point is that the runtime of CCTree(1000) is almost 30 times less than CLOPE, applicable as a fast categorical clustering algorithm.

Finally, Table 4.3 also reports the time elapsed for the clustering performed by the algorithms. It can be observed that COBWEB pay its accuracy with an elapsed time of 17 seconds on the dataset of 10k emails, against the 3 seconds of the much more inaccurate CLOPE. The CCTree algorithm outperforms both COBWEB and CLOPE, requiring only 0.6 seconds in the most accurate configuration (CCTree(0.001,1)).

From internal evaluation we can thus conclude that the CCTree algorithm obtains clusters whose quality is comparable with the ones of COBWEB, requiring even less computational time than the fast but inaccurate algorithm CLOPE.

46 Figure 4.7 – CCTree(0.001,1)

Figure 4.8 – CCTree(0.001,10)

47 Figure 4.9 – CCTree(0.001,100)

Figure 4.10 – CCTree(0.001,1000)

48 Figure 4.11 – CLOPE

4.3.2 CCTree Parameters Selection

Through internal evaluation and the results reported in Table 4.3 and Figures 4.6, 4.7, 4.8, 4.9, 4.10, 4.11, we showed the dependence of the internal evaluation indexes and number of clusters to the values of µ and  parameters. We will briefly discuss here some guidelines to correctly choose the CCTree parameters to maximize the clustering effectiveness.

Concerning the  parameter, we showed in Section 4.3 that it is possible to find the optimal value of  by setting µ = 1 and varying the  to find the fixed point in terms of accuracy, i.e. the optimal  was considered the one for which the lesser amount of  is not improving the accuracy.

Fixed the parameter , the parameter µ must be selected to balance the accuracy with the number of generated clusters. As the number of cluster is affected by the µ parameter, it is possible to choose the optimal value of µ knowing the optimal number of clusters. The problem of estimating the optimal number of clusters for hierarchical clustering algorithms, has been solved in [120], by determining the point of maximum curvature (knee) on a graph showing the inter-cluster distance in function of the number of clusters.

Recalling that silhouette index is inversely related to inter-cluster distance, it is sound to find the knee on the graph of Figure 4.12 computed with the silhouette (Hamming) on the dataset used for internal evaluation, with seven different values of µ. The graph reports the values computed on the same dataset used for internal evaluation. For the sake of representation, we do not show in this graph plots for µ greater than 100.

49 Figure 4.12 – Silhouette in function of the number of clusters for different values of µ.

Figure 4.13 – Sihouette (Hamming).

Applying the L-method described in [120], it is possible to find that the knee is located at µ = 10. It is worth recalling from Table 4.3 that the knee value for µ gives a silhouette value higher than 90%, keeping the number of generated clusters much lower than the ones obtained when µ = 1.

Table 4.4 – Silhouette values and number of clusters in function of µ for four email datasets.

µ = 1 µ = 10 µ = 100 µ = 1000 Data Silhouette Clusters Silhouette Clusters Silhouette Clusters Silhouette Clusters February 0.9772 610 0.9264 385 0.7934 154 0.5931 59 March I 0.9629 635 0.9095 389 0.7752 149 0.6454 73 March II 0.9385 504 0.8941 306 0.8127 145 0.6608 74 March III 0.9397 493 0.8926 296 0.8102 131 0.6156 44

A further insight can be taken from the results of Table 4.4 and Figures 4.13, 4.14, reporting

50 Figure 4.14 – Generated Clusters. the analysis on three additional datasets of spam emails coming from three different weeks of March 2015 from the spam honeypot 5. The sets have comparable size with the one of the dataset used for internal evaluation (first week of February 2015), with respectively 10k, 10k and 9k spam emails.

From both the table and the graph it is possible to infer how the same trend for both silhouette value and number of clusters holds for all the tested datasets. Hence, we verify (i) the validity of the the knee methodology, (ii) the possibility of using the same CCTree parameters for datasets with the same data type and comparable size. Finally, for a further insight we graphically report the results of Table 4.4 in Figures 4.13, 4.14.

From both figures it is visible how the four datasets follow the same trends for internal eva- luation indexes and number of clusters with the same values for the µ parameter.

To give statistical validity to the performed analysis on parameter determinacy, we have ana- lyzed a dataset of more than 200k emails collected from October 2014 to May 2015 from the honey pot 6. The emails have been divided in 21 datasets, containing 10k spam emails each. Each set represents one week of spam emails.

Tables 4.5 and 4.6 report the result of silhouette index and number of clusters for 6 months, from October 2014 to May 2015, except February and March which were already reported in Tables 4.4 and 4.3.

To show that silhouette value and number of clusters of spam campaigns of Tables 4.5, 4.6, 4.4 and 4.3 keep the trends of data sets with comparable size, we first in what follows briefly 5. http :// 6. http ://

51 Table 4.5 – Silhouette result, hamming distance,  = 0.001, and µ changes

Month µ = 1 µ = 10 µ = 100 µ = 1000 Oct1 0,9264 0,88 0,7936 0,5738 Oct2 0,9223 0,8625 0,7299 0,5557 Oct3 0,9071 0,8555 0,7474 0,6623 Nov1 0,9228 0,8706 0,7616 0,5593 Nov2 0,9655 0,9083 0,7873 0,5054 Nov3 0,9702 0,9064 0,7951 0,5078 Dec1 0,9566 0,9012 0,7736 0,6264 Dec2 0,9626 0,9108 0,7784 0,651 Dec3 0,9787 0,942 0,8451 0,6739 Jan1 0,9697 0,9387 0,8876 0,6675 Jan2 0,9683 0,9369 0,8776 0,7407 Jan3 0,9739 0,9441 0,8923 0,662 Apr1 0,9706 0,9161 0,7894 0,6894 Apr2 0,9694 0,9174 0,8234 0,6378 Apr3 0,9738 0,9361 0,8344 0,6866 May1 0,9675 0,9184 0,7679 0,5553 May2 0,964 0,9128 0,7712 0,4434 May3 0,9668 0,9299 0,819 0,5068

Table 4.6 – Number of Clusters ,  = 0.001, and µ changes

Month µ = 1 µ = 10 µ = 100 µ = 1000 Oct1 507 310 141 31 Oct2 652 376 143 61 Oct3 562 333 124 64 Nov1 564 312 128 50 Nov2 689 399 161 56 Nov3 685 391 172 61 Dec 1 586 359 135 66 Dec2 583 343 127 64 Dec3 437 273 118 50 Jan1 366 237 132 44 Jan2 366 216 110 43 Jan3 344 208 118 37 Apr1 574 341 127 54 Apr2 528 304 131 53 Apr3 408 242 101 47 May1 622 419 159 73 May2 578 372 133 60 May3 474 313 131 48

52 present what standard deviation means.

Standard Deviation : In statistics, the standard deviation (mostly shown as σ)[11] is a measure for quantifying the amount of variation of a set of values. A standard deviation which is close to 0 shows that the data points tend to be very close to the mean, whilst the high amount of standard deviation indicates that data points are spread out over a wide range of values. Formally, let X be a random variable with mean value E(X), the standard deviation of X equals to : p σ = E(X2) − (E(X))2

It means that the standard deviation σ is the square root of the variance of X.

Figures 4.15 and 4.16 show the average values for number of clusters and silhouette compu- ted on the 21 dataset varying the value of the µ parameter with the values of the former experiments (i.e. 1, 10, 100, 1k). The standard deviation (defined above) is also reported as error bars. It is worth noting that, the standard deviation for the values of µ = {1, 10} on 10 datasets is slightly higher than 2%, while it reaches 4% for µ = 100 and 8% for µ = 1000, which is in line with the results of Table 4.4. Comparable results are also obtained for the number of clusters where the highest value for standard deviation is, as expected, for µ = 1, amounting to 108, which again is in line with the results of Table 4.4. Thus, for all the 21 analyzed datasets, spanning eight months of spam emails, we can always locate the knee for silhouette and number of clusters when µ = 10.

Figure 4.15 – Sihouette (Hamming).

4.3.3 External Evaluation

The external evaluation is a standard technique to measure the capability of a clustering algorithm to correctly classify data. To this end, external evaluation is performed on a small dataset, whose classes, i.e the desired clusters, are known beforehand. This small dataset must

53 Figure 4.16 – Generated Clusters. be representative of the operative reality, and it is generally separated from the dataset used for clustering.

The external evaluation is a standard technique to measure the capability of a clustering algorithm to correctly classify data. To this end, external evaluation is performed on a small dataset, whose classes, i.e the desired clusters, are known beforehand. This small dataset must be representative of the operative reality, and it is generally separated from the dataset used for clustering. A common index used for external evaluation is the F-measure [98], which coalesce in a single index the performance aboutfor cectly classified elements and misclassified oneelement external evaluation, the result of clustering algorithm is evaluated based on the data which are not used for clustering. The class of of these data are prior known. This set of pre-classified items are often labeled by human experts. External measures for clustering evaluate how close the result of clustering algorithm is to predetemined labeled data.

Formally, let the sets {C1,C2,...,Ck} be the desired clusters (classes) for the dataset D, and 0 0 0 let {C1,C2,...,Cl } be the set of clusters returned by applying a clustering algorithm on D. Then, the F-measure F (i) for each cluster Ci is defined as follows :

0 2|Ci ∩ Cj| F (i) = max 0 i,j |Ci| + |Cj|

To compute the overall F-Measure on all desired clusters, as an aggregated index, the weighted average F-Measures of all predetermined clusters is computed as :

54 k X |Ci| Fc = F (i) | ∪k C | i=1 j=1 j

The F-measure result is returned in the range [0,1], where 1 represents the ideal situation in which the cluster Ci is exactly equal to one of the resulted clusters. More precisely, in F- Measure the elements of each predetermined class is compared with the elements of all resulted clusters, and the maximum similarity, equal to 1, is returned when in resulted clusters there is one identical to predetermined class.

Experimental Results

For the sake of external evaluation, 276 spam emails collected from different spam folders of different mailboxes has been manually analyzed and classified. Emails have been divided in 29 groups (classes) according to the structural similarity of raw email message. The external evaluation set has no intersection with the one used for internal evaluation.

Table 4.7 – External evaluation results of CCTree, COBWEB and CLOPE.

CCTree  = 0.001 Algorithm COBWEB - CLOPE µ = 1 µ = 5 µ = 10 µ = 50 F-Measure 0.3582 0.62 0.6331 0.6330 0.6 0.0076 Clusters 194 102 73 50 26 15

The results of external evaluation are reported in Table 4.7. Building on the results of internal evaluation, the value of node purity has been set to  = 0.001 to obtain homogeneous clusters. The values of µ have been chosen according to the following rationale. µ = 1 represents a CCTree instantiation in which the µ parameter does not affect the result. On the other hand µ = 50 returns a number of clusters comparable with the 29 clusters manually collected. Higher values of µ do not modify the result for a dataset of this size.

The best results are returned for µ = {5, 10}. The F-measure for these two values is higher than 0.63, with a negligible difference, even if the number of generated clusters is higher than the expected one. The F-measure, in fact is considers as correctly classified, elements from a 0 0 same original cluster C even if divided in more clusters C1,...,Cn which do not contain data from other clusters different from C.

Table 4.7 also reports the comparison with the COBWEB and CLOPE algorithms. As showns CCTree algorithm outperforms COBWEB and CLOPE for the F-measure index, showing thus a higher capability in correctly classifying spam emails. We recall that for internal evaluation,

55 COBWEB returned slightly better results than CCTree. The reason of this difference is resul- ted from the number of resulting clusters. COBWEB, in fact, always returns a high number of clusters (Table 4.7). This generally yields a high cluster homogeneity (several small and homogeneous clusters). However, it not necessarily implies a good classification capability. In fact, as shown in Table 4.7, COBWEB returns almost 200 clusters on a dataset of 276 emails, which is six times the expected number of clusters (i.e., 29 clusters). This motivates the lower F-measure score for the COBWEB algorithm. It is worth noting that the CCTree outperform COBWEB even if the minimum number of elements per node is not considered (i.e., µ = 1). On the other hand, CLOPE also performs poorly on F-measure for the 276 emails dataset. The CLOPE algorithm, in fact, only produced 15 clusters, i.e. less than half of the expected ones, with an F-measure score of 0.0076.

4.4 Discussion and Comparisons

[86] introduced three properties related to clustering algorithms, named scale invariance (re- quiring that the output of a clustering be invariant to uniform scaling of output), consistency (requiring that if within-cluster distances are decreased, and between-cluster distances are increased, then the output of a clustering function does not change) and richness (requiring that by modifying the distance function, any partition of the underlying data set can be obtai- ned). In his famous theorem, Kleinberg proved that “independent of any particular algorithm, objective function, or generative data model”, there is no clustering function simultaneously satisfies the proposed three properties. The Kleinberg theorem is referred in the literature [147], [102], [137], [124] to justify that no clustering algorithm stand as a perfect function for a specific problem, instead it is required to respect, as much as possible, the specified and desired properties of the associated problem.

The presented methodology based on a set of 21 categorical features and a novel categorical clustering algorithm, named CCTree, shows the capability of dividing spam emails in very homogeneous cluster, with a good accuracy in discerning different campaigns. The comparison with other categorical clustering algorithms showed the effectiveness and efficiency of CCTree when applied to the same set of features. We recall that CCTree is an unsupervised machine learning technique. Unsupervised learning algorithms do not provide results with the same accuracy of supervised learning techniques (i.e. trained classifiers). However, they have the advantage of not requiring any training procedure, thus they can be applied also on datasets for which no previous knowledge is available. This often represent the reality in the analysis of the spam emails, where it is necessary to cope with the large amount of emails daily produced and collected by honeypots.

Combining the analysis of 21 features, the proposed methodology, becomes suitable to analyze almost any kind of spam emails. This is one of the main advantages with respect to other

56 approaches, which mainly exploit one or two features to cluster spam emails into campaigns. These features are links [4], [92], keywords [22], [30], [46], [123], or images [150] alternatively. The analysis of these methodologies remains limited to those spam emails that effectively contain the attributed features. However, emails without links and/or images are a consistent percentage of spam emails.

In fact, from the analysis of the dataset used for internal evaluation, 4561 emails out of 10165 do not contain any link. Furthermore, only 810 emails contain images. To verify the clustering capability of these methodologies we have implemented three programs to cluster the emails of the internal evaluation dataset on the base of the contained URLs, reported domains and links of remote images. The emails without links and pictures have not been considered.

Table 4.8 – Campaigns on the February 2015 dataset from five clustering methodologies.

Cluster Methodology Analyzed emails Generated Campaigns Link Based Clustering 4561 4449 Domain Based Clustering 4561 4547 Image Based Clustering 810 807 COBWEB (21 Features) 10165 1118 CCTree(0.001,10) 10165 392

Table 4.8 reports the generated number of clusters for each methodology. It is worth noting that on large dataset these cluster methodologies are highly inaccurate, generating a number of campaigns close to the number of analyzed elements, hence, almost every cluster has a single element.

For comparison purpose we reported the results of the most accurate implementation of CC- Tree and of COBWEB, which we recall being able to produce extremely homogeneous cluster, reporting a Silhouette value of 99%. We point out that, comparing Silhouette is meaningless, due to the different sets of used features. Comparisons with other methodologies such as the FPTree-based approaches [27][44], which require the extraction and analysis of a different set of features, are left as a future work.

4.5 Related Work

As discussed in Section 4.4, there are several works in literature which cluster spam emails through pairwise comparisons of URLs, IP address resolved from URLs, domains and image links. The main weaknesses of these approaches is that they cannot be applied on emails not presenting the required features. Also the pairwise comparison impose a quadratic complexity, against the linear one in CCTree. Another clustering approach exploiting a pairwise compari- sons of email subjects is presented in Wei et al. [144]. The proposed methodology introduces a set of eleven features to cluster spam emails through two clustering algorithms. At first an

57 agglomerative hierarchical algorithm is used to cluster the whole data set based on subject pairwise comparison. Afterward, the connected component graph algorithm is used to improve the performance.

The authors of [132] applied a methodology based on k-means algorithm, named O-means clustering, which exploits twelve features extracted from each email. The O-mean algorithm works on the hypothesis that the number of clusters is known beforehand, which is not always a working hypothesis. Furthermore, the authors use Euclidean distance, which for several fea- tures that they apply, it does not bring meaningful information. Differently from this approach the CCTree exploits the more general distance measure, i.e. Shannon entropy. Moreover the CCTree does not require the desired number of clusters as input parameter.

Frequent Pattern Tree (FP-Tree), is another technique applied to detect spam campaigns in large datasets. The authors of [27],[44] extract set of features from each spam message. The FP- Tree is built based on the frequency of features. The sensitive representation of both message layout and URL features, causes that two spam emails with small difference be assigned to different campaigns. For this reason, FP-Tree approach becomes prone to text obfuscation techniques [108], used to deceive anti-spam filters and to emails with dynamically generated links. Our methodology, based on categorical features which do not consider text and link semantics is more robust against these deceptions.

4.6 Conclusion

In this chapter, we proposed a methodology, based on a categorical clustering algorithm, named CCTree, to cluster large amount of spam emails into campaigns, grouping them by structural similarity. To this end, the set of features representing message structure, have been precisely chosen and the intervals for each feature has been found through discretization approach. The CCTree algorithm has been extensively tested on two dataset of spam emails, to measure both the capability in generating homogeneous clusters and the specificity in recognizing predefined groups of spam emails. The guideline for selecting CCTree parameters is provided, whilst the determinacy of selected parameter for the similar data set with the same size has been proven statistically. To this end, several datasets of spam emails, each containing large amount of spam messages (each set contains almost 10k spam emails), gathered from the same untroubled honey-pot, are clustered with the use of CCTree algorithm. In this experiment, the same stopping criteria are applied. Through tables and diagrams we show that the CCtree preserves the same trend when the datasets are in (almost) the same size. Considering the proven accuracy and efficiency, the proposed methodology may stand as a valuable tool to help authorities in rapidly analyzing large amount of spam emails, with the purpose of finding and persecuting spammers. To the best of our knowledge we are the first who showed the effectiveness and efficiency of

58 proposed algorithm for clustering spam emails into campaigns, whilst in previous techniques the proposed techniques were not evaluated.

59 Chapitre 5

Labeling and Ranking Spam Campaigns

Fast analysis of correlated spam emails may be vital in the effort of finding and prosecu- ting spammers performing cybercrimes such as phishing and online frauds. In this chapter we present a self-learning framework to automatically divide and classify large amounts of spam emails in correlated labeled groups. Building on large datasets daily collected through honey-pots, the emails are firstly divided into homogeneous groups of similar messages (cam- paigns), which can be related to a specific spammer. Each campaign is then associated to a class which specifies the goal of the attacker, i.e. phishing, advertisement, etc. The proposed framework exploits a categorical clustering algorithm to group similar emails, and a classifier to subsequently label each email group. The main advantage of the proposed framework is that it can be used on large spam emails datasets, for which no prior knowledge is provided. The approach has been tested on more than 3200 real and recent spam emails, divided in more than 60 campaigns, reporting a classification accuracy of 97% on the classified data. Afterwards, a ranking approach is proposed to automatically rank spam campaigns according to investigator priorities ([128]).

5.1 Introduction

At the end of 2014, emails are still one of the most common form of communication in Internet. Unfortunately, emails are also the main vector for sending unsolicited bulks of messages, generally for commercial purpose, commonly known as spam. The research community has investigated the problem for several years, proposing tools and methodologies to mitigate this issue. However, a definitive solution to the problem of spam emails still has to be found. Unfortunately, the problem of spam emails is not only related to unsolicited advertisement. Spam emails have become a vector to perform different kind of cybercrimes including phishing, cyber-frauds and spreading malware.

60 Motivation : Trying to filter spam emails at the user end, actually is not enough to fight this kind of attacks, which moves the effect of unsolicited spam emails from illicit to real crime. Finding the spammers becomes important not only to tackle at the source the problem of spam emails, but also to legally prosecute the responsible of cybercrimes brought by spam emails different from undesired advertisement. To identify spammers, the early analysis of huge amount of messages to find correlated spam emails with the specific spammer purpose is vital. Several papers in the literature observed that the forensic analysis, which plays a major role in finding and persecuting spammers for cybercrimes, needs a proactive mechanism or tool which is able to perform a fast, multi-staged analysis of emails in a timely fashion [63], [44], [144], [40]. To this end, large amounts of spam emails, generally collected through honey-pots should be at first divided in similar groups, which could be related to the same spammer (i.e., spam campaigns). Afterward to each campaign should be assigned a label describing the purpose of spammer. This goal-based labeling facilitates for investigators the analysis of spam campaigns, eventually directed toward a specific cybercrime, Moreover, the spam campaign labeling based on the goal of spammer can help to rank spam campaigns. However, this analysis generally appears to be a challenging task. In fact, considering the number of produced spam emails and their variance, spam email datasets are huge and very difficult to handle. In particular, human analysis is almost impossible, considering the amount of spam emails daily caught by a spam honey-pot [141] [144]. On the other hand, an automated and accurate analysis requires the usage of correctly trained computational intelligence tools, i.e. classifiers, whose training requires accurately chosen datasets, which presents to the classifier a good reality description in which it will be employed. Moreover, due to the high variance of spam emails, a valid training set may become obsolete in few weeks, and a new up-to-date training set must be generated in a short period of time.

Though previous work largely improved the state of the art in analysis of spam emails for forensic purposes, more improvement is still needed. In particular, previous work either focuses on a specific cybercrime only, especially phishing [50], or exploit in the analysis a small set of features not effective in identifying some cybercrime emails. For example, the analysis of email text words [63], link domains [44] is not effective in identifying emails used to distribute malware, which often do not contain text, or spam emails with dynamic links [16].

Contribution : In this chapter, we propose Digital Waste Sorter (DWS), a framework which exploits a self learning goal of the spammer-based approach for spam email classification. The proposed approach aims at automatically classifying large amount of raw unclassified spam emails dividing them into campaigns and labeling each campaign with its spammer goals. To this end we propose five class labels to group spammer goals in five macro-groups, namely Advertisement, Portal Redirection, Advanced Fee Fraud, Malware Distribution and Phishing. Moreover, a set of 21 categorical features representative of email structure is proposed to perform a multi-feature analysis aimed at identifying emails related to a large range of cyber-

61 crimes. DWS is based on the cooperation of unsupervised and supervised learning algorithms. Given a set of classes describing different spammer goals and a dataset of non classified spam emails, the proposed approach at first automatically creates a valid training set for a classifier exploiting a categorical clustering algorithm, named CCTree (Categorical Clustering Tree). In more detail, this clustering algorithm divides the dataset into structurally similar groups of emails, named spam campaigns [26]. DWS is built on the results of CCTree , which is effec- tive in dividing spam emails in homogeneous clusters. Afterward, significant spam campaigns useful in the generation of the training set are selected through similarity with a small set of known emails, representative of each spam class. Hence, a classifier is trained using the selected campaigns as training set, and will be used to classify the remaining unclassified emails of the dataset.

To further meet the needs of forensic investigators, which have limited time and resource to perform email examinations [40], the DWS methodology does not require a prior knowledge of dataset, except the desired classes (i.e. spammer goals) and a small set of emails representative of each class. It is worth noting that this email set cannot be used to train the classifier. In fact, this set contains a small number of emails not belonging to the dataset to be classified, being thus not necessarily descriptive of the reality in which the classifier will operate.

In the following we will describe in details the DWS framework, explaining the process of division in campaigns, training set generation and campaigns classification. The framework effectiveness has been evaluated against a set of 3200 recent raw spam emails extracted by a honey pot. DWS reported a classification accuracy on this preliminary dataset of 97.8%. Fur- thermore, to justify the classifier selection, an analysis of performances on different classifiers is presented. Furthermore, we propose five features, including the label of campaigns discovered with DWS, to automatically rank a set of spam campaigns according to investigator priorities.

The rest of the paper is organized as follows. Section 5.2 reports related work on email classi- fication. Section 5.3 presents the DWS framework and work-flow in details, also it gives brief background information on the clustering algorithm. Section 5.4 presents the results of the analysis on a real dataset of spam emails, as well as a comparison on the classification results of four different classifiers. In Section 5.7 a technique is proposed to rank spam campaigns. Finally Section 5.6 briefly concludes reporting planned future extensions.

5.2 Related Work

In the literature, the spam campaigns are usually labeled based on characteristic strings (key- words) representing individual campaign types as in [44], [88] and [55]. These approaches are weak against the kind of spam emails which do not contain keywords or that use word obfus- cation techniques. [106] label spam campaigns on the base of URLs, phone number, Skype ID, and Mail ID used as contact information. This methodology is effective only against emails

62 reporting contacts, which are only a subset of all the spam emails found in the wild. This means that the proposed approachh fail in detecting spam campaigns not containing the afo- rementioned contact information.

There are several approaches in the literature in which the spammer goal is considered. Howe- ver, these approaches are mainly focused on detecting phishing emails, not considering other spammer purposes. Fette et al. [50] applied 10 email features to discern phishing emails from ham (good) emails. Bergholz et al. [17] propose a similar methodology with additional fea- tures to train a classifier in order to filter phishing emails. Almomani et al. [3] provide a survey on different techniques in filtering phishing emails, while Gansterer et al. [53] compare different machine learning algorithms in phishing detection. Furthermore, the authors propose a technique which refines the previous phishing filtering approaches. In this work, three types of messages, named ham, spam and phishing are distinguished automatically. Nevertheless, the category of emails containing spam, is not precisely characterized. In [34] a methodology to detect phishing emails based on both machine learning and heuristics is proposed. These approaches report accuracy ranging from 92% to 96%, where the classifiers have been trained through labeled datasets. On the contrary, DWS generates the training set on the fly, without requiring a pre-trained classifier. Notwithstanding, in the performed experiments DWS shows comparable accuracy.

5.3 Digital Waste Sorting

DWS is a Java-based framework which takes as input datasets of unclassified spam emails. Hence, DWS divides the emails in campaigns by mean of a hierarchical clustering algorithm, then labels each campaign through a classifier. The classifier is trained on the fly, through a training set generated by DWS directly from the unlabeled input dataset, exploiting the knowledge generated by the clustering algorithm.

This section describes in details the DWS framework and methodology. First we will present the classes used to label each spam campaign. Then, we present the feature extraction process from raw emails, discussing the features relevance in describing structural elements of an email and their relation to each spam class. The framework is then presented, briefly introducing the clustering algorithm and the methodology for the generation of the training set. Finally the classification process is presented.

5.3.1 Definition of Classes

As anticipated, spam emails can be sent with different intentions, spanning from the common advertisement to vectors of different cybercrimes. We argue that spam emails can be divided in five well known macro-groups which represent the main target of spammers, and can thus be used to label spam campaigns.

63 Figure 5.1 – Advertisement

Advertisement : The advertisement class contains those emails whose target is convincing a user to buy a specific product [84]. Advertisement emails embody the most typical idea of spam messages, advertising any kind of product which could be of interest of companies or private users. Generally these emails only constitute a hindrance to the users that have to spend time removing them from the inbox. Notwithstanding, some campaigns provide revenues up to 1M US dollars to spammers per month [84]. The main requirements for a commercial email to be legal according to Federal Trade Commission [49], is that it uses no deceptive subject lines, provides correct complete header information, real physical location of the business, offers an opt-out choice, and honors opt-out requests in 10 business days. In present work, we consider as advertisement emails both the ones which comply with the legal requirements and the ones that does not, given that their purpose is clearly advertising a product. The first time that spam came under consideration as business, was in April 1994. Two lawyers from Phoenix, named Canter and Siegel, hired a programmer to distribute their “Green Card Lottery Final One !” message to as many newsgroups as possible. The interesting point in this act was that they did not hide the fact that they were the spammers, and even instead they were proud of it. Canter and Siegel decided to write a book with the title of “How to Make a Fortune on the Information Superhighway : Everyones Guerrilla Guide to Marketing on the Internet and Other On Line Services”. Moreover, planned to open a consulting company to teach others, and help them, to post their own advertisement, which never took off. However, still, in 2015, spam emails are as one of the most popular tools for advertising the goods. Figure 5.1 shows a typical sample of avertisement spam email, containing several pho-

64 tos, and prices, which clicking on each photo directs the user to the spammer website, convincing him to buy a product or service.

Portal Redirection : Portal redirection spam emails are the enablers of an evolved adverti- sement methodology. This spam emails are characterized by a minimal structure gene- rally reporting one or more links to one or more websites. Once the user clicks on the link, she is redirected several times to different pages whose address is dynamically gene- rated. The final target page is mostly an advertisement portal with several links divided by categories, generally related to common user needs (e.g., medical insurance). This strategy is useful in reducing the legal responsibility on spam emails of the companies which are advertising a product. The rationale is that the advertised company cannot be sued because another website, i.e. the portal, links to it. As an example, the opt- out clause of advertisement emails [83] does not apply. Moreover, the multi-redirection with dynamic links strategy makes difficult to track the responsible websites. It is worth noting that the strategy of portal redirection emails, is also used to redirect users on websites with the intention of defrauding the users, or to distribute malicious code, and also increasing the visits of a web page. The first redirect service 1, in 1999, got the advantage of top-level domains (TLD), like ".to" (Tonga), ".at" (Austria) and ".is" (Iceland). The aim was to make memorable URLs. The first redirect service return to that redirected 4 million users at its peak in 2000. The success of was resulted from the fact that it contained a wide variety of short memorable domains including “”, “”, “”, “” and “”. Due to the fact that the sales price of top level domains started falling from $70.00 per year to less than $10.00, the use of redirection services got declined. , is an attack with the purpose of indexing web page. The goal of a web designer is to create a web page that will find favorable rankings in the search engines, and the designers create their own web pages on the base of the standards that they think it will succeed. Spam emails are a good place for embedding the links desired to get the higher score of visits. To this end, portal redirection technique can be applied to redirect the users to several desired web pages. Figure 5.2 demonstrates a typical form of a portal redirection spam email, containing several hyper-links, hiding under a luring text, deceiving the user to click.

Advanced Fee Fraud : An advanced fee fraud or confidence trick spam email (synonyms include confidence scheme or scam) attempts to defraud a person after first gaining their confidence, used in the classical sense of trust [71]. Confidential trick spam exploits social engineering to trick the user in paying, by her own will, a certain amount of money to the attacker. Scammers may use several techniques to deceive the user in 1. http ://

65 Figure 5.2 – Portal

Figure 5.3 – Fraud paying money, generally exploiting sentimental relations or promising a large amount of money in return. The confidential trick emails, mostly are written in a friendly long text, to convince the victim the interactions. These kinds of emails, usually, do not redirect the users to other web pages, mainly contain an email address. 419 scams [61] are one of the most common types of confidence tricks, that dates back to the late 18th century. The confidential trick scam typically promises the victim a significant share of a large sum of money, which the spammer requires a small in-advance payment to obtain. If a victim provides the payment, the fraudsters either asks further fees from the victim, or simply disappears. In these cases, the emails’ subject line often contains something like “From the desk of barrister [Name]”, “Your assistance is needed”, and so on. When the victim’s confidence has been achieved, the scammer then introduces a problem that prevents the deal from occurring as planned, such as “To transmit the money, we

66 need to bribe a bank official. Could you help us with a loan ?” or similar. Although being difficult to evaluate the success rate of fraud spam campaign, one individual estimated that he sent 500 emails per day and he received about seven replies, mentioning that when he received a reply, he was 70 percent certain that he would get the money 2. The is another well-known sample of confidential trick spam emails. It contains fake notice of lotter win. The winner is usually asked to send sensitive infor- mation such as name, residential address, occupation/position, lottery number etc. to a free email. Then the spammer informs the victim that releasing the money requires some small fee (insurance, registration, or shipping). When the victim sends the fee, the scammer asks another fee to be paid 3. In the UK, lottery scams become such a big problem that many legitimate lottery sites dedicated pages on the subject to address the issue 4. Figure 5.3 represents a typical confidential trick spam email, which through a long text tries to earn the victim confidence. To this end, a friendly long text is written to earn the reader’s confidence.

Malware : Emails are an important vector for spreading malicious software or malware. Generally the malware is sent as email attachment, while the email structure is very simple, with a small text which encourages the reader to open the attachment or no text at all 5. Once opened the malware infects the user device, showing different pos- sible malicious behaviors. Commonly, the malware transform the victim’s device in a bot which is used to send spam messages to other prospective victims, which can be chosen by the spammer, or even being part of the victim’s contact list. To this category belongs Command and Control malware and Worms [104]. Often the malicious file is camouflaged, inserted in a zip file or with a modified extension, which allows to deceive basic anti-virus control implemented by some spam filters. Figure 5.4 shows a typical representation of malware spam email, where mostly contains an attachment, convincing the user with luring sentences to open it. Notice that it is also possible that malware campaign be designed in the format of portal redirection. By the way, here, when we talk about malware campaign, we mean that from the layout of spam messages, we are almost sure that the spam campaign has been designed for malware distribution. One of the very well-known malware spam campaign, is titled “Melissa.A”, a virus with a woman’s name, appearing on 26 March 1999 in the United States. Melissa.A came with the message “Here is the document you asked me for ... do not show it to anyone”. The virus came through email including an MS Word attachment. once opened, it was 2. 3. 4. 5. http ://

67 Figure 5.4 – Malware

emailing itself to the first 50 people in the MS Outlook contact list. In a few days, it became one of the most important cases of massive infection in history, causing damage of more than 80 million dollars to American companies, such that companies like Microsoft, Intel and Lucent Technologies blocked their Internet connections to be protected from Melissa.A. The virus infected up to 20% of computers worldwide. The ILOVEYOU virus is another malware spam campaign attack, that many considered to be the most damaging virus ever written. It distributed itself by email in 2000 through an attachment in the message. When opened, it loaded itself to the memory, infecting executable files. Once a user received and opened the email containing the attachment “LOVE-LETTER-FOR-YOU.txt.vbs”, the computer became automatically infected. It then infected executable files, image files, audio files, etc. Afterwards, it sent itself to others by looking up the addresses contained in the MS Outlook contact list. It caused billions of dollars in damages. CryptoLocker 6 is a new malware campaign, as ransomware trojan which targeted com- puters running Microsoft Windows, first distributed in Internet on 5 September 2013. When activated, the malware encrypts several types of files stored on local and moun- ted network drives with the use of RSA public-key cryptography, with the private key stored only on the malware’s control server. Afterwards, the malware shows a message which offers to decrypt the data if a payment (through either Bitcoin or a pre-paid cash voucher) is made by a stated deadline, and the victim threatened to delete the private key if the deadline passes. Figure 5.5 shows the increase of Crypto ransomeware from 2013 to 2014 7. 6. 7.

68 Figure 5.5 – Crypto Ransomeware volume

Phishing : Phishing emails attempt to redirect users to websites, which are designed to obtain credentials or financial data such as usernames, passwords, and credit card detail illegally [3]. Generally, these emails pretend to be sent by a banking organization, or coming from a service accessible through username and password, e.g. social networks, instant messaging etc., reporting fake security issues that will require the user to confirm her data to access again the service. To this end, phishing emails are mostly very well presented with a well organized structure, even reporting contact informations such as phone numbers and email. The representative structure of phishing emails we applied in this research, contain short well written text, providing the victim some important news. Mostly there exists one link, which direct the user to a very well designed fake website of a bank, which directly asks the victim to provide her credit card information.

On 26 January 2004, the U.S. Federal Trade Commission proposed the first lawsuit against a suspected phisher. It started from a Californian teenager, who created a web- page designed to look like the America Online website, and used it to steal credit card information 8. Other countries have followed this lead by tracing and arresting phishers. A phishing kingpin, Valdir Paulo de Almeida, was arrested in Brazil for leading one of the largest phishing crime rings, which in two years stole between $18 million and $37 million 9. Phishing, still in 2015, are one of the most dangerous effective kind of spam emails, requiring extensive efforts to fight against. Figure 5.6 demonstrates a typical sample of phishing spam email, mostly well designed to be as much as possible to seem real as referred organization that it pretends to come 8. 9.

69 Figure 5.6 – Phishing


5.3.2 Feature Extraction

DWS parses raw spam emails (eml files) extracting a set of 21 categorical features building a numerical vector readable by clustering and classification algorithms. The extracted features are reported in Table 5.1. Worth noticing that Table 5.1 and 4.1 are identical, and here we just bring it again to relate spammer goal and set of features as what follows.

The “number of recipients” which are in the To and Cc fields of the email differentiate between emails which should look strictly personal, e.g. communications from a bank (phishing) and those that pretend to be sent to several recipients, such as some kind of frauds or advertisement. The structure of links in the email text gives several information useful in determining the email goal. Portal redirections emails and advertisement generally show a high “Number of links”, in the first case to redirect the user to different portal websites, in the second one to redirect the user to the website where she can buy the products. Generally, fraud emails do not report links except for “IP based links”. These links are expressed through IP addresses, without reporting domain names, to reduce the likelihood of being tracked or to make the email text, generally discussing about secret money transaction, more legitimate. The “number of domains in links” represents the number of different domains globally found in all the links in the email text. Phishing and advertisement emails generally have just a single domain respectively of the website where to buy the advertised product and the website of the authority which the message pretends to be sent from. On the other hand portal redirection may contain several domains to redirect the reader to different portal websites. Moreover, links in portal redirection emails generally have a high “average number of dots in links” (i.e. sub-domains) and being dynamically generated are likely to contain hexadecimal or non ASCII - characters. Non ASCII

70 Attribute Description RecipientNumber Number of recipients addresses. NumberOfLinks Total links in email text. NumberOfIPBasedLinks Links shown as an IP address. NumberOfMismatchingLinks Links with a text different from the real link. NumberOfDomainsInLinks Number of domains in links. AvgDotsPerLink Average number of dots in link in text. NumberOfLinksWithAt Number of links containing “@”. NumberOfLinksWithHex Number of links containing hex chars. NumberOfNonAsciiLinks Number of links with non-ASCII chars. IsHtml True if the mail contains html tags. EmailSize The email size, including attachments. Language Email language. AttachmentNumber Number of attachments. AttachmentSize Total size of email attachments. AttachmentType File type of the biggest attachment. WordsInSubject Number of words in subject. CharsInSubject Number of chars in subject. ReOrFwdInSubject True if subject contains “Re” or “Fwd”. SubjectLanguage Language of the subject. NonAsciiCharsInSubject Number of non ASCII chars in subject. ImagesNumber Number of images in the email text.

Table 5.1 – Features extracted from each email. characters in the links are also typical of some advertisement emails redirecting to foreign websites. It is worth noting that all these link based features consider the real destination address, not the clickable text shown to the user. If the clickable text (hyper-link) shows an address (“click here”-like text is not considered) different from the destination address, the link is considered mismatching and counted through the feature “mismatching links”. Phishing and portal redirection emails make extensive use of mismatching links to deceive the user.

Advertisement and phishing emails may appear like a web-page. In this case, the email contains HTML tags. On the other hand, fraud, malware and portal emails rarely are presented in HTML format. The size of an email is another important structural feature. Confidential trick and portal redirections generally are quite small in size, considering they are raw text. Adver- tisement, malware and some kind of phishing emails generally have a more complex structure, including images and/or attachments, which makes the message size to noticeably grow. “At- tachment Number”, “Attachment Size” and “Attachment Type” are structural features mainly used to distinguish between the attachment of malware emails and those of advertisement and phishing emails, which attach to the email images for a correct visualization. The “Number of Images” in an email determines the global look of the message. Images are typical of some advertisement emails and phishing ones. Finally three features are used for the analysis of the subject. For example, some advertisement emails use several one-character words or non ASCII characters in emails to deceive typical spam detection techniques based on keywords

71 Table 5.2 – Feature vectors of a spam email for each class.

Class NumAttach Typeattach NumLinks NumImages NumDomains EmailSize SubjLang CharsInSubj Lang Advert. 0 0 11 12 2 14 10 3 10 Portal 0 0 10 0 1 10 1 3 1 Fraud 0 0 0 0 0 10 1 1 1 Malware 1 5 0 0 0 21 1 2 1 Phishing 0 0 2 0 2 9 1 3 1

[123]. It is worth noting that rarely non ASCII characters are used in phishing emails, to make them look more legit. Moreover, some fraud and phishing emails send deceiving mail subject with the “Re” : or “Fwd” : keyword to look like part of a conversation triggered by the victim. Furthermore, some fraud emails are characterized by the difference between the email “Language” and the “Subject Language”. Many scam emails are, in fact, translated through automatic software which ignore the subject, causing this language duality.

For a further insight, Table 5.2 shows the vectors of some selected features extracted from the five emails of Figures 5.1, 5.2, 5.3, 5.6, 5.4.

5.3.3 DWS Classification Workflow

After the email features have been extracted, the resulting feature vectors are given as input to the DWS classification workflow. This process aims at dividing the unclassified spam emails in campaign and label them through a classifier trained on the fly. The classifier can be applied to label new spam emails. To get better insight, the workflow of proposed approach is depicted in Figure 5.7.

The main part of the workflow is aimed at generating a valid training set from the dataset of unclassified emails, applying hierarchical clustering algorithm to divide email in campaigns (step 1 in Figure 5.7). The chosen algorithm, named Categorical Clustering Tree (CCTree) generates a tree like structure (step 2) which is exploited to associate a campaign to each email coming from a small dataset of labeled emails. The campaign receives the label of the email associated to it (step 3). Thus, this set of campaigns is used as training set for a classifier (step 4), successively used to label all the remaining campaigns (step 5 and 6).

The framework is based on a clustering algorithm (CCTree) and a classifier. As discussed, classifiers are generally more accurate than clustering algorithms, due to the supervised lear- ning approach. However, the major drawback of the classifiers is that a valid training set, with enough elements and representative of the reality is not always available. We argue that it is possible to create such a training set with exploiting the CCTree algorithm and a small set of classified emails (C). The C set contains emails representatives of each class, however, the number of elements is not enough to constitute a valid training set for a classifier.

72 Figure 5.7 – DWS Workflow.

The CCTree algorithm starting from dataset D generates a decision tree-like structures whose leaves are the final clusters, unlabeled. Following the CCTree structure it is possible to collocate the emails of the set C in the unlabeled clusters of the set D, to find similarly structured emails. In fact, in the problem of clustering spam emails, each cluster represents a set of homogeneous similar spam emails, i.e. a spam campaign. Thus, for the purpose of goal-based labeling, all emails belonging to the same cluster will receive the same label. Finally, the emails of these homogeneous clusters can be used as training set for the supervised learning classifier. After the classifier has been trained, it is used to classify the remaining leaves of the CCTree that were not reached by any email of the set C.

Figure 5.7 schematically depicts the typical operative work-flow of the proposed framework. In the following the six steps of the DWS workflow are described in detail.

Phase 1 : Clustering Spam Emails into Campaigns

The first step performed by the DWS framework is to divide large amounts of unclassified spam emails (constituting the set D) into smaller groups of similar messages (steps 1 and 2 in Figure 5.7). Emails are clustered by structural similarity exploiting the CCTree algorithm.

73 Figure 5.8 – Insert new instance X in a CCTree

Phase 2 : Training Set Generation

In order to label the campaigns, it is necessary to train a classifier to recognize emails coming from the five predefined spam classes (steps 3 and 4 in Figure 5.7). To this end, it is necessary to provide to the classifier a good training set, which has to be representative of the reality in which the classifier has to operate. For this reason the training set will be extracted from the unclassified emails dataset D itself. More specifically, the CCTree structure generated in previous step is exploited to label a small number of generated spam campaigns. To this end, small number of campaigns are labeled with the use of a small set of labeled emails C. This set contains a small number of manually selected spam emails, equally distributed in the five classes, all structurally different. These spam emails do not come from the D set. The emails in the C dataset have to be accurately chosen on the base of the email that investigator are interested in. For example, Italian police investigators interested in following a phishing case should put in the C dataset some emails with Italian text and bank names. After extracting the value of the features from the email in C, they are fed one by one to the CCTree generated on D. Following the CCTree structure each email ci is eventually inserted in the campaign Cj (Figure 5.8). Thus the campaign Cj is labeled with the class of ci and all its emails are added to the training set.

If the same spam campaign is reached by two or more emails of different classes, the campaign is discarded and the emails are re-evaluated to be sent to other campaigns. It is worth noting that such an event is unlikely due to the high homogeneity of the clusters generated through CCTree. Furthermore, in the event that an email in C does not reach to any campaign, i.e. a specific attribute value of the email is not present in the CCTree, the email is inserted in the more similar campaign. To this end, the node purity of each campaign is calculated before and after the insertion of the email ci. The email is thus assigned to the campaign in which the difference between the two purities, weighted by the number of elements, is lesser.

Phase 3 : Labeling Spam Campaigns

74 Feeding the training set to the classifier, we are able to classify all remaining campaigns generated by the CCTree (step 5 and 6 in Figure 5.7). To this end, each campaign resulted from CCTree is given to the classifier. The classifier labels each email of received campaign on the base of spammer purpose. Under two conditions DWS considers a spam campaign as non classified.

Firstly, it is possible that more emails belonging to the same campaigns receive different labels, e.g. phishing and portal redirection. In such a case, calling as “majority class” the label with more emails in the cluster, the campaign is considered non classified if the emails of the majority class amount to less than 90% of all the emails in the campaign. classified.

The second condition is instead related to the prediction error reported by the classifier on each element of a campaign. The predicted error, computed as 1 − P (ei ∈ Ωj), where P (ei ∈ Ωj) is the probability that the element ei belongs to the class Ωj, i.e. the label assigned to the element ei. DWS framework considers a campaign as non classified, if the average predicted error is more than 30%. If the non classified campaigns and/or elements are a consistent percentage, it is possible to restart the classification process running the CCTree algorithm with tighter criteria for node purity.

5.4 Results

This section presents the experimental results of the DWS framework. First we discuss the classifier selection process, exploiting two small datasets of manually labeled spam emails. Afterward, we present the results for a real use case of the DWS framework on a recent dataset of spam emails.

5.4.1 Classifier Selection

In this first set of experiments we compare the performance of three different classifiers. To this end, two sets of real spam emails are provided to be used as training and test sets. These two datasets are extracted from emails collected by the untroubled honey pot 10 in February and January 2015. The emails have been manually analyzed and labeled for standard supervised learning classification and performance evaluation. The manual analysis and labeling process has been performed rigorously analyzing text and images, and following the links in each email. Only the emails for which the discovered class was certain have been inserted to the datasets. For a spam email, the label is certain if it matches the label description given in Section 5.3.1 and the label is verified through manual analysis. For example, Portal Redirection emails are certainly labeled if the links really redirect to a portal website. The first dataset, used as training set, is made of 160 spam emails, the second one, used as test set, is made of 80 emails. 10. http ://

75 Experiments have been run on all the classifiers offered by the WEKA library to classify categorical data. For the sake of brevity and clarity we only report the classifier with the better results for each classifier group. More specifically, the chosen classifiers are the K-Star from the Lazy group, the Random Tree Forest from the Tree group and the Bayes Network from the NaiveBayes group. Among these three classifiers, the best one has been used by the DWS framework in its operative phase.

Dataset Dimensioning

The process of manual analysis and labeling is time consuming. However, it is necessary to have a dataset well balanced, without duplicates and representative of the five classes, needed to correctly assess classifier performances. Given the complexity of manual analysis procedure, it is not possible to choose training and testing set of extremely large dimension. Thus, standard dimensioning techniques have been used, for both training and testing set. A general rule to assess the minimum size for a training set is to dimension it as six times the number of used features [140]. It is worth noting that the training set of 160 elements already matches this condition (6 × 21 < 160). However, in multi-class problem, the dimension of data should provide well result in terms of sensitivity and specificity, i.e. true positive rate (TPR) and (1 - false positive rate (FPR)) respectively, when K-fold validation is applied [14]. This must be done keeping balanced the relative frequencies of data in various classes. As shown in the following, the provided testing set returns for K-fold validation a value of Receiver Operating Characteristic’s Area Under Curve (ROC-AUC or AUC for short) higher than 90% for all tested classifiers.

Concerning the test set, it is important the null intersection with the training set and the balanced relative frequencies of the various classes. In [14], the minimum size for a testing set to provide meaningful results, in a problem of classification with five classes, is estimated to be 75, which is smaller than the test set of 80 spam emails provided.

Classification Results

We report now the classification results for the three tested classifiers on the two aforemen- tioned dataset. The first set has been used as training set for the classifiers. According to the methodology in [14], a first performance evaluation has been done through the K-fold (K=5) validation method, classifying the data for K times using each time K − 1/K of the dataset as training set and the remaining elements as testing set. The used evaluation indexes are the True Positive Rate (TPR), False Positive Rate (FPR) and Receiver Operating Characteristic Area Under Curve (ROC-AUC or simply AUC). The AUC is defined in the interval [0, 1] and measures the performance of a classifier at the variation of a threshold parameter T , proper of the classifier itself, according to the following formula : Z ∞ AUC = TPR(T ) · FPR0(T )dT −∞

76 Table 5.3 – Classification results evaluated with K-fold validation on training set.

Algorithm K-star RandomForest BayesNet True Positive Rate 0.956 0.937 0.95 False Positive Rate 0.01 0.019 0.013 Area Under Curve 0.996 0.992 0.996

where FPR0 = 1 − FPR. When the value of AUC is equal to 1, the classifier is considered “perfect” for the classification problem.

Table 5.3 reports TPR, FPR and AUC the three classifiers, i.e. the number of correctly clas- sified elements between the five classes for both the K-fold test on the first dataset (160 spam emails). As shown, all the classifiers return an accuracy higher than 90%.

Table 5.4 – Classification results evaluated on test set.

Algorithm K-star RandomForest BayesNet Measure TPR FPR AUC TPR FPR AUC TPR FPR AUC Advertisement 1.000 0.031 0.998 1.000 0.000 1.000 1.000 0.031 0.967 Portal 0.786 0.000 0.996 0.786 0.016 0.985 0.929 0.000 0.998 Fraud 1.000 0.016 0.992 1.000 0.016 0.951 1.000 0.016 0.928 Malware 0.938 0.016 0.995 0.938 0.016 0.908 0.938 0.016 0.957 Phishing 0.947 0.017 0.977 0.947 0.051 0.963 0.842 0.017 0.907 Average 0.9342 0.016 0.9916 0.9342 0.019 0.9614 0.9418 0.016 0.9514

Afterward, the whole first dataset has been used to train the three classifiers, whilst the second dataset has been used as test set. Table 5.4 reports the detailed classification results where the classifiers are trained with training set (160 spam emails) and evaluated with test set (80 spam emails). The result is reported on the five classes with TPR, FPR and AUC.

For a further insight, we report in Figures 5.9, 5.10, 5.11, 5.12, and 5.13 the comparison of the ROC curves of the three classifiers for the five classes, measured on the test set.

It is worth noting that in all cases the area under the ROC curve is close to 1, hence in general classifiers shows good performances on the testing set for each class.

As can be observed in Table 5.3, on the average the K-star and Bayes Net classifiers give slightly better K-fold results. However, the K-star classifier yields the better results in terms of AUC in average, evaluated with test set ( Table 5.4). Therefore, K-star is the desired classifier which we implement in DWS framework.

77 Figure 5.9 – ROC curve / Advertisement

Figure 5.10 – ROC curve / Portal Redirection

78 Figure 5.11 – ROC curve / Fraud

Figure 5.12 – ROC curve / Malware

79 Figure 5.13 – ROC curve/ Phishing

5.4.2 DWS Application

The second set of experiments aims at assessing the capability of the framework to cluster and label large amounts of spam emails. To this end the DWS framework has been tested on set of 3230 recent spam emails. The spam emails have been extracted from the collection of the honeypot 11, related to the first week of March 2015. The emails have been manually analyzed and labeled for performance analysis.

Phase 1 : Clustering with CCTree

In the first step CCTree has been used to divide the emails in campaigns. The CCTree pa- rameters have been chosen finding the optimal values for number of generated clusters and homogeneity, using the knee method described in Chapter4. Applying CCTree, 135 clusters have been generated of which 73 only contains one element. Generated clusters with a single element have not been considered. These emails are, in fact, outliers which do not belong to any spam campaign. The remaining 3149 emails divided in 62 clusters have been used for the following steps.

Phase 2 : Training Set Generation

To generate the training set we used a small dataset made of three representative emails for each of the five classes. These 15 emails have been manually selected from different datasets of real spam emails, including personal spam inbox of the authors. To facilitate the manual 11. http ://

80 analysis of the classified spam emails, the 15 emails of the set C are written in English language. Each email has been assigned to one of the 62 spam campaigns, following the CCtree structure, as described in Section 5.3.3. The campaigns associated to each email are used as training set.

Table 5.5 – Training set generated from small knowledge.

Class Number of Emails Number of Campaigns Advert. 29 2 Portal 66 3 Fraud 113 3 Malware 27 1 Phishing 17 1 Total 252 10

The generated training set (Table 5.5) is composed of 252 emails, contained in 10 campaigns. It is worth noting that the 15 emails have not been added to the associated cluster after the CCTree classification, to not alter the decision on the following emails.

Phase 3 : Labeling Spam Campaigns

After training the classifier with the generated training set, we label the remaining (52 out of 62) unlabeled spam campaigns of CCTree. The classification results are reported in Table 5.6.

Table 5.6 – DWS classification results for the labeled spam campaigns.

Campaigns Emails Class TPR FPR Accuracy Correct Wrong Correct Wrong Advert. 5 0 137 0 1 0 1 Portal 26 0 1331 0 1 0.03 0.9935 Fraud 10 2 1032 43 0.96 0.01 0.9788 Malware 3 0 31 0 1 1 1 Phishing 7 1 213 18 0.915 0 0.994 Total 51 3 2744 61 0.975 0.008 0.9782

The table reports for each class the amount of campaigns and corresponding email classified correctly or incorrectly. Moreover, we report for the emails the statistics on TPR, FPR and Accuracy (i.e., the ratio of correctly classified elements). The global accuracy, (last row of the table) is of 97,82%. However, we point out that, due to the conditions on predicted error reported in Subsection 5.3.3, 8 campaigns out of 62, containing 344 spam emails are considered unclassified. For the sake of accuracy, considering these 8 campaigns as misclassified, the total accuracy for emails on the dataset is of 87,14%. The accuracy result is in the line with previous works on classification emails into phishing and ham [34], [50], [17].

Concerning the 8 non classified campaigns, 3 campaigns containing 68 spam emails were cor- rectly labeled as portal. However, they are considered unclassified since the average predicted

81 error is higher than 30% in all the 3 campaigns. 4 campaigns containing 258 spam emails have been classified as phishing. 2 of them with 116 messages, were correctly identified but did not match the predicted error condition. The other 2 campaigns have been incorrectly classified as fraud. However, they are considered as unclassified due to high predicted error. The last campaign with 18 elements is in the advertisement class, but incorrectly classified as fraud, though the predicted error condition again is not matched. It is worth noting how the condition on predicted error is useful in increasing the overall accuracy on classified data.

From Table 5.6 it is possible to infer what a large portion of spam messages belongs to portal and fraud classes. Even if these preliminary results are related to a relatively small dataset, they are indicative of the current trend of spam emails distribution, which may provide to the spammer the greatest result with the smallest risk.

5.5 Ranking Spam Campaigns

Due to the fact that the number of spam emails collected daily is astronomic, even after clustering spam emails into smaller similar groups (spam campaigns), still a methodology is required to automatically order spam campaigns according to investigator priorities. To this end, in this section, we provide several features (including label of campaigns) and the weight of features to attribute a grade to each spam campaign. The set of campaigns are ordered based on their grades. More features can be added to the provided set depending on the case study.

Ranking spam campaigns helps the investigator to decide which set of spam messages assi- gned to a specific spammer, needs to be analyzed and prosecuted first. Furthermore, if the investigator pursue a specific goal, for example dangerous spam campaigns toward Canada, our proposed ranking methodology can be applied.

5.5.1 Ranking Features

In this section, we propose a set of ranking features, containing five features, to order spam campaigns. The ranking features are presented in Table 5.7. Afterwards, we explain in detail what each ranking feature means and how it is normalized to the interval [0,1].

— Number of Data belonging to a Spam Campaign (N ) — Domain of URLs (U) — Language of spam message (L ) — Burst Property, Analysis of Distribution of Data in a period of Time (B ) — Class (Label) of Campaigns (C)

Table 5.7 – Set of ranking features

82 Number of Data (N ): The number of data in a campaign refers to the number of spam emails belonging to that campaign. The number of data in campaigns are normalized based on the maximum number of elements in largest campaign. More precisely, suppose the campaign containing the maximum number of elements contains nmax spam emails. i n N = ni The number of data of ’th campaign, containing i elements, is normalized as i nmax . Hence, Ni ∈ [0, 1].

URL Domain of Campaign (U): The URL domain of spam message is a boolean feature, which for a spam email equals to 1, if among the URLs in the body of message the desired domains occurs, and it equals to 0, otherwise. The URL domain of a spam campaign equals to the portion of spam messages in a campaign for them URL domain equals to 1. For example, consider that the investigator is interested in emails oriented to Canada. In this case, appearing the URL with the domains like “.ca ” in the body of messages in a campaign makes it more interesting than other campaigns. To this end, a set of

interesting domains are provided as X = {X1, . . . , .Xk}, then for each message in the spam campaign respecting one of provided interesting domains, the URL domain of message equals to 1. The URL domain of spam campaign equal to the number of message for them the result is equal to 1 divided to whole number of messages in the campaign. From the definition, the feature U is normalized to [0,1].

Language (L): The language of message is another criteria which helps to the investigator interested in spam campaigns oriented to a specific region, e.g. Canada. To this end, a set of desired languages is provided, e.g. for Canada the set of language may contain English and F rench. Then, the language of a message equals to 1, if it has been written in one of desired language, and equals to 0, otherwise. The language of a campaign (L ) equals to the portion of messages for them the language of message equals to 1. From the definition, the criteria L is normalized to [0,1].

Burst Property (B): A spam campaign for which the number of spam messages decreased as the time passes is less dangerous than the one for which the number of produced spam emails are increasing. We call the criteria of increasing the number of elements of a spam campaign as burst property of campaign, and we calculate it as dividing the duration of time between first email (first in terms of time) and last email (last in terms of time) in a spam campaign into two parts. If the number of emails in second part is more than the spam messages in first part, we say that the spam campaign respect burst property, and we attribute 1 to B, otherwise, it is not respecting burst property and we attribute it 0.

Class (label) of Campaign (C): Label of campaign (C) is returned from the result of DWS framework. Here, we propose an approach to attribute a score to each label. Worth

83 noticing that the proposed score for each label can be modified according to investigator’s priorities. Phishing spam messages are the most dangerous kind of spam emails, stealing the impor- tant information of the victim in a very well presented format. After phishing, Malware spam are the most dangerous ones, in the sense that mostly the computer of end user is affected without his awareness. Fraud emails being dangerous enough, are less dange- rous than phishing and malware. The reason is that fraud spam messages mostly reach to their goal through several times of communication, and during this interaction it is possible that the victim become aware of the risk of continuing communication or it is possible that a filtering service stop it before that the required money be transfer- red. Portal emails, mostly not well presented, are generally distinguished by the user as spam, hence, not dangerous as previous groups. Finally, advertisement spam email which mostly propose a real product are the least dangerous spam campaigns. Considering that the campaigns with unknown label are not considered as real dangerous campaigns, we score the phishing, malware, fraud, portal, advertisement and unknown campaigns as 6, 5, 4, 3, 2, 1, respectively. The score of campaign label is normalized by dividing each score to 6 (Table 5.8).

Table 5.8 – Normalized score of spam campaigns label

label Phishing malware Fraud Portal Advert. Unknown normalized score 1 0.83 0.66 0.5 0.33 0.16

5.5.2 Spam Campaign Grade

To attribute a grade to each spam campaign after extracting its ranking features, it is required to provide a weight for each ranking feature. The weight of a feature is characterized by an expert, which may vary from one case to another. The weights of features should be normalized to [0,1], which simply could be achieved by dividing each weight to the sum of weights. The weighted features shows the importance of each feature in ranking spam campaigns.

We define the grade of campaign C, written grade(C), as following :

grade(C) = ω1 ·C + ω2 ·N + ω3 ·U + ω4 ·L + ω5 ·B

where C, N , U, L and B are the extracted ranking features of campaign C, ωi ∈ [0, 1] for1 ≤ i ≤ 5. From the definition grade(C) ∈ [0, 1].

5.5.3 Ranking Application

In this section, we propose an approach to order set of spam campaigns according to the spam campaigns grades. To this end, first we provide a simple ranking methodology, named dense

84 Table 5.9 – Three first ranked campaigns

Number of Data URL Domain Language Burst Label grade Campaign 1 1 0.96 0.98 1 0.5 0.88 Campaign 2 0.78 0.86 0.97 1 0.66 0.854 Campaign 3 0.15 0.91 1 1 1 0.812 ranking. In dense ranking, objects having the same score receive the same rank. Afterwards, we explain the experiment of ranking spam campaigns resulted from Section 5.4.2.

Definition 5.1 (Dense Ranking (1223 ranking)). In dense ranking, objects having the same score receive the same ranking number, and the next object(s) receive the immediate following ranking number. Hence, each object ranking number is 1 plus the number of object ranked above it that are distinct respecting to the ranking order. For example, if A ranks ahead of B and C, where B, C rank equal, and both rank ahead of D, then A gets ranking number 1, B and C each gets the rank number 2, and finally D gets ranking number 3, i.e. A = 1,B = 2,C = 2,D = 3.

To apply dense ranking in ordering set of spam campaigns, it is enough that we first find the grade of each spam campaign. Afterwards, the campaigns get the rank according to their own grade. The greater the grade, the lesser rank.

To order the 62 spam campaigns labeled in Section 5.4.2, firstly we should extract for each campaign the four other ranking features explained in Section 5.5.1.

Concerning the features U and L, we consider the range of interesting domain as {.ca, .com} and {English, F rench}, respectively. By considering the equal weight for each feature, i.e. ωi = 0.2, 1 ≤ i ≤ 5, we calculate the grade of each campaign. The maximum number of elements among 62 campaigns belongs to a portal campaign, containing 407 spam emails. Hence, the number of elements in campaigns are normalized by dividing the number of elements in a campaign divided to 407.

In Table 5.9, we report the properties of first three ranked campaigns. where the grade of each campaign is calculated as following :

grade(campaign1) = 0.2.(1 + 0.96 + 0.98 + 1 + 0.5) = 0.88 grade(campaign2) = 0.2.(0.78 + 0.86 + 0.97 + 1 + 0.66) = 0.854 grade(campaign3) = 0.2.(0.15 + 0.91 + 1 + 1 + 1) = 0.812

The set of first ranked campaigns, reports the set of campaigns required to be analyzed and followed by the investigators. The process is performed automatically, hence, in a short period of time vital information is provided, which is almost impossible to be achieved by considering a huge amount of spam emails as a whole.

85 5.6 Conclusion

Spam emails constitute a constant threat to both companies and private users. Not only these emails are unwanted, occupy storage space and need time to be deleted, also they have become vectors of security threat and used to perform cybercrimes, such as phishing and malware distribution. In this chapter, we have presented a framework, named DWS, for analysis of large amounts of spam emails collected through honeypots. We argue that DWS can provide a helpful tool for police and investigators in forensic analysis of spam emails. In fact DWS automatically clusters and classifies large amount of spam emails in labeled campaigns, to eventually help investigator to focus on campaigns for a specific cybercrime, filtering out the non-interesting spam emails. Moreover DWS is self learning, not requiring any preexistent knowledge of the dataset to analyze. Instead a small set of data, named small knowledge is provided. To update the small knowledge the investigators can add new discovered templates to previous set of small knowledge. Preliminary tests performed on a first dataset of more than 3200 emails showed a good accuracy of the DWS framework. Furthermore, a ranking methodology is proposed to order set of spam campaigns based on the investigator priorities. The set of first ranked campaigns are the ones which should be analyzed first.

86 Chapitre 6

Algebraic Formalization of CCTree

Despite being one of the most common approach in unsupervised data analysis, a very small literature exists on the formalization of clustering algorithms. In this chapter we propose a semiring-based methodology, named Feature-Cluster Algebra, which abstracts the represen- tation of a labeled tree structure representing a hierarchical categorical clustering algorithm, named CCTree ([127]). Through several theorems and examples we show that the abstract schema fully abstracts the tree structure. The full abstraction provide this interesting pro- perty that it is possible to apply an algebraic term and a tree structure one instead of the other, when needed. This means that it is possible to use the well established concepts in the algebraic form of the clustering algorithm to get the equivalent result in the semantic form. We apply the abstract schema of CCTree to formalize CCTree parallelism with the use of rewriting system. To this end, a set of functions and relations are defined on feature-cluster algebra. Then, we first propose a rewriting system to automatically identify whether a term represents a CCTree term or not. Afterwards, a rewriting system is proposed to automatically get a final CCTree term from the addition of two (or more) CCTree terms. The final CCTree term is used to homogenize the structure of all CCTrees in parallel devices.

6.1 Introduction

Clustering is a very well-known tool in unsupervised data analysis, which has been the focus of significant researches in different domains of computer security, spanning from intrusion detection [145], spam campaign detection as explained in previous chapters, to clustering Android malware [121]. The problem of clustering becomes more challenging when data are described in terms of categorical attributes, for which, differently from numerical attributes, it is hard to establish an ordering relationship [6]. The difficulty arises from the fact that the similarity of elements cannot be computed with the use of well-known geometric distances, e.g. euclidean distance. In categorical clustering, each attribute contains a domain of discrete, mutually exclusive features, where each feature represents a value of an element. For example,

87 the attribute color may contain features as red and blue.

Clustering algorithms are vastly applied in real world problems, also in security problems, which in present thesis have already been applied toward spam campaign detection. Notwiths- tanding, a very few works exist to express and solve the problems of clustering algorithms in terms of formal methods. Formal methods are mathematically based languages, techniques, and tools to specify general rules on a system, where the desired properties of the system can be verified easily on the base of identified rules [37]. In the present work, we argue that using formal methods in CCTree, as a specific form of cate- gorical clustering algorithm, provides an abstract representation of clusters, which facilitates the analysis of cluster properties, while getting rid of confronting a large amount of data in each cluster. The proposed formal scheme is used to formalize a challenging task in categorical clustering algorithms, named parallel clustering. CCTree (Categorical Clustering Tree) has a decision tree-like structure, which iteratively di- vides the data of a node on the base of an attribute, or domain of features, yielding the greatest entropy. The division of data is shown with edges coming out from a parent node to its children, where the edges are labeled with the associated features. A node respecting the identified stop conditions, is considered as a leaf. The leaves of the tree, are the desired clusters. Being notably significant the features in CCtree construction, i.e. the edge labels, a CCTree has a feature-based structure. Feature algebra [74] is a semiring-based formal method proposed to formalize feature-based product lines, e.g. software product. We import the idea of feature algebra to formalize feature-based CCTree structure and call our proposed semiring-based algebraic structure : “Feature-Cluster Algebra”. The notion of feature-cluster algebra is used to abstract CCTree representation as a term. The CCTree term is applied to formalize CCTree parallelism on the base of Rewriting System. Parallel clustering is a methodology proposed to alleviate the problem of time and memory usage in clustering large dataset [42].

The contributions of this chapter can be summarized as follows : — A semiring-based formal method, named Feature-Cluster Algebra, is proposed to abs- tract the representation of a categorical clustering algorithm, named CCTree ([127]). The abstraction theory is a delightful mathematical concept, which constructs a brief sketch of the original representation of a problem to deal with it easier. More precisely, abstraction is the process of mapping a representation of a problem, called the ground (semantic), onto a new representation, called abstract (syntax) representation, in a way that it is possible to deal with the problem in the original space by preserving certain desirable properties and in a simpler way to handle, since it is constructed from ground representation by removing unwanted detail [59]. — Through several theorems and examples we show that the proposed approach fully

88 abstracts the CCTree representation under some conditions. Full abstraction is an interesting property of abstract mapping, which guarantees that we can apply the ground (semantic) representation and abstract (syntax) representation of a problem alternatively. —A rewriting system is proposed to automatically verify whether a term is a CCTree term or not. A rewriting system is a set of directed equations on a set of objects. Mostly the objects in rewriting system are called terms and the directed equations are called rewriting rules. The rewriting rules are applied to compute new terms by repeatedly replacing subterms of a given term until the simplest form possible is obtained. The rewriting system is an interesting mathematical concept which automatically creates a new desired final term applying the correctly specified rewriting rules [43]. — The abstract form of the CCTree is applied to formalize the process of parallelizing CCTree clustering on parallel computers with the use of rewriting system. The proposed rewriting system contains a set of rewriting rules which direct us to get a CCTree term from a non CCTree term, representing a CCTree which all CCTrees in parallel devices can merge to it. — We prove that the proposed rewriting systems are confluent. The termination and confluence are two interesting properties of a rewriting system. The termination of a rewriting system guarantees that the system does not contain a loop of rules, which causes the non terminating process of applying the rewriting rules. On the other hand, the confluent property of a rewriting system guarantees that applying the rewriting rules on a given term results in a unique term. This chapter is organized as follows. In Section 6.2, we present a review of the literature about formalization methods applied in feature-based problems. In Section 6.3, the process of transforming a CCTree to its equivalent algebraic expression is explained in terms of semiring. In Section 6.4, we show the proposed algebraic structure fully abstract tree representation. The relations on feature-cluster algebra are introduced in Section 6.5. In Section 6.6, we apply abstract CCTree representation to formalize CCTree parallel clustering in terms of rewriting system. We conclude and point to future directions in Section 6.7.

6.2 Related work

Feature models are information models in a way that a set of products, e.g. software products or DVD player products, are represented as hierarchically arrangement of features, with dif- ferent relationships among features [15]. Feature models are used in many applications as the result of being able to model complex systems, being interpretable, and being able to handle both ordered and unordered features [105]. [15] believe that designing a family of software systems in terms of features, makes it easy to be understood by all stakeholders, rather than

89 other forms of representations. Representing feature models as a tree of features, were first introduced by [82], to be used in software product line. Some studies [31], [32], show that tree models combined with ensemble techniques, lead to an accurate performance on variety of domains. In feature model tree, differently from CCTree, the root is the desired product, the nodes are the features, and different representation of edges demonstrates the mandatory or optional presence of features. [73][74], were the first who applied idempotent semiring as the basis for the formalization of tree models of products, and called it feature algebra. The concept of semiring is used to answer the needs of product family abstract form of expressions, refinements, multi-view reconciliations, product development, and classification. The elements of semiring in the pro- posed methodology, are sets of products, or product families. To the best of our knowledge, we are the first who applied an algebraic structure, to abstract a categorical clustering algorithm representation and formalize the associated issues.

6.3 Feature-Cluster Algebra

In this section, we introduce our proposed semiring-based formal method, named feature- cluster algebra, to abstract the CCTree representation. To this end, we first explain what precisely a semiring implies. Then, the process of transforming a tree structure to its equiva- lent term is presented.

6.3.1 Semiring

In abstract algebra, the term algebraic structure generally refers to the set of elements toge- ther with one or more finitary operations respecting specified properties [68]. In particular, a semiring is an algebraic structure containing two binary operations on a set of elements. More precisely, a semiring is defined as follows.

Definition 6.1 (Semiring). A semiring is a set S, with two binary operations “+” , “·”, called addition and multiplication, respectively, such that (S, +) is a commutative monoid with identity element 0, and (S, ·) is a monoid with identity element 1. The multiplication distributes left and right over addition, and multiplication by 0 annihilates elements of S. A semiring for which multiplication is commutative, is called a commutative semiring [68]. More precisely, S equipped with two binary operations “+”, “·” , such that 0 , and 1 are identity elements to “+”, and “·”, respectively, is a semiring, if for all a, b, c ∈ S, the following laws

90 are satisfied :

(a + b) + c = a + (b + c) 0 + a = a + 0 = a a + b = b + a (a · b) · c = a · (b · c) 1 · a = a · 1 = a a · (b + c) = (a · b) + (a · c) (a + b) · c = (a · c) + (b · c) 0 · a = a · 0 = 0

Briefly, we write (S, +, ·, 0, 1) is a semiring. A semiring (S, +, ·, 0, 1) is called an idempotent semiring, if for any a ∈ S, we have :

a + a = a

Semiring of Features

Lets consider a set of disjoint sorts, denoted as A, is given, where the carrier set of each sort

Ai ∈ A is denoted by VAi . In our context, we call the given set of sorts, the set of attributes, union = S values features and we call the of sorts, denoted as V Ai∈A VAi , the set of or .

Example 6.1. We may consider the set of attributes as A = {color, size}, where the carrier set of each attribute can be considered as Vcolor = {red, blue} and Vsize = {small, large}. In this case, we have V = {red, blue, small, large}.

Definition 6.2 (Sort). We define the sort function which gets a set of features and returns a set of the associated sorts of received feature as follows :

sort : P(V) → P(V)

sort({f}) = VA for f ∈ VA

sort(V1 ∪ V2) = sort({V1}) ∪ sort({V2})

Example 6.2. In the following, we present the application of sort function on a set of features from Example 6.1:

sort({red}) = {red, blue} sort({red, small}) = sort({red}) ∪ sort({small}) = {red, blue, small, large}

Consider F = P(P(V)) be the power set of the power set of V, whilst we denote 1 = {∅} and 0 = ∅. We define the operations “+” and “·”, as choice and composition operators on F as

91 following :

· : F × F → F

F1 · F2 = {X ∪ Y : X ∈ F1 ,Y ∈ F2}

+ : F × F → F

F1 + F2 = F1 ∪ F2

We say that F belongs to the set power of features F, if it respects one of the following syntax forms :

F := 0 | {{f}} | F · F | F + F | 1 (6.2) where f ∈ V.

Example 6.3. In the following, some elements of F on V = {red, blue, small, large} are presented :

F1 = {{red, large}, {blue}}

F2 = {{small}}

F1 · F2 = {{red, large, small}, {blue, small}}

F1 + F2 = {{red, large}, {blue}, {small}}

In the problem of formalizing the categorical clustering, the set {{red, large}, {blue}} may represent two clusters, where the elements of the cluster {red, large} have features red and large in common, and the elements of the cluster {{small}} are all small. This means that we use the addition to separate clusters, and we use multiplication to consider more features in identifying the clusters. It is clear that any combination of the set of the features does not necessarily represent a clustering.

Proposition 6.4. It is easy to verify that the two operations “+” and “·” respect the following

92 properties for every F1,F2,F3 ∈ F :

(F1 + F2) + F3 = F1 + (F2 + F3) (6.3)

F1 + F2 = F2 + F1 (6.4)

F1 · F2 = F2 · F1 (6.5)

(F1 · F2) · F3 = F1 · (F2 · F3) (6.6)

F1 · (F2 + F3) = (F1 · F2) + (F1 · F3) (6.7)

(F1 + F2) · F3 = (F1 · F3) + (F2 · F3) (6.8)

1 · F1 = F1 · 1 = F1 (6.9)

0 · F1 = F1 · 0 = 0 (6.10)

0 + F1 = F1 + 0 = F1 (6.11)

F1 + F1 = F1 (6.12)

Theorem 6.5. The quintuple (F, +, ·, 0, 1) constitutes an idempotent commutative semiring.

Démonstration. The proof is straightforward from the Proposition 6.4.

Definition 6.3. Lets consider |.| returns the number of elements in a set. Then, we say F ∈ F belongs to the set Fn, if |F | = n. Under this definition, F1, i.e. the subset of F, where each element contains just one dataset of features, is the desired one according to our problem. In this case, for F ∈ F1, we remove the brackets and separate the features belonging to the same set by multiplication. Hence, we consider F ∈ F1 if it can be written as one of the syntax forms as : 0 | f | F1 · F2 | 1, for f ∈ V.

It is noticeable when two elements of F1 are added or multiplied, they follow the same pro- perties following the main semiring defined on F. In the following example, we show how this simpler representation is used in the rest of the chapter.

Example 6.6. We simplify the elements of Example 6.3 according to Definition 6.3, as the following :

F1 = {{red, large}, {blue}} = {{red, large}} + {{blue}} = red · large + blue

F2 = {{small}} = small

F1 · F2 = {{red, large, small}, {blue, small}} = {{red, large, small}} + {{blue, small}} = red · large · small + blue · small

F1 + F2 = {{red, large}, {blue}, {small}} = {{red, large}} + {{blue}} + {{small}} = red · large + blue + small

The semiring of features can be used to represent different feature-based clustering algorithms. In our context, planing to address the parallel clustering, we also require to discuss on different

93 datasets that the clusters are originated from. To this end, in upcoming subsection we present the semiring of elements.

Semiring of Elements

Let us consider that the set of the sorts, or the set of attributes A with an order among attributes, is given. Suppose |A| = k, and without loss of generality A1,A2,...,Ak are the ordered sorts which range over A. We say s belongs to the set of elements S, if s ∈ VA1 ×

VA2 × ... × VAk × N, where the carrier of attributes are arbitrarily ordered (then fixed) for each problem, and N is the set of natural numbers. Hence, S ⊆ VA1 × VA2 × ... × VAk × N, i.e. s ∈ S can be written as s = (x1, x2, ··· , xk, n), where xi ∈ VAi for 1 ≤ i ≤ k, and n ∈ N is a natural number representing the ID of an element. For the sake of simplicity, we may use the alternative representation xi ∈ Ai instead of xi ∈ VAi . In our problem, S is the set of all elements desired to be clustered. As the result of having different sets of elements to be clustered in the problem of parallel clustering, we define a semiring of the power set of all elements. In this case, if we have for example two datasets of elements, say S1 and S2, then S = S1 ∪ S2.

Example 6.7. Consider that in Example 6.3, we have the Cartesian product of carrier of attri- butes as “color×size”, then the tuples S = {(red, small, 1) , (blue, small, 2), and (red, large, 3)} is a set of elements on V to be clustered, in a specific problem.

We formally define two operations “+” and “·” as union and intersection of elements of P(S) (the power set of S), as the following :

· : P(S) × P(S) → P(S)

S1 · S2 = S1 ∩ S2

+ : P(S) × P(S) → P(S)

S1 + S2 = S1 ∪ S2

Formally, we say S belongs to the set of elements S ∈ P(S), if it respects one of the following forms :

0 S := ∅ | S | S + S | S · S | S (6.13)

0 where S ⊆ S.

94 Proposition 6.8. It is easy to verify that operations “+” and “·” on every S1,S2,S3 ∈ P(S) respect the following properties :

(S1 + S2) + S3 = S1 + (S2 + S3) (6.14)

∅ + S1 = S1 + ∅ = S1 (6.15)

S1 + S2 = S2 + S1 (6.16)

(S1 · S2) · S3 = S1 · (S2 · S3) (6.17)

S · S1 = S1 · S = S1 (6.18)

S1 · (S2 + S3) = (S1 · S2) + (S1 · S3) (6.19)

(S1 + S2) · S3 = (S1 · S3) + (S2 · S3) (6.20)

∅ · S1 = S1 · ∅ = ∅ (6.21)

S1 + S1 = S1 (6.22)

S1 · S1 = S1 (6.23)

S1 · S2 = S1 if S1 ⊆ S2 (6.24)

Theorem 6.9. The quintuple (S, +, ·, ∅, S) is an idempotent commutative semiring.

Démonstration. The proof is straightforward from Proposition 6.8.

Note : It should be noted that the operations “+” and “·” are overloaded to the kind of elements that they are applied on. This means that if the operation “+” is used between two sets of elements, it refers to the addition operation in semiring of elements, and when the operation “+” is applied between two sets of features, it refers to the addition operation in semiring of features. The same property satisfies for multiplication operation “·”.

Semiring of Terms

In sections 6.3.1 and 6.3.1 we introduced two semirings on the set of features and the set of elements, respectively. The reason underlying this choice is that in our context 1) categorical clusters are generally specified with the set of features, 2) in formalizing the parallel clustering we have several datasets and it is required to clearly specify which dataset of elements we refer to. In what follows, we construct the semiring of terms with the use of previous semrirings, which will be used to abstract the tree structure and to formalize parallel clustering. In the rest of the chapter, we use the same notions and symbols introduced in 6.3.1 and 6.3.1.

Recalling that a cluster in CCTree can uniquely be identified by a set of elements respecting a set of features. We define the satisfaction relation to formally express the concept of cluster.

Definition 6.4 (Satisfaction Relation ). Recalling that when the elements of F contain just one dataset of features we remove the brackets (Definition 6.3). Hence, we define satisfaction relation, denoted with , as the following :

95  : F × P(S) → P(S)

(f, {(x1, x2, ··· , xk, n)}) = {(x1, x2, ··· , xk, n)} if ∃i, 1 ≤ i ≤ k, s.t xi = f

(f, {(x1, x2, ··· , xk, n)}) = ∅ if @i, 1 ≤ i ≤ k, s.t xi = f

(f, S1 ∪ S2) = (f, S1) ∪ (f, S2)

(F1 · F2,S) = (F1,S) ∩ (F2,S) and when (F,S) 6= ∅, we say that S satisfies F . For the sake of simplicity, we apply the alternative representation F  S instead of (F,S) when (F,S) 6= ∅.

We consider that the multiplication “·” and “+” over  respect the following properties :

(F1  S1) · (F2  S2) = (F1 · F2)  S2 if S1 · S2 = S2 (6.25)

(F1  S1) + (F2  S2) = (F1 + F2)  S2 if S1 + S2 = S2 (6.26) where S1 · S2 = S2 means S2 ⊆ S1, and S1 + S2 = S2 means S1 ⊆ S2. In the case neither set is a subset of the other, the multiplication and addition return the received elements unchanged. It should be noted that “·” and “+” are overloaded to their own definition for the semiring of features and the semiring of elements when they are applied between two sets of features and two sets of elements, respectively.

Roughly speaking, these properties can be interpreted as follows. The multiplication is used to find the resulted tuples from the intersection of two clusters resulted two sets that one is the subset of the other one ; and the addition refers to the union of two clusters, where one is the subset of the other one. In our context, the property 6.25 is applied to address the concept of division of a cluster to new smaller clusters. In this case, each small new cluster satisfies the features of the main cluster, plus more restricted features. Moreover, the property 6.26 is used to get the simpler form of clusters according to Definition 6.3.

Example 6.10. Lets consider the set of elements S = {(red, small, 1), (blue, small, 2), (red, large, 3)} on the set of features V = {red, blue, small, large}, are given. The following examples represent different clusters on this datasets in terms of satisfaction relation  :

 (red, {(red, small, 1)}) = {(red, small, 1)}  (red, {(blue, small, 2)}) = ∅  (red, {(red, small, 1), (blue, small, 2)}) = (red, {(red, small, 1)}) ∪ (red, {(blue, small, 2)}) = {(red, small, 1)} ∪ ∅ = {(red, small, 1)}  (red · small, {(red, small, 1)}) = ( red, { (red, small, 1)}) ∩ ( small, { (red, small, 1)}) = {(red, small, 1)} ∩ {(red, small, 1)} = {(red, small, 1)}

96 Proposition 6.11. For F1,F2 ∈ F and S ∈ P(S), the symbol “” satisfies the following properties with respect to “+” and “ ·”:

(F1 · F2)  S = (F1  S) · (F2  S) (6.27)

(F1 + F2)  S = (F1  S) + (F2  S) (6.28)

Démonstration. The proof is straightforward from the properties 6.25 and 6.26, since we have S · S = S and S + S = S.

Actually, the equations 6.27 and 6.28 express how we can transform the different forms of F ∈ F to the form of F ∈ F1.

Example 6.12. The following equation shows the transformation of 6.27 and 6.28 to a set of features as F ∈ F1 defined in Definition 6.3.

{{f1, f2}, {f3}}  S = {{f1, f2}}  S + {{f3}}  S = f1 · f2  S + f3  S

The form of F ∈ F1, is a particular desired representation of the set of features which will be used in our context. Hence, we attribute a specific name to it as what follows.

Definition 6.5 (Feature-Cluster (Family) Term). The set of feature-cluster family terms on

V and S denoted as FCV,S (or simply FC if it is clear from the context) is the smallest set containing elements satisfying the following conditions :

if S ⊆ S then S ∈ FC

if F ∈ F1,S ⊆ S then F  S ∈ FC

if τ1 ∈ FC, τ2 ∈ FC then τ1 + τ2 ∈ FC

In this case, we call S and F S a feature-cluster term and the addition of one or more feature- cluster terms is called feature-cluster family term. We may simply use FC-term to refer to a feature-cluster family term. We define the block function, which receives a feature-cluster family term and returns the set of its blocks. Formally, we have :

block : FC → P(FC) block(S) = {S} block(F  S) = {F  S}

block(τ1 + τ2) = block(τ1) ∪ block(τ2)

In the case that no feature specifies S directly, it is called an atomic term. The set of all atomic terms is denoted as A .

97 Example 6.13. In the following, some examples of FC-terms are presented :

S ∈ FC red · small  S ∈ FC red · small  +blue  S ∈ FC Example 6.14. Suppose that the term τ = red  S + blue  S is given. Applying the block function on τ results in :

block(red  S + blue  S) = {red  S, blue  S}

Definition 6.6 (FC-Term Comparison). We say two FC-terms τ1 and τ2 are equal, denoted by τ1 ≡ τ2, if for different representations of FC-terms, it satisfies the following relations :

S1 ≡ S2 ⇔ S1 = S2

F1  S1 ≡ F2  S2 ⇔ S1 = S2 ,F1 = F2 τ ≡ τ 0 ⇔ block(τ) = block(τ 0)

Example 6.15. In the following examples two simple equivalence of FC-terms have been shown :

red · small  S ≡ small · red  S red · small  S + blue  S ≡ blue  S + small · red  S

Definition 6.7 (Term). We call τ a term, if it has one of the following form :

τ := S | F  S | τ + τ | τ · τ (6.29) where

0 S := ∅ | S | S + S | S · S | S (6.30) F := 0 | {{f}} | F + F | F · F | 1 (6.31) in which 6.30 and 6.31 satisfy the properties specified in 6.3.1 and 6.3.1, respectively.

The set of terms on S and F is denoted as CS,F, or abbreviated as C, where it is known beforehand on which datasets it has been constructed. As previously discussed, when an element of F contains just one dataset of features, we remove the brackets, and with the use of “·” we separate the features belonging to the associated dataset.

Example 6.16. In the following some examples of terms on V = {red, blue, small, large} and dataset S, are presented :

red · small  S red · small  S + blue  S0 (red · small  S) · (blue  S0) {{red, large}, {blue}}  S

98 Theorem 6.17. Two identity elements of C with respect to “+” and “·” are 0  ∅ and 1  S, respectively.

Démonstration. From properties 6.25 and 6.26, and the term definition 6.7, which considers the commutativity of multiplication and addition among terms, we have :

(1  S) · (F  S) = (1 · F )  S = F  S (6.32) (0  ∅) · (F  S) = (0 · F )  ∅ = 0  ∅ (6.33) (0  ∅) + (F  S) = (0 + F )  S = F  S (6.34)

For the other elements of C, the proof is straightforward from the above equations, and properties 6.25, 6.26.

Theorem 6.18. The quantiple (C, “+”, “·”, 0∅, 1S) is an idempotent commutative semiring.

Démonstration. The proof is straightforward from the semrirng definition (Definition 6.1), Sections 6.3.1, 6.3.1, and the properties mentioned in 6.3.1.

Definition 6.8 (Feature-Cluster Algebra). The semiring (C, “ + ”, “ · ”, 0  ∅, 1  S) is called a feature-cluster algebra.

It is noticeable that in present work our terms in following sections, mostly, belong to the set of feature-cluster family terms FC ⊆ C. This means that they as the elements of the semiring (C, “ + ”, “ · ”, 0  ∅, 1  S), follow the same operation and properties among the elements of the proposed feature-cluster algebra.

6.4 Feature-Cluster (Family) Term Abstraction

In this section, we plan to relate the concept of feature-cluster algebra to tree structure. To this end, firstly, some preliminary notions related to graph, abstraction and rewriting system, are presented. Graph theory notions is used to formally represent tree structure. On the other hand, the abstraction theory is used to prove that the syntax form of trees (under some conditions) is equivalent to the semantic form of tree structure. This property is desirable in the sense that we are able to apply several interesting algebraic calculation on syntax forms, whilst whenever it is required it is possible to transform it to its equivalent semantic structure, preserving the same properties of applying the calculations on semantic forms. Moreover, the rewriting system is applied to automatically verify whether a term represents a CCTree or not. Moreover, we can automatically get a homogenized CCTree term resulted from the addition of several CCTree terms.

99 6.4.1 Preliminary Notions

Graph Theory Preliminaries In graph theory [62], a tree is an undirected graph in which any two vertices are connected by exactly one path. A forest is a disjoint union of trees. A tree is called a rooted tree if one vertex has been designated the root, which means that the edges have a natural orientation, towards or away from the root [62]. A node directly connected to another node when moving away from the root is the child node. In a rooted tree, every node except the root has one parent node, called predecessor. Moreover, a child node in a rooted tree is called a successor. A node without successors in a rooted tree is called a leaf. A tree is a labeled tree if the edges of the tree are labeled. A branch of a tree, refers to the path between the root and a leaf in a rooted tree [62]. A descendant tree of an edge f in a rooted tree T , is the subtree of T following edge f.

Definition 6.9 (Graph Homomorphism, Graph Isomorphism). Graph homomorphism sort from a graph G = (V,E) to a graph G0 = (V 0,E0), written as ζ : G → G0, is a mapping ζ : V → V 0 from the vertex set of G to the vertex set of G0 such that {u, v} ∈ E implies {ζ(u), ζ(v)} ∈ E0 [70]. If the homomorphism ζ : G → G0 is bijection whose inverse function is also a graph homomorphism, then ζ is a graph isomorphism. In our context it is important that both {u, v} ∈ E and {ζ(u), ζ(v)} ∈ E0 have the same edge label. Under this condition, we say that two graphs G = (V,E) and G0 = (V 0,E0) are equivalent, denoted as G ≈ G0, if V = V 0,E = E0, for {u, v} ∈ E and {ζ(u), ζ(v)} ∈ E0, we have {u, v} = {ζ(u), ζ(v)}, and finally G and G0 are isomorphic.

Definition 6.10 (Tree Structure). In our context, a graph structure is a triple (F, Q, ω) where : F represents the set of edge labels ; Q is the set of states or nodes ; and ω is the set of transition function, denoted as ω : Q × F → Q. A graph structure is a tree structure if there is no cycle in transitions. In this case, the transitions are written such that each parent node is connected to its children moving from root.

We note a transition ω(s1, f) = s2 as a triple (s1, f, s2). Hence, the set of transitions in our context is a set of triples, where the first component is a parent node (predecessor) and the last component is a child (successor) of first component, whilst the middle component is the edge label (feature) transiting from first this parent node to its child.

Note : It is worth noticing that a CCTree is a tree structure, which in our context can be formally presented as a triple where the first component (F ) contains the set of edge labels, the second component (Q) contains the nodes of CCTree, and the last component is the set of transitions between a parent node through edge labels to its children. We label the root node as the main dataset desired to be clustered.

Abstraction Theory Preliminaries What the abstraction means in general ? Some of the synonyms of the word “abstract” are “brief”, “synopsis” and “sketch”, some of the synonyms

100 of the verb “to abstract” are “to detach” and “to separate”. The intuition which comes out of this list of synonyms is that the process of abstraction is related to the process of separating, extracting from a representation of an object or subject an “abstract” representation that consists of a brief sketch of the original representation [59].

More precisely, the abstraction is the process of mapping a representation of a problem, called the “ground” representation, onto a new representation, called the “abstract” representation, such that it helps to deal with the problem in the original search space by preserving certain desirable properties and is simpler to handle as it is constructed from the ground representation by “ not considering the details” [59]. The most common use of abstraction is in theorem proving, which abstracts the goal, to prove its abstracted version, and then to use the structure of the resulting proof to help construct the proof of the original goal. This is based on the assumption that the structure of the abstract proof is equivalent to the structure of the proof of the goal. The other main use of abstraction theory has been to study the formal properties of abstractions and the operations like composition and ordering which can be defined upon them [59].

An abstraction can formally be written as a function [[.]] : X → Y from the ground represen- tation (semantic form) of an object to its abstract form (syntax form). We say [[.]] adequately abstracts X if from the equivalence of two elements of semantic forms, we get the equivalence of their equivalent syntax forms. Formally, if the equivalence of elements in X is denoted by ' and the equivalence of elements in Y is represented with =∼, then :

∼ [[X1]] = [[X2]] ⇒ X1 ' X2 (6.35) we say [[.]] abstracts X if we have :

∼ X1 ' X2 ⇒ [[X1]] = [[X2]] (6.36) when 6.35 and 6.36 are both satisfied, we say [[.]] fully abstracts X , i.e. we have :

∼ [[X1]] = [[X2]] ⇔ X1 ' X2

Rewriting System Terminology A rewriting system is shown with a set of directed equa- tions on a set of objects. Mostly the objects in rewriting system are called terms and the directed equations are called rewriting rules. The rewriting rules are applied to compute new terms by repeatedly replacing subterms of a given term until the simplest form possible is obtained [43]. More precisely, a rewriting rule is an ordered pair, written as x → y of terms x and y. Similar to equations, rules are applied to replace instances of x by corresponding instances of y. Unlike the equations, rules are not applied to replace instances of the right-hand side y [43]. A term over symbols G, constants K, and variables X is either a variable x ∈ X , a constant k ∈ K, or

101 an expression of the form g(t1, t2, . . . , tn), where g ∈ G is a function symbol of n arguments, and ti are terms [43]. A derivation for a rule →, is a sequence of the form t0 → t1 → .... The term t is reducible with respect to rule →, if there is a term u such that t → u ; otherwise it is considered as irreducible.A rewrite system R is a set of rewrite rules, x → y, where x and y are terms. The term u is a → normal form of t, if t →∗ u and u is irreducible via →, where ∗ → means that the rule → is applied n times (n ∈ N). A relation → is terminating, if there is no infinite derivations t0 → t1 → ..., which means that an infinite derivations does not reach to a normal term. A relation → is confluent, if there is an element v such that s →∗ v and t →∗ v whenever u →∗ s and u →∗ t for some elements s, t and u. A relation → is convergent, if it is terminating and confluent. Convergent rewriting systems are interesting, because all derivations lead to a unique normal form [43]. A conditional rule is an equational implication in which the term in the conclusion is reached just if the conditions are satisfied. We use the form x1 = u1 ∧ ... ∧ xn = un|x → y to show that under the conditions x1 = u1 ∧ ... ∧ xn = un we have x → y.

6.4.2 Graph Structure and Feature-Cluster Family Terms

In this subsection, we explain how graph structure and feature-cluster family term can be transformed to each other. To this end, we first present the “meaning” relation to transform a feature-cluster family term to a labeled graph structure. Afterwards, we present a function to get a feature-cluster family term from a labeled tree structure. Then, we prove in a theorem that if two labeled trees are equivalent, they return equal terms. However, we show that the two equal feature-cluster family terms do not necessarily return two equivalent graph structures. We prove that under the condition of considering a fixed order among the features, the latter requirement will also be respected.

In the provided examples, attributes Color = {r(ed), b(lue)}, Size = {s(mall), l(arge)}, and Shape = {c(ircle), t(riangle)} are used to describe the terms.

To avoid the confusion of different representations of an FC-term, in what follows we present the definitions of factorized and non factorized terms.

Definition 6.11 (Factorized Term). We define the factorization rewriting rule through an A attribute A ∈ A, denoted as −→, from an FC-term to its factorized form as the following :

A f · τ1 + f · τ2 −→ f · (τ1 + τ2) for f ∈ A we denote the normal form of applying the factorization rewriting rule on term τ applying factorized rewriting rule, through attribute A as τ ↓A, and the set of factorized forms of the terms of FC is denoted by FC ↓. A term after factorization is called a factorized term.

Definition 6.12 (Defactorization). We define the defactorized rewriting rule on an FC-term

102 as what follows :

f · (τ1 + τ2) →d f · τ1 + f · τ2

A normal term resulted from defactorized rewriting rule is called a non factorized term. A non factorized form of the term τ is denoted as τ ↑. The set of non factorized terms of FC are denoted by FC ↑.

Example 6.19. In what follows, we show how factorization and defactorization perform. For factorization we have :

(r · s  S + r · c  S + b · s  S) ↓color= r · (s  S + c  S) + b · s  S and for defactorization :

r · (s  S + c  S) + b · s  S →d r · s  S + r · c  S + b · s  S

From Feature-Cluster Family Term to Tree Structure Applying the same notions presented in previous sections, in what follows we define three functions, which return the set of edge labels, the set of nodes, and the set of transitions from a received FC-term, respectively. These three functions are used in our context to get a forest structure from an FC-term. We define the function of feature, denoted by Θ, which gets a non factorized FC-term and returns a set of features as follows :

Θ: FC ↑→ P(V) Θ(S) = ∅ Θ(f  S) = {f} Θ(f · F  S) = {f} ∪ Θ{F  S}

Θ(τ1 + τ2) = Θ(τ1) ∪ Θ(τ2) we define the function of states, noted as Φ, which gets a non factorized FC-term and returns a set of FC-terms, as follows :

Φ: FC ↑→ P(FC ↑) Φ(S) = {S} Φ(f  S) = {f  S, S} Φ(f · F  S) = {f · F  S} ∪ Φ(F  S)

Φ(τ1 + τ2) = Φ(τ1) ∪ Φ(τ2)

103 Moreover, we define the transition function, denoted as Ω, which gets a non factorized FC-term and returns a triple which returns the transitions from the associated node as follows :

Ω: FC ↑→ P(FC ↑ ×V × FC ↑) Ω(S) = ∅ Ω(f  S) = {(S, f, f  S)} Ω(f · F  S) = {(F  S, f, f · F  S)} ∪ Ω(F  S)

Ω(τ1 + τ2) = Ω(τ1) ∪ Ω(τ2)

Now we are ready to introduce the meaning relation which gets a non factorized FC-term and returns a forest structure.

Definition 6.13. The meaning relation, denoted as [[.]], gets a non factorized FC-term and returns a triple, representing a forest (or tree) structure, as following :

[[.]] : FC ↑→ GV,FC [[τ]] = (Θ(τ), Φ(τ), Ω(τ)) where GV,FC is the set of all possible forest structures on the set of edges’ labels V and the set of nodes’ labels FC.

Example 6.20. In what follows, we show how a feature-cluster family term is transformed to its equivalent graph structure according to the above rules :

[[r  S + b · l  S + b · s  S]] = ({b, r, l, s}, {S, r  S, b  S, b.l  S, b.s  S}, {(S, r, r  S), (b  S, l, b · l  S), (S, b, b  S), (b  S, s, b · s  S)})

From Tree Structure to a Feature-Cluster Family Term We define the function root, denoted as r, which gets a tree and returns the root of the tree. Formally :

r : TV,FC → Q

r(T ) = {s | ∪si∈Q {(si, f, s)} = ∅} where TV,FC is the set of rooted trees on V and FC. We define the set of edge labels of the children of r(T ) as follows :

δ(T ) = {f | ∃ s0 ∈ Q s.t. (r(T ), f, s0) ∈ ω}

Moreover, in a tree T , the descendant tree directly after edge f, as the derivative tree of T following edge f, is denoted by ∂f (T ). We define the Ψ function which gets a tree structure

104 T , and returns the features as follows : X Ψ(T ) = f · Ψ(∂f (T )) (6.37) f∈δ(T ) where Ψ(T ) = 1 when δ(T ) = 1. We represent f · 1 as f. We define the transform function, denoted by ψ, which gets a set of k labeled trees (forest) and returns an FC-term as follows :

ψ : GV,FC → FC ψ(∅) = 0

ψ(T1 ∪ T2) = Ψ(T1)  r(T1) + Ψ(T2)  r(T2)

Example 6.21. Suppose the following tree is given :

M = ({f1, f2}, {s, s1, s2}, {(s, f1, s1), (s, f2, s2)}) then, the only state to which there is no transition is node s. Hence, we have :

Ψ(M) = f1 · Ψ(∅, {s1}, ∅) + f2 · Ψ(∅, {s2}, ∅) = f1 · 1 + f2 · 1 = f1 + f2 which the resulting term is equal to :

ψ(M) = Ψ(M)  s

Definition 6.14. A term resulting from a CCTree structure, or equivalently transformable to a tree structure representing a CCTree, is called a CCTree term.

Example 6.22. Suppose the CCTree of Figure 6.1 is given. The tree structure of this CCTree can be written as the following :

({red, blue, small, large}, {S, Sr,Sb,Sb·s,Sb·l},

{(S, red, Sr), (S, blue, Sb), (Sb, small, Sb.s), (Sb, large, Sb.l)})

The CCTree term resulting from this CCTree equals to :

red  S + blue · small  S + blue · large  S

Proposition 6.23. For each non factorized FC-term τ, there exists at least one forest structure in GV,FC that represents τ. Moreover, for each labeled forest structure T in GV,FC, there exists a unique term that represents T .

Démonstration. The proof is straightforward from the proposed methodology of transforming a forest structure to a term and vise versa.

105 S

red blue

Sr Sb

small large

Sb.s Sb.l

Figure 6.1 – A Small CCTree

Theorem 6.24. The meaning relation [[.]] adequately abstracts a graph structure resulting from a feature-cluster (family) term on V and the same fixed dataset of elements S ⊆ S. This 0 means that for two non factorized FC-terms τ and τ we have :

[[τ]] ≈ [[τ 0]] ⇒ τ ≡ τ 0 (6.38)

Intuitively, the relation 6.38 expresses that if two forest structures resulting from two FC- terms are equal, by certain we can conclude that the original terms were equal as well. In other words, if τ 6≡ τ 0 then we can conclude that [[τ]] 6≈ [[τ 0]].

Démonstration. From Proposition 6.23, for each non factorized FC-term there exists a unique term representing it. This means that [[τ]] and [[τ 0]] certainly return a term. Now, suppose that the left hand side of 6.38 is satisfied. Hence, we have :

[[τ]] ≈ [[τ 0]] ⇒ Θ(τ) = Θ(τ 0) , Φ(τ) = Φ(τ 0) , ∆(τ) = ∆(τ 0) (6.39) ⇒ block(τ) = block(τ 0) ⇒ τ ≡ τ 0 (6.40) where 6.39 is resulted from the equivalent graph structures of τ and τ 0, and 6.40 is satisfied from Φ(τ) = Φ(τ 0) and the fact that two main terms were originated from the same dataset.

The following example contradicts the satisfiability of relation 6.38 from right to left.

Example 6.25. The two following feature-cluster family terms are equivalent in terms of term comparison 6.6, i.e. we have :

f1 · f2  S + f1 · f3  S ≡ f2 · f1  S + f3 · f1  S

106 but their equivalent tree representation are not equivalent, since we have :

[[f1 · f2  S + f1 · f3  S]]

= ({f1, f2, f3}, {S, f1  S, f1 · f2  S, f1 · f3  S},

{(f1  S, f2, f1.f2  S), (f1  S, f3, f1.f3  S), (S, f1, f1  S)})

[[f2 · f1  S + f3 · f1  S]]

= ({f1, f2, f3}, {S, f2  S, f3  S, f2.f1  S, f3.f2  S},

{(f2  S, f1, f2.f1  S), (S, f2, f2  S), (f1  S, f3, f1.f3  S), (S, f1, f1  S)}) where the first one contains five nodes, whilst the second one contains four nodes. This means that they are not isomorphic graphs.

This example shows that commutativity of “·” is not an appropriate property for full abstrac- tion. In what follows, we will show that the reverse of 6.38 is satisfied if an order of features is identified on the set of features, which solves the problem of multiplication (“·”) commutativity.

Definition 6.15 (Ordered Features). We say that the set of features V is an ordered set of features if there is an order relation “<” on V, such that (V, <) is a total order set. This means that for any f1, f2 ∈ V we either have f1 < f2 or f2 < f1. We say F1 is exactly equal ∼ to F2, denoted by F1 = F2, if considering the order of features they are equal.

Definition 6.16 (Order Rewriting Rule). Let an ordered set of features (V, <) be given. We say an FC-term is an ordered FC-term on (V, <), if it is the normal form of applying the following rewriting rule :

f1 · f2  S →O f2 · f1  S if f1 < f2 ∀ f1, f2 ∈ V

Moreover, we define a rewriting rule which orders the features of an FC-term based on an attribute A ∈ A as follows :

A f2 · f1  S −→O f1 · f2  S if f1 ∈ A we represent the normal for of a term τ applying above rewriting rule, based on attribute A, as τ ⇓A.

Example 6.26. Suppose that the set of features V1 = {red, blue, small, large} is given. Wi- thout loss of generality, fixing a strict order “<” among them as “red < blue < small < large” results in having (V1, <) as a total ordered set. The following examples show how ordered FC- terms on V1 are obtained applying order rewriting rule : red · small  S → small · red  S red · small  S + blue · large  S → small · red  S + large · blue  S ∼ Moreover : red · small  small · red, whilst red · small = red · small.

107 Definition 6.17 (Ordered FC-term Comparison). We say two ordered FC-terms on (V, <) are exactly equal, denoted by ∼, as the smallest relation for which the terms respect one of the following relations :

1. if S1 = S2 then S1 ∼ S2 ∼ 2. if S1 = S2 ∧ F1 = F2 then F1  S1 ∼ F2  S2 0 3. if ∀τi ∈ block(τ) ∃τj ∈ block(τ ) s.t τi ∼ τj 0 0 and ∀τj ∈ block(τ ) ∃τi ∈ block(τ) s.t τj ∼ τi then τ ∼ τ

Example 6.27. Lets consider the ordered set of features of Example 6.26 is given. The follo- wing examples show how two ordered FC-terms are compared :

red · small  S  small · red  S red · small  S ∼ red · small  S red · small  S + blue  S ∼ blue  S + red · small  S

Theorem 6.28. Let (V, <) be a total ordered set of features and S ⊆ S. The meaning relation [[.]] abstracts the forest (tree) structure resulted from the ordered non factorized FC-terms on 0 V and S. This means considering τ and τ be two arbitrary ordered non factorized FC-terms on (V, <) and S ⊆ S, we have : τ ∼ τ 0 ⇒ [[τ]] ≈ [[τ 0]] (6.41)

Démonstration. Suppose the left side of 6.41 satisfies. This means that for each feature-cluster 0 term τi ∈ τ there exists a feature-cluster term τj ∈ τ such that τi and τj are exactly equal. This property causes that the set of transitions of [[τi]] to be equal to the set of transitions of [[τj]]. Consequently, 6.41. More precisely, we have : 0 0 τ ∼ τ ⇒ ∀τi ∈ block(τ)∃τj ∈ block(τ ) s.t τi ∼ τj(⇒ [[τi]] ≈ [[τj]]), (6.42) 0 ∀τj ∈ block(τ )∃τi ∈ block(τ) s.t τj ∼ τi(⇒ [[τi]] ≈ [[τj]]) ⇒[[τ]] ≈ [[τ 0]] (6.43)

Now we are ready to present the main theorem of this section, which provides the conditions of full abstraction.

Theorem 6.29 (Main Theorem). Let the ordered set of features (V, <), the set of elements S ⊆ S are given. The meaning function [[.]] fully abstracts the ordered feature-cluster family terms on (V, <) and S. This means that for two arbitrary ordered feature-cluster family terms 0 τ and τ on V and S, we have : [[τ]] ≈ [[τ 0]] ⇔ τ ∼ τ 0 (6.44)

Démonstration. The proof is straightforward from the proofs of theorems 6.24 and 6.28.

108 6.5 Relations on Feature-Cluster Algebra

In this section, we define several relations on feature-cluster algebra and discuss the properties of the proposed relations. Here, we will use the same notions and symbols introduced in 6.3.1, 6.3.1, and 6.3.1.

Definition 6.18 (Attribute Division). Attribute division (DA) is a function from A × FC to {T rue, F alse}, which gets an attribute and a non factorized FC-term as input ; it returns T rue or F alse as follows :

DA : A × FC ↑→ {T rue, F alse} ......

DA(A, S) = F alse

DA(A, f  S) = T rue if f ∈ A

DA(A, f  S) = F alse if f∈ / A

DA(A, f · F  S) = DA(A, f  S) ∨ DA(A, F  S)

DA(A, τ1 + τ2) = DA(A, τ1) ∧ DA(A, τ2)

The concept of attribute division is used order the attributes presented in a term, which will be discussed later.

Example 6.30. In the following we show how attribute division performs :

DA(color, r · s  S + r · c  S + b · s  S)

= DA(color, r · s  S) ∧ DA(color, r · c  S) ∧ DA(color, b · s  S) = T rue

Definition 6.19 (Initial). We define the initial (δ) function from P(FC ↑) to P(F), which gets a set of ordered non factorized terms on (V, <), and returns a set of the first features of each term as follows :

δ : P(FC ↑) → P(F) δ(∅) = {0} δ({S}) = {1} δ({f · F  S}) = {f}

δ({τ1 + τ2}) = δ({τ1}) ∪ δ({τ2})

δ({τ1, τ2}) = δ({τ1}) ∪ δ({τ2}) with the following property :

δ({X,Y }) = δ(X) ∪ δ(Y )

109 where X,Y ∈ P(FC ↑). In the case that the input set contains just one term, we remove the brackets, i.e. δ({τ}) = δ(τ), when |{τ}| = 1. Moreover, when the output set also contains just one element, for the sake of simplicity we remove the brackets, i.e. δ(X) = {f} = f for X ∈ P(FC ↑).

Example 6.31. In the following we show the result of initial function on pair of terms :

δ({S , r · s  S)}) = {1, r}

Definition 6.20 (Derivative). The Brzozowski derivative [23], denoted as u−1S, of a set S of strings and a string u is defined as the set of all the rest strings obtainable from a string in S by cutting off its prefix u. In our context, importing the idea of Brzozowski, we define the derivative, denoted by ∂, as a function which gets an ordered non factorized FC-term on (V, <) and returns the term (set of terms) by cutting off the first features as follows :

∂ : FC ↑→ P(FC) ∂(S) = ∅ ∂(f  S) = {S} ∂(f · F  S) = {F  S}

∂(τ1 + τ2) = ∂(τ1) ∪ ∂(τ2)

Note : Note that the functions initial (δ) and derivative (∂) are overloaded to the input, depending to the input that if it is a tree or a term.

Definition 6.21 (Order of Attributes). We say attribute B is smaller or equal to attribute A on the non factorized term τ ∈ FC ↑, denoted as B τ A, if the number of blocks of τ that B divides, is less than (equal to) the number of blocks that A divides. Formally, B τ A implies that :

|{τi ∈ block(τ) | DA(B, τi) = T rue}| ≤ |{τi ∈ block(τ) | DA(A, τi) = T rue}|

Given a set of attributes A and a term τ, the set (A, τ ) is a lattice. We denote the upper bound of this set as uA,τ . This means that we have : ∀ A ∈ A ⇒ A τ uA,τ .

Example 6.32. In the following we show how the order of attributes of a term is identified. Suppose the term τ = r · s  S + r · c  S + b · s  S is given. We have :

block(τ) = {r · s  S, r · c  S, b · s  S} consequently,

|{τi ∈ block(τ) | DA(shape, τi) = T rue}| = 1

≤ |{τi ∈ block(τ) | DA(size, τi) = T rue}| = 2

≤ |{τi ∈ block(τ) | DA(color, τi) = T rue}| = 3

110 which means that we have :

shape τ size τ color

Recalling that not having the predefined order among features creates a problem in full abs- traction of terms. To this end, here we propose a way to order the set of features which is appropriate to our problem. First of all, given a feature-cluster family term τ, we find the order of attributes according to definition 6.21, whilst if for two arbitrary attributes A and A0, we have A = A0, without loss of generality, we choose a strict order among them, say A ≺ A0. Then in each attribute we arbitrarily order the features. It is important that the features of smaller attribute be always smaller than the features of greater attribute. For example, if size ≺ color, we consider the order of features as small < large < blue < red, whilst all the features of color are greater than all the features of size.

Definition 6.22 (Ordered Unification). Ordered unification (F) is a partial function from P(A) × FC ↑ to FC ↓, which gets a set of attributes and a non factorized term ; it returns the A normal form of applying rewriting rule −→O introduced in Definition 6.16, iteratively, based on the order of attributes on received term, as follows :

F : P(A) × FC ↑→ FC F(∅, τ ↑) = τ

F({A}, τ ↑) = τ ⇓A

F(A, τ) = F(uA,τ , F(A − {uA,τ }, τ ↑))

The normal form of ordered unification is called a unified term. By F ∗(τ) we mean that F is performed iteratively on the set of ordered attributes on τ to get the unified term.

Example 6.33. To find the unified form of τ1 = r · s  S + r · c  S + b · s  S , we have :

∗ F (τ1) = F({shape, color, size}, τ1 ↑)

= F(color, F(size, F(shape, τ1))) = r · s  S + r · c  +b · s  S

Definition 6.23 (Component relation). Given two ordered non factorized FC-terms τ1 and τ2 on (V, <), we define the component relation, denoted by ∼1, as the first level comparison of terms as the following :

τ1 ∼1 τ2 ⇔ δ(τ1) = δ(τ2)

Proposition 6.34. The component relation is an equivalence relation on the set of ordered non factorized FC-terms.

111 Démonstration. For ordered non factorized FC-terms τ1, τ2 and τ3, we have :

if τ1 ∼1 τ1 iff δ(τ1) = δ(τ1)

if τ1 ∼1 τ2 ∧ τ2 ∼1 τ1 iff δ(τ1) = δ(τ2) ∧ δ(τ2) = δ(τ1)

if τ1 ∼1 τ2 , τ2 ∼1 τ3 then τ1 ∼1 τ3 iff δ(τ1) = δ(τ2), δ(τ2) = δ(τ3) then δ(τ1) = δ(τ3)

Definition 6.24 (Component). Let consider that the ordered term τ ∈ FC ↑ on (V, <) is given. The equivalence class of τ 0 ∈ block(τ) is called a component of τ, and it is formally defined as :

0 0 [τ ]τ = {τi ∈ block(τ) | τ ∼1 τi}

The set of all components of the term τ through the equivalence relation ∼1, is denoted by block(τ)/ ∼1 or simply τ/ ∼1, i.e. we have :

τ/ ∼1= {[τi]τ | τi ∈ block(τ)}

Definition 6.25 (Component Order). Let X and Y be two sets of ordered non factorized FC-terms on (V, <). We say X is smaller than Y , denoted as X < Y , if :

X < Y ⇔ ∀f 0 ∈ δ(X), ∀f 00 ∈ δ(Y ) f 0 < f 00

Specifically, let τ be an ordered non factorized FC-term on (V, <). We order the components of τ according to the order of features in V as what follows :

0 00 0 0 00 00 0 00 [τ ]τ < [τ ]τ ⇔ ∀f ∈ δ([τ ]), ∀f ∈ δ([τ ]) f < f

It is noticeable that |δ([τ 0])| = |δ([τ 00])| = 1, for all τ 0, τ 00 ∈ block(τ), since the first features of all elements in a component are equal.

We denote the i’th component of τ/ ∼1 as [τ]i. Due to the fact that the features are ordered strictly, the term components are also ordered strictly.

Definition 6.26 (Well formed term). Well formed function, denoted as W , is a binary func- tion from FC ↑ to {T rue, F alse}, which gets a unified non factorized FC-term ; it returns T rue if the set of first features of its components is equal to a sort of A to which these features belong ; it returns F alse otherwise. Formally : ( T rue if δ(τ/ ∼ ) = sort(δ([τ ] )) ∀τ ∈ block(τ) W (τ) = 1 i τ i F alse otherwise where δ(τ/ ∼1) = sort(δ([τ1]τ )) means that the the set of the first features of the components of the term τ is equal to the attribute that the first feature belongs to. A unified term τ is called a well formed term, if W (τ) = T rue. An atomic term is a well formed term.

112 Example 6.35. The unified term of Example 6.33, τ = r · s  S + r · c  S + b · s  S is a well formed term, since we have :

δ(τ/ ∼1) = δ({{r · s  S , r · c  S}, {b · s  S}}) = {r, b} sort(δ([r · s  S])) = sort({r · s  S , r · c  S}) = {r, b} consequently W (τ) = T rue.

It is noticeable that in an ordered CCTree term all first features belong to the same attribute. Hence, in what follows we exploit the concept of well formed term to identify whether a term represents a CCTree term or not.

6.5.1 CCTree Term Schema

We know that each CCTree term is a feature-cluster family term. However, in converse, a feature-cluster family term is not necessarily representing a CCTree term. It would be inter- esting to know which feature-cluster family term represents a CCTree term. This knowledge provides us with the opportunity to iteratively use the rules on CCTree terms.

Theorem 6.36. A unified term represents a CCTree term, or it is transformable to a CCTree structure, if and only if, it can be written in the following form :

∗ X F (τ) = fi · τi (6.45) i such that “ W (F ∗(τ)) = T rue”, i.e. the unified form of the received term is a well formed term ; and the unified form of each τi is a well formed term as well (W (τi) = T rue ) which respects the above formula.

Démonstration. First we show that a unified term obtained from a CCTree structure satisfies the equation 6.45. In a CCTree, the attribute used for division in the root, has the greatest number of occurrence in non factorized CCTree term (all blocks of CCTree term contain one of the features of this attribute). According to 6.37, for transforming the tree to a term, the 0 0 first features of components are specified from δ(T ) = {f | ∃ s ∈ Q s.t. (sT , f, s ) ∈ ω}, where in CCTree all belong to the same sort, i.e. we have :

0 0 δ(T ) = {f | ∃ s ∈ Q s.t. (sT , f, s ) ∈ ω} = sort({f}) ⇒ W (ψ(T )) = T rue we call the tree following a child of the root as a new tree. It is noticeable each new tree is a CCTree by itself ; hence, it respects 6.45. By considering the tree following the new tree as new trees themselves, the aforementioned process is iteratively repeated for all new trees, due to the iterative structure of CCTree, i.e. from 6.37, we have :

W (ψ(∂f (T ))) = T rue ∀ f ∈ δ(T )

113 this means that if the input tree structure is a CCTree, then the obtained term respects the above formula.

On the other hand, a unified term that respects equation 6.45 can be converted to a CCTree structure. To this end, τi’s are the components of τ separating their first features (fi’s). The set of the first features of components of the term, constitute the transitions of the first division from the root of CCTree, i.e. : X X [ X Ω( δ([τi]) · ∂(τk)) = {(S, δ([τi]), ∂(τk))}

[τi]∈τ/∼1 τk∈[τi] [τi]∈τ/∼1 τk∈[τi] where S is the main dataset the term is originated from. Since the term is well formed, it guarantees that the label of children belong to the same sort, as required by CCTree. Due to the iterative rule for successive components, iteratively the structure of CCTree is constructed. Note that the condition of equivalence of the first features of components to a sort, guarantees that in the process of transforming the term to its equivalent tree structure, all the features of a selected attribute exist.

With the use of above theorem, we propose a rewriting system which is applied to automatically check if a term represents a CCTree term or not.

CCTree Rewriting System

To verify automatically if a term is a CCtree term, a set of conditional rewriting rules are provided in Table 6.1. The term ∅ in this table, refers to a null term. In this regard, the CCTree rewriting system is applied on a received term ; the term is a CCTree term if the only irreducible term is ∅.

In this rewriting system, f(τ) means that the semnatics of f(τ) is replaced, whilst the result J K is considered as one unique term, not several terms. Furthermore, τ1 : τ2 contains two terms τ1 and τ2, whilst each one is considered as a new term. Moreover, [τ]i refers to the i’th component of τ/ ∼1.

(1) (τ ∈ A ) | τ → ∅ (2) (τ 6= F ∗(τ)) | τ → F ∗(τ) ∗ J K (3) (τ = F (τ)) ∧ (W (τ))) ∧ (τ∈ / A ) | τ → Στ ∈[τ]1 ∂(τk) : ... : Στ ∈[τ] ∂(τk) J k K J k |τ/∼1| K

Table 6.1 – CCTree Rewriting System

The first rule of Table 6.1 specifies that if a term is an atomic term it is directed to ∅. The second rule expresses that if a term is not in unified form, it is required to transfer it to its unified representation. The third rule specifies that if a non atomic unified term is well formed,

114 it is divided to the derivative of its components. The last rule is used to verify whether the CCTree conditions satisfy for the following components or not. These rules are following the structure of Theorem 6.36 in identifying whether a term is CCTree term or not.

Example 6.37. Suppose that the term τ1 = a1  S + b1  S, with the set of attributes A =

{a1, a2},B = {b1, b2}, are given. We apply the CCTree rewriting rules to automatically verify if τ1 is a CCTree term or not. ∗ The term τ1 is not atomic. Moreover, we have τ1 = F (τ1) and (W (τ1) = F alse). There is no CCTree rewriting rule which can be applied, whilst this term is not ∅. This means that the received term τ1 is not a CCTree term.

Example 6.38. With the use of CCTree rewriting system, we show that the term τ2 = a1 

S + a2  S with the set of attributes A = {a1, a2},B = {b1, b2}, is a CCTree term.

∗ (3) (1) (τ2 = F (τ2)) ∧ (W (τ2)) | a1  S + a2  S −−→ S : S −−→ ∅ : ∅

There is no irreducible term except ∅, hence, τ2 is a CCTree term.

6.5.2 Termination and Confluent Rewriting System

In the present section, we first present what the termination and confluent of a rewriting sys- tem mean. Furthermore, through several theorems, we prove our proposed rewriting system is terminating and confluent. Termination and confluence are the interesting properties of a rewriting system, which gua- rantee that firstly, applying the rewriting rules of the proposed system will not involve in an infinite loop of application, and furthermore, applying the rewriting rules we always get a unique result.

Termination and Confluence of Rewriting System A rewriting system is terminating, if there is no infinite derivations a1 → a2 → a3 → ... in R. This implies that every derivation eventually ends to a normal form [43]. Lankford theorem claims that a rewriting system R is terminating, if for some reduction ordering >, x > y for all rules x → y ∈ R. An order is a reduction ordering, if it is monotonic and fully invariant [43]. A relation is monotonic if it preserves the order through adding or reduction a term in both sides, and it is fully invariant, if it preserves the order when a term is substitute in both sides of the relation [43]. An element a in the rewriting system R is locally confluent if for all b, c ∈ R such that a → b and a → c, there exists d ∈ R such that b →∗ d and c →∗ d. If every a ∈ R is locally confluent, then → is called locally confluent. Newman’s lemma expresses that a terminating rewriting system is confluent if and only if it is locally confluent [43].

Theorem 6.39. The CCTree rewriting system is terminating.

115 Démonstration. To prove this theorem we first define a reduction order on the rules of CCTree rewriting system. To this end, we define the size function which gets an FC-term and returns the number of features appeared in the term as follows :

size : FC → N size(S) = 1 size(f  S) = 1 size(F · τ) = |F | + size(τ)

size(τ1 + τ2) = size(τ1) + size(τ2) where we consider size(∅) = 0 and size(τ1 : τ2) = size(τ1) + size(τ2). We say FC-term τ1 is less than FC-term τ2, denoted by τ1 ≤ τ2, if the number of features in τ1 is less than the number of features in τ2, or equally size(τ1) ≤ size(τ2). This partial ordering is well-founded, since there is no infinite descending chain (number of features are limited). It is monotonic, because the property of number of features in two terms is preserved when a term is added or reduced in both sides. Furthermore, the substitution in left and right sides, preserves the order of number of features, i.e. it is fully invariant. Therefore, the proposed ordering is a reduction ordering.

Considering that ∅ is a null term containing no feature, in the first rule we have atomic term > ∅. In the second one, the conditional rule is just applied when the term is not equal to its unified form ; whilst the ordered unification function, if applied, does not change the number of features, i.e.

τ ≥ F ∗(τ) for τ 6= F ∗(τ) since size(τ) = size(F ∗(τ)). Worth noticing that this rule is a one step rule, such that when the term is unified, the other rules are exploited. In the third rule, the first features of all components of the left term are removed, i.e. the size (number of features) of the left-hand term is greater than the size (number of features) in the right-hand one. Hence, the proposed reduction ordering ≤ on CCTree rewriting system, shows that the system is terminating.

Theorem 6.40. The CCTree rewriting system is locally confluent.

Démonstration. In CCTree rewriting system, all rules are conditional and there is no term for which two (or more) conditions are satisfied at the same time. This means that the possibility of having τ → τ1 and τ → τ2 where τ1 6= τ2, does not happen. Hence, the rewriting system is locally confluent.

Theorem 6.41. The CCTree rewriting system is confluent.

116 Démonstration. According to Newman’s lemma, the CCTree rewriting system being termina- ting ( Theorem6.39) and locally confluent (Theorem 6.39), it is confluent.

6.6 CCTrees Parallelism

It is not uncommon that a data mining process requires several days or weeks to be completed. Parallel computing systems bring significant benefit, say high performance, in implementation of massive database [33]. Parallel clustering is a methodology proposed to alleviate the pro- blem of time and memory usage in clustering large amount of data [94], [18]. SPMD (Single Program Multiple Data) parallelism is the most common approach in parallel computation [135]. In SPMD parallel algorithm, multiple computers implement the same al- gorithm on different subsets and exchange the partial results to merge to a final result. In the present work, we propose SPMD parallelism of CCTrees in terms of a rewriting system. To this end, a large amount of data desired to be clustered is divided into two (or more) parallel computers, where each computer clusters the received dataset with the use of CC- Tree algorithm. The result of each CCTree is transformed to its equivalent CCTree term. The resulted CCTree terms are reported to master computer for composition. The CCTree terms are composed automatically based on our proposed composition rewriting rules (6.2). The

Figure 6.2 – Parallel Clustering Workflow. composition result is reported to each computer to homogenize the all CCTree terms, and consequently the structure of all CCTrees (Figure 6.2). Getting a CCTree term from the composition of received terms, provides us with two advan- tages : First, the process of parallelism can be continued iteratively. Furthermore, it explains how the set of clusters resulted from two (or more) CCTrees can be merged. To address the composition process, a set of composition rewriting rules (Table 6.2) are pro- posed to get automatically a CCTree term when a term is not a CCTree term.

117 The split relation, the 4’th rule of Table 6.2, is added to the rules of Table 6.1 to get CCTree term from non CCTree term.

Definition 6.27 (Split). Let a unified term τ ∈ FC ↑ on (V, <) and the set of attributes A, is given. Considering uA,τ as the upper bound attribute of τ, we define the split relation as what follows : ( τ if W (τ) = T rue split(τ) = P ζ(τ ) if W (τ) = F alse τi∈block(τ) i where : ( τi if DA(A, τi) = T rue ζ(τi) = (P a ) · τ if (A, τ ) = F alse ai∈uA,τ i i DA i

This means that all blocksof τ which do not contain any feature of uA,τ are multiplied to the addition of the features of uA,τ .

In the following examples we show how split relation is applied.

Example 6.42. Lets consider τ1 = r · s  S + r · c  S + b · s  S, is given. We have W (r · s 

S + r · c  S + b · s  S) = T rue, i.e. τ1 is a well formed term, which results in :

split(r · s  S + r · c  S + b · s  S) = r · s  S + r · c  S + b · s  S

Example 6.43. Suppose the term τ2 = r · s  S + c  S + b  S is given. We have :

W (r · s  S + r · c  S + b  S) = F alse

hence, τ2 is not a well formed term. Considering uA,τ2 = color we have :

DA(colro, r · s  S) = T rue

DA(colro, r · c  S) = T rue

DA(colro, b  S) = F alse which results in :

split(r · s  S + c  S + b  S) = r · s  S + (r + b) · c  S + b  S = r · s  S + r · c  S + b · c  S + b  S

It is worth noticing that when a term is not a CCTree term, it is possible to infer it from its unified form when the first features of its components do not belong to the same attribute. Therefore, the split rule is proposed to create a well formed term from a non CCTree term.

In what follows, we add the split rule to the previous rewriting system, which is used when a term is not a CCTree term to obtain a CCTree term.

118 6.6.1 Composition Rules

The composition rewriting rules to get a CCTree term from a non CCTree term is presented in Table 6.2. In the proposed rewriting system, f(τ) means that the semnatic of f(τ) is J K replaced, whilst the result is considered as one unique term, not several terms. Furthermore, τ1 : τ2 contains two terms τ1 and τ2, whilst each one is considered as a new term. Moreover, [τ]i refers to the i’th component of τ/ ∼1.

(1) (τ ∈ A ) | τ → ∅ (2) (τ 6= F ∗(τ)) | τ → F ∗(τ) ∗ J K (3) (τ = F (τ)) ∧ (W (τ)) ∧ (τ∈ / A ) | τ → Στ ∈[τ] ∂(τk) : ... : Στ ∈[τ] ∂(τk) k 1 k |τ/∼1| (4) (τ = F ∗(τ)) ∧ (∼ W (τ)) | τ → split(τ) J K J K J K Table 6.2 – Composition Rewriting System

Comparing to Table 6.1, just the forth rule (split rule) is added. This rule guarantee that if a term is not a CCTree term, how by splitting the term based on the upper bound attribute we may get a CCtree term.

6.6.2 CCTree Term From Composition Rewriting Rules

Here we briefly explain how to find a CCTree term from non CCtree term with the use of composition rewriting system. To this end, first of all, the set of attributes A describing the received term τ is provided. Note that in categorical clustering algorithm, the set of attributes are known beforehand. The set of attributes and non CCTree term are given to the composition ∗ rewriting system. When the conditions of the rule (τ = F (τ))∧(W (τ)) | τ → Στ ∈[τ]1 ∂(τk) : J k K ... : Στ ∈[τ] ∂(τk) respects for a term τ, we save τ. Then all Στ ∈[τ]i of τ are replaced J k |τ/∼1| K J k K by their own successive terms respecting this rule. This process is repeated iteratively till reaching to atomic term in all components of term. The result of this term is the desired CCTree term.

Example 6.44. Suppose that the addition of two CCTree terms is given as τ = a1  S + a2  0 0 S + b1  S + b2  S , with the set of attributes A = {a1, a2},B = {b1, b2}. It is easy to verify that τ is not a CCTree term from the rules of Table 6.1. We are interested to find a CCTree term from received non CCTree term τ, with the use of

119 composition rewriting system. To this end we have :

(4) (i)(τ = F ∗(τ)) ∧ (∼ W (τ)) | τ −−→ split(τ) J K

0 0 0 (ii) split(τ) = τ = a1  S + a2  S + (a1 + a2) · b1  S + (a1 + a2) · b2  S J K

0 ∗ 0 0 (2) ∗ 0 0 0 00 (iii)(τ 6= F (τ )) | τ −−→ F (τ ) = (a1 · (S + b1  S ) + a2 · (S + b1  S )) = τ J K

00 ∗ 00 00 00 ∗(3)∗ 0 0 (iv)(τ = F (τ )) ∧ (W (τ )) | τ −−−→ S + b1  S (I): S + b1  S (II)

0 (4) 0 (2) 0 (I) S + b1  S −−→ (b1 + b2) · S + b1  S −−→ b1 · (S + S ) + b2  S ∗(3)∗ (1) −−−→ S + S0 : S −−→∅ : ∅

0 (4) 0 (2) 0 (II) S + b1  S −−→ (b1 + b2) · S + b1  S −−→ b1 · (S + S ) + b2 · S ∗(3)∗ (1) −−−→ S + S0 : S −−→∅ : ∅

To find the resulted CCTree term, we consider the terms respecting the rule (3), shown with ∗(3)∗. Hence, we have them as follows :

0 0 (∗) a1 · (S + b1  S ) + a2 · (S + b1  S ) 0 (∗∗) b1 · (S + S ) + b2  S 0 (∗ ∗ ∗) b1 · (S + S ) + b2 · S

0 0 Then since (∗∗) results from this term S + b1  S inside (∗), and (∗ ∗ ∗) from term S + b1  S inside (∗), we replace them to their previous form :

0 0 a1 · (b1 · (S + S ) + b2  S) + a2 · (b1 · (S + S ) + b2 · S)

Since there is no more term respecting rule (3), the above term is the desired CCTree term. It is easy to automatically verify that the resulted term is a CCTree term according to Table 6.1.

6.6.3 CCTree Homogenization

After that the final CCTree term, resulting from the composition of two (or more) CCTree terms, is returned to parallel devices, the CCTree term of each computer has to be extend to the final CCTree term. The extension of each CCTree term to a final CCTree term will homogenize the structure of all CCTrees. To this end, it is enough to add a CCTree term with the final CCTree term. Then, all split rules applied on CCTree term in the process of

120 its composition with final CCTree term, shows the required split in the associated CCTree structure, following the procedure of transforming a term to tree provided in 6.4.2.

Note It is worth noticing that after homogenizing all the CCTrees to the final CCTree, the data respecting the same set of features go to the same cluster of final CCTree. However, merging a lot of data points from different clusters of different CCTrees to one cluster, may cause that the final nodes not respect required purity. To solve this issue, after merging the data, the purity of each final node should be computed, and if not pure enough, it requires to be split based on the CCTree rules of construction.

Theorem 6.45. The composition rewriting system is terminating.

Démonstration. The only rule added to composition rewriting system comparing to CCTree rewriting system, is the rule split. We show that split rule is not contradicting the termination and confluence of rewriting system. First of all, the split rule is one step rule, i.e. the result of split rule, after one step application, is considered as the premise of other rules (which decreases the term). On the other hand, on each term, the split rule is applied at most equal to the number of attributes (finite). Hence, since the split by itself is one step rule, and for each term it is called finite times, the composition rewriting system is terminating.

Theorem 6.46. The composition rewriting system is locally confluent.

Démonstration. There is no term respecting at the same time two (or more) conditions of composition rewriting system, i.e. there is no term τ for which τ → τ1 and τ → τ2, where τ1 6= τ2. This means that composition rewriting system is locally confluent.

Theorem 6.47. The composition rewriting system is confluent.

Démonstration. From Theorems 6.45 and 6.46, the composition rewriting system is termi- nating and locally confluent, respectively. Hence, from Newman’s lemma, the composition rewriting system is confluent.

6.6.4 Time Complexity

Here we present a theorem which calculates the time complexity of constructing several CCTree in parallel devices.

Theorem 6.48. Let us consider n to be the total number of elements desired to be clustered, r be the number of attributes, vmax be the maximum number of values in an attribute, and K be the maximum number of non leaf nodes. The time complexity of constructing CCTrees in t parallel devices equals to : 1 ·O(K × (n × m + n × v )) t max

121 Démonstration. In Section 3.5, we explained about calculating the time complexity of construc- ting the CCTree. Recalling again, consider n as the number of elements in whole dataset, ni be the number of elements in node i, m be the total number of features, vl the number of features of attribute Al, r the number of attributes, and vmax = max{vl}, (1 ≤ l ≤ r). For constructing a CCTree, if K = m + 1 be the maximum number of non leaf nodes, which arise in a complete tree, then the maximum time required for constructing a CCTree with n elements equals to O(K × (n × m + n × vmax)). Now if we equally divide the dataset containing n points to t devices, it takes O(K × ((n/t) × 1 m + (n/t) × vmax)) = t ·O(K × (n × m + n × vmax)) to create t CCTrees, i.e the whole required time will be divided to the number of devices. The other part of algebraic calculations requires constant time.

6.7 Conclusion

In this chapter, a semiring-based formal method, named Feature-Cluster Algebra, is proposed to abstract the representation of a categorical clustering algorithm, named CCTree. The abstraction theory is a delightful mathematical concept, which constructs a brief sketch of the original representation of a problem to deal with it easier. More precisely, abstraction is the process of mapping a representation of a problem, called the ground (semantic), onto a new representation, called abstract (syntax) representation, in a way that it is possible to deal with the problem in the original space by preserving certain desirable properties and in a sim- pler way to handle, since it is constructed from ground representation by removing unwanted detail. The abstraction process is performed with the use of a powerful algebraic structure, na- med semiring. Through several theorems and examples, we show that the proposed approach, under some conditions, fully abstracts the CCTree structure. The full abstraction property guarantees that the semantic and syntax forms of a problem can be used alternatively, whilst preserving the required properties. Furthermore, we presented a set of functions and relations on feature-cluster algebra, which is used to present the CCTree schema in general. We provided a rewriting system which au- tomatically identifies whether a term represents a CCTree or not. The CCTree abstract representation is used in CCTree parallel clustering. Generally, the pro- cess of clustering requires time and space, specially when a large amount of data are desired to be analyzed. The problem of time and precision in clustering becomes more challenging in security issues, where the fast and precise analysis is required to find the strategies against intruder. We proposed a rewriting system which automatically returns a CCTree term, in a way that all CCTrees in parallel devices can be generalized to. The termination and confluence of the proposed rewriting system have been proved, which guarantees first of all we have no loop in applying the proposed rewriting systems, and mo-

122 reover, the resulted final term is unique.

To the b est of our knowledge, the proposed technique in this chapter is a novel methodology in applying algebraic structure in formalizing a clustering algorithm representation and ad- dressing the associated issues. The proposed approach can be extended to other feature-based clustering and classification algorithms.

123 Chapitre 7

Conclusions and Future Work

In present section, we first summarize what we presented in this work, and afterwards, we present the future directions for continuing the present study.

7.1 Thesis Summary

The current strategies to minimize the impact of spam messages mostly focus on stopping spam messages to be delivered to end user inbox. This kind of analysis, although being quite effective in decreasing the cost of spam emails, does not stop spammers, who still impose non negligible cost to users and companies. The reason could be that the spammer, the root of the problem, finds the minimum risk to be followed, whilst he has the possibility to send millions of messages in a short period of time with minimum expenses. To this end, analyzing a spammer behavior to find the strategies against and may be persecuting him, becomes an important issue in spam forensics. However, such an effort requires a first analysis of huge amount of spam messages, collected in a short period of time in honey-pots, whilst the size is magnified after some minutes.

To address this issue, in this thesis, we first proposed a categorical clustering algorithm, named CCTree, to group large amount of spam messages into smaller groups, based on the structural similarity. CCTree has a tree-like structure, where the root node of the tree contains all spam messages. The CCTree divides spam messages, step-by-step, grouping together the similar data and obtaining homogeneous subsets of data points. The measure of similarity of clustered data points at each step of the algorithm is given by an index called node purity. If the level of purity is not sufficient, it means spam messages belonging to this node are not sufficiently homogeneous and they should be divided into different subsets (nodes) based on the characteristic (attribute) that yields the highest value of entropy. The rationale under this choice is that dividing data on the base of the attribute which yields the greatest entropy helps in creating more homogeneous subsets where the overall value of entropy is consistently reduced. This approach aims at reducing the time needed to obtain homogeneous subsets. The

124 division process of non homogeneous sets of data points is repeated iteratively till all sets are sufficiently pure or the number of elements belonging to a node is less than a specific threshold identified by the user. These pure sets are the leaves of the tree and will represent the desired spam campaigns.

To apply CCTree in clustering large amount of spam emails into spam campaigns, we provided a set of 21 categorical features representative of email structure. Then, through analysis on 200k spam emails, we proposed and validate a methodology to choose the optimal CCTree parameters based on detection of max curvature point (knee) on a homogeneity-number of clusters graph. We proved the effectiveness of CCTree in spam campaign detection through internal evaluation, to estimate the ability in obtaining homogeneous clusters and external evaluation, for the ability to effectively classify similar elements (emails), when classes are known beforehand. The efficiency of CCTree has been shown through the comparison to one of the fast well-known categorical clustering algorithm.

We proposed a framework, named Digital Waste Sorter (DWS), which exploits a self learning goal of the spammer-based approach for spam email classification. The proposed approach aims at automatically classifying large amount of raw unclassified spam emails dividing them into campaigns and labeling each campaign with its spammer goals. To this end, we proposed five class labels to group spammer goals in five macro-groups, namely Advertisement, Portal Redirection, Advanced Fee Fraud, Malware Distribution and Phishing. Moreover, a set of 21 categorical features representative of email structure is proposed to perform a multi-feature analysis aimed at identifying emails related to a large range of cybercrimes. DWS is based on the cooperation of unsupervised and supervised learning algorithms. Given a set of classes describing different spammer goals and a dataset of non classified spam emails. First, the pro- posed approach automatically creates a valid training set for a classifier exploiting CCTree. DWS is built on the result of CCTree , which is effective in dividing spam emails in homoge- neous clusters. Afterward, significant spam campaigns useful in the generation of the training set are selected through similarity with a small set of known emails, representative of each spam class. Hence, a classifier is trained using the selected campaigns as training set, and will be used to classify the remaining unclassified emails of the dataset. Furthermore, we propose six features, including the label of campaigns discovered with DWS, to automatically rank a set of spam campaigns according to investigator priorities.

Finally, to abstract CCTree representation, we proposed a semiring-based approach, named feature-cluster algebra. Several interesting relations and functions are defined on the abstract schema of CCTree, named CCTree term. The concept of CCTree term is applied in the for- malization of CCTree parallelism, which is expressed in terms of rewriting system. Clustering parallelism can be used to speed up the process of grouping large amount of data in parallel devices.

125 To summarize, we have to say that what we proposed in this thesis can be used as a tool for cyber crime investigators to organize automatically a huge amount of spam messages in a short period of time. This tool provides the investigator with the priority of the most dangerous spammers, trough best ranked spam campaigns, required to be followed.

7.2 Future work

This thesis can be extended in several directions. In what follows we present what we plan to extend.

The technique that we proposed in this thesis can be applied as a useful tool in automatic fast detection of the most dangerous spam campaigns. To show the efficiency and effectiveness of our proposed approach, we plan to apply it on a huge amount of spam messages, containing one of the most dangerous current spam campaigns, e.g. cryptowall 3.0 malware. We plan to show that our approach detects it automatically among other campaigns.

To speed up the process of clustering spam emails into campaign, we expect to apply several sampling algorithms. In statistics, sampling approach is concerned with the selection of a subset of elements for which the statistical properties of dataset is preserved, and it is applied to estimate characteristics of the whole population. In the concept of spam messages, since we always encounter a large amount of data, finding the best strategy in sampling data from whole dataset, which preserves the main characteristics of the whole dataset may help to speed up the analysis.

Furthermore, we plan to apply the proposed methodology in detecting, labeling, and ranking social spam campaigns, e.g. Facebook or Twitter. To this end, first of all, the representative features of social spam campaigns should be identified. Afterwards, the most popular cyber- crimes in social networks should be characterized as the label of discovered spam campaigns to train a classifier. Finally, the ranking features needs to be identified to order the set of social spam campaigns.

Another area of research which we are interested to apply our proposed methodology refers to botnet detection and finding the botmaster, the root of the problem. To this end, although many efforts have been done in prosecuting the botmaster through botnet, we expect our proposed approach works well in botnet detection through precise spam campaign detection and consequently catch the spammer. The reason is that we believe the proposed mechanism is able to precisely identify the zombies (bots) controlled by the same spammer (botmaster).

In the side of formalization, there are a lot of directions to extend our proposed approach, since it is among the very first efforts in applying formal methods in clustering algorithms. First, we plan to extend the idea of semiring in abstracting the representation of other well-known categorical clustering algorithms. Then, we apply the abstract schema in concepts related

126 to feature analysis, parallel clustering, etc. Furthermore, we plan to apply more interesting properties of semiring, to address more issues in categorical clustering algorithms. For example, semiring homomorphism can be applied in automatically identifying whether two categorical clustering are identical or not.

128 Table 7.1 – Table of Notations

 , Node purity µ , Minimum number of elements in a node A , The set of sorts (attributes) VA , The carrier set of sort A V , The union set of carrier sets of A sort , A function which returns a set of carrier sets of received features F , The power set of the power set of V F1 , A subset of F in which each set contains just one element S , The set of records (elements)  , Satisfaction relation F  S , The set of elements of S that satisfy the set of features F FC , Set of feature-cluster terms A , Atomic terms block , A function which returns a set of feature-cluster terms ≡ , FC-term comparison C , The set of terms A −→ , Factorization rewriting rule −→d , Defactorization rewriting rule FC ↓ , The set of factorized FC-terms FC ↑ , The set of non factorized FC-terms (Σ, Q, δ) , Graph structure [[.]] , A function which returns a tree from received feature-cluster family term Ψ , A function which returns a feature-cluster family term from a received forest (tree) GV,FC , The set of all possible forest on the set of edge labels V and node labels FC ≈ , Ordered FC-terms comparison DA , Attribute division function δ , Initial function B ≺τ A , Attribute B is smaller than attribute A k F , Ordered unification function ∂ , Derivative function [τ]i , The i’th component of τ W (τ) , Well formed term ∗ F(A, τ) , Unified term split(τ) , Split function

129 Annexe A


A.1 Source Codes of Proposed Approach

In what follows some important source codes used in CCTree construction, labeling, and etc. are provided.

Shannon entropy function entropy = shannon_entropy(attribute_vals)

%INPUT: %attribute_vals: [1 ∗N] INTEGER %Is the vector with the values for each attribute inside a cluster %OUTPUT: %entropy: [1 ∗ 1 ] DOUBLE % The entropy for the specific attribute.

function entropy = shannon_entropy(attribute_vals) ordered_vect = sort(attribute_vals); %Order the array to divide the different values of the attribute vector_size = size(attribute_vals);

i =0; while isempty(ordered_vect)== 0 %Find the number of elements for each attribute value in the vector

i=i +1;

130 index = find(ordered_vect == ordered_vect(1)); temp = size(index); dim(i) = temp(2); ordered_vect(index)=[]; end entropy =0; counter = size(dim);

for j = 1 : counter(2) %compute the entropy entropy=entropy − ((dim(j)/vector_size(2)) ∗ log2(dim(j)/vector_size (2))); end

The Shannon Entropy of a Cluster :

function e = clustering_entropy(A,ci) num_clusters = size(A); num_clusters = num_clusters(2); r e s u l t = 0 ; for i=1:num_clusters if not(isempty(A{i})) num_cols_ai = size(A{i }); num_cols_ai = num_cols_ai(2); vect_ai = A{i }(: ,num_cols_ai) ’; num_cols_ci = size(ci); num_els_ci = num_cols_ci(1); num_cols_ci=num_cols_ci (2); vect_ci = ci (: ,num_cols_ci −1) ’; intersection = intersect(vect_ai ,vect_ci); dim = size(intersection); dim = dim(2); i f ( dim~=0) result = result + (dim/num_els_ci)∗ log(dim/num_els_ci); end end end e = −r e s u l t ;

131 end

Node purity

function [np, max_entropy_attribute] = node_purity(data ,weight) n_attr = size(data); n_attr = n_attr(2) −3; i f nargin < 2 weight = ones(1,n_attr)∗1/ n_attr ; end np=0; max_entropy = 0; max_entropy_attribute=1; for i=1:n_attr−1 temp_entropy = shannon_entropy(data(: , i ) ’); if temp_entropy > max_entropy max_entropy = temp_entropy; max_entropy_attribute=i ; end np=np+weight( i )∗ temp_entropy ; end

CCTree function :

function [clusters , labels] = CCTree (data , node_purity_threshold , max_num_elem) t i c num_elem = size(data); num_elem = num_elem(1); associate_vector = 1:num_elem; associate_vector = associate_vector ’; %count the email lines data = [data,associate_vector ]; level = 0; %initialize data structures nodes_per_level = {}; nodes_next_level = {}; all_nodes ={}; l e a v e s = {} ;

[current_node_purity , current_attribute] =node_purity(data); %compute node purity of the whole dataset num_elem_curr_node = size (data);

132 num_elem_curr_node = num_elem_curr_node(1); %check number of elements i f current_node_purity > node_purity_threshold && num_elem_curr_node > max_num_elem %split if set is NOT pure AND too many elements [nodes_per_level , labels ] = CCTreeSplit(data,current_attribute); %nodes_per_level contains the various clusters l e v e l = 1 ; e l s e clusters = data; l a b e l s = [ ] ; return ; end while 1 num_nodes_curr_level = size(nodes_per_level); num_nodes_curr_level = num_nodes_curr_level(2); new_level=0; %boolean to check if there is a new level pd3=nodes_per_level ;

for i=1:num_nodes_curr_level %for all nodes in this level temp_node = nodes_per_level{i}; %extract a cluster num_elem_curr_node=size (temp_node ); num_elem_curr_node=num_elem_curr_node (1); [current_node_purity , current_attribute] = node_purity(temp_node); %compute purity

i f current_node_purity > node_purity_threshold && num_elem_curr_node > max_num_elem %if set is pure OR there are too few elements [temp_cell_array ,temp_label]=CCTreeSplit(temp_node, current_attribute ); %split and assign to temp variable the new cluster nodes_next_level=[nodes_next_level , temp_cell_array ]; %add node to a deeper level new_level=1;

133 num_nodes_curr_level = size(nodes_per_level); num_nodes_curr_level = num_nodes_curr_level(2); new_level=0; %boolean to check if there is a new level pd3=nodes_per_level ;

for i=1:num_nodes_curr_level %for all nodes in this level temp_node = nodes_per_level{i }; %extract a cluster num_elem_curr_node=size (temp_node ); num_elem_curr_node=num_elem_curr_node (1); [current_node_purity , current_attribute ] =node_purity(temp_node); %compute purity

i f current_node_purity > node_purity_threshold && num_elem_curr_node > max_num_elem %if set is pure OR there are too few elements [temp_cell_array ,temp_label]=CCTreeSplit(temp_node, current_attribute ); %split and assign to temp variable the new cluster nodes_next_level=[nodes_next_level , temp_cell_array ]; %add node to a deeper level new_level=1; %disp(’leafe found’); %create a leaf e l s e %disp(’leafe not found’); leaves= [leaves; temp_node]; %add it to the leaf collection end end

clusters = leaves; %assign the leaves to the results all_nodes = [all_nodes nodes_per_level]; nodes_per_level=nodes_next_level ; %next level becomes current level nodes_next_level = {};

134 if new_level==0 %stop if all nodes are leaves break ; end level = level + 1; end toc

CCTree Labeling Function :

function M = CreateCCTreeLabelledMatrix(c) iter = size(c); iter = iter(1); M= [ ] ; for i=1:iter numofelements = size(c{i}); numofelements = numofelements(1); vect = i ∗ ones(numofelements ,1); tempmat = c{i}; tempmat = [tempmat,vect ]; i f ( i ==1) M=tempmat ; e l s e M = [M;tempmat]; end end end

Precise cluster

function p = precision_cluster(Ai, Cj) num_cols_ai = size(Ai); %num_el_ai = num_cols_ai(1); num_cols_ai = num_cols_ai(2); num_cols_cj = size(Cj); num_el_cj = num_cols_cj(1); num_cols_cj = num_cols_cj(2); vect_ai = Ai(: ,num_cols_ai) ’; vect_cj = Cj(: ,num_cols_cj −1) ’; intersection = intersect(vect_ai ,vect_cj);

135 result=size(intersection ); p=result (2)/num_el_cj; end end

Recall cluster

function r = recall_cluster(Ai, Cj) num_cols_ai = size(Ai); num_el_ai = num_cols_ai(1); num_cols_ai = num_cols_ai(2); num_cols_cj = size(Cj); num_cols_cj = num_cols_cj(2); vect_ai = Ai(: ,num_cols_ai) ’; vect_cj = Cj(: ,num_cols_cj −1) ’; intersection = intersect(vect_ai ,vect_cj); result=size(intersection ); r=result (2)/num_el_ai; end

Find Clusters by Purity :

function [index , purity]= FindClusterByPurity(data,leaves) num_of_leaves = size(leaves ); num_of_leaves = num_of_leaves(1); tot_el = size(cell2mat(leaves)); tot_el = tot_el(1); min_purity = Inf; index = −1; nattr_leaf = size(leaves{1}); nattr_leaf = nattr_leaf(2); nattr_data = size(data); nattr_data = nattr_data(2); size_diff = nattr_leaf − nattr_data; data = [data,zeros(1,size_diff )]; %add two empty values to match the size of leaf for i=1:num_of_leaves num_of_elements = size(leaves{i }); num_of_elements = num_of_elements(1); if (num_of_elements > 1)

136 %do not consider nodes with a single element purity_old = node_purity_mod(leaves{i }); purity_new = node_purity_mod([ leaves{i };data ]); %add data and compute new purity difference = (purity_new − purity_old ); difference = difference ∗ (num_of_elements ); %do not consider node whose purity is increased if difference < min_purity min_purity = difference; index = i ; end end end purity = min_purity; end


function f = FMeasure_Clusters(Ai, c) r e s u l t s = 0 ; num_of_clusters = size(c); num_of_clusters = num_of_clusters(1); for i=1:num_of_clusters op = 2∗ precision_cluster(Ai,c{i})∗ recall_cluster(Ai,c{i}) /(precision_cluster(Ai,c{i})+recall_cluster(Ai,c{i })); results = max(results ,op); end f = r e s u l t s ; end

137 A.2 Tables of Attributes

In what follows the set of features of each attribute, and the range of each feature, which are applied in CCTree algorithm are presented in tables. Each table represents one attribute, whilst the first column of each table constitute the set of features of that attribute, and the second column shows the number we assigned to each feature in the same raw. The two binary attributes Linkwithat(@) and LinkswithnonASCIIcharacter are not pre- sented in tables. For these two attributes, if presented in the body of spam message there is no link with (@) or no link non ASCII character, we attribute the number 0 to this message, otherwise the attributed number equals to 1.

Table A.1 – Language of spam message and subject

Language Attributed Number Unknown language 0 English language 1 Italian language 2 French language 3 German language 4 Spanish language 5 Chinese language 6 Arabic language 7 Persian language 8 Japanese language 9 Russian language 10 Croatian language 11 Portuguese language 12 Indian language 13

Table A.2 – Type of Attachment

Attachment Type Attributed Number None 0 PDF 1 EXEC 2 DOC 3 PIC 4 TXT 5 ZIP 6 Other 7

138 Table A.3 – Attachment Size

Attachment Size Attributed Number Attachment Size 0 kb 0 Attachment Size 1-100 kb 1 Attachment Size 100-500 kb 2 Attachment Size 500-1000 kb 3 Attachment Size 1000-more kb 4

Table A.4 – Number of attachment

Attachment Number Attributed number No attachment 0 1 attachment 1 2 attachments 2 3 attachments 3 4 attachments and more 4

Table A.5 – Average size of attachments

Average Attachment Size Attributed Number average size of attachment 0 0 average size of attachment 1-100 1 average size of attachment 100-500 2 average size of attachment 500-1000 3 average size of attachment 1000 and more 4

Table A.6 – Type of Message

Message Type Attributed Number Plain Text 1 HTML based 2 Image based 3 Links Only 4 Others 5

139 Table A.7 – Length of Message

Message Size Attributed Number Length Class 0-100 kb 0 Length Class 100-200 kb 1 Length Class 200-300 kb 2 Length Class 300-400 kb 3 Length Class 400-500 kb 4 Length Class 500-600 kb 5 Length Class 600-700 kb 6 Length Class 700-800 kb 7 Length Class 800-900 kb 8 Length Class 900-1000 kb 9 Length Class 1000-5000 kb 10 Length Class 5000-10000 kb 11 Length Class 10000-20000 kb 12 Length Class 20000-30000 kb 13 Length Class 30000-40000 kb 14 Length Class 40000-50000 kb 15 Length Class 50000-60000 kb 16 Length Class 60000-70000 kb 17 Length Class 70000-80000 kb 18 Length Class 80000-90000 kb 19 Length Class 90000-100000 kb 20 Length Class 100000-more kb 21

Table A.8 – IP-based links verification

IP based Verification Attributed Number No IP based links 0 Contain IP based links 1

Table A.9 – Mismatch links

Mismatch Links Attributed Number No Mismatch link 0 1 Mismatch Link 1 2 Mismatch Links 2 3 Mismatch links and more 3

140 Table A.10 – Number of links

Number of Links Attributed Number No link 0 1 link 1 2 links 2 3 links 3 4 links 4 5 links 5 6 links 6 7 links 7 8 links 8 9 links 9 10-100 links 10 more than 100 links 11

Table A.11 – Number of Domains

Number of Domains Attributes Number No domain 0 1 domain in links 1 2 domains in links 2 3 domains in links 3 4 domains in links 4 5 domains in links 5 6-10 domains in links 6 more than 10 domains in links 7

Table A.12 – Average number of dots in links

Average Number of Dots in Links Attributed Number 0 dot per link 0 1 dot per link 1 2 dots per link 2 3 dots per link 3 more than 3 dots per link 4

Table A.13 – Hex character in links

Number of links with Hex Attributed Number No link with Hex character 0 1 link with Hex character 1 2 links with Hex character 2 3 links with Hex character 3 4 links with Hex character 4 5 links with Hex character 5 6-10 links with Hex character 6 more than 10 links with Hex character 7

141 Table A.14 – Words in Subject

Number of Words in Subject Attributed Number No word in subject 0 1-5 words in subject 1 6-10 words in subject 2 more than 10 words in subject 3

Table A.15 – Characters in subject

Number of Characters in Subject Attributed Number No character in subject 0 1-10 characters in subject 1 10-20 characters in subject 2 more than 20 character in subject 3

Table A.16 – Non ASCII characters in subject

Number of Non ASCII characters in Subject Attributed Number No non ASCII character in subject 0 1 non ASCII character in subject 1 2-5 non ASCII characters in subject 2 6-10 non ASCII characters in subject 3 more than 10 non ASCII characters in subject 4

Table A.17 – Recipients of spam email

Number of Recipients Attributed Number No recipient 0 1 recipient 1 2 recipients and more 2

142 Table A.18 – Images in spam messages

Number of Images Attributes Number No image 0 1 image 1 2 images 2 3 images 3 4 images 4 5 images 5 6 images 6 7 images 7 8 images 8 9 images 9 10-20 images 10 21-30 images 11 31-40 images 12 41- 50 images 13 51-100 images 14 101- 500 images 15 501-1000 images 16 more than 1000 images 17

