Clonal Selection Method for Immuntiy based Intrusion Detection Systems

by Kasthurirangan Parthasarathy

Abstract This Paper presents a description of an intrusion detection ap- proach modeled on the basis of two bio-inspired concepts namely, negative selection and clonal selection. The negative selection mechanism of the im- mune system can detect the foreign patterns in the complement(nonself) space. The clonal selection principle is used to explain the basic features of an adaptive to an antigenic stimulus. It establishes the idea that only those cells that recognize the are selected to proliferate. The selected cells are subject to affinity maturation process, which improves their affinity to the selective antigens. The MODCLONALG algorithm de- scribed in this paper, is a special refinement of the clonal selection principle, that attempts to implement the negative selection mechanism. A detailed discussion of the MODCLONALG to generate the rule sets representing the negative selection(nonself) is provided. Finally, an Intrusion Detection Sys- tem model is proposed that incorporates a knowledge base constructed by CLONALG using negative selection and uses CLONALG for recognition of the malicious activities in the system.

Keywords Artificial , negative selection principle, clonal selection principle, evolutionary algorithms, optimization, intrusion detec- tion.

1 Introduction

Computer Security is a field that has gained significance over the past few years, especially with the widespread internetworking of computers. One of the important aspects of computer security is the detection of intrusions and attacks. Hence, considerable amount of research works have been dedicated to the exploration of various possible methods for detection of intrusions and

1 attacks. Of late, the intrusion detection systems, modeled on the basis of the Artificial Immune System, have gained prominence because of their promise to provide for feasible and efficient detection mechanisms[4]. The Artificial Immune System is modeled on the basis of the Natural Immune System found in living organisms. In this paper, an Intrusion Detection System is proposed which makes use of the negative selection mechanism of the immune system along with the clonal selection principle. The main objective is to combine the clonal selection method with negative selection to obtain a comprehensive definition of the nonself space. The nonself space represents a set of activities in a system that are considered to be abnormal or undesirable. This is hence the complement of the self space which represents a set of activities that are considered normal in the system. The clonal selection method is a refined form of evolutionary approaches which are stochastic search processes. The definition of self or nonself have combinatorial possibilities and hence the search space is vast. Conventional deterministic approaches cannot provide complete coverage of such a search space in real time. Hence,the evolutionary approach, with its stochastic nature, provides a reasonably efficient method to develop a representation of self or nonself that can cover the enormous search space[1].These evolutionary algorithms are modeled on the basis of Natural Evolution, where in the fittest of the individuals are selected for reproduction and they recombine to produce unique offspring with mutation to add diversity to their characteristics. The clonal selection principle makes use of these features but with a slight modification that allows it to handle the cases of multimodal optimization much more efficiently[2]. The MODCLONALG discussed in this paper, is a specialized form of a clonal selection algorithm, that generates a set of rules charecterizing the complement(nonself) space. The Intrusion Detection System model proposed makes use of this algorithm to build its knowledge base. The pattern recog- nition tasks of the system are performed by the CLONALG which is another form of a clonal selection algorithm.

2 Artificial Immune Systems

Artificial Immune Systems (AIS) form the basis of solutions for various real world problems and in particular intrusion detection. AIS aim at using ideas gleaned from in order to develop systems capable of performing a wide range of tasks in various research areas. This is basically a refinement of the Natural Immune System built into the living organisms, specifically directed at information processing.

2 When a pathogen (a germ) enters the body of an organism, the immune system immediately recognizes that the pathogens cell formulation is different from that of the body cells. The Germinal Center (GC) plays an important role in this activity and takes over to tackle the situation. It is one of the functional modules of the natural immune systems, which evolves in some organs and plays a major role in immune response. The development of a GC is a complex process, which is formed dynamically when -activated B-Cells migrate into primary follicles of the peripheral lymphoid organs. The formation of GC requires activation of B-Cells, Migration of B-Cells, T and Interactions, and the availability of the network of follicular denritic cells (FD). From the information processing point of view, the role of the germinal center can be used as a pattern matching model, particularly, to distinguish between the known patterns and novel patterns. There are various concepts in the Artificial Immune System like, Clonal Selection, Affinity Maturation, Somatic Mutation, Receptor Editing that are discussed in relation to the detection of intrusions in this paper.

3 Intrusion Detection

Intrusion detection is a problem related with the field of Computer Secu- rity where in the computer systems are guarded against malicious activities and attacks. There are various forms of security breach like, virus activity, masqueraded user, denial of service attacks etc. The ’Intrusion Detection Systems’ continually monitor the state of the system and raise alarms when any suspicious activities occur in the system. These systems are best mod- eled with a tremendous amount of motivation from the Artificial Immune Systems. The conventional intrusion detection systems are built by incorporating a knowledge base with the system that provides a comprehensive definition of normal activities called Self so that the system can use it to differentiate unusual activities which are categorized as nonself. The system activities can be classified in to:

1. Normal

2. Attack

3. Error

4. Abnormalities

3 The normal activities can be recorded and on the basis of the collected information, a definition of normal behaviour in the system is created as a statistical average of all the recorded activities. The system can then be monitored continuously to record the current activities which are compared with the normal behaviour and depending upon the level of deviation can be classified further into attacks or errors or abnormalities.

3.1 Representation of the Self and Non-Self The main purpose of intrusion detection is to identify which states of a system are normal and which are abnormal.The states of a system can be represented by a set of features.

System State Space[1]: A state of the system is represented by a vector i i i n of features x = (x1, ..., xn)²[0, 0, 1.0] . The space of states is represented by the set S ⊆ [0.0, 1.0]n. It includes the feature vectors corresponding to all possible states of the system.

Normal Subspace[1]: A set of feature vectors Self ⊆ S represents the normal states of the system.Its complement is called Nonself and is defined as Nonself = S − Self. Thus the charecteristic function for Self definition would be:

n XSelf : [0.0, 1.0] → 0, 1 If various levels of normalcy are to be represented then the definition can be modified such that:

n XSelf : [0.0, 1.0] → [0.0, 1.0] The problem[1]: Given a set of normal spaces Self 0 ⊆ Self, a good estimate of the normal space charecteristic function xSelf needs to be built. This function should be able to decide whether or not the observed state of the system is anamalous.

3.2 Negative Selection Approach in Intrusion Detec- tion Intrusion Detection systems need to distinguish between the self and nonself space. Also, the nonself space elements must be further categorized in or- der to determine specific response for protection and recovery from different attacks. The negative selection mechanism in the immune system works in

4 such a way that when the cells generated bind to the self-protein, they are destroyed, so only those cells which do not bind to self proteins are allowed to proliferate. This approach is extended to the intrusion detection process and a negative selection algorithm has been arrived at[1]. The Negative Selection algorithm[1] can be summarized as: 1. Define Self as collection of strings of length l over a finite alphabet, a collection that needs to be monitored. 2. Generate a set of R detectors, each of which fails to match any string in s. 3. Monitor S for changes by continually matching the detectors in R against S. If any detector ever matches, then a change is known to have occured as the detectors are designed to not match any of the original strings in s. These detectors can be represented as a set of real-valued rules.These rules constitute a complement of the normal values of the feature vectors(described below). A rule is considered good if it does not cover positive samples and its area is large. Thus, an evolutionary algorithm can be used to evolve the rules to cover the nonself space guided by this criterion. The basic structure of the detector rules is as follows:

1 R :If Cond1, then Level1 . . i R :If Condi, then Level1 . . i+1 R :If Condi+1, then Level2 . . j R :If Condj, then Level 2 . .

where

i i i i Condi = x1 ∈ [low1, high1] and . . . and xn ∈ [lown, highn];

(x1, . . . , xn) - feauture vector;

j j [lowi , highi ] : lower and upper values for the feature xi in the condition part of the rule Rj

5 The condition part of each rule defines a hypercube in the descriptor space ([0.0, 1.0]n). Thus, this set of rules attempts to cover the nonself space with hypercubes. The different levels indicate the levels of deviation from the normal definitions that are allowed. These definitions are ordered hier- archically such that level 1 contains level 2, level 2 contains level 3, and so forth. This means that a descriptor can be matched by more than one rule, but the highest level reported will be assigned. This set of rules generates a noncrisp charecteristic function for the nonself space: [ j j j µnonself (~x) = max(l|∃R ∈ R, ~x ∈ r andl = level(r ) 0) (1) where level(rj) represents the deviation level reported by the rule Rj.

Figure 1: A schematic representation of the self, nonself space definition with variability parameter v resulting in various levels

The following sections discuss how these rule sets can be generated.

4 Evolutionary Algorithms

Evolutionary Algorithms derive their nature from Darwins theory of Natu- ral Selection. The Natural Selection theory emphasizes the Survival of the Fittest concept in where only the healthier and relatively fitter individuals in a population get to reproduce so that the resultant offspring have compre- hensive traits inherited in them. This eventually leads to a situation wherein most of the individuals in a population have healthy traits that leads to a healthy population.

6 Evolutionary Algorithms constitute these stochastic evolutionary tech- niques to solve various real world problems. They strive to take in a set of samples from the search space and obtain the global optimum value us- ing recombination and mutation to direct its search process. This process is predominantly influenced by the accumulation of knowledge gained from the fitter of the members in the population samples. This knowledge gained helps in guiding the Evolutionary Algorithm towards the optimum value and the mutation operations take care of avoiding the local optimum values. The fitness of the members is decided on the basis of how well they perform in the environment defined. Evolutionary Algorithms have certain desirable properties like:

1. Minimal knowledge is required

2. Ability to solve difficult problems

3. Availability of Solution

4. Robustness ability to provide the correct solution even in faulty oper- ating conditions

But Evolutionary Algorithms have their own disadvantages like:

1. It is difficult to optimize the parameters involved

2. It is computationally intensive

3. The fitness function and the genetic operators involved are not obvious

4. They can result in premature convergence in which case we would have to contend with a local optimum value

5 Clonal Selection

The clonal selection functioning of the immune system can be interpreted as a remarkable microcosm of Charles Darwins law of evolution, with the three major principles of repertoire diversity, genetic variation and natural selec- tion. Repertoire diversity can be maintained if the immune system produces far more than will be effectively used in binding with an antigen. In fact, it appears that the majority of antibodies produced do not play any active role whatsoever in the immune response. Natural variation is provided by the variable gene regions responsible for the production of highly diverse population of antibodies, and selection occurs, such that only antibodies

7 able to successfully bind with an antigen will reproduce and be maintained as memory cells. The similarity between adaptive biological evolution and the production of antibodies is even more striking when one considers that the two central processes involved in the production of antibodies, genetic recombination and mutation, are the same ones responsible for the biological evolution of sexually reproducing species. Thus, cumulative blind variation (mutation) and natural selection, form the basis for the clonal selection prin- ciple. The clonal selection algorithm (CLONALG), to be described further in the text, aims at demonstrating that this cumulative blind variation can generate high quality solutions to complex problems.

5.1 Clonal Selection Theory When an animal is exposed to an antigen, some subpopulation of its bone marrow derived cells (B ) respond by producing antibodies (Ab). Each cell secretes a single type of antibody, which is relatively specific for the antigen. By binding to these antibodies (cell receptors), and with a second signal from accessory cells, such as the T-helper cell, the antigen stimulates the B cell to proliferate (divide) and mature into terminal (non- dividing) antibody secreting cells, called plasma cells. The process of cell division (mitosis) generates a , i.e., a cell or set of cells that are the progenies of a single cell. While plasma cells are the most active antibody secretors, large B lymphocytes, which divide rapidly, also secrete antibodies, albeit at a lower rate. On the other hand, T cells play a central role in the regulation of the B cell response and are preeminent in cell mediated immune responses, but will not be explicitly accounted for the development of our model.Lymphocytes, in addition to proliferating and/or differentiating into plasma cells, can differentiate into long-lived B memory cells. Memory cells circulate through the blood, lymph and tissues, and when exposed to a second antigenic stimulus commence to differentiate into large lymphocytes capable of producing high affinity antibodies, pre-selected for the specific antigen that had stimulated the primary response. Fig. 2 depicts the clonal selection principle. The main features of the clonal selection theory that are made use of in intrusion detection are:

• Proliferation and differentiation on stimulation of cells with antigens;

• Generation of new random genetic changes, subsequently expressed as diverse antibody patterns, by a form of accelerated somatic mutation (a process called affinity maturation); and

8 Figure 2: Clonal Selection Principle

• Elimination of newly differentiated lymphocytes carrying low affinity antigenic receptors.

A brief introduction to the various concepts in the clonal selection prin- ciple are provided in the following section. A detailed discussion of these concepts can be found in [2].

5.1.1 Reinforcement Learning Learning in the immune system involves raising the relative population size and affinity of those lymphocytes that have proven themselves to be valuable by having recognized a given antigen. In the clonal selection method only a small set of best individuals are maintained so that the problem can be solved with the minimal resources available. This method tries to emulate the natural immune system in which intial exposure to an antigen is handled by a set of low affinity clones of B cells and the effectiveness to secondary

9 encounters is considerably increased by the presence of the memory cells associated with the first exposure. This scheme which depends,on associative memory,is referred to as reinforcement learning. Some charecteristics of the associative memories are particularly interest- ing in the context of artificical immune systems:

1. The stored pattern is recovered through the presentation of an incom- plete or corrupted version of the pattern

2. They are usually robust, not only to noise in the data, but also to failure in the components of the memory.

5.1.2 Somatic Hyper Mutation and Receptor Editing is a process which involves introduction of random changes to the gene structure so that the produced antibodies are different from those of the first reaction. This will ensure that diversity is introduced in the repertoire and also there is a chance for exploring antibodies that have better binding capacity (to the antigen). Hence this process aids climbing in the right direction of a hill, if the problem is modeled to have the opti- mal solution at the highest peak of a landscape. Obviously, the suboptimal mutations would fall off because of their lower fitness. Receptor editing allows the low fitness antibodies to be modified com- pletely so as to give them a chance to improve. Thus, somatic hyper muta- tion helps in fine tuning the solution where as the receptor editing concept allows to overcome the local maxima problem.

5.2 Clonal Selection Algorithm This algorithm makes use of the concepts discussed previously. The algorithm involves affinity maturation where in only the antibodies that exhibit better binding are given higher fitness values. The notations that are made use of in this algorithm: ( Ab)- antibodies, Ag- antigens which are matrices, Agj a vector, and S corresponds to a coordinate axis in the shape-space. The other notations used are:

NxL • Ab: available antibody repertoire (Ab ∈ S , Ab = Ab{r} ∪ Ab{m});

mxL • Ab{m}: memory antibody repertoire (Ab{m} ∈ S , m ≤ n);

rxL • Ab{r}: remaining antibody repertoire (Ab{r} ∈ S , r = N − m);

nxL • Ag{m}: population of antigens to be recognized (Ag{m} ∈ S );

10 • fj: vector containing the affinity of all antibodies with relation to Agj;

j • ab{n}: n antibodies from Ab with highest affinities to Agj

j j • C : population pf Ncclones generated from Ab{n} • Cj∗: population Cj after the affinity maturation (hypermutation) pro- cess;

• Ab{d}: set of d new molecules that will replace d low-affinity antibodies from Ab{r};

∗ j∗ • Abj : candidate, from C ,to enter the pool of memory antibodies.

The alogrithm CLONALG works as follows:

1. First establish a set of Antibodies Ab in the repertoire while maintain- ing a separate subsetof them as memory and also the Antigens Ag.

2. Randomly choose an anitgen and present it to all the antibodies in the repertoire

3. Determine the affinity vectors for all the antibodies in the repertoire with respect to the antigen.

4. Select n highest affinity anitbodies from Ab composing a new set of high affinity antibodies.

5. This set is then cloned independantly,proportional to their antigenic affinities, generating a repertoire of clones; the higher the anitgenic affinity, the higher the number of clones generated

6. The repertoire is then submitted to the affinity maturation phase where in the hyper mutation of the antibodies occurs where in the antibodies with higher affinity have a smaller mutation rate.

7. Determine the affinity of the matured clones with respect to the antigen.

8. From this set of mature clones select the most fittest one and add to the set of memory antibodies.

9. Finally replace a set of ’d’ lowest affinity antibodies from the set of antibodies, by new individuals.

The schmeatic representation of the algorithm is as follows:

11 Figure 3: CLONALG: A schematic representation of the clonal selection algorithm

5.3 Generation of Rule Sets: The MODCLONALG The CLONALG which is a refinement over the evolutionary approach can be successfully applied to the problem of differentiation between the self and nonself space as described in the beginning of this paper. The clonal se- lection algorithm has been successfully proved to perform multimodal opti- mization[2] which is vital for coverage of multiple points in the self or nonself search space. In this approach, a set of rules are substituted for the antibodies and the normal descriptors (section 4.2) are substituted for the antigen. we intitally start with a random set of rules and with the repetition of the clonal selection algorithm through several generations, a set of rules that are comprehensive enough to cover the nonself region is obtained. This requires a specific fitness function which will evaluate the population of rules on the basis of whether they overlap with the normal descriptors. If overlapping occurs then the rules

12 can be classified to have the lowest affinity. This way those rules which have maximum overlaps with the normal descriptors get evicted out of contention. The normal descriptors can be represented as a feature vector set S = 1 m j j j j {x , . . . , x }. Each element x is a n-dimensional vector x = (x1, . . . , xn). Each xj can be considered as an antigen. The antibody set will initially con- tain a random set of rules.The CLONALG can then be modified to perform an optimization of the rule set since there is no specific antigen to be rec- ognized but rather a substantial region of the nonself space to be covered.A few modifications have to be made to CLONALG to perform optimization, which results in MODCLONALG. The changes required are as follows:

• In Step 1, there is no explicit antigen population to be recognized, but an objective function g(x) to be optimized (maximized or minimized). This way, an antibody affinity corresponds to the evaluation of the objective function for the given antibody: each antibody Abi represents an element of the input space. In addition, as there is no specific antigen population to be recognized, the whole antibody population Ab will compose the memory set and, hence, it is no longer necessary to maintain a separate memory set Abm; and • In Step 7, n antibodies are selected to compose the set Ab, instead of selecting the single best individual Ab∗.

Since the optimization process aims at locating multiple optima within a single population of antibodies, two parameters may assume default values:

1. Assign n = N, i.e. all antibodies from the population Ab will be selected for cloning in Step 3; and

2. The affinity proportionate cloning is not necessarily applicable, mean- ing that the number of clones generated for each of the N antibodies is assumed to be the same, and (1) becomes

XN Nc = round(β · N), i=1

The second statement above, has the implication that each antibody will be viewed locally and have the same clone size as the other ones, not privi- leging anyone for their affinity. The antigenic affinity (corresponding to the objective function value) will only be accounted to determine the hypermu- tation rate for each antibody, which is still proportional to their affinity.

13 Figure 4: MODCLONALG: A schematic representation of the clonal selec- tion algorithm used in Intrusion Detection

Note: In order to maintain the best antibodies for each clone during evo- lution, it is possible to keep one original (parent) antibody for each clone unmutated during the maturation phase (Step 5). Thus the function g(y) to be optimized happens to be the least overlap- ping(with the self space) set of rules, each one of which strives to cover a maximum area of the nonself space. This can be represented as:

g(y) = −(Ry − xj) + A(Ry) where, y j j j y y R − x is the number of elements xk|xk ∈ [lowk, highk] and y y y y y A(R ) = [high1 − low1 ] + ... + [highn − lown] happens to be the area covered by that particular rule.

j j The mutation operator then works on the range [lowk, highk] to modify a particular rule. Those elements of the rule that are varied would be those

14 ranges with higher inclusion of the normal descriptor elements. Thus, over repeated optimization runs the best rule sets can be evolved.

5.4 Discussion The performance of the CLONALG in case of optimization problems has been encouraging[2]. The CLONALG performs its search through mechanisms of somatic mutation and receptor editing, balancing the exploitation of the best solutions with the exploration of the search-space. A detailed analysis of the performance of CLONALG has been provided in [2]. Also, a comparison of the CLONALG with conventional genetic algorithms employing niching technique[3] has been provided in [2]. The results suggest that when the separation among the peaks is non-uniform, the fitness sharing overlooked peaks distributed with less uniformity and privileged the highest ones (Fig.5), while CLONALG located a higher amount of local optima, including the global maximum of the function (Fig.6). Thus it can be seen that CLONALG provides better coverage than a conventional GA. This performance measure can also be extended to the MODCLONALG since it basically works the same way as the CLONALG except that it employs a specialized fitness function and relinquishes the memory set. Hence, the observations made on the results of applying CLON- ALG to optimization problems [2] may hold true for the MODCLONALG also. Therefore, it can be concluded that the MODCLONALG has a better chance for providing comprehensive coverage of the nonself space than a con- ventional GA, since the nonself space involves optimization for peaks with varying uniformity. This aspect needs some implementations to be done and experiments to be performed inorder to verify that the conclusion holds.

15 Figure 5: GA maximizing an arbitrary function f(x)

6 The Intrusion Detection System

The Intrusion Detection System can now be designed with a comprehensive definition of the malicious activities obtained in the form of a rule set. This rule set can be incorporated with the detection system which will continu- ously monitor the activities of the system. If any activity exceeds a threshold set by the regulations of the system, then the detection system will match the system activity with its rule set. If a substantial match is obtained, it confirms the occurence of an intrusion and depending upon the level of devia- tion corresponding actions will be evoked.The CLONALG has been observed to perform efficiently[2] in case of pattern matching problems also. Hence, it can be made use of to match the system’s current activities descriptor(a state vector similiar in representation to a normal descriptor) with the rule set to determine the which pattern it belongs to. If the resemblance to any of the rules is substantial then the decision support system may be evoked to take necessary action. The system activities monitored can be at the various levels, namely: 1. User level 2. System level 3. Process level 4. Packet level

16 Figure 6: CLONALG maximizing an arbitrary function f(x)

Thus the rule base of the system may contain rules that are specialized for each level. A detailed discussion of the development of such an intrusion detection system is discussed in [4].

Figure 7: Different Modules of an Intrusion Detection System

17 7 Conclusion

In this paper, the development of an Intrusion Detection System based on the negative selection and the clonal selection principles, is considered and a model system has been proposed. The negative selection approach substan- tially minimizes the storage requirements of the self or nonself definitions[1], by maintaining a minimal rule set that covers only the non-self space. Thus, negative selection avoids storage of combinatorial possbilities of self defini- tions and thus resolves the storage issues that are critical for real-time appli- cations. Also, the minimal rule set allows for faster matching and improves the efficiency of the system response. The clonal selection algorithm serves to create a comprehensive rule set that can detect most of the intrusions. It has been observed to perform better than the conventional evolutionary algorithms[2] in case of performing multimodal optimization and hence it can provide better coverage of the Non-self space. Thus, i hope that the Intrusion Detection System built on these principles will be able to perform efficiently in realtime environments. In future, i propose that specific implementation of the CLONALG for the generation of rule set charecterizing negative selection be performed to evaluate its trustworthiness in a realtime environment. Also, a substantial amount of work can be dedicated to the design of the fitness function used by the CLONALG to evolve rules that can almost completely cover the Non-self space. Previous experiments with the implementation of the negative selec- tion approach and the clonal selection method, in isolation, have provided en- couraging results [2],[3]. Thus, this Intrusion Detection System model holds a lot of promise and further investigations may provide interesting results.

18 References

[1] Dipankar Dagupta and Fabio Gonzalez. An -Based Technique to Charecterize Intrusions in Computer Networks,IEEE Transactions on Evolutionary Computation (June 2002),Vol.6,No.3.

[2] Leandro N. de Castro, Fernando J. Von Zuben. Learning and Optimiza- tion Using the Clonal Selection Principle, IEEE Transactions on Evolu- tionary Computation (2001).

[3] Stephanie Forest, Brenda Javornik. Genetic Algorithms to Explore Pat- tern Recognition in the Immune System, Evolutionary Compuatation (1993),Vol.1,No.3.

[4] Dipankar Dasgupta,Fabio Gonzalez. An Intelligent Decision Support System for Intrusion Detection and Response, Intelligent Security sys- tems Research Lab,Department of Computer Science,University of Memphis (2001).

[5] S.A.Hofmeyr and S.Forrest. Immunity by Design: An Artificial Immune system, in Proc. of GECCO’99 (1999)

[6] —–, The Clonal Selection Theory of Acquired Immunity, Cambridge University Press (1959).

19