<<

Role Classification of Hosts within Enterprise Networks Based on Connection Patterns

Godfrey Tan Massimiliano Poletto John Guttag and Frans Kaashoek

MIT Mazu Networks MIT

[email protected] [email protected] guttag,kaashoek ¡ @mit.edu

Abstract unlimited access to external networks while oth- ers may have restricted access. Some users may be Role classification involves grouping hosts into re- limited in the amount of bandwidth they may con- lated roles. It exposes the logical structure of a net- sume, and so on. The number of policies is open- work, simplifies network management tasks such as pol- ended. icy checking and network segmentation, and can be used Monitoring network performance. Almost every to improve the accuracy of network monitoring and anal- complex network suffers from various localized ysis algorithms such as intrusion detection. performance problems. Network managers must This paper defines the role classification problem detect these problems and take action to correct and introduces two practical algorithms that group hosts them. based on observed connection patterns while dealing with changes in these patterns over time. The algorithms Detecting and responding to security violations. have been implemented in a commercial network moni- Increasingly, networks are coming under attack. toring and analysis product for enterprise networks. Re- Sometimes the targets are chosen at random, as sults from grouping two enterprise networks show that in most virus-based attacks, and in other cases the number of groups identified by our algorithms can they are picked intentionally, as with most denial- be two orders of magnitude smaller than the number of of-service attacks. These attacks often involve hosts and that the way our algorithms group hosts highly compromised within the enterprise reflect the logical structure of the networks. network. Early detection of attacks plays a critical role in reducing the damage. 1 Introduction Conducting these activities on a host-by-host basis is not feasible for large networks. Network managers need Today, many enterprises have internal networks (in- to extract structure from their networks so that they can tranets) that are as or more complicated than the entire think about them and make decisions at larger levels of of a few years ago. Managing these networks granularity. Today, this structuring is most often done is increasingly costly, and the business cost of network in an ad hoc manner that relies on administrators’ best problems increasingly high. guesses about the computers, services, and users on the Managing an enterprise network involves a number network. Obviously, this method has scaling problems. of inter-related activities, including: This paper presents two algorithms that, used to- gether, partition the hosts on an enterprise network into Establishing a topology. A network’s topology has a groups in a way that exposes the logical structure of a significant impact on its cost, security, and perfor- network. The grouping algorithm classifies hosts into mance. An increasingly important aspect of topol- groups, or “roles,” based on their connection habits. The ogy design is network segmentation. In an effort correlation algorithm correlates groups produced by dif- to provide fault isolation and mitigate the spread ferent runs of the classification algorithm. of worms like Nimda [3] and Code Red [2], enter- The two algorithms together provide the following prises segment their networks using firewalls [4], properties: routers, VLANs [7], and other technologies. 1. They guarantee that a host is only grouped with Establishing policies. Different users of a network other hosts that have the strongest degree of sim- have different privileges. Some users may have ilarity in connection habits. 2. They provide a mechanism to merge groups, and give network administrators fine-grained control Source Revision over the merging process, so that meaningful re- Control sults can be achieved. Sales-1 Eng-1 3. They deal with transient changes in connection pat- Mail terns by analyzing the profiled data over long peri- ods.

Web 4. They respond to non-transient changes in connec- Sales-N Eng-M tion patterns by producing a new partitioning and describing the differences between the new parti- Sales tioning and the previous partitioning. Database

5. Their run time grows quadratically with the number of hosts in the enterprise network. Figure 1. Grouping of related hosts based on con- nection patterns. Edge indicates that nodes com- As we demonstrate in Section 6, the algorithms re- municate regularly. The dashed circle represents duce the number of logical units that a network adminis- the group boundary. trator must deal with by one or two orders of magnitude. The algorithms are implemented as part of an enterprise monitoring and analysis system that is in production use at several large enterprises. Section 2 outlines the system in which the algo- might indicate that hosts Sales-1 to Sales-N communi- rithms operate, and introduces an example scenario that cate with three servers: Mail , Web server, and will be used throughout the paper. Section 3 describes SalesDatabase server. Similarly, the patterns might in- the models used to develop practical solutions. Sec- dicate that hosts Eng-1 to Eng-M communicate mostly tion 4 and Section 5 explain the two practical algorithms with Mail server, Web server, and SourceRevisionCon- for solving the role classification problem. Section 6 trol server. presents preliminary results, and Section 7 discusses re- Based on this information the grouping algorithm can lated work. We conclude with discussions of our current logically divide all machines into five groups: (i) the and future work in Section 8. sales group consisting of hosts Sales-1 to Sales-N, (ii) the engineering group consisting of hosts Eng-1 to Eng- M, (iii) the common server group consisting of Mail 2 System Overview and Web, (iv) the sales server group consisting of Sales- Database and (v) the engineering server group consist- The role classification algorithms are implemented as ing of SourceRevisionControl. part of a system designed to detect and respond to secu- The results of the grouping algorithm are currently rity violations in large enterprise networks. Such net- being used in two major ways: works commonly consist of tens of thousands of com- puters, spread over different geographic locations. The 1. The Mazu network monitoring and detection sys- security system consists of probes and a central aggre- tem decides whether a host’s behavior matches the gator. The probes analyze packets on the link or links expected policy setting, partly based on the history they are attached to, and send relevant information (in- of the host’s group membership. For example, if a cluding IP address/port tuples) to the aggregator. host in the engineering group were to suddenly start The aggregator is a scalable system that consists of opening connections to the SalesDatabase server, it one or more CPUs. It periodically runs several analysis might be a cause for alarm. algorithms on the data it has received from the probes. It 2. The network administrators review the grouping re- uses the role classification algorithms to refine its anal- sults to better understand the structure of their net- yses and to allow the administrators to describe group- works and to get useful insights for conducting net- based policies. work re-organization tasks such as consolidating Figure 1 presents a simple enterprise network and a servers and network segmentation. partitioning of computers into groups that the aggrega- tor might produce based on the communication patterns The system allows a network manager to label each observed by the probes. The communication patterns identified group with descriptive roles and set policies

  per group. The system continuously monitors the com- ¥  ,

munication patterns, adjusts groups as computers come

§ §

© ¥  ¤¥ ©¥

and go, flags policy violations, and raises alerts about – similarity ¤¦¥ similarity

¤¦¥§ © ¥  ¤¥ ©¥  potential security violations. Because all this informa- – similarity similarity  tion is presented on the level of groups (instead of indi-

vidual hosts), a network manager is able to understand We extend this definition of similarity to define the

§ 

and process the changes and alerts more easily. The al- average similarity between a host ¥ and a group ,

§

© 

gorithms also provide network administrators with flex- avg similarity ¤¦¥ , as the ratio of the sum of the



¥  ibility to control the grouping process to achieve results similarity between ¥§ and each to the num- that highly reflect their intuitive notion of the network ber of hosts in  :

structure. #%$&(')*&

§

© ¥ 

The algorithms presented in this paper are solely similarity ¤¥

§

© "!

avg similarity ¤¥  +¢ based on the connection patterns of hosts such as the ¢

set of neighboring hosts. However, the algorithms can ¡

A partitioning  of respects avg similarity if for

easily be extended to use other information such as pro-

§, § - § §

   ¤¥ © . all ¥ and , avg similarity

tocols and port numbers used and bytes transferred to

§

©  avg similarity ¤¦¥ . achieve desired results. For instance, some network ad- Respecting similarity or avg similarity is not suffi- ministrators may desire that Mail and Web servers be put ¡ cient to generate a useful partitioning of . After all, in different groups. In this case, the protocol informa- a partitioning that puts all the nodes in one group or tion can be used to keep the role classification algorithm one that puts each in a separate group respects from grouping together hosts that use different sets of similarity. We therefore provide a parameter that can protocols. We are currently exploring ways to expand be used by network administrators to control how ag- the capability of the grouping and correlation algorithms ¡ gressive the algorithm is in partitioning into groups. by providing network administrators with more flexibil-

ity to achieve desired results.

Let /10243 , the similarity threshold, be an integer The algorithms assume that the connection patterns £ greater . A partitioning respects similarity and

of hosts highly reflect the logical roles that they play.

/ ¥¨§ ¥ 05263 if it respects similarity and if, for and

For some networks where this is not true, the algorithms

 ¤¥¨§7© ¥ 8/ in , similarity 05243 .

will not do a good job. However, we believe that hosts in

¡

a typical enterprise network that share the same logical A partitioning  of is said to be maximal with re- / role will demonstrate similar connection patterns. spect to similarity and 05263 if it respects similarity

and /¨05243 and there does not exist another parti- ¡

3 Model tioning of that respects similarity and /*05263 and

has fewer groups. By adjusting /*05263 , one gets a In this section, we develop a model for thinking about maximal grouping with fewer groups in which the the grouping problem. We define the problem in the members of each group are more similar to each abstract, providing a model with several functions and other. parameters that can be adjusted to meet various goals. Later in the paper, we present and evaluate instantiations 3.1 Defining Similarity of these parameters.

We use connection behavior as a basis for host group- ¡

Let be the set of hosts in an enterprise network. ing, because that information is easily available by just

¡ ¡ ¢ We will use ¢ to denote the number of hosts in . monitoring the network. To group hosts, we need to de-

Let similarity be a commutative function from pairs fine similarity in a way that captures the extent to which of hosts in ¡ to an integer greater than or equal to pairs of hosts establish connections to the same set of £ other hosts. We start by defining similarity between

. Roughly speaking, if similarity ¤¦¥¨§ © ¥  is high, then we would like our grouping algorithm to place hosts as a function of the number of common hosts with

which they communicate. Intuitively, hosts that share

§ ¥ the hosts ¥ and in the same group. Defining similarity so that it is both efficient to compute and the same logical role communicate with similar sets of yields a good grouping is at the heart of the problem hosts. addressed in this paper. A connection is a pair consisting of a source host ad-

dress and a destination host address. The connection set

¡ ¡



9:¤¦¥ ; <¢=<

A partitioning  of respects similarity if for all of a host, , is the set, and there is a

§  §  § §?

©  ¥ © ¥  ¥ <> ¥ 9:¤¦¥  distinct groups  , , and connection between and . If , then

 § §

9:¤¥  ¤¥ ©¥ 

¥ . We define the relation neighbor 3. The grouping results that network administrators

§ § 

! ¥ ¥ 9:¤¦¥  to be true if and only if ¥ or . For desire may vary from network to network and

later use, we extend the definition of neighbor to groups therefore the role classification algorithm must pro-

 §7© 

by defining neighbor ¤ to be true if and only if vide flexibility for them to control its mechanics so



 §

there exists a host ¥¨§ that is a neighbor of another that meaningful grouping results can be achieved.

  host ¥ . We can use the notion of a connection set to provide a simple definition of similarity: In a typical network setting for a technology company,

each lab or test machine may be dedicated to a single en-

¡

! ¢ 9:¤¥§ 9:¤¥  ¢ similarity ¤¥§ © ¥  (1) gineer. Thus, each of these lab machines, despite sharing

the same role, can have a connection pattern that is very

§

© ¥  That is to say that similarity ¤¥ is equal to the num-

different from the rest of the lab machines. To be able ¥ ber of neighbors that ¥§ and have in common. to correctly group such machines together, the group- We are now in a position to specify the requirements ¡ ing algorithm must take into account the potential roles of a grouping algorithm. Given a set of hosts, and a

of neighboring hosts rather than comparing the neighbor / similarity threshold, 0243 , it must find a partitioning,

¡ sets.

 , of that is maximal with respect to avg similarity

and /¨05263 , i.e., Furthermore, some hosts may potentially be classi- fied into more than one role. For instance, there could

1.  respects avg similarity. This constraint guaran- exist a machine in the network in Figure 1 that com- tees that each host is within the group with which it municates with both sets of machines with which many has the strongest average similarity. engineering machines and sales machines communicate

respectively. In such cases, the connection patterns of

 

   ¤¦¥1© 5 2. For all ¥ and , avg similarity hosts must be evaluated carefully to ensure that each

/¨05263 . This requirement guarantees that each host in a group is sufficiently closely related to every host is grouped with other hosts with which it has the other host in the group, thus ensuring that groups strongest similarity in connection habits. are not too large. The role classification problem is not trivial for the

¡ aforementioned reasons. Not only does the computation 3. There is no other partitioning  of that meets of the similarity measure matter, but the process of how the first two requirements and has a larger average nodes are grouped based on the similarity values among group size. This guarantees that groups are not too node pairs is also important. small. The grouping algorithm consists of two phases: i) the This specification is independent of the definition group formation phase and ii) the group merging phase. of avg similarity. For some networks, such as the The group formation phase identifies each group of hosts one represented in Figure 1, the above definition of that have similar sets of neighbors using a simple sim- avg similarity yields excellent results. However, for oth- ilarity measure such as the one described in Section 3. ers a slightly more complex definition works better. We The purpose of the group formation phase is two-fold: present such a definition in Section 4.2. i) to efficiently identify various groups of hosts, each of which has drastically different overall connection pat- 4 Role Classification terns, and ii) to prepare for the second phase of the al- gorithm. The formation phase of the algorithm can effi- The role classification problem is not difficult to solve ciently find the desired partitioning for the example net- in ideal situations, such as the network shown in Fig- work in Figure 1 but may fail for many networks since ure 1, in which two nodes that share the same logical it does not take into account the potential roles of neigh- role communicate with the identical set of machines. boring hosts as explained earlier. In general, the group Clearly, such a situation does not reflect the connec- formation phase may generate a partitioning that con- tion patterns in typical enterprise networks. Three major tains more groups than desired. challenges of the role classification problem are: The group merging phase decides whether groups, 1. Two hosts that share the same logical role may produced by the formation phase, can further be merged communicate with drastically different sets of ma- using a much more sophisticated similarity measure. chines. This phase provides network administrators with fine- grained control over the merging process so that the 2. A host may potentially be classified into more than grouping results reflect their intuition of the network one role. structure.

) )

¡  4.1 Forming Groups and  is . ( will be used later to com- pute the degree of similarity between groups.)

Group formation can be thought of as a graph the- ¡© ¥

¡ (e) For each ungrouped host , where



 

ory problem. From the connection sets of , one can £

9:¤¥  ¢  ¢ and , create a new group generate a neighborhood graph, nbh-graph, where each containing only ¥ as described in 2d.

node represents a host and each edge with weight rep-

resents that there are common neighbors between the The algorithm runs iteratively over conn-graph un-

¡ £

hosts. Thus, a neighborhood graph captures the extent til no ungrouped node remains or ! . At each step to which pairs of hosts communicate with the same set multiple BCCs may be identified simultaneously and a of other hosts. We use an undirected graph since almost single node could be a part of several BCCs indicating all communication between hosts in the intranets is bi- that it may share multiple roles. In this case, the node directional. However, in certain situations, directional- becomes a part of a BCC with the largest size. If more ity may be used to improve the quality of the grouping than one such BCC exists, we choose one randomly. By results and we continue to investigate this issue. iterating over ¡ from high to low, the algorithm asso-

One approach to the grouping problem is to treat it as ciates each host ¥ with other hosts with the strongest a k-clique problem where nbh-graph is partitioned into similarity. cliques of size ¡ in which each edge in the clique has a In the grouping algorithm, the minimum number of

weight greater than or equal to some constant ¢ . Once nodes required to form a BCC is two. Technically, the ¡

a -clique is identified, one assigns all the nodes in the minimum number of nodes to form a BCC is  , since we ¡

-clique to one group, since they all share at least ¢ com- do not allow duplicate edges between any two nodes. mon neighbors. This approach is problematic, because Nevertheless, we allow two isolated nodes connected by

(i) the ¡ -clique problem is NP-complete [25], and (ii) an edge to form a group. ¡ requiring that each host pair in a group has exactly Since a BCC is not a clique, some node pairs in the common neighbors is too strong a requirement. BCC may not have edges between them allowing node Another approach is to treat grouping as related to pairs that do not share at least ¡ common neighbors to be the problem of identifying bi-connected components in the same group. However, any two nodes in a BCC (BCCs). A BCC is a connected component in which have at least two disjoint paths along which two suc- any two edges lie in a simple cycle. Thus, there exist cessive nodes share at least ¡ common neighbors. In

at least two disjoint paths between any two nodes in a other words, any two nodes in a group demonstrate in at ¡

BCC. Unlike the -clique problem, BCC can be solved least two different ways that they have strong similarity

:¤¥¤§¦©¨  ¤ ¨ in £ , where and are the number of nodes in connection habits, significantly reducing the possibil- and edges in the graph respectively [9, 27]. Moreover, ity that they may serve different roles. This observation all nodes in the BCC need not be connected to each other is a major reason why we believe BCCs are suitable for directly. This approach is the one we use. forming groups. The group formation phase operates as follows: When a set of hosts is placed into a group, the nodes representing those hosts are removed from conn-graph 1. Generate the connectivity graph, conn-graph, and replaced by one node (called the group node) rep-

based on the observed connection patterns. resenting the entire group. There are edges connecting

¡ ¡

¡ that group node to each node to which one of the hosts

0 0 2. For ! down to 1, where is the max- imum number of hosts with which a single host in the group was connected. communicates: In some cases where a node may have connection pat- terns so different from any other nodes, the node should Repeat until no new groups can be assigned: form a group by itself. Step 2e forms a new group with

only ¥ in it if there exist no nodes that have the number

 

(a) From conn-graph, build the k-neighborhood of common neighbors greater than or equal to 9:¤¥ .



£ 

graph k-nbh-graph. We set ! and find that it works well with various (b) Remove group nodes (see 2d) from k-nbh- networks. graph. Figure 2 illustrates the evolution of the grouping al- gorithm, in terms of k-nbh-graph, for the network de-

(c) Generate all BCCs in k-nbh-graph. picted in Figure 1. The first group is formed when

¡

!"!#¦$¤ !

(d) For each BCC  , replace in conn-graph the , where is the number of hosts used

 ¤

nodes in  by a new group node rep- by sales personnels and is the number of hosts used

! !

resenting those nodes. Label  by a pair by engineers. For specificity, let us assume that

) ) )

¡ ¡

©  ¤ !% ¤ , where is a unique identifier . As shown in the picture, the 6-nbh-graph con-

¡£¢¥¤¦¡¨§ © PROCEDURE MEETCO¢ NNECTIONREQ( )

Sales-1 

  

&

 

& 

a1 §

3 3 

&



  

  a2 Source Sales-2 Sales-3 Mail Revision return (a1 is within  percent of a2)

3

Control ¡¨§ M + N ¡£¢

ROCEDURE EET IMILARITY EQ © +*

P M S & R ( , )

¤

  (

kmax ! #"%$'&)(

¢ § ¡

Web ¡

 ,

Eng-1 SIMILARITY( , ) 0/

Sales 0/

,

2 -.1 3 3 Database if (kmax -.( and ) return true; Eng-2 Eng-3 else

3 0/

,

2

-.14 5

return (kmax 3.( and )

¡£¢ ¡¨§ ©

a) k = M+N b) k = 3 c) k = 1 PROCEDURE SIMILARITY(  , )



=<>@?

67

  

&



¨ 8:9

&

¦;





=<>@?

&



6A

 

Figure 2. Evolution of the grouping algorithm at var- 

¨ 8:9

;

¡

¢ §

¡ ¡

ious values. ¡B

& 

For each common neighbor group of and

*

JI  JI 

H<' ? H<' ?

&

¤

 

, ,DC

     

' '

#E)FG&

 ¢

*

§LK ¢ §

, C



&GM M

N N

K 7OPO return ,

tains two hosts, Mail and Web, and the algorithm puts

¡  them in one group. When ! , the algorithm iden- tifies two additional BCCs, one containing all the sales Figure 3. Pseudo-code for determining the similar- machines and the other, all the engineering machines. ity and connection requirements. Finally, because of the bootstrap condition (see Step 2e), the algorithm creates two groups, one containing Sales-

Database and the other, SourceRevisionControl, when

¡  

 £  ! ! . The algorithm repeatedly merges two groups that meet the two requirements and have the highest similarity

4.2 Merging Groups measure until no groups can be merged. The  value of a newly merged group is set to the minimum number The aforementioned group formation algorithm that of connections a host in the group has. uses a simple definition of similarity tends to produce Figure 3 depicts the pseudo-code for determining the too many groups in many situations. Consider, for ex- average connection requirement and the similarity re-

ample, the network in Figure 1 modified so that Sales-1 quirement. The procedure MEETCONNECTIONREQ de-

§  communicates with the Mail and SalesDatabase servers cides whether the two groups,  and , meet the con- but not the Web server. The grouping algorithm in Sec- nection requirement. This requirement keeps a group tion 4 will create two groups for the sales hosts, one that with a large number of connections from merging with only contains Sales-1 and one that contains the rest of another group with a much smaller number of connec- the sales hosts. This might be appropriate, but it is prob- tions.

ably not what a network administrator would want. The procedure MEETSIMILARITYREQ determines $

The group merging phase builds on the results gen- whether $ the two groups meet the similarity require-

2 2VU

/RQTS / /RQTS erated by the group formation phase. It merges groups ment. / and , , are similarity thresh-

that are similar in connection habits in a way that allows olds that can be set by the user to control the merging $

users to control the process so that more meaningful re- process. Which threshold$ is used depends upon whether

) )*&

2 2

¥ ©    

sults can be achieved. max ¤ , where is a constant in-

) 

During the grouping phase, we merge two groups if tended to define whether a  value is “high.” The sim-

they meet the following two requirements: ilarity threshold for merging groups is higher for groups )

with a high  value, those groups whose member hosts Similarity requirement. The similarity measure be- share a high number of common neighbors. This is be- tween the two groups exceeds a user-specified cause merging two groups can change the relationships threshold. between other groups in a way that induces additional Connection requirement. The average number of con- undesirable merges. nections of each group is comparable. Again, consider the groups in the example network illustrated in Figure 1. Notice that if ¤ is large, the simi- 5.1 Challenges larity measure between the SalesDatabase group and the Mail and Web group will be large. Similarly, for large For the rest of this section, we assume that there ex-

! , the SourceRevisionControl group will be highly sim- ists a unique host identifier that never changes. We note ilar to the Mail and Web group. If all three groups were that the IP address may not be a good use when Dynamic to merge, it would effectively cause the sales group and Host Control Protocol (DHCP) is used since a host’s IP the engineering group to merge, resulting in a partition- address may change over time. For smaller networks, ing with two groups: one containing all the servers, and a simple solution such as using DNS names as unique one containing all other hosts. In most situations, this identifiers and dynamically updating the changes of IP grouping would be undesirable since the network ad- addresses may be sufficient [26]. This problem of as- ministrators lose the important separation between the signing a unique identifier to each host within enterprise Sales machines and the Eng machines. For these rea- ) networks is beyond the scope of this paper. sons, groups with high  values are required to have The connection habits of a host may change as a re- a higher similarity measure to merge. We discuss how sult of the following events: i) new host arrivals, ii) ex-

best to choose the constants in Section 6. isting host removals, and iii) role changes by existing £

SIMILARITY computes the similarity (between hosts. Due to a combination of these events, some exist-

££

and ) of connection patterns between two groups. ing hosts may communicate with different sets of hosts

 ¤¢¡§ ©£¡ 

9 returns the total number of connections be- and thus the results of the grouping algorithm before and

¡§ ¡ ¡ tween and , where 2 could either be a host or a after these events may be different as: i) new groups are group. Two groups are considered similar if they have formed, ii) existing groups are deleted, iii) the member many common neighboring groups and similar average compositions of some groups change, and iv) the con-

numbers of connections. For example, if the set of nection sets of some groups change. The changes affect

§  neighbors of  is a subset of the set of neighbors of , not only the hosts directly involved in the aforemen-

it increases the desirability of merging these two groups. tioned events but also to other hosts whose connection §

However, if the average numbers of connections of  habits have not changed in a logical sense. and  are quite different, the desirability of merging Hypothetically, if we know the exact sequence of ev- them is lessened. ery single change event that happened between two ex- ecutions of the role classification algorithm, the results 5 Role Correlation of the first execution could be incrementally updated to achieve the new results. Having such a change log, al- though not impossible, can complicate the network data Over time, connection habits may evolve as new gathering process. More importantly, a detailed change servers and employees are added while some existing log cannot always lead to correct ID correlations. ones leave. Sometimes hosts may behave erratically as Consider the example network in Figure 1. As- a result of being victims or villains of denial of service sume that Sales-1 and Eng-1 switch roles as a result (DOS) attacks. Due to any of these behaviors, the group- of personnels switching jobs or changing machines. ing algorithm may produce a different set of groups than Sales-1 now communicates with SourceRevisionCon- the one produced by the algorithm a few days ago. As trol whereas Eng-1 communicates with SalesDatabase. explained in Section 4, the grouping algorithm assigns From the change log, it would seem that the con- an integer ID to each group of hosts that it identifies. nection sets of both SourceRevisionControl and Sales- There is no guarantee that the sets of IDs produced by Database change whereas in reality, their logical roles two runs of the grouping algorithm will have any cor- never changed. The difficulty here is in distinguishing relation between them. This situation is clearly unde- which changes in connection patterns are the primary sirable to the users who may want to associate logical causes that result in differences in group formations be- names and policy settings to the group IDs and preserve tween two executions of the grouping algorithm. Fur- these group specific data throughout the executions of thermore, there may also be natural changes in connec- the grouping algorithm at various times. tion patterns of many nodes. For instance, an existing In this section, we describe in detail the group cor- server machine may be replaced by two new machines relation algorithm that takes the two sets of results pro- that do load sharing among machines. The log- duced by the grouping algorithm and correlates the IDs ical roles of the client machines have not changed but of one set with that of the other so that the two groups, their observed connection patterns have. The rest of this one in each set of resulting groups, will have the same section describes the role correlation algorithm that does ID if and only if the machines in both groups are highly not rely on the change log but rather uses the same set of likely to share the same logical role. information (i.e. only connection sets) made available to the grouping algorithm. by heuristically computing the time-varying similarity measure and comparing against the pre-defined thresh- 5.2 The Role Correlation Algorithm old. The group correlation algorithm operates as fol- lows:

The correlation algorithm operates by comparing the

  £¢¨§

results of two executions of the grouping algorithms. 1. For each group  , identify and as the

 £¢1§

same group if i)  has the strongest time-

¡ £¢¨§ ¡

Let  and be the group sets generated by the



¤¦¥ ¤

grouping algorithm at time and respectively. The varying similarity with  , among all the groups

$

£¢¨§

in  and ii) the average number of connections §

correlation algorithm updates the ID set of  , so that

2

) )

¡ ¡ 

is at least within  percent of the average number

¨ ¨ ©

 £¢¨§  £¢¨§

!   

, where  and ,



£¢¨§

of connections of  .

£¢¨§  if and only if  and considered to represent the

same logical role. More specifically, the connection pat-

 © £¢¨§

2. For each group pair ¤ that remain uncor-

£¢¨§ 

terns of the members of  and those of are very

 £¢1§ related, decide whether  and represent

similar. The groups correlation algorithm correlates the

) )

 ¡ 

¡ the same logical group based on how similar the ¨ ©

¨ and in a meaningful manner and thus al-



 connection patterns between and its neighbor

low applications to preserve data specific to a particular

£¢¨§ groups are to those between  and its neighbor group. groups.

The role correlation algorithm:

£¢¨§

  Isolates the primary events, such as node arrivals Step 1 decides whether the two groups and and removals, that directly affects the connection are identical based on the time varying similarity mea- habits of groups, sure. As in Section 4.2, we compute the similarity mea- sure based on the average number of connections be-

Identifies nodes that have not changed their neigh- tween the groups and their common neighbors. How-

bors, ever, finding the common neighbor set between  and

 £¢1§

 is not trivial. This is because we cannot simply



 9:¤ 

Heuristically computes the time-varying similar- assume that a neighbor ¥ and a neighbor



 £¢¨§ 9:¤  £¢1§  ity between the connection habits of two groups ¥

) are the same host even if they have the

¡ 



¤ ¤¦¥ ¨ formed at times and , and assigns !

) same host identifier. We use the following techniques to

¡

¨ © if and only if the role of hosts (in terms on identify the common neighbor set:



£¢1§

their connection patterns) in  can be consid-





ered identical to the role of hosts in . If a neighbor ¥ of shares the same host identi-

£¢1§ £¢¨§  fier with the neighbor ¥ of and both have

First, the correlation algorithm eliminates the differ- ¡

¡ been considered highly likely to be the same host



£¢1§

ences between the two host sets, and , so that it

£¢¨§ 

0 ©¥ ¥ (i.e. ¥ ), we assume is the

can compare the connection patterns meaningfully. The

£¢1§ £¢¨§

¥  neighbor to  in the same way as is to .

algorithm computes the set of nodes that existed at time

¡ ¡



¤¦¥ ¤

£¢1§

 (©¥ £¢¨§(

but have been removed in time ( ), and For each neighbor pairs ¤¥ that are not con-

¡ ¡

¤

£¢1§

the set of nodes that only appear at time (  ). sidered as highly likely to be the same host, we as-

¡

 

All new nodes are removed from and deleted nodes sume ¥ is the neighbor to in the same way as

¡

£¢¨§

£¢1§ £¢1§ 

are removed from . Thus, the changes in connec- ¥ is to if and only if the following con-

$

£¢¨§

tion set of each host is only as a direct result of changing dition is true. The connection set size of ¥ is

2

¥

connection patterns between the host and its neighbors within  percent of that of and no other neigh-

¤

£¢¨§

(which existed at time ). bor of  has the connection set size closer to

Second, the algorithm heuristically identifies the set, that of ¥ . 

same, of nodes that are very like to play the same log-



¤¦¥ ¤ 

ical roles from to . We say that the two nodes ¥ The algorithm then computes the time-varying similarity

£¢1§

¥

£¢¨§

©¥  and are highly likely to be the same machine (i.e. measure between each neighbor pair ¤¥ , which

it hasn’t changed its logical role) if they have the identi- meets the aforementioned requirements, as the mini-



£¢¨§ 

! ;7¥ ¢  ¥ 

cal connection sets. Specifically, same mum of the average number of connections between ¥

¡

£¢¨§ £¢1§

©9:¤¦¥  ! 9:¤¦¥  >

¥ £¢1§  £¢1§

. We will explain shortly how and  and between and . If the sum of

  we use the fact that a host ¥ same to our advantage in the similarity measures for all common neighbor pairs

computing the time varying similarity measure. within the bounds of the specified thresholds , the algo-

£¢1§ 

The role correlation algorithm will determine rithm declares that groups  and mean the same

£¢1§  whether the two groups  and are the same group group. 6 Results the figure). On the other hand, almost every sales host

in group ©¥ communicates with both the Microsoft Ex- 

In this section, we evaluate the performance of the change server and the NT sever from group ¥ , but not algorithms using traces gathered over a day at two cor- with the Unix mail server nor the source revision control

porate networks. We show that the algorithms operate server. In fact, there are just two connections between

£ well for both networks and examine the effects of user- group §¥ and each of groups and . The four engi- defined thresholds on the results of the role classification neering hosts in §¥ had connection patterns very sim- algorithm. ilar to those of sales hosts, so they were grouped ac- We call the two test networks Mazu and BigCom- cordingly. Most probably, these machines are used by pany. Mazu is part of the corporate network at Mazu engineering managers who do not perform engineering- Networks, Inc., in Cambridge, MA. It consists of related tasks such as coding, and use the Exchange mail 110 hosts, including engineering workstations, several server instead of the Unix mail server. servers, and laptops. Mazu develops various software If we have the perfect knowledge of the logical struc- products in the area of network security and monitoring. ture of the network, we can use that knowledge to quan- The BigCompany network consists of 3638 hosts, in- tify the resulting quality of the groups produced by cluding workstations, servers, and many IP phones. For grouping algorithms. One simple yet effective metric privacy reasons, BigCompany must remain anonymous. used in the cluster validation literature [16, 12] is Rand Statistic, which is based on testing whether a pair of ob- 6.1 Effectiveness of the Grouping Algorithm jects belongs to the same group as decided by the group-

ing algorithm and according to our knowledge. Let

 We evaluate the effectiveness of the role classifica-  and be the partitionings of hosts produced by a

tion algorithm by comparing the groups formed by the grouping algorithm and based on our knowledge respec-    

/ /

algorithm against the logical roles that hosts play as de- tively. Let /"/ , , and be the numbers of host



termined by knowledgeable network administrators. For pairs that belong to the same group in both  and ,

  $

all the experiments, unless$ otherwise noted, we set user- to the same group in and to different groups in , to

£

2 2



different groups in  and to the same group in , and

!¡ © /RQTS £¢¤¢! ©  !£¥

defined thresholds, / and .



 /

We examine how these thresholds affect the results in to different groups in both  and respectively.



 

later sections. and / are indicative of how different is from .

£

! ©§

Figure 4 shows some of the groups formed by the Rand Statistic  is between and



©¦¦§

 role classification algorithm running on the Mazu data . The higher the value, the more similar  and are. and configured with the default parameters. Each cir- For the Mazu network, we were able to ascertain the

cle in the figure represents a group and lists its members logical roles of all except 8 hosts. We worked closely and its connections with other groups. Where possible, with the Mazu network administrator to obtain  , the we have indicated the logical role of each host, which ideal partitioning of hosts. We find that the partition-

we obtained by asking the Mazu network administrator. ing produced by the grouping algorithm (with default

 

£

!©¢¤ ©/ !¥ © / !

(Of course, this logical information was not used in con- parameters) achieves /"/

 

 £ 

 © ! ¤ ¦¢  !   structing the grouping.)  and . This shows that Observe that the role classification algorithm placed the results of the grouping algorithm reflect to a high almost all engineering (eng) machines in a single group, degree our intuitive notion of the underlying structure of

¦¢ . Also note that the number of connections of an en- the network. We note that the reason for having a rela-  gineering host varies from 4 to 9. Similarly, most ma- tively high / is because the algorithm identifies sub- chines used by sales, management (admin) and opera- sets of hosts within large groups as separate groups. For

tions (ops) were placed in a single group, §¥ . The largest instance, the grouping algorithm produces a few differ- £

group, , contains new machines and test machines in ent groups, each containing a single eng machine instead £ the lab. of putting them in group (not shown in the figure).

However, four hosts that are identified as engineering This is because those eng machines have the total num- ¦¢

machines are placed in group §¥ rather than group . ber of connections far greater than the average number £

The reason is that these machines do not communicate of connections that a host in group does. Such dis-

with a set of hosts that engineering machines in group tinction may prove useful in certain situations. £

¨ communicate with. As shown in Figure 4, each engi- Table 1 lists the five largest groups produced by run- £

neering machine in group ¨ has, on average, one con- ning the grouping algorithm on the BigCompany net- £ nection with group  , which consists of a Unix mail work. Again, we relied on information generated by the server, and one connection with group , which con- network administrator to help us understand whether the sists of a source revision control server (not shown in groupings generated by the algorithm matched the log- Group 80(1) 10.7.1.54 (new_machine): 2 10.7.1.12 (new_machine): 3 Group 87(2) 10.7.1.14 (new_machine): 2 10.0.0.81 (?): 3 10.7.1.51 (new_machine): 2 10.0.0.70 (sales): 3 10.7.1.53 (new_machine): 2 Group 85(4) 10.0.0.72 (?): 2 10.7.1.11 (new_machine): 2 10.0.0.94 (eng): 8 10.0.0.110 (admin): 5 10.7.1.13 (new_machine): 2 10.0.0.102 (eng): 4 10.0.0.157 (sales): 5 Group 71(25) 10.0.0.19 (lab): 1 10.0.0.91 (eng): 10 10.0.0.132 (eng): 6 10.0.0.18 (mazu_nt): 48 10.2.0.18 (lab): 1 10.0.0.124 (ops): 11 10.0.0.115 (admin): 5 10.0.0.26 (ms_exchange): 25 10.0.0.52 (lab): 1 10.0.0.118 (eng): 4 10.0.0.162 (sales): 5 10.2.0.20 (lab): 1 10.0.0.114 (eng): 9 10.0.0.122 (eng): 4 ----comm with---- 10.1.0.69 (lab): 1 10.0.0.92 (eng): 7 10.0.0.140 (admin): 4 Group 10(9) Group 6: 0.5 10.1.0.23 (lab): 1 10.0.0.93 (eng): 4 10.8.0.11 (test_bed): 3 10.0.0.16 (unix_mail): 16 Group 8: 1 10.2.0.42 (lab): 1 10.0.0.103 (eng): 5 10.0.0.120 (ops): 6 Group 9: 1 10.2.0.17 (lab): 1 10.0.0.109 (eng): 5 10.0.0.116 (admin): 4 ----comm with---- Group 10: 1 10.2.0.19 (lab): 1 10.0.0.123 (eng): 4 10.0.0.101 (ops): 6 Group 8: 1 Group 17: 0.5 10.1.0.70 (lab): 1 10.0.0.143 (eng): 5 Group 9: 1 Group 21: 0.5 10.7.100.39 (new_machine): 1 ----comm with---- 10.0.0.134 (admin): 3 Group 17: 1 Group 22: 1 10.1.0.14 (lab): 1 Group 4: 0.909091 10.0.0.146 (eng): 3 Group 71: 2 Group 28: 0.5 10.7.7.250 (new_machine): 2 Group 6: 0.909091 Group 85: 10 Group 69: 0.5 10.0.0.21 (build): 4 Group 8: 0.0909091 ----comm with---- Group 87: 1 Group 71: 0.5 10.1.0.71 (lab): 3 Group 10: 0.909091 Group 4: 0.823529 Group 76: 0.5 10.1.2.18 (lab): 3 Group 21: 0.272727 Group 6: 0.0588235 Group 80: 0.5 10.2.0.24 (lab): 2 Group 27: 0.181818 Group 10: 0.0588235 Group 84: 4 10.0.0.29 (lab): 3 Group 67: 0.181818 Group 22: 0.117647 Group 85: 7.5 10.2.0.23 (lab): 2 Group 71: 1.36364 Group 28: 0.0588235 Group 87: 16.5 10.1.0.12 (lab): 2 Group 76: 0.363636 Group 30: 0.0588235 Group 80: 0.363636 Group 54: 0.0588235 ----comm with---- Group 84: 0.636364 Group 64: 0.0588235 Group 4: 1 Group 86: 0.272727 Group 71: 1.94118 Group 6: 0.296296 Group 76: 0.764706 Group 8: 0.037037 Group 84: 0.117647 Group 9: 0.259259 Group 87: 0.0588235 Group 71: 0.037037 Group 85: 0.148148

Figure 4. Grouping results based on data gathered over one day at Mazu. The number in parentheses next to )

the group ID is the group’s  . The number next to each host is a count of the host’s connections. Each line after “comm with” denotes a neighbor group and the average number of connections between the group and that

neighbor.

£ £

ical structure of the network. Group  consists of idle), and that is to a host that opens connections to about

desktops whose IP addresses are managed by the DHCP 1,600 machines. With our help, BigCompany is cur-

£ £

server. Almost every machine in group  communi- rently investigating why this host scans about 45% of all

cates with approximately 85 of the machines in group the machines on the network. This example is another

£

¥¤¢ , and vice-versa. This pattern suggests that it was good use of how the role classification algorithm can

appropriate for the grouping algorithm to combine the be applied to understand networks and detect anomalous

£

machines in group ¥¤¢ , which use static IPs, into a behavior. group. Most machines in both groups run Microsoft Table 2 summarizes the grouping results of the two Windows. The high number of connections between networks. Observe that the number of groups in the the groups is due to Windows file sharing, which uses BigCompany network is 26 times smaller than the num- the NetBIOS network protocol. File sharing creates a ber of hosts. Unfortunately, we cannot use Rand Statis- large number of connections between the hosts in the tic to quantify the quality of the groups produced by two groups, even though in both groups there is little the grouping algorithm since we don’t have the perfect intra-group communication. We continue to investigate knowledge of the logical roles of each machine in the this interesting relationship between the two groups. It BigCompany network. Nevertheless, the network ad- is striking, and further proof of the need for better analy- ministrators at BigCompany report that they find them sis tools, that the network administrators we have talked both useful and consistent with their intuitions about to themselves don’t know why the groups are partitioned their networks. We are also in the process of analyzing in this way. a larger network owned by HugeCompany that consists of 49,041 hosts.

The grouping algorithm also correctly classifies all

£¤¨ 

¦ IP phones into one group,  . Group consists

of web servers and other servers that desktops in group 6.2 Effectiveness of the Correlation Algorithm

£ £ £

  regularly communicate with. Group is the largest group, with 1519 members. Most machines in This subsection shows that for a specific scenario, the this group have a single connection (hence the role name role correlation algorithm associates new groups with Group 82(1): old 80 10.7.1.54 (new_machine): 2 10.7.1.12 (new_machine): 3 Group 88(2): old 87 10.7.1.14 (new_machine): 2 10.0.0.81 (?): 3 10.7.1.51 (new_machine): 2 Group 87(4): old 85 10.0.0.70 (sales): 3 10.7.1.53 (new_machine): 2 10.0.0.94 (eng): 8 10.0.0.72 (?): 2 10.7.1.11 (new_machine): 2 10.0.0.200 (new-eng): 4 10.0.0.157 (sales): 5 10.7.1.13 (new_machine): 2 10.0.0.102 (eng): 4 10.0.0.132 (eng): 6 Group 71(24): old 71 10.7.7.250 (new_machine): 2 10.0.0.91 (eng): 10 10.0.0.115 (admin): 5 10.0.1.18 (new-mazu_nt): 48 10.2.0.18 (lab): 1 10.0.0.124 (ops): 11 10.0.0.162 (sales): 5 10.0.0.16 (ms_exchange): 24 10.0.0.52 (lab): 1 10.0.0.118 (eng): 4 10.0.0.122 (eng): 4 10.2.0.20 (lab): 1 10.0.0.114 (eng): 9 10.0.0.140 (admin): 4 ----comm with---- 10.1.0.69 (lab): 1 10.0.0.92 (eng): 7 10.8.0.11 (test_bed): 3 Group 10(10): old 10 Group 6: 0.5 10.1.0.23 (lab): 1 10.0.0.93 (eng): 4 10.0.0.120 (ops): 6 10.0.0.26 (unix_mail): 17 Group 8: 1 10.2.0.42 (lab): 1 10.0.0.103 (eng): 5 10.0.0.116 (admin): 4 Group 9: 1 10.2.0.17 (lab): 1 10.0.0.109 (eng): 5 10.0.0.101 (ops): 6 ----comm with---- Group 10: 1 10.2.0.19 (lab): 1 10.0.0.123 (eng): 4 10.0.0.143 (eng): 5 Group 8: 1 Group 16: 0.5 10.1.0.70 (lab): 1 10.0.0.134 (admin): 3 Group 9: 1 Group 21: 0.5 10.7.100.39 (new_machine): 1 Host Previous New ----comm with---- 10.0.0.146 (eng): 3 Group 16: 1 Group 22: 1 10.1.0.14 (lab): 1 Group 4: 0.833333 Group 71: 2 Group 29: 0.5 10.0.0.19 (lab): 2 Group 6: 0.916667 ----comm with---- role role Group 87: 11 Group 69: 0.5 10.0.0.21 (build): 4 Group 8: 0.0833333 Group 4: 0.8125 Group 88: 1 Group 71: 0.5 10.1.0.71 (lab): 3 Group 10: 0.916667 Group 6: 0.0625 Group 80: 0.5 10.1.2.18 (lab): 3 Group 21: 0.25 Group 10: 0.0625 10.0.0.16 unix mail ms exchange Group 82: 0.5 10.2.0.24 (lab): 2 Group 28: 0.166667 Group 22: 0.125 Group 86: 4 10.0.0.29 (lab): 3 Group 67: 0.166667 Group 29: 0.0625 Group 87: 8 10.2.0.23 (lab): 2 Group 71: 1.33333 Group 31: 0.0625 10.0.0.26 ms exchange unix mail Group 88: 15.5 10.1.0.12 (lab): 2 Group 80: 0.333333 Group 55: 0.0625 Group 82: 0.416667 Group 65: 0.0625 ----comm with---- 10.0.0.18 mazu nt Group 86: 0.583333 Group 71: 1.9375 - Group 4: 1 Group 89: 0.25 Group 80: 0.6875 Group 6: 0.296296 Group 86: 0.125 Group 8: 0.037037 10.0.1.18 - mazu nt Group 88: 0.0625 Group 9: 0.259259 Group 71: 0.037037 10.0.0.200 - eng Group 87: 0.185185 10.0.0.110 admin -

Figure 5. The grouping results on the Mazu network with several changes (see table) to the connection patterns. The number next to “old” represents the ID of the correlated group shown in Figure 4.

Group ID Members Logical Role how the member compositions of these four groups 1043 1490 Idle change from the ones in Figure 4. Both unix mail and 1020 158 DHCP-Desktops ms exchange are correctly identified in the same fashion 1138 396 Servers as in Figure 4 despite their role reversal. The new NT 1092 167 IP-Phones server (new-nt server) appropriately takes the place of 1075 156 StaticIP-Desktops the old one. Similarly, a new eng host is grouped with other eng machines. Despite various changes to the con-

Table 1. The five largest groups classified in Big- nection patterns, the role correlation algorithm was able

¤ Company network that consists of  hosts. to correctly associate each new group with an existing Logical role is identified by knowledgeable network one. We continue to investigate the limits of the role administrators at BigCompany. correlation algorithm under rigorous changes in connec- tion patterns.

6.3 Configuration existing ones in an appropriate way. Figure 5 lists The algorithms use two internal constants that we be- the scenario we investigate. In the Mazu network, we lieve are not sensitive to particular network connection swapped the roles of unix mail and ms exchange by patterns. The group formation phase of the role classi-

switching their IP addresses. We also replaced the old fication algorithm (see Section 4.1) requires a constant



   NT server, called mazu nt (10.0.0.18), with a new server £ to keep a host ¥ from forming groups with (10.0.1.18). Finally, we removed an old admin ma- other hosts that have less than a fraction  of the num- chine (10.0.0.110) and brought in a new eng machine ber of connections that ¥ has. The group merging phase

(10.0.0.200). Although the specific scenario is just one keeps the two groups from merging if the average num-

¡   of many possible ones, it includes the types of changes ber of connections of a group is not within £

that could happen in a real network. of the other’s (see Figure 3).



£  £ 

! ¢ The modified connection patterns were used as in- We set ! and . Our experiments with puts to the role classification algorithm. The role corre- both Mazu and BigCompany networks indicate that the lation algorithm then correlated the new grouping results default values work well on at least two rather differ- with the original results. Every group in the new results ent networks. We believe that, in general, it will not be is correlated with an old group. Figure 5 depicts the necessary to adjust these constants. Nevertheless, we four groups that are affected by the changes. Observe plan to expose these parameters to network administra- 500 Mazu split. The extent to which such splits occur varies from 450 BigCompany network to network. A knee in the curve indicates that 400 350 the algorithm can expose the logical structure of the net- 300 work in two significantly different manners. Consider

250 QTS again the network in Figure 1. If / is too low, Mail, 200 Web, SalesDatabase, and SourceRevisionControl will all 150 Number of Groups 100 be placed in one group, whereas all sales and engineer- 50 ing machines will be placed in another. In some cases, 0 0 10 20 30 40 50 60 70 80 90 100 such grouping might be more appropriate than the one Similarity Threshold achieved in Figure 1. Network administrators should

compare the grouping results on both sides of the knee

RQ S Figure 6. Number of Groups vs. /

and decide which one better suits their needs. $

£

2

$



Our experiments show that as long as / ,

2 $ 500 changes to / hardly affect the grouping results. There-

Mazu 2 $ 450 BigCompany fore, we suggest that / be fixed.

400 2 On the other hand, the choice of  has a significant 350 impact and $ should probably vary from network to net-

300 2

250 work. If  is set to the maximum number of connec-

200 tions that any host has, the similarity measure$ between £ 150 2

Number of Groups

$

QTS  ! hosts is only compared against / . If , the sim-

100

2 $

50 ilarity measure is only compared against / . Ideally, 2

0  should be set at a value that partitions the hosts in 0 5 10 15 20

K Threshold the network into two groups, one containing all server- $

like machines, and one containing$ all others.

2

2  Figure 7. Number of Groups vs. Figure 7 shows how  affects the number of groups formed. For any two data points with the same number of groups, the grouping results are identical.

tors so that they can adjust them along with the similarity Clearly, the grouping$ results do not change for the Mazu

2

  thresholds to achieve grouping results that most reflect network when  . Similarly, the grouping re-

their intuition of the network structure. sults$ hardly change for the BigCompany network when

2

$



 . This implies that it is not too difficult to find

2 $

6.4 Effects of Similarity Thresholds an appropriate  for a particular network. By default,

2

!£¥ we set  and believe that this value will be suit-

able for most networks. Nevertheless,$ we are currently $

In this subsection, we examine$ how the choice of the

2

2 2

working on automatically setting  .

Q S /  user-defined thresholds, / , , and , affect the number of groups formed by the role classification algo-

rithm. Recall that the two groups are merged if and only 6.5 Run Time

$ QTS

if their similarity measure is %/ . Furthermore, if the

)

2

$ Table 2 shows the time taken to run the role classi-

  maximum  associated with the groups is , they

2 fication algorithm on the Mazu and BigCompany net- $

are not merged unless their similarity measure is ,/ .



£   ££

2 works. We performed our experiments on a Linux ma-

RQ S / We require that / .

chine equipped with a 2GHz Intel Xeon processor and QTS Figure 6 illustrates how / affects the total number of groups formed for both Mazu and BigCompany net- 4GB of memory. The run time achieved by the algo-

rithms grows quadratically with the number of nodes QTS works. The number of groups increases with / . Again,

and is acceptable for use in commercial enterprise net- QTS a large / value keeps more groups from merging and as a result, the total number of groups remains large. work monitoring and analysis tools. We continue to fur-

The number of groups may not increase smoothly ther improve the performance of the algorithms. QTS with the increase in / . For instance, there is steeper

incline (knee) in the number of groups of BigCompany 7 Related Work QTS

network when / is increased from 70 to 90. The

Q S reason is that the increase in / causes some groups The work described in this paper was implemented with high numbers of connections to split, since they in part using Click [21], a modular system that no longer meet the stronger similarity requirement to makes it easy to build efficient packet processing de- merge. This in turn causes several neighboring groups to vices on commodity PC hardware. The grouping al- Network Hosts Groups Run time(s) As explained in Section 2, role grouping is well- Mazu 110 25 0.069 suited to improving network monitoring and policy man- BigCompany 3638 137 63 agement. An entire industry [8, 15, 17] caters to en- HugeCompany 49041 1374 2101 terprises’ network management needs, and much liter- ature is devoted to network monitoring, traffic report- Table 2. The summarized grouping results for Mazu ing, and performance measurement [13, 20, 23, 24]. All and BigCompany networks. this work differs significantly from ours. The commer- cial network management systems are primarily inte- gration and alerting tools, intended to provide operators with a unified view of disparate devices on the network. gorithm only requires information about connections They serve as conduits for the raw data, but do not ex- among hosts, so it can obtain data from a variety of tract higher-level semantics such as role relationships. sources, from summary formats like RMON [28] and Academic work has focused on network monitoring and Netflow [6] to packet-level sniffers like tcpdump [18]. techniques for performance measurement, but again, the Part of the role classification algorithm can be viewed interpretation of data is generally left to humans. as a data clustering algorithm. Data clustering has been Another tool that can help operators understand their an active area of research for a few decades [14, 19] and networks is network visualization [1, 5, 11]. Visual- is known to be a difficult problem combinatorially. The ization focuses on graphic design and automated lay- techniques used to cluster data vary widely according out algorithms to help users digest the vast amount of to the assumptions, and contexts specific to application data generated by network monitoring tools. Unlike the domains and many existing techniques are specifically grouping algorithm, these techniques have no notion of developed for pattern recognition and image analysis. the logical structure of the network. However, they can In general, a data clustering algorithm attempts to clus- complement grouping, exposing grouping information ter data points or patterns, each of which is represented to the user and using grouping information to make bet- by a vector of real numbers. Patterns that are similar ter decisions about visual layout. to each other are clustered together. The most popular metric for similarity measure is the Euclidean distance. 8 Summary One well-known clustering technique is the hierarchi- cal agglomerative clustering technique. The idea is to This paper has presented two practical algorithms merge clusters based on the pair-wise similarity measure (grouping and correlation) that group hosts on an enter- of patterns. The merging process is stopped according to prise network into roles according to their observed con- some predefined similarity thresholds. In this aspect, the nection patterns. The first algorithm partitions hosts on group merging phase of the role classification algorithm the network into groups based on connection data. The can be classified as a hierarchical agglomerative cluster- second algorithm meaningfully correlates the results ob- ing technique. tained by running the first algorithm at different times, The main reason why traditional data clustering al- taking into account the evolution of connection patterns gorithms cannot be easily extended for our application over time. domain is because it is difficult to represent the con- To our knowledge, the problem of automatically nection pattern of each host with a vector of numbers grouping and classifying hosts based on their behavior in such a way that the widely used Euclidean distance on the network has not been addressed before. This to measure the similarity between two connection pat- paper formulates the problem by presenting an abstract terns makes sense. Furthermore, we can leverage the model in addition to the concrete algorithm specifica- communication patterns found in typical enterprise net- tions. The general framework we have developed ac- works, such as client-server communications, to achieve commodates other classification algorithms in addition more meaningful grouping results. We also note that to the ones we have described. traditional data clustering techniques do not deal with Grouping hosts according to their connection habits temporal correlation of clusters as the role correlation exposes the logical structure of the network, and can algorithm does. serve to improve understanding of the network and to The role classification algorithm is applicable to net- simplify a variety of network management tasks. It can work intrusion detection. For example, grouping infor- also improve the accuracy of automated tools, such as mation provides context that can be used by intrusion systems for network monitoring and intrusion detection. detection systems [10, 22] (IDS) to help determine how Experience with the algorithms on two corporate net- unusual (and hence potentially suspicious) a certain net- works, one with about 100 hosts and one with over 3600 work behavior is (see Section 2). hosts, indicates that they work well. They are easy to tune, and produce results that are meaningful and con- [11] EICK, S. G., AND WILLS, G. J. Navigating large net- sistent with the intuition of experienced network admin- works with hierarchies. In Proceedings of the IEEE Con- istrators. Importantly, our experience on the corporate ference in Visualization (25–29 1993), pp. 204–210. networks has shown that automated classification algo- [12] HALKIDI, M., BATISTAKIS, Y., AND VAZIRGIANNIS, rithms such as these can play an important role in as- M. On clustering validation techniques. Journal of In- sisting network administrators. The algorithms are also telligent Information Systems 17, 2-3 (2001), 107–145. fairly efficient, and their performance remains practical [13] HARRISON, H. E., MITCHELL, M. C., AND SHAD- even for networks with several thousand hosts. DOCK, M. E. Pong: A flexible network services moni- Much work remains to be done. We plan to continue toring system. In Proceedings of the USENIX 8th System improving the performance of the algorithm. The ideal Administration Conference (LISA VIII) (San Diego, CA, solution should be better than quadratic time complex- Sept. 1994). ity, since that could eventually be the limiting factor on [14] HARTIGAN, J. Clustering Algorithms. John Wiley, 1988. very large networks. We will also explore other defini- [15] HEWLETT-PACKARD COMPANY. OpenView. http: tions of host similarity for grouping. For instance, one //openview.hp.com. could consider incorporating services (such as TCP or [16] HUBERT, L., AND ARABIE, P. Comparing Partitions. UDP port information) or protocols into the definition of Journal of Classification 2 (1985), 193–218. a connection, so that a web server would not be grouped [17] INTERNATIONAL BUSINESS MACHINES. Tivoli Soft- with a mail server. In addition, we have yet to ex- ware. http://tivoli.com. plore many of the applications of automatically-derived [18] JACOBSON, V., LERES, C., AND MCCANNE, S. tcp- grouping information, which include network manage- dump. http://www-nrg.ee.lbl.gov,1991. ment, provisioning, security, and perhaps others. [19] JAIN, A. K., MURTY, M. N., AND FLYNN, P. J. Data clustering: a review. ACM Computing Surveys 31, 3 References (1999), 264–323. [20] KEYS, K., MOORE, D., KOGA, R., LAGACHE, E., TESCH, M., AND CLAFFY, K. The archi- [1] BECKER, R. A., EICK, S. G., AND WILKS, A. R. Visu- tecture of CoralReef: an Internet traffic monitor- alizing network data. IEEE Transactions on Visualization ing software suite. In PAM2001 Workshop on Pas- and Graphics 1, 1 (1995), 16–28. sive and Active Measurements (Apr. 2001), CAIDA. [2] CERT COORDINATION CENTER. CERT Advisory CA- http://www.caida.org/tools/measurement/coralreef/. 2001-19: Code Red Worm Exploiting Buffer Overflow [21] KOHLER, E., MORRIS, R., CHEN, B., JANNOTTI, J., In IIS Indexing Service DLL, July 2001. AND KAASHOEK, M. F. The Click modular router. ACM [3] CERT COORDINATION CENTER. CERT Advisory Transactions on Computer Systems 18, 3 (August 2000), CA-2001-26: Nimda Worm. http://www.cert. 263–297. org/advisories/CA-2001-26.html, Septem- [22] MUKHERJEE, B., HEBERLEIN, L., AND LEVITT, K. ber 2001. Network intrusion detection. IEEE Network 8, 3 (May- [4] CHESWICK, B., AND BELLOVIN, S. M. Firewalls and June 1994), 26–41. Internet Security: Repelling the Wily Hacker. Addison [23] PATARIN, S., AND MAKPANGOU, M. Pandora: a flexi- Wesley, 1994. ble network monitoring platform. In Proceedings of the [5] CHESWICK, B., BURCH, H., AND BRANIGAN, S. Map- 2000 USENIX Annual Technical Conference (San Diego, ping and visualizing the Internet. In Proceedings of the CA, June 2000). 2000 USENIX Annual Technical Conference (San Diego, [24] PLONKA, D. Flowscan: A network traffic flow re- CA, June 2000). porting and visualization tool. In Proceedings of the [6] CISCO SYSTEMS. Cisco IOS NetFlow. http:// USENIX 14th System Administration Conference (LISA cisco.com/warp/public/732/netflow. XIV) (New Orleans, LA, December 2000). [7] CISCO SYSTEMS. Dynamic VLANs. http:// [25] SIPSER, M. Introduction to the Theory of Computation. cisco.com/univercd/cc/td/doc/product/ PWS Publishing Company, 1997. lan/cat5000/rel_5_2/config/vmps.htm. [26] SNOEREN, A. C., AND BALAKRISHNAN, H. An End- [8] COMPUTER ASSOCIATES INTERNATIONAL. Unicenter. to-End Approach to Host Mobility. In Proceedings of the http://cai.com/unicenter. ACM Conference on Mobile Computing and Networking (Boston, MA, 2000). [9] CORMEN, T. H., LEISERSON, C. E., AND RIVEST, [27] TARJAN, R. Data Structures and Network Algorithms. R. L. Introduction to Algorithms. The MIT Press and Society for Industrial and Applied Mathematics, 1983. McGraw-Hill, 1990. [28] WALDBUSSER, S. Remote Network Monitoring Man- [10] DENNING, D. E. An intrusion-detection model. IEEE agement Information Base Version 2. Network Working Transactions on Software Engineering 13, 2 (Feb. 1987), Group RFC 2021, Jan. 1997. 222–232.