Improviesed Genetic Algorithm for Fuzzy Overlapping Community Detection in Social Networks
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Advanced Computational Engineering and Networking, ISSN: 2320-2106, Volume-5, Issue-9, Sep.-2017 http://iraj.in IMPROVIESED GENETIC ALGORITHM FOR FUZZY OVERLAPPING COMMUNITY DETECTION IN SOCIAL NETWORKS 1HARISH KUMAR SHAKYA, 2KULDEEP SINGH, 3BHASKAR BISWAS 123Indian Institute of Technology, Varanasi, (UP) India, India Email: [email protected],[email protected], [email protected] Abstract: Community structure identification is an important task in social network analysis. Social communications exist with some social situation and communities are a fundamental form of social contexts. Social network is application of web mining and web mining is also an application of data mining. Social network is a type of structure made up of a set of social actors like as persons or organizations, sets of pair ties, and other interaction socially between actors. In recent scenario community detection in social networks is a very hot and dynamic area of research. In this paper, we have used improvised genetic algorithm for community detection in social networks, we used the combination of roulette wheel selection and square quadratic knapsack problem. We have executed the experiment on different datasets i.e. Zachary’s karate club [31], American college football [39], Dolphin social network [32] and many more. All are verified and well known datasets in the research world of social network analysis. An experimental result shows the improvement on convergence rate of proposed algorithm and discovered communities are highly inclined towards quality. Index term: community detection, genetic algorithm, roulette wheel selection, I. INTRODUCTION optimization problem, which involves a quality function first defined by Newman and Girvan [3], A social network is a branch of data mining which involves finding some structure or pattern amongst the set of individuals, groups and organizations. A social network involves representation of these societies in the form of a graph with the individuals and this quality function, called modularity, gave an as the vertices and the relationship among the approximate measure of the validity of the formed individuals being represented by the edges [1]. community structure. For each subset, the partition is Certain individuals of a social network are said to said to be better, if the number of excess of links in belong to a community if they have large number of each module is larger than the number of excess of interconnections. Similarly two individual links in the random case. So, the best partition of the communities will have less number of connections given social network will be the one in which the between them. In normal terms, community structure modularity has maximum value. Optimizing the value is defined as the groups of vertices such that the of the modularity of a community structure is a number of edges within the individuals of each group challenging task, since when the size of the social is greater than the number of edges that connect the network increases, the number of partition will individuals of this group to the rest of the groups in increase in at least exponential relation with the size. the community network. This complete process of It has been recently proved that optimizing the finding the exact community structure of any given modularity of a community structure for a given social network is called community detection in social network is an NP-complete problem, [4] so it social network [2].Community structure in any given is not possible to find the exact optimum value of social network gives us an indication of some modularity in polynomial time with the help of a important pattern which may be hidden on normal deterministic algorithm but we can find a good analysis, and thus can help us to understand a lot of approximate solution in polynomial time using processes and phenomenon of social networks and various techniques, like greedy agglomeration [5][6], communities better. This also helps when someone simulated annealing[7][8][9], extremal optimization makes an application using the social network and its [10] and spectral division [11]. communities. Community detection in social networks can be done The validity of the methods used for community with the help of genetic algorithm [12].The basic detection in social networks is based on the structure of a genetic algorithm involves three steps: “goodness” of the found communities, which is Selection operation, crossover operation and mutation evaluated with the help of some quality function. The operation [13]. Selection operation involves selecting problem of identifying different communities in given a certain fraction of individuals from the network as it social networks has been changed into an is and copying them to the next generation. This selection is made for the best communities. This is in Improviesed Genetic Algorithm for Fuzzy Overlapping Community Detection in Social Networks 27 International Journal of Advanced Computational Engineering and Networking, ISSN: 2320-2106, Volume-5, Issue-9, Sep.-2017 http://iraj.in consistence with biological genetics assuming that a time complexity O(m2n) for a graph with n certain fraction of best or most fit individuals will vertices and m edges. The steps of this algorithm survive. The crossover operation is like reproduction include centrality computation, removal of edge in biology. This involves mixing the genes of two with largest centrality, recalculation of centrality individuals. Similarly in case of community and iteration of the cycle from step 2. detection, two individuals are crossover to form a Modularity Maximization [17]: This is the most new individual in next generation. This individual has widely used method. Modularity function some of the genes of one parent and some of them of measures quality of the community structure. In the other one. Mutation operation is identical to modularity maximization method, we search biological mutation in which some of the individuals over all possible division of the network and adapt to form new species with changed genetic obtain the division with high modularity. But this material. Similarly, in community detection using solution n intractable. Approximate solutions are genetic algorithm, we modify the genes or made. Some of these approximate optimization characteristics of some percentage of individual to method include greedy algorithms, simulated form a new species in next generation. annealing and spectral optimization. The first Three types of partitions, possibilistic partitions, greedy algorithm was proposed by Newman. fuzzy partitions, and crisp/hard partitions for This algorithm tends to form large community. communities are possible. For fuzzy community In simulated annealing, the probabilistic partition, fuzzy c-means clustering [14] is used. procedure is followed and the global optimum of a function is found. Modularity optimization fails II. LITERATURE REVIEW to detect clusters smaller than a particular value called resolution limit. Community detection is an important research topic In Fuzzy community detection algorithms, we define in the field of complex networks. Genetic algorithms a soft membership vectors or belonging vector for have been used as an effective optimization technique each node. This is used to quantify association to solve this problem. The earliest method was called strength in each community. The following the minimum-cut method. Then hierarchical researches have been made in this field: clustering method came up. This was followed by Nepusz used simulated annealing [18] Girvan Newman algorithm which was further method. He converted the problem into a optimized using modularity maximization. The non-linear constraint optimization problem. algorithms in detail are: Zhang used spectral clustering [19] framework and proposed an algorithm to do Minimum-Cut Method [15]: In this method the effective division of social network into number of partition was predetermined and then communities. He used Fuzzy C-Means the network was divided. It was ensured that the clustering [20] (FCM) to obtain the soft division was such that the community was of assignment. Users specify an upper bound to approximately the same size. Also the number of the number of communities. If this upper intergroup edges is minimized. This method is bound is k then, only the top k-1 Eigen less than ideal as it finds only a fixed a number vectors are computed. Both, accuracy and of communities. time complexity rely on the value of k. Hierarchical Clustering [16]: In hierarchical In 2007, Newman and Leicht used a mixture clustering, a similarity measure is defined. This models [21] and provided an appropriate quantifies topological type of similarity between Fuzzy community detection method. This nodes. Method used includes cosine similarity, was possible only due to probabilistic nature jaccard index and haming distance between of the algorithm. In this, the number of adjacency matrix rows. Similar nodes where communities is same as the mixture models. grouped into one community. Two methods were This number needs to be specified in used for this grouping: Single linkage clustering advance. and complete linkage clustering. In the former, Researchers are being made in the field of fuzzy two groups are different iff all pair of nodes in community detection using evolutionary algorithms. different groups have similarity less than a given The following models were developed in the past: value. In the later, all pair of nodes within a Modularity-based Model: In 2004, Newman group has similarity greater than a threshold. and Girvan devised an evaluation criteria Girvan Newman Algorithm: Edges that lie called modularity denoted by Q. This between two communities are identified and criterion takes into account that the number removed. Identification is performed by of edges within a community are maximised between-ness measure in which a number is and the inter-community edges are assigned to each edge. This number is large if minimised. Higher the value of modularity, edge lie between many pair of nodes. This the better the solution is.