Qanet: Tensor Decomposition Approach for Query-Based Anomaly
Total Page:16
File Type:pdf, Size:1020Kb
1 QANet: Tensor Decomposition Approach for Query-based Anomaly Detection in Heterogeneous Information Networks Vahid Ranjbar, Mostafa Salehi, Pegah Jandaghi, and Mahdi Jalili, Senior Member, IEEE Abstract— Complex networks have now become integral parts of modern information infrastructures. This paper proposes a user-centric method for detecting anomalies in heterogeneous information networks, in which nodes and/or edges might be from different types. In the proposed anomaly detection method, users interact directly with the system and anomalous entities can be detected through queries. Our approach is based on tensor decomposition and clustering methods. We also propose a network generation model to construct synthetic heterogeneous information network to test the performance of the proposed method. The proposed anomaly detection method is compared with state-of-the-art methods in both synthetic and real-world networks. Experimental results show that the proposed tensor-based method considerably outperforms the existing anomaly detection methods. Index Terms— Anomaly Detection, Heterogeneous Information Networks, Query Based Anomaly Detection, Tensor Decomposition —————————— —————————— 1 INTRODUCTION ANY real systems usually consist of interactions ten find fraudulent activities by examining unusual M between various components and entities and can transaction patterns [10]. Network intrusion detection be modeled as networked structures [1]. Examples in- methods detect potential network attacks by comparing clude social activities of humans, ecological systems, traffic signatures with incoming traffic and finding unu- communication and computer networks and biological sual patterns in incoming traffic [11]. Abnormalities can systems. Information networks are everywhere and have be a node (entity), an edge (connection) or a subnet (a become a vital component of modern information infra- group of entities) that should not exist in the network but structure. In recent years, analysis of information net- exist. Types of attributes and features associated with works has attracted scholars across disciplines including nodes and edges that are used to detect abnormalities, computer science, social sciences, mathematics and phys- can be of any kind in relation to the entities or relation- ics [2]. Modeling real-world data as information networks ship between them. is a new tool that can often provide richer information as So far, anomaly detection methods have mainly fo- compared to traditional modeling techniques such as cused on homogeneous information networks and un- multidimensional modeling [3]. Representing data as in- structured multidimensional data [12], [13]. Anomaly formation networks makes it possible to model relation- detection in heterogeneous information networks is a ships between entities. In some systems the nodes and/or challenging task mainly due to specific characteristics of edges are not from the same type and various types might such networks [14]. Most of the methods developed orig- coexist in the system; such systems are often modeled by inally for homogeneous networks do not work in the case heterogeneous (or multilayer) networks [4], [5], [6], [7]. of heterogeneous one. Often, one would like to find One of the major challenges in the analysis of infor- anomalies in certain type of nodes/edges in heterogene- mation networks is to discover anomaly and abnormal ous networks. For example, in bibliographic networks in components. Anomaly detection is a branch of data min- which nodes can be authors, papers or venues, one might ing that is associated with discovery of abnormal occur- want to find author(s) that are the most different (abnor- rences in the data. It has many applications in areas such mal) with others in the way they publish papers (i.e. the as security, finance, biology, healthcare, and law en- topic of papers and/or venues published). In this exam- forcement [8]. For example, online social media detects ple, the difference in behavior should be a particular au- spam reviews by finding unusual patterns [9]. Banks of- thor chosen as reference. Such anomaly detection is re- ———————————————— ferred to as query-based anomaly detection method [14]. V. Ranjbar and M. Salehi are with the Faculty of New Sciences and This paper introduces a novel anomaly detection Technologies, University of Tehran, Tehran, Iran and also with the method based on tensor decomposition named QANet. In School of Computer Science, Institute for Research in Fundamental Sci- the proposed method, various meta-paths each node has ence (IPM), P.o.Box 19395-5746, Tehran, Iran. E-mail: [email protected]; [email protected]. with others are considered as properties of that node. P. Jandaghi is with the department of Computer Science, University of Then, features are extracted using tensor decomposition, Southern California, California, USA, E-mail: [email protected]. and clustering techniques are used to detect anomalies. M. Jalili is with the School of Engineering, RMIT University, Mel- bourne, Australia. E-mail: [email protected]. 2 We use synthetic and real datasets to evaluate perfor- mance of QANet. Specific contributions of our manu- script are as follows: We provide a user-centric anomaly detection approach that uses tensors to store meta-paths in heterogeneous in- formation networks and also uses tensor decomposition techniques to extract nodal features from a tensor. We introduce a network generation model and an anomaly Fig. 1. (a) An instance of heterogeneous information network for bibliographic network, and (b) bibliographic network schema. injection method to construct synthetic heterogeneous in- formation networks to test the performance of the pro- DEFINITION 3. (META-PATH) [2], [39] A meta path 풫 is posed anomaly detection method. a path defined on a schema 푇퐺 = (퐴, 푅), and is denoted in 푅1 푅2 푅푙 We create queries from two real-world datasets including the form of 퐴1 → 퐴2 → … → 퐴푙+1, with length of 푙. For sim- IMDB and DBLP as well as the constructed synthetic da- plicity, we can also use node types to denote the meta taset and compare the proposed method with state-of-the- path if there are no multiple relation types between the art methods. same pair of node types: 풫 = (퐴1퐴2 … 퐴푙+1). APV and The rest of the paper is organized as follows. Section 2 APA are meta-paths for heterogeneous information net- presents the preliminaries including a number of defini- work in Fig. 1. tions and the problem statement. We discuss the related A meta path 풫 = (퐴1퐴2 … 퐴푙) can be reversed where −1 work in section 3. We discuss the tensor decomposition the reversed path is denoted as 풫 = (퐴푙퐴푙−1 … 퐴1). If 풫 −1 model and describe our approach for the ranking candi- is equal to 풫 , then 풫 is a symmetric meta-path. For ex- date set of user query in section 4. A set of comprehensive ample, APA and APVPA are symmetric meta-paths. experiments is performed to evaluate the effectiveness of DEFINITION 4. (META-PATH INSTANTIATION) [39]. QANet in section 5. Section 6 draws the conclusions. If for each 푎푖 ∈ 푉 we have 퐹1(푎푖) = 퐴푖 and for each edge 푒푖(푎푖푎푖+1) that belong to relation 푅푖 in meta-path 풫 = 2 PRELIMINARIES (퐴푖퐴i+1 … 퐴푙), the path 푝(푎푖푎푖+1 … 푎푙) is denoted as meta- path instance for 풫. It can be denoted as 푝 ∈ 풫. Mike- This section provides some preliminaries including a paper1-VLDB is a meta-path instance for APV meta-path number of definitions required to formally state the prob- in the network of Fig. 1. lem of query-based anomaly detection in heterogeneous DEFINITION 5. (ANOMALY QUERY). An anomaly que- information networks. ry Q is denoted by 푄 = (푞 , 푞 ) where 푞 is a query on DEFINITION 1. (HETEROGENEOUS INFORMATION 퐶 푅 푐 network entities. 푆 ⊂ 푉 is the output indicating the outli- NETWORK) [2]. A heterogeneous information network 퐶 ers, known as the candidate set. 푞 is also a query on consists of multi-type entities that can have different type 푅 network entities whose output is 푆 ⊂ 푉 serving as the relationships between them, which are defined by a di- 푅 reference of normal nodes. The types of referenced and rected graph. Without loss of generality, the information candidate sets are the same. Candidate entities can be a network 퐺 can be defined as 퐺 = (푉, 퐸, 퐹 , 퐹 ), where 푉 is 1 2 separate set or sub-set of the reference sets. the set of nodes and 퐸 is set of edges. Function 퐹 : 푉 → 퐴 1 DEFINITION 6. (ANOMALY SCORE). The degree of is a function that maps each node to its type, where 퐴 = structural difference between a node and those in the ref- {퐴 , 퐴 , … , 퐴 } is the set of node types and 푇 is the number 1 2 푇 erence set is the anomaly score (Ω ) of that node rela- of node types. Each node 푣 ∈ 푉 maps to a particular type 푄퐴푁푒푡 tive to the reference set. The structural difference means in entity type set 퐴: 퐹 (푣) ∈ 퐴. Function 퐹 : 퐸 → 푅 is also a 1 2 the difference in the formation of node relationship with function that maps each edge to its type from set 푅 = others. {푅1, 푅2, … , 푅푙} where 푙 is number of edge types. Each edge 푒 ∈ 퐸 is mapped to a special type in edge type set 3 RELATED WORKS 푅: 퐹2(푒) ∈ 푅. Fig. 1(a) shows a heterogeneous information network on bibliographic data. This network includes Various methods have been proposed for anomaly detec- three types of node: Papers (P), Authors (A) and Venues tion which can be used in certain applications. Classifica- (V). Each paper has link to a set of authors and a venue tion methods [10], [15], [16], [17], [18] require labeled data where these links belong to a set of link types. and usually provide a label for test data, which are not To better understand the type of entities and relation- appropriate for applications that require a rating for ab- ship between them in a heterogeneous information net- normality.