Exploiting Edges Semantics in Citation Graph Data Using an Efficient Vertical Association

Total Page:16

File Type:pdf, Size:1020Kb

Exploiting Edges Semantics in Citation Graph Data Using an Efficient Vertical Association

Exploiting Edges Semantics in Citation Graph Data using an Efficient Vertical Association Rule Mining Model

Imad Rahal, Dongmei Ren, Weihua Wu, Anne Denton, Chris Basemann and William Perrizo

Computer Science Department North Dakota State University Fargo, ND 58105, USA [email protected]

ABSTRACT Association mule mining, frequent itemset mining, data mining, graph data, link analysis, Graphs are increasingly becoming a vital citation analysis. source of information within which a great deal of semantics is embedded. As the sizes of graphs 1. INTRODUCTION increase, the ability of arriving at the embedded semantics becomes more difficult. One type of From a data mining perspective, this paper important hidden semantics is that which is attempts a special kind of association rule mining embedded in the edges of directed graphs. Citation (ARM) 2 3. The standard ARM, when applied to graphs serve as a good example in this context. scientific publications, would consider each This paper attempts to understand temporal publication as a transaction. A transaction aspects in publication trends through citation includes a variable number of items drawn from a graphs by identifying patterns in the subjects of predefined item space. Originally, ARM was scientific publications using an efficient vertical applied to shopping carts or market baskets that association rule mining model. Such patterns can contain combinations of items bought from a store. (a) indicate the original topics that lead into or The item space is the set of all items that are for participate in the development of other target sale in a store. Borrowing this concept over to the subject(s) later in time, (b) highlight the evolution publications context, the presence of a property, of subjects, and (c) give insights on the potential such as a subject matter, could be viewed as an effects of current research on future research. We item. In this paper, we view the set of all subject highlight how our work is different than previous matters present in the publications represented by work on graph mining, citation mining, and web the citation graph under consideration as a well- structure mining, propose an efficient vertical data defined item space where a certain paper can have representation model, introduce a new subjective more than one subject matter. The user is advised interestingness measure for evaluating patterns that we will use the terms “property”, “subject” with a special focus on those patterns that signify and “subject matter” interchangeably hereinafter. strong associations between properties of cited So far, since all properties relate to a single papers and citing papers (referred to as citee and publication without any knowledge of the citation citer papers in this paper, respectively), and graph, applying ARM could only highlight which present an efficient algorithm for the purpose of subject matters are most commonly covered in discovering the rules of interest with detailed conjunction with which others. To learn about the analysis. future impact of publications or their evolution, we have to relate their properties to the properties of KEYWORDS papers that are citing them. The rest of this paper is organized as follows: In section 2, we formally state the problem of association rule mining. In sections 3 and 4, we give an overview on the current research literature purposes one of which is discovering influential on web structure and citation mining and graph journals by giving journals “influence weights” 10 mining, respectively, and show how our work is based on the number of influential journals citing different. Section 5 discusses the vertical data them. Web structure mining (aka web link mining) model proposed in our work. In section 6 we borrows this idea over to the newer context of the introduce a new subjective interestingness World Wide Web. Formally, Web structure measure and compare it to other measures in the mining is the process of discovering influential literature; we also present the notions of intra- and and “authoritative” pages over the Web. Two inter-node rules. Section 7 gives a detailed types of web pages can be distinguished: description of our algorithm along with a authorities and hubs. Authorities are web pages performance analysis study and an investigation of that are linked to by many other pages because the results. Section 8 concludes this work with they contain important information; while, hubs potential future extensions. are web pages that contain links to many authorities on some subject. 16 proposes an 2. ASSOCIATION RULE MINING approach for analyzing the link structure of the World Wide Web and finding information of Formally, the problem of association rule “high-quality” in response to broad-topic search mining can be stated as follows: Let I be a set of queries which can have thousands of potentially items (also called an itemset), and D be a set of relevant web page matches. In order to provide transactions (cardinality is |D|) each containing a effective searching for users, only the subset of list of items. An itemset containing k items is authoritative pages on the topics of interest should referred to as a k-itemset. The number of be returned. 7 discusses a newer approach for the transactions containing all the items of an itemset same purpose that is based on 16. is defined to be the support of the itemset 2 3. 18 introduces the novel concept of citation Given an absolute support threshold, minisupp, mining which combines citation bibliometrics and where 0 < minisupp < |D|, an itemset X in I is said text mining. Their work focuses on analyzing and to be frequent with respect to D and minisupp, if documenting the impacts of research on the and only if support of X is greater than or equal to development of real-life applications, technology, minisupp. and other research over time along with the Every frequent itemset may produce one or pathways through which that can be achieved; in more association rules each of which has a addition, it attempts to recognize and highlight the confidence value 2 3 greater than a minimum different characteristics related to the user specified confidence threshold, miniconf, where 0 population. In other words, the work studies the <= miniconf <= 1 (in which case we say that the different impacts of research along with the rule is strong). An association rule is a rule of the impacted population. They represent the studied form AC where A, the antecedent of the rule, research by papers indexed by the Science Citation and C, the consequent of the rule, are itemsets in I. Index database1 and the user population by other The confidence of an association rule is defined as papers in the database citing the studied research the support of the itemset formed by the union of papers as well as their future extensions. Citation A and C divided by the support of A (i.e. it gives bibliometrics 22 is utilized to highlight the the proportion of the transactions in D containing characteristics of the impacted user population by A that also contain C). Association rule mining analyzing the user papers. The text mining 1 finds all strong rules from all frequent itemsets. component of their approach analyzes the relations among technical areas in population papers and 3. WEB STRUCTURE AND CITATION between them and other areas in research papers MINING using intelligent and feasible text mining. Research in citation analysis and web structure Our work is similar in essence to other mining such as 16 and 7 have focused primarily on research in the areas of web structure mining and 1 citation mining. The area of citation analysis has A database providing access to a very large number of bibliographies, abstracts, and references found in more 3500 been studied in information retrieval (IR) for many journals in over 100 disciplines. discerning entities such as journals, papers or web subgraph) is based on the number of transactions pages that are deemed influential to or (or graphs) in the graph database 14 supporting the authoritative on certain topics of interest. In itemset. The task of finding frequent patterns is contrast, our work has a totally different entity of then represented as frequent subgraph or sub- focus, the publication subject matter, which, in a transaction searching. Searching occurs by way, could be viewed as the query input in web comparing subgraphs to check for similarity, a structure mining. In addition, our work differs problem defined as the subgraph isomorphism from 18 and other citation mining research in that problem which is known to be NP-complete 14 19. we attempt to discover research trends by To alleviate the complexity of matching understanding the semantics hidden in the edges of subgraphs, various heuristics and methods are citation graph data. Those edges are used to relate utilized. Common among these is a form of publications, represented by thier corresponding canonical labeling which allows label codes subject matters, where the edge direction attached to graph nodes or edges to be compared embodies a time element. We adopt a new rather than the actual graph structures which approach that uses a popular data-mining invariably takes less time 14 19; however, application to arrive at rules associating subject methods that use these labeling techniques suffer matters from publications written at different when the data has few unique labels. Two popular points in time. The objective of our study is to algorithms for frequent itemset generation applied discover rules capable of uncovering subject to graph data are summarized next. Other matter extensions and evolution over time as well algorithms can be found in 21 and 41. as providing a “potential” framework for Apriori-based graph mining (AGM) 14 mines predicting future subject matters that could be association rules from frequent substructures or affected by current research, where applicable. As subgraphs. In AGM, graphs take the shape of will be discussed later, a vertical data adjacency matrices. Canonical matrix coding is representation model is employed to enable the used in order to alleviate the isomorphism possible analysis of huge citation graphs where problem. Candidate generation 2 3 occurs when other approaches might fail. We also propose a two subgraph matrices share all but the last new interestingness measure to guide us through column. These two matrices are joined to form the the enormous rule space in the quest for the rule next level candidate which has all the common subset of interest. columns in addition to the differing ones. This is in direct comparison to the way itemsets are joined 4. GRAPH MINING in typical Apriori-based algorithms 2 3. After generating a candidate subgraph, the graph set is A particular interest in graph data is that of then scanned to determine the support of that interactions. Many objects in the world have subgraph. special interaction relationships where objects The Frequent Sub-Graph algorithm (FSG) 19 influence one another. One area in which is another algorithm based on Apriori. FSG’s interactions are involved is the study of approach seeks to limit the space by only proteomics in biology 31. Most work in cells is considering connected subgraphs. A sparse graph not isolated; rather, it is done by interacting representation is used to reduce storage and proteins. Past techniques in protein interactions computational costs. As with AGM, FSG adds one include binary networks 4, Bayesian networks 31, edge at a time during candidate generation. and support vector machines 6. However, there is However, it was noted more prominently that each still a wide range of graph-based techniques that join for a candidate does not necessarily result in a still needs to be explored. single candidate as is the case in classical ARM. In association rule mining of graph data, the All previous work on graph mining focused on idea of what constitutes a transaction and what the discovery of patterns within the graph defines the support of an itemset depends on the structure. Node information played a secondary graph model. In some instances, a transaction can role, if any, in the process. Our approach, on the be viewed as an individual graph and the support other hand, is centered on the data available in the of an itemset (which is a sub-transaction or a graph nodes. Graph structures are exclusively used to relate the information between nodes that In this work, we utilize a path-based approach are connected by citations edges. Some work in to represent graph data in tabular or relational bioinformatics has used a similar approach but format where we consider each path in the graph focused on biological aspects 23 30. In addition, as a transaction. In the simplest version, we can because we are dealing with citation data, only limit ourselves to paths of length one. This means directed graph (digraph) structures are considered that each edge will result in a single transaction. in our work where nodes represent papers and This transaction will contain information on both edges represent citation relationships. Even though of the participating nodes. The corresponding such graphs are viewed as acyclic, in practice, they ARM problem can then be phrased in terms of the might contain cycles due to factors such as relation that results from twice joining the relation “preprints” and “self-citations”. Research on containing the node data of the citation graph the association rule mining of graph data has edge relation. If the frequency of a particular considered more general and even more property is evaluated on this joined table, the complicated graph structures such as “hyper” result will differ from the frequency in the original graphs. relation containing only the node data. It is The set of citer-citee relationships is usually important to recognize that the concept of a embodied in a citation graph. For every citation, frequent itemset will depend on the length of path we have a directed edge connecting the citer to the used as a transaction – recall that we can represent citee. depicts part of a citation graph drawn using paths of length x as transactions instead of limiting a modified version of the TouchGraph software ourselves to only edges of length one. Note that 38. The figure shows three disconnected there is a lot of redundancy in the resulting components. In the larger component, a paper relation since the information from a particular having a property 12.10.-g2 and another paper node will be repeated once for every incoming or having a property 11.25.Mj are involved in a outgoing edge. For example, suppose that each citation relationship where the latter paper cites paper has n properties, P1 to Pn, where each the former (in the figure, citation relations go from property value can be either 1 or 0 denoting the lower to higher nodes). existence or absence of that property in the corresponding paper, respectively. depicts the framework discussed in this section that we will use to represent a citation graph as a set of transactions. The first half of this table is referred to as the citee set while the second as the citer set. The first row shows a paper with visible properties

P1 and P2 being cited by a paper with property P2.

Citee Citer

P P P P P P 1 2 .... n 1 2 … n . 1 1 … 0 0 1 … 0 . 1 0 … 0 0 0 …. 0 . 0 1 … 1 1 1 …. 1 . 0 0 … 0 1 0 …. 1 . . Figure 1. An example citation graph. Figure 2. Framework for representing graph 5. DATA REPRESENTATION edges as transactions. As we shall describe later, we utilize a bitmap-

2 based vertical model to represent the data shown Those properties are subject codes drawn from the Physics in . Unlike 42 , which also uses bitmaps to encode and Astronomy Classification Scheme (PACS) numbers used in the physics domain. More on this issue in section 7.5. the existence or absence of properties, our model uses bit-based compression. Compression will A number of studies in the literature have help assuage problems when the number of edges analyzed the notion of interestingness in data in the graph (i.e. transactions in our model) grows mining. In general, interestingness measures of enormously especially in cases where the data is patterns can fall in one of two classes: objective sparse. The reader is referred to section 1 for more measures and subjective measures 24 35 36. details on how compression is achieved. Objective interestingness measures are data- In the context of macromolecule substructure centric in that they define the interestingness of a discovery, 24 models paths in three-dimensional pattern in terms of the data used in the mining coordinate graphs, where nodes represent atoms process. They also depend on the structure of the and edges represent chemical bonds, as patterns. For example, in ARM, the two ubiquitous transactions similar to the way we model graph objective measures are support and confidence edges. Because macromolecules can have a large both of which highlight the statistical strength and number of atoms, their corresponding graphs tend significance of the discovered rules. Due to the to have a massive number of paths to be many complexities arising in the pattern discovery represented as transactions which can clearly process, objective measures usually discover a become infeasible. The authors also emphasize large number of patterns and thus fall short of their that the excepted number of atoms in interesting purpose3 especially when the notion of patterns is at least five but is usually more than interestingness depends on additional factors such that. As a result, they propose an interesting and as the decision-maker. A number of subjective intuitive approach to reduce the large number of measures have been proposed for the latter transactions by pruning all atom-pair interactions scenario. In general, subjective measures endeavor whose length (i.e. distance between the two to generate a smaller, tailored set of patterns that is interacting atoms) is above a certain user-defined potentially more interesting and useful to the threshold. They justify this step by the fact that the pattern examiner. distance between two interacting atoms is As discussed in 36, subjective measures inversely proportional to the strength of the depend on two main factors to discover patterns, chemical bond between them. In our case, the data namely, actionability and unexpectedness. represents citations and is limited to paths of Actionability states that a pattern is considered length one (i.e. to edges only). The efficiency interesting to examiner if it calls for action on his degradation resulting from the increase in the or her behalf. Unexpectedness focuses more on the number of transactions can be circumvented by surprising factor of the pattern with respect to the using compression as we shall demonstrate in our examiner; i.e., how much does the pattern surprise experiments over large datasets. In cases where the examiner. In order for the unexpectedness compression does not prove to be very effective, factor to be integrated into a subjective measure, a we suggest utilizing temporal constraints on the system of beliefs 36 must be defined first. Such a data by limiting the nodes in the graph, which system would define the standard knowledge represent publications, to those published between expected by the examiner. The discovery process certain dates and ignoring all other nodes along then captures all deviations from such standards as with their incoming and outgoing edges. Another unexpected and thus as interesting to the examiner. suggestion is to limit the edges in the graph only to In general, beliefs can be of two types: hard and those involving nodes containing specific subjects soft. Hard beliefs represent the knowledge that of interest. Due to time limitations, this paper does examiner is not willing to change even in the light not attempt to analyze the effects of those of newly discovered contradictory evidence; the suggestions; such analysis is to be considered in validity of the discovered patterns and sometimes future extensions of this work. the original data are questioned instead. On the other hand, soft beliefs could be changed by 6. RULE “INTERESTINGNESS” MEASURE examiner if suggested by new patterns. A user-

6.1 On Interestingness Measures 3 Note that the purpose of data mining is to discover useful and comprehensible knowledge from huge amounts of data. defined measure of strength, referred to as degree antecedents of R and B are logically contradicting, of belief, is usually associated with every belief in and (2) the number of tuples intersecting R and B the system. A number of subjective interestingness (i.e. the subset of tuples where the antecedents of measures for association rules are presented next. R and B are both true) is “statistically” large. In 29 uses a probabilistic approach to discover order to arrive at the subset of interesting rules, the unexpected rules in the form of “rule-pairs”. Their authors assume the validity of what they refer to as work is domain-independent in that it requires no the “monotonicity of beliefs” which states that if a prior knowledge in the form of beliefs against belief holds on some data with some degree then it which the unexpectedness factor is measured. must also hold on large subsets of that data. 20 proposes the use of synthetic comparison between rules and beliefs (which are also 6.2 Intra- And Inter-Node Rules represented as rules) to arrive at interesting rules. A rule, R, is considered to be different than a To define our interestingness measure, we belief, B, if R and B have similar consequents but introduce the concept of inter- and intra-node very dissimilar antecedents or vice versa. In this rules. Intra-node rules relate properties within context, their distance-based similarity is based on papers of the same category. the synthetic structures of rules and beliefs. 36 defines the interestingness of a pattern by Definition 6.1: (Intra-Node Rule) An intra-node the degree with which it “shakes” the belief rule is a rule whose antecedent and consequent are system. Their work emphasizes the importance of properties drawn from either the citer or the citee rule actionability along with the complexity set of papers but not from both simultaneously, associated with formulating and integrating it into support is greater than the minimum support the discovery process. To alleviate this problem, threshold, and confidence greater than the they assume that most actionable rules are minimum confidence threshold. unexpected and use rule unexpectedness as the focal point in their interestingness measure. A In general, intra-node rules could be derived number of approaches were suggested to define without knowledge of the graph. Those rules the degrees of beliefs such as the frequency would, however, differ from rules that are derived approach which is limited to beliefs in the form of in the path-based setting described in the previous rules and the conditional approach which can be section, namely, inter-node rules. applied to more general forms of beliefs and thus Inter-node rules depend fundamentally on the is chosen by the authors. Their work can be knowledge derived from the graph structure. applied in dynamic environments where the data Here, a rule might have its antecedent drawn from changes often thus affecting the degrees of beliefs the citer, while, the consequent from the citee and and, consequently, the outcome of interesting vice versa. Different formats of rules could be patterns. derived in this manner, but we will limit ourselves 37 defines a subjective interestingness to only one form in which the antecedent is drawn measure for the analysis of healthcare insurance from the citee while the consequent is drawn from claims. The concern here is to find “deviations the citer. As we shall discuss later, this type of from the norms” which can call for corrective rules can give insights into research publication actions to reinstate them back to standards. The trends by associating properties of papers written actions are pre-defined by domain experts for each at different points in time. class of deviations which is clearly only possible for very domain-specific applications. Definition 6.2: (Inter-Node Rule) An inter-node 24 associates a degree with every belief in the rule is a rule whose antecedent is drawn from the system where beliefs are coded as rules. Degrees citee set of papers, consequent is drawn from the of beliefs are defined by the examiner and can be citer set of papers, support is greater than the updated using a “revision procedure”. As in 36, minimum support threshold, and confidence is they focus on the unexpectedness factor of greater than the confidence of the corresponding interestingness and define a rule, R, to be intra-node rule (defined next). unexpected with respect to a belief, B, if (1) the Definition 6.3: (Corresponding Intra-Node earlier work represented by citee properties and Rule) The corresponding intra-node rule of an later work represented by the citer properties. inter-node rule is a “potential” rule having the Given a subject S of interest, the derived same antecedent as the inter-node rule (drawn associations can be used to show S’s future subject from the citee set of papers), a consequent whose extensions (i.e. what subjects have directly or properties are the same as those in the consequent indirectly extended from S). This is could be done of the inter-node rule but drawn from the citee set by matching S against the antecedents of the rules of papers, and a confidence greater than the and viewing the subjects in the consequents as minimum confidence threshold. extensions of S. A rule such as R: Subject_X_Citee  Subject_Y_Citer says that a Rules could be simply derived by using big number of papers with subject Y cite papers minimum support and confidence thresholds; with subject X which could indicate that Y is an however, since it is very difficult to estimate a extension of X. A more concrete example would minimum threshold for confidence that would be a rule stating that subject “databases” implies yield interesting rules, we consider an inter-node R subject “data mining” since the latter is an to be of interest if there exists a corresponding extension of or a development from the former4. intra-node rule, R’, such that the confidence of R Another use of such associations is is larger than or equal to confidence of R’. As highlighting the evolution of subjects which can described in definition 6.3, the R’ has the same be viewed as the opposite of the first use. By antecedent as R (drawn for the citee set) along matching a subject of interest, S, against the with a consequent having the same properties consequents of the rules, we can view the subjects drawn from the citer set for the R and from the in the antecedents as the original subjects from citee set for R’. Note that in definition 6.3 we say where S extended or derived. The rationale for this that a corresponding intra-node rule is a is that papers, written later in time, usually follow “potential” rule to emphasize the point that we are the trend of citing the original and seminal papers not interested in its support, we just use its that started a certain research direction. As an confidence, which should be greater than the example, almost all papers involving subject minimum specified threshold, for testing the inter- matter “association rule mining” usually cite node rule at hand. As an example, consider an Agrawal’s work 2 3 implying that their inter-node rule, R: Citee_Prop1  Citer_Prop6, application areas could be viewed as an extension Citer_Prop9. This rule associates property 1 from of the market basket research (MBR) area. This is the citee set with the combination of properties 6 also true for this paper where we model citation and 9 from the citer set. The corresponding intra- graph data as MBR transactions to enable the node rule, R’, would be Citee_Prop1  application of ARM; as a result, we can view Citee_Prop6, Citee_Prop9. This kind of MBR as an original subject which we (and many information tells us that inter-node properties are others) have extended. exhibiting stronger associations than A third use can be to predict the “potential” corresponding intra-node ones which deserves impact that current papers might have on future attention. As a result, the reader is advised that the papers. We do realize that this last use might not notion of the constant ubiquitously-known be applicable all the time; nevertheless, it might confidence threshold is not directly utilized in our provide valuable knowledge when it does. For work; we substitute it with a tailored form derived example, a rule such as R: Subject_X_Citee  for each rule dynamically and separately from the Subject_Y_Citer, Subject_Z_Citer might tell us confidence of the corresponding intra-node rule. that having a set of papers, S, involving subject To a domain expert in the publications field, matter X implies that future papers involving we expect most of the rules to be semantically subject matters Y and Z might cite S with a certain interpretable; however, surprises may arise. We support (R’s support) and confidence. Looking at consider those as a form of unexpected knowledge which could be interesting. From a data-mining 4 As a matter of fact, data mining as a research area can be perspective, inter-node rules provide valuable viewed as a combination of a number of areas such as databases, machine learning, statistics, artificial intelligence, knowledge embodied in associations between and others. the same issue from a different angle, using inter- must coexist together in the same publications. In node rules such as R may help us in determining view of the fact that we are more interested in what future publication subject matters might get understanding subject evolution, extensions and affected by current publication subject matters. potential predictivity, the time element plays a For example, from R, we can conclude that it is crucial in our notion of interestingness which probable for current papers with subject matters X coexisting subjects don’t satisfy. As a result, we to be cited by future papers with subject matters Y view inter-node rules that are not stronger than and Z; thus, we can say that subject matter X will their corresponding intra-node rules as not probably have an effect on subject matters Y and interesting. Z. Notice that, from this last observation, we might In order for a statistically large number of also be able to gain more insight on future subjects papers PCiter with a certain subject SCiter to cite a set given current ones. We envision the usefulness of of papers PCitee involving a subject SCitee where SCiter such inter-node rules as they present associations is not prominently spread in PCitee, SCiter, in its between the properties of two sets of papers current form, ought to be rather newer than SCitee written at different points in time (the citee set and which explains its scarcity in PCitee. For example, a the citer set) that are more confident than similar large number of papers on “data mining” cite the associations between properties of the same set of “machine learning” literature which is an older papers. subject that forms one of the roots of “data In this work, we are primarily interested in mining”. Inter-node associations adhering to this patterns showing stronger associations among justification can clearly highlight subject evolution inter-node properties than among intra-node as well as future extensions as discussed properties, i.e., associations between properties of previously. Note that, sometimes, SCitee and SCiter citee papers and citer papers that are stronger than could be disciplinary unrelated subjects and similar associations between properties drawn associations can be of interdisciplinary value. An from citee papers alone. In a sense, our work is example would be the “Nash Equilibrium” work similar to 29 in that we too discover specific pairs that the famous Dr. John Nash did at Princeton of rules, where each pair is composed of an inter- University in area of game theory in 1950 which node rule along with its corresponding intra-node awarded him the Nobel Prize in economics in rule, and use the conditional probability of those 1994. Almost all economics literature on rules to decide upon the interestingness of the equilibrium cites Nash’s work or its extensions inter-node rule. Another similarity with 29 is that even though economics and game theory are we do not utilize a user-define system of beliefs disciplinary unrelated fields of research. like other work on subjective interestingness Another scenario could be that SCiter is older measures such as 20 24 36 37; the corresponding than SCitee but became a “hot” research subject after intra-node rules are used to set the norms. Any the introduction of SCitee. In the event that this is inter-node rule that deviates from this norm by true, a justification for the observed citation having a confidence larger than the confidence of phenomenon could be that research in SCitee has its corresponding intra-node rule is unexpected lead to important advancements or findings with and thus interesting. The rationale for this is based high applicability to SCiter. Biological research on our belief that, in general, one usually expects started way before the introduction of computers; intra-node properties to exhibit the strongest however, due to the advancements in the fields of associations because their nodes play the same data mining and machine learning, a large number role in the graph which happens to be the citee role of papers on biology, especially those focusing on in this work as we are only focusing on citee intra- the “in-silico” analysis (i.e. through the use of node rules. If we consider the derived relation computers) of biological data 31, cite the newer resulting from our data representation for a data mining and machine learning literatures. moment, we can realize that for an intra-node rule The above discussion elucidates that our to a have a high confidence value, the subjects subjective interestingness measure discovers participating in the rule consequent must exist unexpected rules capable of highlighting subject together with those in the antecedent part quite extensions and evolution which also could give often; in order for that to happen, those subjects insights on the potential effects of current research on future research where applicable. We focus root level. The second level in the tree gives the largely on the unexpectedness factor of number of 1s in each of the halves of the bit group. interestingness simply because, at this stage, we The first node from the left on the second level are interested in understanding the semantics gives the number of 1s in the first half of the bit embedded in citation graphs. This is in direct group, and, similarly, the second node gives the comparison with 35 which also limits the number of 1s in the second half of the bit group. definition of interestingness to the unexpectedness This logic is continued throughout the tree with factor but only because the actionability factor is each node giving the number of 1s in either the hard to formulate and integrate into the discovery first or the second half (depending on whether it is process as suggested. the left or right node) of the bit group represented

by the parent node. In part d) of , the root of P2 is 7. THE PROPOSED MINING 6, which is the 1-bit count of the entire bit group. APPROACH The second level of P2 contains the 1-bit counts of the two halves separately, 4 and 2. Since the first 7.1 The P-tree Technology half contains only 1s, it is considered pure (referred to as pure-1) and thus there is no need to Our implementation for this work is based on partition it further. This last aspect is called P-tree a novel and efficient data structure, the P-tree compression 9 26 and is one of the most important (Predicate or Peano tree). P-trees are tree-like data characteristics of the P-tree Technology. Similarly, structures that store relational data in a loss-less nodes representing halves containing only 0s, like compressed column-wise format by splitting each the left node on the third level of the tree in d), are attribute into bits, grouping bits at each bit considered pure-0 and are not partitioned further. position, and representing each bit group by a P- The second half, 2, is pure-1 and need not be tree. P-trees provide a lot of information and are partitioned further. structured to facilitate fast data-mining processes 9 For efficiency purposes, we do not directly use 26 – an aspect greatly appreciated when dealing the basic P-trees shown in ; instead, we use a with vast amounts of raw data as is the case in variation of P-trees, called Pure-1 trees (or P1- most data-mining applications. trees). P1-trees contain nodes labeled with either 0 P-trees can be applied to binary data as well as or 1. A node in a P1-tree is a labeled with 1 if and numeric data stored in relational format though we only if the corresponding bit group it represents is will limit ourselves to the former as this happens made up entirely of 1s (i.e. it is pure-1). Nodes to be the format of data we are dealing with in the labeled with 0 can be either pure-0 nodes or mixed paper. Note that data partitioning here is attribute- (i.e. not pure-0 nor pure-1). Note that we can (or column-) based and not row-based as is the easily differentiate between pure-0 nodes and case with the ubiquitous relational data mixed nodes by the fact that pure-0 zero nodes representation. have no children (i.e. they are leaf nodes). 4 To create P-trees from relational binary data, shows the P1-trees corresponding to the P-trees in we store all bit values in each binary attribute for . all the transactions separately. In other words, we P1-trees are manipulated using fast operations group all bits, for all transactions t in the table, in such as AND, OR, NOT and ROOTCOUNT (the each binary attribute A, separately. Each such count of the number of 1s in the bit group group of bits is called a bit group. shows the represented by the tree in order to query the process of creating P-trees from a binary relational underlying data 9 26. The NOT operation is a table. Part b) of the figure shows the creation of straightforward swap of each pure node. Pure-1 two bit groups, one for each attribute in a). Parts c) nodes become pure-0 nodes and vice versa; while, mixed nodes stay as they are. The AND operation and d) show the resulting P-trees, P1 and P2, one ANDs the nodes at the same position in the for each of the bit groups in b). P1 and P2 are constructed by recursively partitioning the bit operand trees while the OR operation ORs them. groups into halves. Each P-tree records the total Note that ANDing a pure-0 node with anything number of 1s in the corresponding bit group on the results in a pure-0 node, and ORing a pure-1 node with anything results in a pure-1 node. These observations, which can be attributed to P-tree mining competition held in conjunction with the compression, are exploited to achieve fast P-tree Ninth Annual ACM SIGKDD Conference ANDing and ORing. depicts the AND and OR (http://www.cs.cornell.edu/ projects/kddcup/). The results of P1 and P2 from ??. An extensive deal of subset under consideration deals with citation literature exists on the P-tree technology and its graphs and publication subjects represented by applications 8 15 27 28 32 33 40. PACS numbers (Physics and Astronomy Classification Scheme). PACS numbers are numbers used to represent subject matters of Column1 Column2 0 1 publications in the physics domain. The total 0 1 0 1 0 1 0 1 number of PACS numbers available in the given 0 1 0 1 dataset is 828. We have 828 citee PACS numbers 0 1 1 0 and 828 citer PACS numbers amounting to a total 1 0 0 0 0 0 1 1 of 1656 attributes or columns used in the derived 1 1 1 1 table. The total number of transactions (i.e. edges) 1 1 considered is 1448 out of a possible 352,807 edges b) The resulting 2 bit a) A 2-column table groups in the original dataset. The reason for this reduction is that we selected only the subset of 3 6 papers participating in the citation graph (i.e. nodes in it) and having PACS numbers. As 0 3 4 2 aforementioned, the possible attributes in each 1 2 transaction are the 828 citee PACS numbers 0 2 followed by the 828 citer PACS. We used item 1 0 indexes 1 to 828 for citee attributes and 829 to c) P d) P 1 2 1656 for citer properties. Each transaction records Figure 3. Construction of basic P-trees. the item indexes of the PACS numbers existing in its participating nodes. The same file could also be represented in binary where attribute values record 0 0 the existence or absence of the corresponding 0 0 1 0 PACS number in the participating nodes of every 0 1 edge. In our case, we used the latter format to help 1 0 1 expedite the process of P-tree creation as 1 0 described in the previous subsection. a) P b) P 1 2 After representing the data vertically using P- Figure 4 . Pure-1 Trees. trees, we divide the ARM task into four steps, the first two of which could be done in parallel. First, 0 0 we mine all frequent itemsets from the citee part of the dataset. Those itemsets satisfy the minimum 1 0 0 0 specified support threshold, minisupp (i.e. have 0 1 support greater that or equal to minisupp). Second, 0 1 we mine all frequent itemsets from the citer part. 1 0 Note that representing data in P-tree format has the 0 a) P AND P advantage of speeding up the frequent itemset- 1 2 b) P OR P 1 2 mining process because, after creating the P-trees Figure 5. AND and OR operations. which could be done offline, no database scans are ever needed 9 26, just logical operations on 7.2 The P-tree-based Algorithm compressed bitmaps. To get the support of an itemset containing items X and Y, all we have to

We used our P-tree-based ARM do is to AND Px and Py and issue a ROOTCOUNT implementation for this work to analyze a subset operation on the resulting tree (cheaper than a of the dataset available for the KDD Cup 2003 database scan 9 26). Note that as in 34, we take competition, a knowledge discovery and data advantage of our vertical data representation to utilize memory efficiently by only materializing (which in turn must satisfy miniconf) and mark all the P-trees related to the part of the dataset we are the inter-node rules that match this criterion. As dealing with. Even while mining each part mentioned previously, we do not care for the separately, all P-trees for non-frequent itemsets are support of the corresponding intra-node rule; we unloaded from memory as they won’t be useful just use its confidence for testing purposes. anymore, resulting is better memory utilization. Computing the confidence of a rule is rather In addition, the two steps described so far are straight forward because of P-trees; the confidence independent thus giving us the ability to perform of a rule is equal to the ROOTCOUNT of the P- them asynchronously (i.e. they could be done in tree representing the itemset in the rule antecedent parallel 34). union the rule consequent divided by the After mining all the frequent itemsets in both ROOTCOUNT of the P-tree representing the parts of the dataset, we perform a join step. The antecedent. As aforementioned, each such inter- reason for this is the format of the desired inter- node rule provides us with valuable information as node rules; each rule must have its antecedent it associates subject matters of publications written drawn from the citee part while its consequent at different points in time. The formal description from the citer part. By definition, the support of of our P-tree-based algorithm used herein is given the rule must be greater than or equal to minisupp in 6. (i.e. the support of the union of the antecedent and PROC Mine Frequent Itemsets consequent must be greater than or equal to  Input: minisupp and so must be the supports of each  A set of P-trees each representing a column considered separately); thus, instead of mining all where Pi is the P-tree for column i frequent itemsets across all of the dataset which  minsupp  Output: could result in an exponential blowup in the  All frequent itemsets number of itemsets that must be generated and  Method: tested (because of doubling the feature space) and  For every P-tree check if the ROOTCOUNT then pruning all itemsets that do not contain items is greater than minisupp. The set of columns satisfying this criterion is called from both parts of the dataset, we perform a Frequent_1_Itemset divide-and-conquer approach by mining each part  Initialize an empty vector of frequent separately. The join step is straight forward and itemsets called Frequent_Itemsets takes advantage of the anti-monotonicity of  For each X in Frequent_1_Itemset support with respect to itemset size which states  Append X to Frequent_Itemsets  For each Y before X in that any itemset has support greater than or equal Frequent_Itemsets such that Y (or any to the support of any of its supersets and thus no of its subsets) is not on the taboo list itemset can be frequent unless all of its subsets are maintained for X also frequent. For example, if the result of joining  Let PXY = PX AND PY (notice that now a P-tree is an itemset, it can two frequent itemsets Iciter and Icitee is a non- have 1 or more items, e.g. the P- frequent itemset then there is no need to join Iciter or tree for itemset {5,9} can be any of its supersets with I or any of its derived by ANDing P5 and P9) citee PROC Mine Association Rules supersets. At this point, we have all frequent  If ROOTCOUNT(PXY) >=  Input: itemsets containing both citee and citer parts. Each minsupp, insert PXY in  A set of frequentFrequent_Itemsets itemsets each composed of two parts: citee and citer itemset produces a only one rule because all of its  Else, insert Y in the citee items should reside in the antecedent while  miniconf taboo list maintained for X the citer items in the consequent. As a result,  Output:  All inter-node rules such that the producing rules is fast and requires almost no confidence(inter-node rule) >= processing (no enumeration of the different rules confidence(corresponding intra-node rule) >= that could be derived from an itemset) other than miniconf the confidence test.  Method:  For every k-itemset, IS, where k>1 check the The fourth and last step is to produce the inter- confidence of the rule AàC such that (a) A node rules. To do that, we have to compare the ∩ C = Ø, (b) A U C = IS, (c) A is the citee confidence of each inter-node rule with the part of the itemset, and (d) C is the citer part confidence of its corresponding intra-node rule of the itemset  Output the inter-node rule AC if its confidence greater than or equal to the confidence of its corresponding intra- node rule which in turn has must have a confidence greater than or equal to miniconf cb, and cba, etc … Note that first we insert the new frequent item in the vector and then try it with all itemsets preceding it; as a result, before testing the frequency of an itemset, we can be sure that all of its subsets are frequent and in the Frequent_Itemsets vector, thus exploiting the anti- monotonicity property of support with respect to itemset as introduced in 2 3. For example, we do not try the new item c with ba if either a or b is not in Frequent_Itemsets vector simply because ba would cease to exist in this case; however, we still would try the new item d with cba even if cd is not frequent because cba is in the Frequent_Itemsets vector. PROC Association Rule Mining To rectify this problem, we associate a  Input: temporary itemset list, referred to as a taboo list  Two sets of P-trees one for the citee part and one for the citer part such that each P-tree (TL), with every new frequent item, I, inserted represents a column where Pi is the P-tree for into the vector where will save all the itemsets that column i produce infrequent itemsets when joined with I.  miniconf Going back to the previous example, if cd is  minisupp infrequent then we append itemset c to the taboo  Output:  All inter-node, intra-node rules combinations list of item d (TLd for short). In later steps, we can such that the confidence (inter-node) >= skip all supersets of c. In general, before confidence (intra-node) >= miniconf and computing the support of any new candidate support (inter-node) >= minisupp itemset, X, containing item d, we check if any  Method:  S1 = PROC Mine Frequent Itemsets (citee P- subset of X is in TLd. If so, we discard X. This may trees, minisupp) seem unfeasible; however, an efficient  S2 = PROC Mine Frequent Itemsets (citer P- implementation has been devised as we shall trees, minisupp) discuss in more details in the next subsection.  S = S1 Join S2  PROC Mine Association Rules (S, minconf) 7.3 Implementation Details

For every new frequent item, I, we maintain a

Figure 6. P-tree-based Algorithm. TLI which stores the itemsets whose supports, when joined with I, are less than minisupp and All the procedures in 6 are self-explanatory; thus the supports of their supersets when joined we will highlight some important points in relation with I need not be computed. In our to our frequent itemset mining procedure (PROC implementation, each TLI is a P-tree having a size Mine Frequent Itemsets). The procedure does a equal to the number of itemsets in the depth first enumeration of the itemsets in the item Frequent_Itemsets vector preceding I (i.e. all space, I, testing the support of an itemset only nodes that I need to be joined with I). A value 1 is after processing all of its subsets. For example, used for itemsets which, when joined with I, result suppose that for I = {a, b, c, d, e} we have in infrequent itemsets; thus, none of their supersets

Frequent_1_Itemsets = {a, b, c, d}. The way need to be joined later with I. The remaining TLI entries are tested and inserted in the entries will be 0s. For example, for the set of items Frequent_Itemsets vector is as follows (assuming I = {a, b, c, d}, suppose that the entries in the everything is frequent): a, b, ba, c, ca, cb, cba, d, Frequent_Itemsets vector created so far are: a, b, da, db, dba, dc, dca, dcb, dcba. First we insert a, ab, c, ac, bc, abc. For item d, we initialize a TLd then we insert b and try it with everything on its having 7 entries all containing 0s initially. If the left (from left to right) in Frequent_Itemsets union of item d with node b results in an vector; so we try ba. Similarly we insert c, then ca, infrequent itemset, then the second entry in TLd which corresponds to itemset b is flagged with a 1, items in the join step and prunes all those (k+1)- and so are all entries containing b (i.e. entries itemsets that have at least one infrequent subset in pointing to ab, bc and abc). But how can we the prune step. Pruning can be accomplished either efficiently tell which other entries contain item b? by searching which incurs an execution time cost We maintain for each item an index list as a P-tree or by using special data structures to store frequent (referred to as an index P-tree) that has a 1 value itemsets like hash trees as suggested in 3 which for every position that this item exists in. For incurs a space cost. In our approach, we combine example, item c will have the following index list those two steps into one step because we only (it will be stored as a P-tree but we are just listing form a candidate itemset if and only if all subsets the entries in a list for convenience): 0,0,0,1,1,1,1. of that itemset are frequent. In other words, item c occurs in the Frequent_Itemsets vector in positions 4, 5, 6 and 0 7. Every new itemset added to the vector results in the expansion of all index P-trees by either a 1, if the corresponding item is in the new node added, 1 0 or a 0, otherwise. The TL of the current item is also expanded by 0. 1 0 Going back to the previous scenario where the joining of item d with itemset b results in an Figure 7. The resulting taboo list in P-tree infrequent itemset and thus node b need to be format. added to TLd, we simply OR the index P-tree of item b with TLd and store the result in TLd. In Note that the index P-trees are an exact replica general, if we want to add itemset xy….z to some of the Frequent_Itemsets vector but in binary taboo list, say TLI, we AND the index P-trees for format (i.e. all the items of every itemset in the all items in the itemset (i.e. AND index P-tree of x vector have a 1 value in their index P-trees in the with that of y … with that of z). The result will position of the itemset), so we need not physically give us where itemset xy….z and all its supersets store the Frequent_Itemsets vector. All frequent occur in the Frequent_Itemsets vector. Then we itemsets can be derived from the index P-trees. In

OR the resulting P-tree index with TLI and store addition, each TL is only stored for the life of the the result in TLI which results in appending node processing of the corresponding item after which it xy….z to TLI. can be discarded. Also during its life, it is stored as We maintain the taboo lists and index lists as a P-tree and thus is compressed when possible. As P-trees as this will provide faster logical a result, our approach has no storage overhead operations in addition to compression. Note that in other than the temporary taboo lists each lasting the case of taboo lists, compression could speed for the duration of the processing of the node traversal especially in cases where there are corresponding items, after which it is discarded. many consecutive 1’s. For example, suppose the entries in a taboo list are: 1111 1100. 7 depicts the 7.4 Performance Analysis corresponding P-tree of the given taboo list. In this example, instead of going through the first four To the best of our knowledge, no work before nodes sequentially and then skipping them because has attempted to discover rules for analysis of they are flagged with 1s, using a P-tree to citation graphs like we are doing in this work. To represent the taboo list, we can directly skip the give the reader a clearer view of our efficiency, we first 4 entries because they form a pure-1 node5 on developed an implementation for the work 2nd level of the P-tree. suggested herein using P-trees (called P-ARM) Our approach eliminates the need for the two and compared it with two contemporary steps required by Apriori to generate candidate association rule mining approaches both of which itemsets, namely, the join and prune steps. Apriori use horizontal data representation, namely, FP- joins any two k-itemsets sharing the first “k-1” Growth (FPG) 12 which is known to be one of the best state-of-the-art approaches, and Depth-First 5 Refer to section 1 for more details on pure-1 nodes. Apriori 17 (DFA) which also uses a depth first traversal of the search space. It has often been transaction (i.e. the number of subjects a noted in the data mining literature that publication can belong to) is estimated to be rather experimental comparison results are dependent on small. The average number of items per the implementations used; as a result, we used transaction in DS1 and DS3 is 10% of the total publicly available implementations for the number items in the item space. In DS2, this approaches we compared with. For FPG, we used number drops to 2% making DS2 sparser than the popular implementation of Goethals which is DS1 and DS3 (both of which are still considered available for download at relatively sparse). Note that we attempted to http://www.cs.helsinki.fi/u/goethals/software/inde expand the last dataset, DS3, to ten million records x.html. We used Kosters et al implementation of but FPG ran out of memory and DFA took so DFA from the Frequent Itemset Mining much time so that we had to terminate it manually. Implementations Repository 11. There are two important parts that need to be Our implementation was coded in C++ and highlighted in our approach: (1) the efficiency of executed on an Intel Pentium-4 2.4GHz processor the overall algorithm which utilizes a divide-and- workstation with 2GB RAM running Debian conquer methodology and (2) the efficiency for Linux. To show our efficiency over large datasets, mining frequent itemsets which is the dominant we generated a number of synthetic datasets using factor in almost all ARM algorithms. For (1), we IBM’s Quest synthetic data generator 13 because, compare with DFA and FPG which will mine the to the best of our knowledge, such large datasets frequent itemsets over the whole dataset and retain are not publicly available such as on the UCI data only those that contain both citee and citer items. repository 39 or on the Frequent Itemset Mining For (2), we compare with another implementation Implementations Repository 11. 1 below briefly using P-trees that is based on Apriori (called P- describes each of those datasets where ITEMS is APRIORI). P-APRIORI uses the same algorithm the number of items in the whole dataset, TRANS that we use except for the part where it mines is the number of transactions in millions, and AVG frequent itemsets from the citee and citer parts is the average number of items per transaction for separately; a vertical Apriori implementation each dataset. based on P-trees as in 8 substitutes our frequent itemset mining algorithm. For the all experiments, Table 1. Datasets descriptions. we focus on mining all the frequent itemsets only by varying the minimum support threshold. ITEMS TRANS AVG 8, 9 and 10 show comparison results for the DS1 100 20 10 four approaches described previously on the given datasets at various support thresholds. As is DS2 500 10 10 evident from the figures, the two P-tree-based DS3 1000 5 100 approaches, which are based on the divide-and- The experiments presented herein are designed conquer algorithm proposed herein show better to study the performance of our approach against results over all the given datasets showing notable contemporary approaches. We tried to improvements of more than one half an order of create datasets with similar characteristics to magnitude in the case of DS2. On the other denser citation graph datasets represented in the form datasets, DS1 and DS3, improvements vary suggested in this work especially in terms of depth between two times and more than seven times. and width. We focus especially on large datasets This demonstrates the effectiveness of the overall that are relatively sparse in order to demonstrate algorithm proposed in this work over large and the efficiency of our approach. We take “sparse” relatively sparse datasets, which is the expected to mean few items in most transactions of the format of citation graph data targeted by our work. dataset. The rationale for focusing relatively on Compared to P-ARRIORI, which does not use sparse datasets is that, using our data the proposed frequent itemset mining, our representation model, even though the number of approach reduces the time into more than one half transactions representing edges in the graph can over DS1 and D3 and less than a half over DS2. grow enormously, the number of items per Again, this supports our claims regarding the effectiveness of the frequent itemset mining which rank third and fourth, respectively, on both approach integrated into this work. datasets. Performance shows an improvement of more 12 depicts the performance results over than half an order of magnitude at low support “BMS-POS” dataset from 11. This dataset has thresholds where the number of itemsets produced 1657 items and 515597 transactions with a is potentially very large. The improvements of our maximum of 165 items per transaction (i.e. approach can be mainly attributed to the divide- relatively sparse). To some extent, P-ARM and-conquer methodology utilized, and the performs better than FPG which in turn performs Boolean vertical data representation used which is better than DFA over this dataset; however, even complemented by compressed data structures though the performance of FPG is comparatively resulting in very fast itemset intersection good, it has degraded compared to its performance operations. In addition, our approach enumerates over the “Chess” dataset. This supports our subsets in a very efficient manner that eliminates previous statement regarding the better non-frequent candidate itemsets without extra performance of FPG over dense datasets where the storage or time overhead thus resulting in a faster trie data structure compresses very well. The frequent mining algorithm. performance degradation of P-APRIORI suggests It is worth noting that results on smaller poor performance for the Apriori algorithm over publicly available datasets such on 11 and 39 were this dataset which could not be circumvented by not as encouraging specially when compared with our divide-and-conquer approach. For P-ARM, FPG. A justification for this is that FPG views results have improved greatly from those over the transactions as ordered strings of items and “Chess” dataset which could be justified by the proceeds by storing the transactions in a trie data bigger size of this dataset (though still considered structure creating an in-memory version of the relatively small) and its sparsity thus supporting dataset and then traverses the trie to derive all over previous claims. frequent itemsets. After creating the trie, no database scans are need. In general, this is very DS1 feasible in case the dataset is relatively small so as 1000 to fit in memory, and fairly dense where it will 900 compress due to the use of tries (after ordering all 800 700 items in transactions so as to increase the overlap). ) 600 s (

This observation is demonstrated in our e 500 m i experiments where FPG does not perform well on T 400 very large relatively sparse datasets unlike the case 300 of small dense datasets and even small sparse as 200 shown in 11 and 12, respectively, which we 100 0 include for completeness. 0 10 20 30 40 50 Support (%) 11 shows the performance results of the above P-ARM FPG four approaches over “Chess” dataset 11, which DFA has 75 items and 3196 transactions each P-APRIORI containing 37 items (i.e. very dense). FPG clearly Figure 8. Performance analysis results over outperforms all other approaches on both datasets DS1. due to the factors mentioned above. A justification for our performance degradation is not being able to amortize the cost of creating and processing P- trees over very small datasets where even compression does not prove to be effective. Note that using P-trees, we can also compress very dense datasets because of the large numbers of consecutive 1s. Our approach ranks second showing better results than P-APRIORI and DFA, DS2 BMS-POS 1000 900 120 800 100 700 )

s 600 ) 80 ( s ( e

500 e m

i 60 m i T 400 T 300 40 200 20 100 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.5 1 1.5 Support (%) P-ARM Support (%) P-ARM FPG FPG DFA DFA P-APRIORI P-APRIORI Figure 9. Performance analysis results over Figure 12. Performance analysis results over DS2. BMS-POS.

DS3 7.5 Result Analysis 1400 We ran our algorithm several times on the 1200 dataset described in section 2 using different 1000 support threshold values, and noted the impact on ) s

( 800 the number of produced intra- and inter-node e m i

T 600 rules. Miniconf was set to zero. 13 depicts a

400 graphical representation of the number of intra- and inter-node rules produced versus the support 200 value chosen. The figure also depicts a table 0 showing the exact measures in both cases. 20 30 40 50 60 Support (%) Though the number of intra-node rules is, in P-ARM FPG most cases, larger than that of the inter-node rules, DFA P-APRIORI it is interesting that whenever we have intra-node Figure 10. Performance analysis results rules, at least one inter-node rule shows. Another over DS3. noteworthy observation is that, in the support range of [20, 50] only one inter-node rules shows (support is given is absolute values). The rule is Chess 11.30.Pb_CITEE11.30.Er_CITER,11.27.+d_CI TER(conf>=0.294798,supp=94/1448); it 250 associates the 11.20.Pb citee PACS number with 200 the 11.30.Er and 11.27.+d citer PACS numbers )

s with a confidence of approximately 29.5% and a (

150 e support of 6.5%. This rule has the highest support m i T 100 among all inter-node rules produced by our ARM algorithm. Note that no rule, neither intra-node nor 50 inter-node, is produced when support exceeds 55. 0 Confidence values for inter-node rules 50 60 70 80 90 100 Support (%) fluctuate between 0.004 and 1 (i.e. from 0.4% to P-ARM 100%). The following rule has a confidence of FPG DFA 100% and is worth mentioning: P-APRIORI 11.20.Dj_2_CITEE 03.65.Db_2_CITER, Figure 11. Performance analysis results over 03.80.+r_2_CITER (conf=100%, supp=0.35%) Chess. 14 presents some of the inter-rules that were Figure 14. Subset of the set of rules produced at different support values. We include generated using different supported them for completeness. PACS numbers drawn values. from the citee set and the citer set are appended by Support =20/1448 to Support =50/1448 _CITEE and _CITER, respectively. (~1.38% to 3.45%) “Supersymmetry”-->”Charge conjugation, parity, time reversal, Effect of support on the number of intra- and inter- and other discrete symmetries”, “Extended classical solutions; Inter- node rules cosmic strings, domain walls, texture” (conf=>0.294798,supp=51/1448) node rules Support =15/1448 (~1%) 9000 Intra- “Chiral symmetries” --> “Chiral Lagrangians”, “Other node 8000 nonperturbative calculations” (conf=>0.25, supp=15/1448) rules 7000 “Other nonperturbative calculations”--> “Chiral symmetries”, “Chiral Lagrangians” (conf=>0.267857, supp=15/1448) 6000

s “Unified field theories and models” -->”Gravity in more than e l 5000 u four dimensions, Kaluza-Klein theory, unified field theories; r

f alternative theories of gravity”, “Compactification and four- o

r 4000

e dimensional models” (conf=>0.475, supp=19/1448) b m

u 3000 “Supersymmetry”-->”Charge conjugation, parity, time reversal, N and other discrete symmetries”, “Extended classical solutions; 2000 cosmic strings, domain walls, texture” (conf=>0.294798, supp=51/1448) 1000

0 Figure 15. Rules after replacing PACS 1 2 3 4 5 10 15 20 25 30 35 40 45 50 55 numbers with their definitions.

Inter-node rules 20 28 18 42 36 14 4 1 1 1 1 1 1 1 0 In order to understand the semantics of the Intra-node rules 85 35 23 18 10 19 13 87 81 59 59 37 34 3 0 produced rules, we consulted the description of the Minimum support PACS numbers available at American Institute of Physics (AIP) 5 at Figure 13. Effect of support on the number of http://www.aip.org/pacs/pacs03/all.txt. This intra- and inter-node rules. description associates every PACS number with its corresponding subject matter. For example, the Support =20/1448 to Support =50/1448 rule with the highest support among inter-node (~1.38% to 3.45%) rules:  11.30.Pb_CITEE->11.30.Er_CITER,11.27.+d_CITER 11.30.Pb_CITEE 11.30.Er_CITER,11.27.+d_CI (conf>=0.294798,supp=94/1448) TER (conf>=0.294798,supp=94/1448), can now be semantically rewritten as: “Super symmetry”  “Charge conjugation, Support =15/1448 (~1%) parity, time reversal, and other discrete symmetries” and “Extended classical solutions; 11.30.Rd_CITEE--> 12.39.Fe_CITER,12.38.Lg_CITER (conf>=0.25,supp=15/1448) cosmic strings, domain walls, texture” (conf>=29.5%, supp=6.4%). 12.38.Lg_CITEE-->11.30.Rd_CITER,12.39.Fe_CITER According to our analysis, subject matter (conf>=0.267857supp=16/1448) “Super symmetry” has an impact on subject 12.10.-g_CITEE-->04.50.+h_CITER,11.25.Mj_CITER matters “Charge conjugation, parity, time (conf>=0.475,supp=20/1448) reversal, and other discrete symmetries” and 11.30.Pb_CITEE-->11.30.Er_CITER,11.27.+d_CITER “Extended classical solutions; cosmic strings, (conf>=0.294798,supp=94/1448) domain walls, texture” where the latter subject matters have extended from the former in some way. 15 lists rules from 14 with the PACS numbers replaced by their semantic equivalents current research on future research. We hope to from AIP. have initiated a novel research direction which The discovered rules were analyzed and will lead to better understanding of how we ought evaluated by experts in physics field theory. One to understand citation graph data. observation was that most of the rules have older We proposed an efficient vertical ARM model and more general antecedents (citee part) than for representing citation graph data generating consequents (citer part) possibly indicating that rules capable of associating research subjects of physics researchers tend to cite the entire history publications written at different points in time. In of development of a subject, going back several the future, we plan to study the efficacy of the decades if necessary, thereby, connoting that older suggested temporal and subject constraints and more general subjects are more likely to be proposed to reduce the number of considered cited often with time. It is clear that old and graph edges by focusing only on the nodes (along general subjects form the ground for most research with their incoming or outgoing edges) satisfying subjects that come later thus supporting our claims the given constraints. We would also like to regarding the ability of the discovered rules to analyze the usefulness of the concepts presented highlight subject extensions and evolution over herein when applied to other potential domains time. In addition, for most of the rules, it could be such citation mining and web structure mining. In confirmed that, in fact, consequent subject matters addition, have extended, in some form, from antecedent In our citation data analysis, we have subject matters. However, some of the rules exploited the time factor embedded in the require further investigation. Forming chains of directionality of the edges in the citation graph in rules by matching the antecedent of one rule with an efficient manner. An edge from node X to node the consequent of another, we were able to look in Y in a citation graph implies that paper X cites many hobs backwards in subject evolution (or paper Y and, more importantly, that Y was written forward in subject extensions) and to understand before X (ignoring factors such as “preprints” and how subjects form and what their future impacts “self-citations”). The time factor is perhaps one of are. Thus, even though we focused only on direct the main reasons we restricted our analysis to citations by limiting ourselves to the use of single- citation graph data; nevertheless, we believe our edge paths as the transactions in our data model, techniques could be generalized to other we are able to gain deeper insights on subject application domains. Another future direction in evolution and extensions, and future impacts by this area would be to analyze different types of forming rule chains. All of these observations fit, directed graphs with the aim of exploiting factors to a large extent, our motivation for the format of other than the time factor. the desired rules, subjective notion of interestingness, and claims regarding their REFERENCES usability. [1] A Roadmap to Text Mining and Web 8. CONCLUSION Mining. http://www.cs.utexas.edu/users/ pebronia/text-mining. July 2004. In this paper, we have proposed an efficient [2] R. Agrawal, T. Imielinski, and A. Swami, model for understanding temporal aspects in Mining association rules between sets of publication trends through citation graphs by items in large databases. Proceedings of identifying patterns in the subjects of scientific the ACM SIGMOD, International publications. Our approach has shown good Conference on Management of Data improvements when compared to other (Washington, D.C.), 1993. cotemporary approaches especially on very large [3] R. Agrawal and R. Srikant, Fast and relatively sparse data. algorithms for mining association rules. Patterns of interest could reveal the original Proceeding of the VLDB, International subject matters from which other subjects of Conference on Very Large Databases interest might have extended later in time, the (Santiago, Chile), 1994. evolution of subjects, and the potential effects of [4] T. Akutsu, S. Miyano, and S. Kuhara, Identification of genetic networks from a nearest neighbor classification on spatial small number of gene expression patterns data streams using p-trees. Proceedings of under the Boolean network model. 1999. the PAKDD, Pacific-Asia Conference on http://citeseer.nj.nec.com/akutsu99identifi Knowledge Discovery and Data Mining cation.html. June 2003. (Taipei, Taiwan), 2002. [5] American Institute of Physics. [16] J. M. Kleiberg, Authoritative Sources in a http://www.aip.org/. June 2004 Hyperlinked Environment. Proceedings of [6] M. Brown et al, “Knowledge-based the ACM-SIAM Symposium on Discrete analysis of microarray gene expression Algorithms (San Francisco, California), data by using support vector machines.” 1998. Proceedings of the National Academy of [17] W. A. Kosters and W. Pijls APRIORI, A Sciences. 97(1): 262-267, 2000. Depth First Implementation. Proceedings [7] S. Chakrabarti, B. E. Dom, D. Gibson, J. of the IEEE ICDM FIMI, Workshop of Kleinberg, R. Kumar, P. Raghavan, S. Frequent Item Set Mining Rajagopalan, and A. Tomkins, “Mining Implementations (Melbourne, Florida), the Link Structure of the World Wide 2003. Web.” IEEE Computer 32(8): 60-67, [18] R. N. Kostoff, J.A. del Rio, J. A. August 1999. Humenik, L. E. O. Garcia, and L. A. M. [8] Qin Ding, Qiang Ding, and W. Perrizo, Ramirez, “Citation Mining: integrating Association Rule Mining on Remotely text mining and bibliometrics for research Sensed Images Using P-trees. Proceedings user profiling.” Journal of the American of the PAKDD, Pacific-Asia Conference Society for Information Science and on Knowledge Discovery and Data Technology 52(13), November 2001. Mining, Springer-Verlag, Lecture Notes in ISSN: 1532-2882, John Wiley & Sons, Artificial Intelligence 2336, 66-79, May Inc. New York, USA. 2002. [19] M. Kuramochi and G. Karypis. An [9] Q. Ding, M. Khan, A. Roy, and W. efficient algorithm for discovering Perrizo, The p-tree algebra. Proceedings frequent subgraphs. 2002. of the ACM SAC, Symposium on Applied http://www.users.cs.umn.edu/~karypis/pu Computing (Madrid, Spain), 2002. blications/Papers/PDF/fsg2.pdf. February [10] L. Egghe and R. Rousseau, “Introduction 2003. to Informetrics.” Elsevier, 1990. [20] B. Liu and W. Hsu, Post-Analysis of [11] Frequent Itemset Mining Implementations Learned Rules. Proceedings of the AAAI, Repository, http://fimi.cs.helsinki.fi. National Conference on Artificial [12] J. Han, J. Pei and Y. Yin, Mining Intelligence (Portland, Oregon), 828-834, Frequent Patterns without Candidate August 1996. Generation. Proceeding of ACM [21] T. Matsuda, H. Motoda, T. Yoshida, and SIGMOD, International Conference on T. Washio, Mining patterns from Management Of Data (Dallas, Texas), 1- structured data by beam-wise graph- 12, 2000. based induction. 2002. [13] IBM Quest Synthetic Data Generator, http://www.ai.ijs.si/SasoDzeroski/MRDM http://www.almaden.ibm.com/software/qu 2002/proceedings/motoda.pdf. February est/Resources/datasets/syndata.html. 2003. [14] A. Inokuchi, T. Washio, and H. Motoda, [22] F. Osareh, "Bibliometrics, Citation An apriori-based algorithm for mining Analysis and Co-Citation Analysis: A frequent substructures from graph data. Review of Literature I." Libri 46 (4):149- Proceedings of the PKDD, European 158, September 1996. Conference on Principles and Practice of [23] T. Oyama, K. Kitano, K. Satou, and T. Ito, Knowledge Discovery in Databases Mining association rules related to (Lyon, France), 2000. protein-protein interactions. 2000. [15] M. Khan, Q. Ding, and W. Perrizo, K- http://citeseer.nj.nec.com/459605.html. July 2003. October 2003. Proceedings of the ACM SIGMOD, [24] B. Padmanabhan and A. Tuzhilin, International Conference on Management “Unexpectedness as a Measure of of Data (Paris France), June 2004. Interestingness in Knowledge Discovery.” [34] P. Shenoy, J. Haristsa, S. Sudatsham, G. Decision Support Systems, 27(3), 1999. Bhalotia, M. Baqa and D. Shah, Turbo- [25] S. Parthasarathy and M. Coatney, Efficient charging vertical mining of large Discovery of Common Substructures in databases. Proceedings of the ACM Macromolecules. Proceedings of the IEEE SIGMOD, International Conference on ICDM, International Conference on Data Management of Data (Austin, Texas), 22- Mining (Maebashi City, Japan), December 29, May 2000. 2002. [35] A. Silberschatz and A. Tuzhilin, On [26] W. Perrizo, Peano count tree technology Subjective Measures of Interestingness in lab notes. Technical Report NDSU-CS- Knowledge Discovery. Proceedings of the TR-01-1, 2001. ACM SIGKDD, International Conference http://www.cs.ndsu.nodak.edu/~perrizo/cl on Knowledge Discovery and Data asses/785/pct.html. January 2003. Mining (Montreal, Canada), 275-281, [27] W. Perrizo, Qin Ding, A. Denton, K. 1995. Scott, Qiang Ding, and M. Khan, PINE - [36] A. Silberschatz and A. Tuzhilin, “What Podium Incremental Neighbor Evaluator Makes Patterns Interesting in Knowledge for spatial data using Ptrees. Proceedings Discovery Systems.” IEEE Transactions of the ACM SAC, Symposium on Applied on Knowledge and Data Engineering. Computing (Melbourne, Florida), 2003. Special Issue on Data Mining, 8(6), 970- [28] W. Perrizo, Q. Ding, and A. Roy, 974, 1996. Deriving high confidence rules from [37] E. Suzuki, Autonomous Discovery of spatial data using peano count trees. Reliable Exception Rules. Proceedings of Proceedings of the WAIM, International the ACM SIGKDD, International Conference on Web-Age Information Conference on Knowledge Discovery and Management (Xi'an, China), 91-102, July Data Mining (Newport Beach, California), 2001. 259-262, August 1997. [29] G. Piatetsky-Shapiro and C.J. Matheus, [38] TouchGraph LLC, www.touchgraph.com. The Interestingness of Deviations. July 2004. Proceedings of the AAAI Workshop on [39] University of California Irvine data Knowledge Discovery in Databases repository, http://kdd.ics.uci.edu. (Seattle, Washington), 25-36, August [40] B. Wang, F. Pan, D. Ren, Y. Cui, Q. Ding, 1994. and W. Perrizo, Efficient OLAP [30] P. Pipenbacker et al., ProClust: Improved operations for spatial using Peano trees. clustering of protein sequences with an Proceedings of the ACM SIGMOD extended graph-based approach. 2002. Workshop on Data Mining and http://citeseer.nj.nec.com/inokuchi00aprio Knowledge Discovery (San Diego, ribased.html. July 2003. California), 2003. [31] L. Pray, “Unraveling Protein-Protein [41] X. Yan and J. Han, gSpan: Graph-Based Interactions.” In the Scientist. 17(2), Substructure Pattern Mining. 2003. January 2003. http://citeseer.nj.nec.com/ yan02gspan. [32] I. Rahal and W. Perrizo, An optimized html. February 2003. Approach for KNN Text Categorization [42] M. Zaki and K. Gouda, Fast Vertical using P-tees. Proceedings of the ACM Mining Using Diffsets. Proceedings of the SAC, Symposium on Applied Computing ACM SIGKDD, International Conference (Nicosia, Cyprus), March 2004. on Knowledge Discovery and Data [33] M. Serazi, A. Perera, Q. Ding, V. Mining (Washington D.C., USA), August Malakhov, I. Rahal, F. Pan, D. Ren, W. 2003. Wu and W. Perrizo, DataMIMETM.

Recommended publications