Finding Patterns in an Unknown Graph
Total Page:16
File Type:pdf, Size:1020Kb
1 Finding Patterns in an Unknown Graph Roni Stern aMeir Kalech a and Ariel Felner a that the structure of the graph is given either explic- a Department of Information System Engineering itly, in data structures such as adjacency list or adja- Ben Gurion University of the Negev cency matrix, or implicitly, with a start state and a set Beer-Sheva, 84105, Israel of computable operators (e.g., moving a tile in a slid- E-mails: [email protected], [email protected], ing tile puzzle state). The complexity of such search [email protected] algorithms is therefore measured with respect to CPU time and memory demands. We refer to such problems AbstractSolving a problem in an unknown graph requires an as problems on known graphs. agent to iteratively explore parts of the searched graph. Ex- By contrast, there are domains that can be modeled ploring an unknown graph can be very costly, for example, as graphs, where the graph structure is not known a- when the exploration requires activating a physical sensor or priori, and exploring vertices and edges requires a dif- performing network I/O. In this paper we address the prob- ferent type of resource, that is, neither CPU nor mem- lem of searching for a given input pattern in an unknown ory. For example, a robot navigating in an unknown graph, while minimizing the number of required exploration terrain, where vertices and edges correspond to physi- actions. This problem is analyzed theoretically. Then, al- cal locations and roads, respectively. Acquiring knowl- gorithms that choose which part of the environment to ex- edge about the vertices and edges of the searched graph plore next are presented. Among these are adaptations of ex- may require activating a physical sensor and possibly isting algorithms for finding cliques in a known graph as well as a novel heuristic algorithm (Pattern∗). Addition- mobilizing the robot, incurring a cost of fuel (or any ally, we investigate how probabilistic knowledge of the exis- other energy resource). Another example is an agent tence of edges can be used to further minimize the required searching the World Wide Web, where the web sites exploration. With this additional knowledge we propose a and hypertext links represent the vertices and edges of Markov Decision Problem (MDP) formulation and a Monte- the searched graph, respectively. Since the web is ex- Carlo based algorithm (RPattern∗) which greatly reduces tremely large and dynamic, accessing vertices requires the total exploration cost. As a case study, we demonstrate sending and receiving network packets (e.g., HTTP re- how the different heuristic algorithms can be implemented quest/response). We refer to such problems as prob- for the k-clique pattern as well as for the complete bipartite lems on unknown graphs. pattern. Experimental results are provided that demonstrate Solving problems in unknown graphs (as well as in the strengths and weaknesses of the proposed approaches on any other type of graph problem) requires exploring random and scale-free graphs as well as on an online web crawler application searching in Google Scholar. In all the some parts of the graph. We define an exploration ac- experimental settings we have tried, the proposed heuristic tion for a vertex as an action that discovers all its outgo- algorithms were able to find the searched pattern with sub- ing edges and neighboring vertices. Such exploration stantially less exploration cost than random exploration. actions are associated with a cost, denoted hereafter as Keywords: Heuristic search, Unknown graphs, Subgraph iso- exploration cost. This exploration cost is often concep- morphism tually different than the traditional computational ef- fort (of CPU and memory). In the web graph domain, for example, the exploration can correspond to sending 1. Introduction an HTTP request, retrieving an HTML page and pars- ing all the hypertext links in it. The hypertext links are Many real-life problems can be modeled as prob- the outgoing edges, and the connected web sites are the lems on graphs, where one needs to find subgraphs that neighboring vertices. The associated exploration cost have a specific structure or attributes. Examples of such includes the network I/O of sending and receiving IP subgraph structures are shortest paths, any path, short- packets. For a physical domain, where a robot is nav- est traveling salesperson tours and cliques. Most classi- igating in an unknown terrain, the exploration is done cal algorithms that solve such graph problems assume by using sensors at a location to discover the near area, AI Communications ISSN 0921-7126, IOS Press. All rights reserved 2 Finding Patterns in an Unknown Graph e.g., the outgoing edges and the neighboring vertices in of exploration cost, is equal to and often much better the map. The associated exploration cost includes the than an adaptation of the state-of-the-art clique search cost of activating sensors at a vertex. In both cases the algorithm. The strengths and weaknesses of the differ- CPU and memory costs are often negligible in compar- ent heuristic algorithms are evaluated. We also imple- ison to the other exploration costs. An important task, mented and evaluated the algorithms on a web crawler which is addressed in this paper, is to solve the prob- application, where the papers that are accessible via lem while minimizing the exploration cost. This is es- Google Scholar are the vertices of the searched un- pecially important when computational CPU cost is of known graph. Results show that using Clique∗, cliques lesser importance and can be neglected as long as it is are found more often and with less exploration cost running in time that is tractable. compared to KnownDegree and random exploration. In this paper, we address the problem of searching Beyond the value of investigating such a basic prob- for a specific pattern of vertices and edges in an un- lem in the unknown graph setting, finding patterns in known graph while aiming to minimize the exploration an unknown graph has practical applications in real- cost. Starting from a single known vertex, the search is world domains. For example, finding a set of physical performed in a best-first search manner. In every step, locations forming a clique suggests the existence of a if the desired pattern does not exist in the known part of metropolitan area. Another example is a set of scien- the graph a single “best” vertex is chosen and explored. tific papers, where finding a set of papers that reference This process is repeated until the desired pattern is each other suggest resemblance in content. Therefore found or until the entire unknown graph is explored. finding such a cluster of referencing papers can be use- Several general heuristic algorithms are proposed for ful in a data mining context (complemented by a tex- choosing which vertex to explore next: KnownDegree, tual data mining approach), where the goal is to find a ∗ ∗ Pattern and RPattern . KnownDegree is a straight- set of scientific papers from a given subject. Section 8 forward adaptation of a common known graph heuris- describes experimental results of such a web crawler tic, in which the vertex with the highest degree is application where a k-clique of referencing papers is ∗ explored first. Pattern exploits the structure of the searched in Google Scholar. searched pattern by choosing to explore the vertex that This paper is organized as follows. First we for- is “closest” to being a part of the searched pattern. A mally define the problem of finding a pattern in an un- metric for “closeness” of a vertex to a pattern is pre- known graph (Section 2) and list related work (Sec- ∗ sented. With this “closeness” metric, Pattern has the tion 3). Then, a best-first search framework for solv- property of returning a tight lower bound on the num- ing this problem is described (Section 4). Several de- ber of exploration steps required to find the searched terministic heuristic algorithms are given for this best- pattern. For scenarios where probabilistic knowledge first search framework (Section 5), as well as a heuris- of the unknown graph is available, we propose the tic algorithm that can exploit probabilistic knowledge RPattern∗ RPattern∗ heuristic algorithm. is a ran- of the searched graph (Section 6). Following, we ana- domized heuristic algorithm that chooses the next ver- lyze the proposed best-first search framework theoret- tex to explore by applying a Monte-Carlo sampling ically (Section 7) and compare the described heuris- procedure in combination with Pattern∗ as a default tic algorithms experimentally (Section 8).The paper heuristic. finally concludes with a discussion and future work To demonstrate the applicability of the proposed (Section 9). heuristic algorithms, we describe how to implement A preliminary version of this paper already ap- them for two specific patterns: a k-clique and a com- peared [1] focusing on searching for the k-clique pat- plete (p-q)-bipartite graph. We develop the concept of tern in an unknown graph. In this paper we take a big a potential k-clique and a potential complete (p-q)- step beyond that work, as detailed in Section 3. bipartite graph, along with supporting corollaries that allow efficient implementation of the Pattern∗ heuris- tic algorithm for these patterns. Empirical evaluation were performed on the k-clique pattern, by applying 2. Problem Definition the proposed heuristic algorithms when searching ran- dom and scale-free graphs. Results show that the per- Following are several definitions and notations re- formance of Clique∗ and RClique∗ (the k-clique vari- quired for formally describing the problem of finding ants of Pattern∗and RPattern∗respectively), in terms a specific pattern in an unknown graph.