Internet Technology Evolution

Querical Data Networks

Cyrus Shahabi and Farnoush Banaei-Kashani

U n i v e r s i t y o f S o u t h e r n C a l i f o r n i a ( U S C ) C o m p u t e r S c i e n c e D e p a r t m e n t

I N T R O D U C T I O N

Recently, a family of massive self-organizing data networks has emerged. These networks mainly serve as large-scale distributed query processing systems. We term these networks Querical Data Networks (QDN). A QDN is a federation of a dynamic set of peer, autonomous nodes communicating through a transient-form interconnection. Data is naturally distributed among the QDN nodes in extra-fine grain, where a few data items are dynamically created, collected, and/or stored at each node. Therefore, the network scales linearly to the size of the dataset. With a dynamic dataset, a dynamic and large set of nodes, and a transient-form communication infrastructure, QDNs should be considered as the new generation of distributed database systems with significantly less constraining assumptions as compared to their ancestors. Peer-to-peer networks (Daswani, 2003) and sensor networks (Estrin, 1999, Akyildiz, 2002) are well-known examples of QDN.

QDNs can be categorized as instances of “complex systems” (Bar-Yam, 1997) and studied using the complex system theory. Complex systems are (mostly natural) systems hard (or complex) to describe information-theoretically, and hard to analyze computationally. QDNs share the same characteristics with complex systems, and particularly, bear a significant similarity to a dominating subset of complex systems most properly modeled as large-scale interconnection of functionally similar (or peer) entities. The links in the model represent some kind of system-specific entity-to-entity interaction. Social networks, a network of interacting people, and cellular networks, a network of interacting cells, are two instances of such complex systems. With these systems, complex global system behavior (e.g., a social revolution in a society, or food digestion in stomach!) is an emergent phenomenon, emerging from simple local interactions. Various fields of study, such as sociology, physics, biology, chemistry, etc., were founded to study different types of initially simple systems and have been gradually matured to analyze and describe instances of incrementally more complex systems. An interdisciplinary field of study, the complex system theory 1, is recently founded based on the observation that analytical and experimental concepts, tools, techniques, and models developed to study an instance of complex system at one field can be adopted, often almost unchanged, to study other complex systems in other fields of study. More importantly, the complex system theory can be considered as a unifying meta-theory that explains common characteristics of 1 Go to New England Complex Systems Institute (http://necsi.org/) for more information about the complex system theory. complex systems. One can extend application of the complex system theory to QDNs by:

1. Adopting models and techniques from a number of impressively similar complex systems to design and analyze QDNs, as an instance of engineered complex systems; and 2. Exporting the findings from the study of QDNs (which are engineered, hence, more controllable) to other complex system studies.

This article is organized in two parts. In the first part, we provide an overview, where we 1) define and characterize QDNs as a new family of data networks with common characteristics and applications, and 2) review possible database-like architectures for QDNs as query processing systems and enumerate the most important QDN design principles. In the second part of the article, as the first step toward realizing the vision of QDNs as complex distributed query-processing systems, we focus on a specific problem, namely the problem of effective data location (or search) for efficient query processing in QDNs. We briefly explain two parallel approaches, both based on techniques/models borrowed from the complex system theory, to address this problem.

B A C K G R O U N D

Here, we enumerate the main componental characteristics and application features of a QDN.

Componental Characteristics

A network is an interconnection of nodes via links, usually modeled as a graph. Nodes of a QDN are often massive in number and bear the following characteristics:  Peer functionality: All nodes are capable of performing a restricted but similar set of tasks in interaction with their peers and the environment, although they might be heterogeneous in terms of their physical resources. For example, joining the network and forwarding search queries are among the essential peer tasks of every node in a peer-to-peer network.  Autonomy: Aside from the peer tasks mentioned above, QDN nodes are autonomous in their behavior. Nodes are either self-governing, or governed by out-of-control uncertainties. Therefore, to be efficacious and applicable the QDN engineering should avoid imposing requirements to and making assumptions about the QDN nodes 2. For example, strict regulation of connectivity (e.g., enforcing number of connections and/or target of connections) might be an undesirable feature for a QDN design.  Intermittent presence: Nodes may frequently join and leave the network based on their autonomous decision, due to failures, etc.

On the other hand, links in various QDNs stand for different forms of interaction and communication. Links may be physical or logical, and they are fairly inexpensive to rewire. Therefore, a QDN is a large-scale federation of a dynamic set of autonomous peer nodes building a transient-form interconnection. 2 One can consider peer tasks as rules of federation, which govern the QDN but do not violate autonomy of individual nodes. Conventional approaches developed to model and analyze traditional distributed database systems (and classical networks, as their underlying communication infrastructure) are either too weak (oversimplifying) or too complicated (overcomplicated) to be effective with large-scale and topology-transient QDNs. The complex system theory (Bar-Yam, 1997), on the other hand, provides a set of conceptual, experimental, and analytical tools to contemplate, measure, and analyze systems such as QDNs.

Application Features

A QDN is applied as a distributed source of data (a data network) with nodes that are specialized for cooperative query processing and data retrieval. The node cooperation can be as trivial as forwarding the queries, or as complicated as in- network data analysis. In order to enable such an application, QDN should support the following features:

 Data-centric naming, addressing, routing, and storage: With a QDN, queries are declarative; i.e., query refers to the names of data items and is independent of the location of the data. The data may be replicated and located anywhere in the data network, the data holders are unknown to the querier and are only intermittently present, and the querier is interested in data itself rather than the location of the data. Therefore, naturally QDN nodes should be named and addressed by their data content rather than an identifier in a virtual name space such as the IP address space. Consequently, with data-centric naming and addressing of the QDN nodes (Heidemann, 2001), routing (Ratnasamy, 2002) and storage (Ratnasamy, 2003) in QDN are also based on the content. It is interesting to note that non-procedural query languages such as SQL also support declarative queries and are appropriate for querying data-centric QDNs.  Self-organization for efficient query processing: QDNs should be organized for efficient query processing. A QDN can be considered as a database system with the data network as the database itself (see the next section). QDN nodes cooperate in processing the queries by retrieving, communicating, and preferably on-the-fly processing of the data distributed across the data network. To achieve efficiency in query processing with high resource utilization and good performance (e.g., response time, query throughput, etc.), QDN should be organized appropriately. Examples of organization are: intelligent partitioning of the query to a set of sub-queries to enable parallel processing, or collaborative maintenance of the data catalogue across the QDN nodes. However, the peer tasks of the QDN nodes should be defined such that they self-organize to the appropriate organization. In other words, organization must be a collective behavior that emerges from local interactions among nodes; otherwise the dynamic nature and large scale of QDN render any centralized micro-management of QDN unscalable and impractical.

V I S I O N : A D A T A B A S E Q U E R Y I N G F R A M E W O R K F O R Q D N S

In previous section, we defined a Querical Data Network (QDN) as a distributed data source and a query processing system. On the other hand, a database system (DBS) is designed specifically as a general framework for convenient and efficient querying of static or dynamic collections of interrelated data. Thus, querying QDNs can be designed, developed, and executed by adopting DBS as the general-purpose querying framework, leveraging on its rich abstractions, theories, and processing methods (Govindan, 2002, Harren, 2002). In particular, adopting the DBS framework potentially results in 1) convenient and rapid application development for users at the conceptual level, by providing well- known and transparent abstractions independent of the implementation of the querying, and 2) efficient query processing at the physical level, by providing a general-purpose querying component that adopts and customizes query processing methods from the database literature as well as other related fields such as distributed computing. Here, we depict and compare potential architectures for the DBS framework, define a taxonomy of approaches to generalize this querying framework for the entire family of QDNs, and enumerate some important design principles for query processing in QDNs.

Architecture

The DBS querying framework for QDNs suggests a 2-level architecture comprising of conceptual level and physical level. At the conceptual level, queries are defined based on the conceptual schema of the data network, independent of the physical implementation of the query processing. The physical data independence allows rapid development of QDN applications, similar to the database application development. For instance, a peer-to-peer application that monitors violation of speed limit at a highway can pose the following query to a (hypothetical) mobile peer-to-peer network of vehicles:

SELECT vehicle-ID FROM Cars WHERE (speed > 70) AND (location IN “Highway No. 88”)

Similarly, a heat alert application may pose the following query to a heat detection sensor field: “Report the outlier heat data in all offices during night.” Queries are executed at the physical level. There are two extreme choices of design for query processing at the physical level: centralized and decentralized; a hybrid design may also be meaningful for particular applications. With a centralized design, a potential querier (i.e., one of the QDN nodes or an outsider) receives the query from the QDN application at the conceptual level. The query is disseminated to all QDN nodes, where the query is interpreted based on the local conceptual schema and all raw data required to process the query are transmitted back to the querier via the network (see Figure 1-a). The querier treats the collected data as a centralized database, and processes and analyzes the data to respond the query. With this scheme, data sourcing and query processing are completely decoupled; the data network maintains and communicates the data, and the querier processes the query individually. The cooperation among QDN nodes is limited to forwarding the query and the raw data, and the network is used only as a point-to- point communication infrastructure to communicate the data between the sources and the querier. Conceptual Level Conceptual Level

Physical Level Information Physical Level Information

Database

Database Querical Data Data Netowrk (DB-QDN)

Querical Data Network (QDN)

a. Centralized Design b. Decentralized Design

Fig. 1. Database System Framework for Querying Querical Data Networks (QDNs)

A centralized design is simple to implement. The centralized query processing approach is also resource-efficient for trivial queries such as typical peer-to-peer search queries. However, with complex queries, where, for example, two 10000- record tables must be joined to retrieve a few records, the overhead of transmitting the entire content of the tables to the querier for central processing is overwhelming and renders the centralized approach impractical. With most QDNs (e.g., sensor networks), this overhead is particularly intolerable due to the several orders of magnitude higher cost of communication as compared with that of computation in typical QDN nodes. Instead, in-network and on-the-fly processing of the query potentially eliminates the redundant communication of the data; hence, it is more efficient and scalable. The decentralized design adopts the latter approach. With the decentralized design, the QDN is itself both the source of the data and the query-processing unit (see Figure 1-b). The data sourcing/communication and data analysis tasks are integrated, and QDN nodes cooperate to perform both tasks within the network. With this approach, queries are processed based on the following general scheme: query is disseminated to a selected set of QDN nodes; QDN nodes exploit their computing power to process the query locally, cooperatively, in parallel, and in a distributed fashion, to extract the required information from the raw data; eventually, the extracted information merge to comprise the final query result while traveling toward the querier through the network. The efficiency of the decentralized design is due to in-network processing. With this approach, communication of the raw data is restricted to short-range (hence, less costly) communications among local cooperative-analysis groups of QDN nodes, which process the voluminous raw data and extract the concise required information to respond to the query. Although the decentralized design potentially promises an efficient and scalable querying framework for QDNs, realizing this design is a challenging endeavor and requires designing efficient distributed and cooperative query processing mechanisms that comply with specific characteristics of QDNs.

Taxonomy

Based on the two fundamentally distinct design choices for the physical level of the DBS framework (i.e., centralized and decentralized), one can recognize two approaches to implement a DBS-based querying system for QDNs:

1. Database for QDN: This approach corresponds to the querying systems with centralized query processing. These systems are similar to other centralized database applications, where data are collected from some data sources (depending on the host application, the data sources can be text documents, media files, and in this case, QDN data) to be separately and centrally processed. 2. Database-QDN (DB-QDN): The systems that are designed based on this approach strive to implement query processing in a decentralized fashion within the QDN; hence, in these systems “QDN is the database”.

Design Principles

By definition QDNs tend to be large-scale systems and their potential benefits increase as they grow in size. Therefore, between the two types of DBS-based querying systems for QDNs, the database-QDNs (DB-QDNs) are more promising because they are scalable and efficient. Among the most important design principles for distributed query processing at DB-QDNs one can distinguish the following:

1. In-network query processing: In-network query processing is the main distinction of DB-QDNs. In-network query processing techniques should be implemented in a distributed fashion, ensuring minimal communication overhead and optimal load-balance. 2. Transaction processing with relaxed properties: Due to the dynamic nature of QDNs, requiring ACID-like properties for transaction processing in DB-QDNs is too costly to be practical and severely limits the scalability of such processing technique. Hence, transaction-processing properties should be relaxed for DB- QDNs. 3. Adaptive query optimization: Since QDNs are inherently dynamic structures, optimizing query plans for distributed query execution in DB-QDNs should also be a dynamic/adaptive process. Adaptive query optimization techniques are previously studied in the context of central query processing systems (Avnur, 2000). 4. Progressive query processing: Distributed query processing tends to be time- consuming. With real-time queries, user may prefer receiving a rough estimation of the query result quickly rather than waiting long for the final result. The rough estimation progressively enhances to the accurate and final result. This approach, termed progressive query processing (Schmidt, 2002-a), allows users to rapidly obtain a general understanding of the result, to observe the progress of the query, and to control its execution (e.g., by modifying the selection condition of the query) on the fly. 5. Approximate query processing: Approximation techniques such as wavelet- based query processing can effectively decrease the cost of the query, while producing highly accurate results (Schmidt, 2002-b). Inherent uncertainty of the QDN data together with the relaxation of the query semantics justify application of approximation techniques to achieve efficiency.

F U T U R E T R E N D S

One of the most fundamental functionalities required to realize a DB-QDN is the search primitive. Efficient location of the data within QDN, a large-scale and dynamic system with distributed and dynamic dataset, is a challenging task vital to QDN query processing. For the remainder of this section, we briefly explain two parallel approaches one can adopt from the complex system theory to address the QDN search problem. First, we discuss a self-organizing mechanism to structure the topology of the QDN to a search-efficient topology. This topology can be considered as a distributed index structure that organizes the nodes and therefore, the data content of the nodes, for efficient search. For the design of the search- efficient QDN topology as well as the search dynamics, we are inspired by the “small-world” models. Small-worlds are models proposed to explain efficient communication in a social network, which is a semi-structured complex system. Second, we propose an efficient query flooding mechanism for QDNs. Flooding is not only required for broadcast queries at all QDNs, but also for uni-cast and multi-cast queries in unstructurable/unindexible QDNs. With these QDNs, the extreme dynamism of the QDN topology and the extreme autonomy of the QDN nodes renders any attempt to impose even a semi-structure on the network by an index-like structure inefficient and/or impossible. We use percolation theory, an analytical tool borrowed from the complex system theory, to formalize and analyze such efficient flooding mechanism.

Probabilistic Indexing of QDNs for Efficient Approximate Query Processing

Considering a QDN as a database (with every node of the QDN as the potential entry point of the query), similar to traditional databases QDN should be “indexed” for efficient processing of the queries. To process approximate queries 3, we propose self-organizing the interconnection of the QDN based on the data content of the QDN nodes. With this organization, the network distance between every two nodes is positively correlated with the similarity of their data content with high probability. This approach results in an indexed network with distinguishable data localities, allowing efficient routing of the queries toward the nodes holding the result set of the query. The similarity measurements performed by each node while joining QDN to select an appropriate set of neighbors can be thought of as the off- line pre-computations required to create the index for efficient on-line query processing. Also, the topology of the generated interconnection should be compared with the tree-like topologies of the traditional hierarchical index structures in centralized databases. In addition to allowing efficient navigation/traversal of the dataset, this topology should support the dynamism of the dataset and the network, and more importantly, should avoid assuming a central entry point (the root node in hierarchical indices) for the query, in order to balance the query load among all the nodes of the QDN.

3 Considering the transience of the QDN structure and the dynamism of the dataset, exact query processing with zero false dismissal is not a practical option. It turns out that a probabilistic “small-world” model, which is a topology proposed to explain efficient communication in social networks, is a perfect candidate topology to index QDNs. With our searchable QDN model (Banaei- Kashani, 2003), we propose a self-organization mechanism that generates a QDN with small-world topology based on a recently developed small-world model (Watts, 2002). We complement the generated small-world network topology (i.e., the index) with a query forwarding mechanism (i.e., the index lookup technique) that effectively routes partial-match queries toward the QDN nodes that store the matching data items. Currently, we are focusing on extending this query routing technique to support more challenging queries such as range queries and nearest- neighbor queries.

Criticality-based Probabilistic Flooding

Flooding is a common mechanism used in many networks, including QDNs, to broadcast a piece of information (e.g., an alert, or a search query) from a source node to other nodes of the network. With normal flooding, each node always forwards the received information to all its neighbors (i.e., directly connected nodes). In spite of many beneficial features, such as providing broad coverage and guaranteeing minimum delay, normal flooding is not a scalable communication mechanism, mainly because of the communication overhead it imposes to the system. To alleviate this problem, we introduce probabilistic flooding (Banaei- Kashani, 2003). With probabilistic flooding, unlike normal flooding, a node forwards the information to its neighbor probabilistically, with probability p. By changing the probability value p, we can control the effective connectivity of the network while information is forwarded. The idea is to tune the probability value p to a critical operation point (the phase transition point) such that statistically the network remains connected (to preserve full reachability) while redundant paths are eliminated. Percolation theory (Stauffer, 1992) is an analytical tool from the complex system theory extensively used to study probabilistic diffusion-like physical phenomena; e.g., diffusion of oil inside porous rocks in oil reservoir, a physical complex system. We use percolation theory to formalize the probabilistic flooding approach as a query-diffusion problem, and to find its critical (optimal) operating point rigorously. Our formal analysis shows that the critical value of p can be as low as 1%, which translates to 99% reduction in communication overhead of flooding, hence, scalable flooding.

C O N C L U S I O N

In this article, we identified Querical Data Networks (QDNs) as a family of data networks, recently emerging as a new generation of distributed database systems with significantly less constraining assumptions. We envision a QDN as a distributed query processing system with a database-like architecture. In search of an effective approach to design and analyze Database-QDNs, we find the complex system theory, a theory that explains a family of systems with characteristics that bear significant similarity to those of QDNs, extremely helpful. As an instance application of this approach, we provide two parallel solutions for the QDN search problem inspired by models adopted from the complex system theory. A C K N O W L E D G M E N T S

This research has been funded in part by NSF grants EEC-9529152 (IMSC ERC), IIS- 0082826 (ITR), IIS-0238560 (CAREER), IIS-0324955 (ITR) and IIS-0307908, and unrestricted cash gifts from Okawa Foundation and Microsoft. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation .

R E F E R E N C E S

Akyildiz, I.F., Su, W., Sankarasubramaniam, Y., and Cayirci, E. (2002). A Survey on Sensor Networks. IEEE Communications Magazine, 40(8), 102-114. Avnur, R., and Hellerstein, J. (2000). Eddies: Continuously Adaptive Query Processing. Proceedings of ACM International Conference on Management of Data, 261-272. Banaei-Kashani, F., and Shahabi, C. (2003). Searchable Querical Data Networks. Lecture Notes in Computer Science, Springer Verlag, 2944, 17-32. Banaei-Kashani, F., and Shahabi, C. (2003). Criticality-based Analysis and Design of Unstructured Peer-to-Peer Networks as Complex Systems. Proceedings of the Third International Workshop on Global and Peer-to-Peer Computing (GP2PC) in conjunction with CCGrid, 351-359. Bar-Yam, Y. (1997). Dynamics of Complex Sytems. Westview Press. Daswani, N., Garcia-Molina, H., and Yang, B. (2003). Open Problems in Data- Sharing Peer-to-Peer Systems. Proceedings of the 9th International Conference on Database Theory, 1-15. Estrin, D., Govindan, R., Heidemann, J., and Kumar, S. (1999). Next Century Challenges: Scalable Coordination in Sensor Networks. Proceedings of International Conference on Mobile Computing and Networks, 256-262. Govindan, R., Hellerstein, J., Hong, W., Madden, S., Franklin, M., and Shenker, S. (2002). The Sensor Network as a Database. Technical Report 02-771, University of Southern California. Harren, M., Hellerstein, J., Huebsch, R., Loo, B.T., Shenker, S., and Stoica, I. (2002). Complex Queries in DHT-based Peer-to-peer Networks. Proceedings of the 1st International Workshop on Peer-to-Peer Systems. Heidemann, J., Silva, F., Intanagonwiwat, C., Govindan, R., Estrin, D., and Ganesan, D. (2001). Building Efficient Wireless Sensor Networks With Low-Level Naming. Proceedings of the Symposium on Operating Systems Principles, 146-159. Ratnasamy, S., Francis, P., Handley, M., Karp. R., and Shenker, S. (2001). A Scalable Content Addressable Network. Proceedings of the ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, 161-172. Ratnasamy, S., Karp, B., Shenker, S., Estrin, D., Govindan R., Yin, L., and Yu, F. (2003). Data-Centric Storage in Sensornets with GHT, a Geographic Hash Table. Mobile Networks and Applications, 8(4), 427-442. Schmidt, R., and Shahabi, C. (2002-a). How to Evaluate Multiple Range-Sum Queries Progressively. 21st ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 133 – 141. Schmidt, R., and Shahabi, C. (2002-b). Propolyne: A Fast Wavelet-Based Algorithm for Progressive Evaluation of Polynomial Range-Sum Queries. Eighth Conference on Extending Database Technology, 664-681. Stauffer, D., and Aharony, A (1992). Introduction to Percolation Theory. Taylor and Francis, second edition. Watts, D.J., Dodds, P.S., and Newman, M.E.J. (2002). Identity and search in social networks. Science, 296, 1302-1305.

Key Terms

Peer-to-peer (P2P) Networks: A peer-to-peer network is a distributed, self- organized federation of peer entities, where the system entities collaborate by sharing resources and performing cooperative tasks for mutual benefit. It is often assumed that such a federation lives, changes, and expands independent of any distinct service facility with global authority.

Sensor Networks: A sensor network is a network of low-power, small form-factor sensing devices that are embedded in a physical environment and coordinate amongst themselves to achieve a larger sensing task.

Distributed Hash Tables (DHTs): A distributed index structure with hash table- like functionality for information location in the Internet-scale distributed computing environments. Given a key from a pre-specified flat identifier space, DHT computes (in a distributed fashion) and returns the location of the node that stores the key.

Complex Systems: Complex Systems is a new field of science studying how parts of a complex system give rise to the collective behaviors of the system. Complexity (information-theoretical and computational) and emergence of collective behavior are the two main characteristics of such complex systems. Social systems formed (in part) out of people, the brain formed out of neurons, molecules formed out of atoms, the weather formed out of air flows are all examples of complex systems. The field of Complex Systems cuts across all traditional disciplines of science, as well as engineering, management, and medicine.

Small World Models: It is believed that almost any pair of people in the world can be connected to one another by a short chain of intermediate acquaintances, of typical length about six. This phenomenon is colloquially referred to as the “six degrees of separation,” or equivalently, the “small-world” effect. Sociologists propose a number of topological network models, the small-world models, for the social network to explain this phenomenon.

Percolation Theory: Assume a grid of nodes where each node is occupied with probability p and empty with probability (1-p). Percolation theory is a quantitative (statistical-theoretical) and conceptual model for understanding and analyzing the statistical properties (e.g., size, diameter, shape, etc.) of the clusters of occupied nodes as the value of p changes. Many concepts associated with complex systems such as clustering, fractals, diffusion, and particularly phase transitions are modeled as percolation problem. The significance of the percolation model is that many different problems can be mapped to the percolation problem; e.g., forest-fire spread, oil field density estimation, diffusion in disordered media, etc.

Content-Centric Networks: A content-centric network is a network where various functionalities such as naming, addressing, routing, storage, etc., are designed based on the content. This is in contrast with classical networks that are node- centric.