A Multi-Attribute Resource and Query
Total Page:16
File Type:pdf, Size:1020Kb
DISSERTATION ENHANCING COLLABORATIVE PEER-TO-PEER SYSTEMS USING RESOURCE AGGREGATION AND CACHING: A MULTI-ATTRIBUTE RESOURCE AND QUERY AWARE APPROACH Submitted by H. M. N. Dilum Bandara Department of Electrical and Computer Engineering In partial fulfillment of the requirements For the Degree of Doctor of Philosophy Colorado State University Fort Collins, Colorado Fall 2012 Doctoral Committee: Advisor: Anura P. Jayasumana V. Chandrasekar Daniel F. Massey Indrajit Ray Copyright by H. M. N. Dilum Bandara 2012 All Rights Reserved ABSTRACT ENHANCING COLLABORATIVE PEER-TO-PEER SYSTEMS USING RESOURCE AGGREGATION AND CACHING: A MULTI-ATTRIBUTE RESOURCE AND QUERY AWARE APPROACH Resource-rich computing devices, decreasing communication costs, and Web 2.0 technologies are fundamentally changing the way distributed applications communicate and collaborate. With these changes, we envision Peer-to-Peer (P2P) systems that will allow for the integration and collaboration of peers with diverse capabilities to a virtual community thereby empowering it to engage in greater tasks beyond what can be accomplished by individual peers, yet are beneficial to all the peers. Collaborations involving application-specific resources and dynamic quality of service goals will stress current P2P ar- chitectures that are designed for best-effort environments with pair-wise interactions among nodes with similar resources. These systems will share a variety of resources such as processor cycles, storage capac- ity, network bandwidth, sensors/actuators, services, middleware, scientific algorithms, and data. However, very little is known about the specific characteristics of real-world resources and queries as well as their impact on resource aggregation in these collaborative P2P systems. We developed resource discovery, caching, and distributed data fusion solutions that are more suitable for collaborative P2P systems while characterizing real-world resource, query, and user behavior. The contributions of this research are: (1) a detailed analysis of real-world resource, query, and user characteristics and their impact on resource dis- covery solutions, (2) a tool to generate large synthetic traces of multi-attribute resources and range que- ries, (3) resource and query aware P2P-based multi-attribute resource discovery solution that is both effi- cient and load balanced, (4) a community-based caching solution that enhances both the communitywide and system-wide lookup performance in large-scale P2P systems, and (5) demonstrated the applicability of NDN (Named Data Networking) for DCAS (Distributed Collaborative Adaptive Sensing) systems by developing a distributed multi-user, multi-application, and multi-sensor data fusion solution based on ii CASA (Collaborative Adaptive Sensing of the Atmosphere). Proposed solutions and the analysis are ap- plicable to a wide variety of contexts such as DCAS systems, P2P clouds, grid/cloud computing, GENI (Global Environment for Network Innovations), and mobile social networks. Next, each of the five con- tributions is described briefly. First, we derived an equation to capture the cost of multi-attribute resource advertising and query- ing. The nature of parameters in the equation under different systems is determined by analyzing datasets from PlanetLab, SETI@home, EGI grid, and a distributed campus computing facility. These datasets ex- hibit several noteworthy features that affect the performance. The attributes of both the resources and que- ries are highly skewed and correlated. Attribute values have different marginal distributions and change at different rates. Queries are less specific where each query tends to specify only a small subset of the available attributes and large ranges of attribute values. These properties of resources and queries are then used to qualitatively and quantitatively evaluate the fundamental design choices for P2P-based multi- attribute resource discovery. Design choices are evaluated based on the cost of advertising/querying, load balancing, and routing table size. Compared to uniform queries, real-world queries are relatively easier to resolve using unstructured, superpeer, and single-attribute-dominated-query-based structured P2P solu- tions. However, they introduce significant load balancing issues to existing designs. Cost of RD in struc- tured P2P systems is effectively O(N) (N is the number of nodes) as most range queries are less specific. Second, a set of mechanisms is presented to generate realistic synthetic traces of multi-attribute resources (with both static and dynamic attributes) and range queries using the statistical behavior learned from real-world datasets. Such traces are useful in large-scale performance studies of resource discovery solutions, job schedulers, etc., in collaborative P2P systems as well as grid, cloud, and volunteer compu- ting. Random vectors of static attributes are generated using empirical copulas that capture the entire de- pendence structure of multivariate distribution of attributes. Time series of dynamic attributes are ran- domly drawn from a library of multivariate time-series segments extracted from the datasets. These segments are identified by detecting the structural changes in time series corresponding to a selected at- tribute. Time series corresponding to rest of the attributes are split at the same breakpoints and randomly iii drawn together to preserve their contemporaneous correlation. Correlation among static and dynamic at- tributes is preserved by grouping the time-series segments based on their static attributes. Multi-attribute range queries are generated using a probabilistic finite state machine that preserves the popularity of at- tributes and correlations among attribute values. A tool is developed to automate the synthetic data gener- ation process. It is independent of the dataset hence data from any other platform may be used as the basis for trace statistics. Third, a resource and query aware P2P-based multi-attribute resource discovery solution is pre- sented that is both efficient and load balanced. The solution consists of five heuristics that can be execut- ed independently and distributedly. The first heuristic tries to maintain a minimum number of nodes in the overlay while pruning nodes that do not significantly contribute to the range-query resolution. Removing nonproductive nodes reduces the cost (e.g., hops and latency) of advertising resources and resolving que- ries. The second and third heuristics dynamically balance the key and query load distributions by transfer- ring keys to neighbors as well as by adding new neighbors when existing ones are insufficient. The last two heuristics, namely fragmentation and replication, form cliques of nodes that are placed orthogonal to the overlay ring. Such a node placement dynamically balances the highly skewed key and query loads while reducing the query cost. By applying these heuristics in the presented order, a resource discovery solution that better responds to real-world resource and query characteristics is developed. Efficacy of the solution is demonstrated using a simulation-based analysis under a variety of single and multi-attribute resource and query distributions derived from real workloads. Fourth, we developed a distributed caching solution that exploits P2P communities to improve the communitywide and system-wide lookup performance. The solution consists of a sub-overlay formation scheme and a Local-Knowledge-based Distributed Caching (LKDC) algorithm. Sub-overlays enable communities to forward queries through their members. While queries are forwarded, LKDC algorithm causes members to identify and cache resources of interests to them, resulting in faster resolution of que- ries for popular resources within each community. Distributed local caching requires global information (e.g., hop count and popularity of contents) that is difficult and costly to obtain. Moreover, the problem is iv NP-complete when the size of contents/resources varies. However, by relaxing the content size constraint (which is acceptable for the purpose of improving lookup performance), and by means of an analysis of globally optimal behavior and structural properties of the overlay, we developed the LKDC algorithm that not only relies on purely local information but also provides close-to-optimal caching performance. The caching solution automatically adapts to changing popularity and user interests. It works with any skewed distribution of queries in addition to introducing minimal modifications and overhead to the overlay net- work. For example, simulations based on Chord overlay show a 40% reduction in average path length using only 20 cache entries per node, and individual communities gained a 17-24% improvement com- pared to system-wide caching. Fifth, we present a proof of concept solution that demonstrates the applicability of NDN for mul- ti-user, multi-application, and multi-sensor DCAS systems such as CASA. In this example, a network of weather radars name data based on their geographic location and weather feature (e.g., reflectivity of clouds or wind velocity) independent of the radar(s) that generated them. This enables end users to speci- fy an area of interest for a particular weather feature while being oblivious to the placement of radars and associated computing facilities. Conversely, the DCAS system can use its knowledge about the underly- ing system to decide the best radar scanning and data processing strategies. Such sensor-independent names also