An Approach for Unified and Efficient Querying of Heterogeneous Web Data Sources
Total Page:16
File Type:pdf, Size:1020Kb
UniQue: An Approach for Unified and Efficient Querying of Heterogeneous Web Data Sources Markku Laine, Jari Kleimola and Petri Vuorimaa Department of Computer Science, Aalto University, Espoo, Finland Keywords: Mashup Applications, Web Querying, Web Standards, XML Technologies. Abstract: Governments, organizations, and people are publishing open data on the Web more than ever before. To consume the data, however, requires substantial effort from web mashup developers, as they have to familiarize themselves with a diversity of data formats and query techniques specific to each data source. While several solutions have been proposed to improve web querying, none of them covers aforementioned aspects in a developer friendly and efficient manner. Therefore, we devised a unified querying (UniQue) approach and a proxy-based implementation that provides a uniform and declarative interface for querying heterogeneous data sources across the Web. Besides hiding the differences between the underlying data formats and query techniques, UniQue heavily embraces open W3C standards to minimize the learning effort required by developers. Pursuing this further, we propose Unified Query Language (UQL) that combines the expressiveness of CSS Selectors and XPath into a single and flexible selector language. We show that the adoption of UniQue and UQL can effectively streamline web querying, leverage developers’ existing knowledge, and reduce generated network traffic compared to the current state-of-the-art approach. 1 INTRODUCTION as a solution to uniform data access, they are not broadly adopted due to their inherent complexity and Traditional database experts rely on well-defined unfamiliarity among developers. Indeed, the data formats and query language APIs. The query majority of web data providers are still relying on API, such as SQL, provides precisely focused access more mainstream data formats and developers on to the data. Although the persistent data and its less complex selector languages. structure vary between development projects, the Therefore, we propose a unified web querying data format and the query API remain the same. approach that a) builds on these prevailing practices Database experts are thus able to leverage their and b) takes full advantage of developers’ existing previous experience, tools, and reusable code experience with standard web technologies and snippets from past development efforts. associated tools. This approach, which we call Web mashup developers (hereinafter referred to UniQue, provides a uniform and declarative query as “developers”) are less fortunate. First, web data is interface that allows developers to perform precise served in a variety of data formats, and it is not selection queries against heterogeneous data on the uncommon to mix and match XML, JSON, or even Web. We focus on efficient querying of text-based HTML and CSV in a single mashup application. data originating from web data sources and services Second, to access the data, developers need to (e.g., static files and Web APIs, respectively). conform to several proprietary APIs. This requires Querying of big data sets and databases is beyond (re-)learning, since even if data providers offer the scope of this work. RESTful APIs, the query string parameters are likely The main contributions of this paper are: to differ. The APIs are often also inefficient and . We propose an approach (UniQue) for unified return substantial amount of redundant data, which and efficient querying of heterogeneous web data needs to be filtered on the client side with yet sources. another API. We propose improvements to the guidelines for While semantic web technologies (e.g., RDF and mapping JSON and CSV data into a single form SPARQL) have been proposed (Harth et al., 2011) (XML); a special focus is put on 84 Laine, M., Kleimola, J. and Vuorimaa, P. UniQue: An Approach for Unified and Efficient Querying of Heterogeneous Web Data Sources. In Proceedings of the 12th International Conference on Web Information Systems and Technologies (WEBIST 2016) - Volume 1, pages 84-94 ISBN: 978-989-758-186-1 Copyright c 2016 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved UniQue: An Approach for Unified and Efficient Querying of Heterogeneous Web Data Sources friendliness and round-trippability. We propose a unified query language (UQL), which extends CSS Selectors with XPath expressions for improved expressiveness. We report on the results of an experimental study, in which we compared the UniQue approach with the current state-of-the-art approach. The rest of this paper is organized as follows. Section 2 presents a motivating example of querying heterogeneous web data sources and the challenges related to it. Section 3 defines the requirements for a unified and efficient web querying approach. Section 4 introduces the design of the proposed approach, while Section 5 describes our proxy-based implementation. Section 6 evaluates the approach Figure 1: Web mashup application accessing multiple web and presents the results with discussion. Section 7 data sources and services. compares our approach to related work and Section 8 draws conclusions. Table 1: Information about the data sources used by the web mashup application. 2 MOTIVATING EXAMPLE AND CHALLENGES As a motivating example, consider Alice, who is a developer and an avid climber. Alice has decided to build a web mashup application that provides a 2.1.1 Challenge 1: Heterogeneous Data central place for climbers alike to stay updated on all Formats the latest from elite rock climbers. Figure 1 shows a basic design for Alice’s Web data exists in many different formats. As a mashup application. According to the design, the result, Alice needs to learn and understand the application needs to connect to the total of eight web differences and similarities between multiple data data sources in order to retrieve all the required data. models, and deal with the impedance mismatch Moreover, the data sources are completely different problem. from each other, as shown in Table 1. First, Alice needs to scrape data from hardclimbs.info to 2.1.2 Challenge 2: Heterogeneous Query generate a list of world’s top 10 climbers, and later Techniques to retrieve their hardest ascents. She also needs to read climbers’ social media account IDs and The heterogeneity of web data formats leads to a competition results from static files. This requires situation where Alice needs to master a diverse writing custom data wrappers or data range of query techniques, including web scraping, transformations. Finally, to retrieve climbers’ basic writing custom data wrappers, and consuming Web information, tweets, and photos from social media APIs. Each query technique in turn involves specific services, Alice needs get familiar with the technologies and specifications that Alice needs to documentation and query parameters of each Web get familiar with. API endpoint. 2.1.3 Challenge 3: Inefficient Use of 2.1 Challenges Network Resources We identified three major challenges that developers Many web data sources (e.g., static files, web pages, (i.e., Alice) face in building web mashup and web feeds) lack of an API. Therefore, Alice applications. needs to pull the full response data to the client, even if all that she needs is a fraction of that data. Web 85 WEBIST 2016 - 12th International Conference on Web Information Systems and Technologies APIs partially suffer from the same problem, as their R6: Flexible Query Language. The query filtering capabilities are usually rather limited. language must be simple yet expressive In summary, these challenges not only decrease enough for retrieving data of interest (less Alice's productivity as a developer, but also increase data to be transferred). Moreover, it must be the application complexity for data access and possible to extend its expressiveness while querying. keeping the learning curve to a minimum (MacLean et al., 1990). 3 REQUIREMENTS 3.4 System Requirements Based on our motivating example and the literature R7: Client Independence. The system must review (cf. Section 7), we derived a list of support various clients (e.g., web browsers requirements that fall within the scope of this paper. and HTTP client software/libraries), without requiring any additional software installation. 3.1 Technology Requirements R8: Server-Side Processing. The system must perform all processing on the server side in R1: Ease of Authoring. Technologies must be order to elicit reductions in response data. familiar to a broad range of developers. R9: Data Source Server Independence. The system must support existing web data source Generally, declarative languages are considered easier to reason about than servers, without requiring any modifications imperative ones (Van Roy and Haridi, 2004), to them. and thus should be preferred. R2: Web Integration. Technologies must be open and standardized to enable 4 THE UNIQUE APPROACH interoperability with existing tools, broad adoption, and future uses on the Web. In this section, we present our unified querying approach to meet the requirements presented in 3.2 Data Format Requirements Section 3. We set out to give an overview of the proposed approach and continue with a detailed R3: Uniform Data Representation. description of its essential parts. Heterogeneous web data must be converted into a single, human and machine-readable 4.1 The Approach in a Nutshell format in order to minimize the knowledge required by developers. We propose the UniQue approach to help developers R4: Friendly and Round-Trippable Mappings. in accessing and querying JSON / CSV / HTML / Mapping original data to a single format must XML data originating from heterogeneous web data produce data that is both easy to consume and sources, such as static files and Web APIs. The query, i.e., friendly. Additionally, it must be design of the proposed approach fulfills the possible to map the data back to its original requirements R1-R6, while the system format, i.e., round-trippable, for data updates. implementation focuses on the requirements R7-R9. Focus should be put on friendliness, while The basic idea is to expose a uniform query preserving round-trippability.