Recurring Retrieval Needs in Diverse and Dynamic Dataspaces: Issues and Reference Framework

Recurring Retrieval Needs in Diverse and Dynamic Dataspaces: Issues and Reference Framework Barbara Catania, Francesco De Fino, Giovanna Guerrini University of Genoa, Italy {firstname.lastname}@dibris.unige.it ABSTRACT is, a set of smart information services offered by a Processing information requests over heterogeneous municipality to users that, e.g., retrieves information dataspaces is very challenging because aimed at guar- about attractions, points of interests, and shops. A anteeing user satisfaction with respect to conflicting user might, for instance, ask for the authors of the fig- requirements on result quality and response time. In urative artworks she is watching (i.e., they are located [3], it has been argued that, in dynamic contexts pre- close to her position), together with information about venting substantial user involvement in interpreting other places where such authors are currently expos- requests, information on similar requests recurring over ing. In this specific context, data may come from dif- time can be exploited during query processing. ferent datasets, are heterogeneous and very dynamic, In this paper, referring to a graph-based modeling as the user may quickly change her position or the en- of dataspaces and requests, we propose a preliminary vironment itself (including data) might be different at approach in this direction centered on the enabling different instants of time. concept of Profiled Graph Query Pattern (PGQP) as Processing complex requests on diverse and dynamic an aggregation of information on past evaluations of sources requires: (i) request interpretation; (ii) source a set of previously executed queries. The information selection; (iii) actual processing of the request on the maintained in PGQP is not query results, as in mate- relevant, and usually big in size, sources. The overall rialized queries, but can include different kinds of data process is costly and, nevertheless, it may not guar- and metadata. antee user satisfaction on the obtained result since the request (i) could be incorrectly interpreted, (ii) could be processed on inaccurate, incomplete, unre- 1. INTRODUCTION liable data, or (iii) could require a processing time Motivations. The last years have been character- inadequate to its urgency. ized by a tremendous growth of available informa- To make processing time more adequate to the ur- tion, coming from information sources that are highly gency of the request, approximate query processing heterogeneous in terms of structure, semantic rich- approaches can be considered, by providing a fast an- ness, and quality. Sources can indeed contain unstruc- swer at the cost of a lower accuracy [4]. Additionally, tured data as well as data with well-defined structure, to improve the interpretation of the request, and re- strongly correlated and semantically complex but rel- lated source selection, one possibility is to rely on user atively static data (e.g., Linked Open Data), highly involvement. Novel paradigms for exploratory user- dynamic user-generated data. This information is, data interactions, that emphasize user context and however, a resource currently exploited much below interactivity with the goal of facilitating exploration, its potential because of the difficulties for the users interpretation, retrieval, and assimilation of informa- in accessing it. Users would like to find answers to tion, are being developed. Such solutions, however, complex requests expressing relationships among the do not seem adequate in dynamic contexts, character- entities of their interest, but they are able to specify ized by urgent requests and by communication means requests only vaguely because they cannot reasonably hampering user interaction. know the format and structure of the data that encode Other innovative approaches should therefore be de- the information relevant to them. vised, relying on different kinds of information for find- As an example, consider a smart city explorer, that ing possibly approximate answers to complex information needs, even vaguely and imprecisely specified, operating on the full spectrum of relevant content in a dataspace of highly heterogeneous and poorly con- trolled sources. In [3], to overcome difficulties related to heterogeneity and dynamic nature, the exploitation of additional information, in terms of user profile and 2017, Copyright is with the authors. Published in the Workshop Proceedings of the EDBT/ICDT 2017 Joint Conference (March 21, request context, data and processing quality, similar 2017, Venice, Italy) on CEUR-WS.org (ISSN 1613-0073). Distri- requests recurring over time, is suggested. bution of this paper is permitted under the terms of the Creative In this paper, we focus on similar requests recurring Commons license CC-by-nc-nd 4.0 over time for improving approximate processing of re- spond to previously computed results, rather it ranges quests that we assume already interpreted and rep- from query results to any other higher level informa- resented according to a given formalism. Recurring tion collected during query processing (e.g., query- retrieval needs are very common in dynamic contexts, to-data mappings). In specifying the framework, we such as, for instance, during or after an exceptional identify the need for PGQP-aware approximate query event (environmental emergencies or flash mobbing processing techniques and PGQP management issues initiatives), in the context of users belonging to the raised by the framework. Each instantiation of the same community or that are in the same place, pos- framework will lead to the definition of a specific PGQP- sibly at different times. The information needs are aware query processing approach for a given set of widespread among different users, because induced by information associated with PGQPs. Approximated the event, the interests of the community, and the results can be generated for two main reasons: due place, respectively. We aim at taking advantage of the to the dynamic nature of the dataspace, information experience gained by prior processing in new searches associated with PGQPs may change between two dis- for limiting interpretation errors and response time, tinct executions of the same query; PGQPs are se- thus reducing the possibility of producing unsatisfac- lected based on similarity criteria with respect to the tory answers. query at hand. Recurring retrieval needs have been considered in The paper finally proposes a specific instantiation query processing for a very long time in terms of ma- of the framework, in the context of which the various terialized views (see, e.g., [7, 8]) The idea is to precom- components and concepts are formalized, and algo- pute the results of some recurring queries as materi- rithms for the most relevant tasks are specified. In alized views, select some of such views (view selection this instantiation, we formally define PGQPs but we problem) and re-use them (view-based query process- leave to future work the association of PGQPs with ing) for the execution of new requests. Unfortunately, information collected in prior executions. the usage of materialized views in the contexts de- scribed above suffers of some problems: (i) views as- Paper organization. The proposed framework is sociate a result to a given query, however in the ref- presented in Section 2 and a specific instantiation of erence contexts we may be interested in associating the framework is defined in Section 3. Section 4 con- other or additional information (e.g., the set of used cludes by discussing a number of open issues we are data sources or, under pay-as-you-go integration ap- investigating in this work-in-progress. proaches [6], query-to-data mappings); (ii) view up- dates are very frequent in dynamic environments, reducing the efficiency of the overall system; (iii) view- 2. PGQP-BASED FRAMEWORK based query processing techniques usually rely on a The framework we propose is based on a graph- precise semantics while heterogeneity tends to favour based representation of dataspaces, interpreted as a approximation-based approaches. collection of (possibly) schemaless data sources, and of Alternative usages of the concept of recurring re- user retrieval needs, expressed as entity relationships trieval needs in query processing have been recently queries over such dataspaces [14]. Specifically, we as- provided but, also in this case, precise query pro- sume to deal with data sources represented in terms cessing is taken into account. For example, in [13], of partially specified graphs, where some information, common needs are used with the aim of expediting associated with nodes, may be unknown [2]. Similarly, the processing of graph queries against a database of each request is represented in terms of a graph, speci- graphs, by reducing the number of subgraph isomor- fying entities and relationships of interest, which may phism tests to be performed. contain node variables, in order to characterize information to be retrieved [2]. As an example, Figure 2 Contributions. This paper presents the preliminary (a) presents a graph query Q corresponding to the in- results of an on-going work aiming at exploiting information need \the authors of the figurative artworks formation about similar requests recurring over time the user is watching (i.e., they are located close to her in approximate query processing of requests executed position), together with such artworks, a biography of over diverse and dynamic dataspaces. We refer to a the authors, and information about places in Genoa

Load more