A Scalable Search Engine for Mass Storage Smart Objects
Total Page:16
File Type:pdf, Size:1020Kb
A Scalable Search Engine for Mass Storage Smart Objects Nicolas Anciaux 1,2 Saliha Lallali 1,2 Iulian Sandu Popa1,2 Philippe Pucheral1,2 1INRIA Rocquencourt 2Université de Versailles Saint-Quentin-en-Yvelines 78153 Le Chesnay Cedex, France 78035 Versailles Cedex, France [email protected] [email protected] ABSTRACT built by querying these data. So far, all these data end up on This paper presents a new embedded search engine designed for centralized servers where they are analyzed and queried. This smart objects. Such devices are generally equipped with centralized model however has two main drawbacks. First, the extremely low RAM and large Flash storage capacity. To tackle ever increasing amount of data transferred from their source (the these conflicting hardware constraints, conventional search smart objects) to their destination (the servers) has a dramatic engines privilege either insertion or query scalability but cannot negative impact on environmental sustainability (i.e., energy meet both requirements at the same time. Moreover, very few waste) and costs (i.e., data transfer). Hence, significant energy solutions support document deletions and updates in this context. and bandwidth savings can be obtained by storing data locally in In this paper, we introduce three design principles, namely Write- smart objects, especially when dealing with large data and seldom Once Partitioning, Linear Pipelining and Background Linear used records [8, 19]. Second, centralizing data sensed about Merging, and show how they can be combined to produce an individuals introduces an unprecedented threat on data privacy embedded search engine reconciling high insert/delete/update rate (e.g., at 1Hz granularity, an electricity smart meter can reveal the and query scalability. We have implemented our search engine on precise activity of the inhabitants [3]). a development board having a hardware configuration representative for smart objects and have conducted extensive experiments using two representative datasets. The experimental results demonstrate the scalability of the approach and its superiority compared to state of the art methods. Figure 1. Smart objects endowed with MCUs and NAND Flash 1. INTRODUCTION This explains the growing interest for transposing traditional data With the continuous development of sensor networks, smart management functionalities directly into the smart objects. Simple personal portable devices and Internet of Things, the tight query evaluation facilities have been recently proposed for sensor combination of computing power and large storage capacities nodes equipped with large Flash memory [8, 9] to enable filtering (Gigabytes) into many kinds of smart objects becomes reality. On operations. Relational database operations like selection, the one hand, large Flash memories (GBs) are now adjunct to projection and join for new generations of SIM cards with large many small computing devices, e.g., SD cards plugged into Flash storage capacities have been proposed in [4, 18]. More sensors [19] or raw Flash chips superimposed on SIM cards [4]. complex treatments such as facial recognition and the related On the other hand, microcontrollers are integrated into many indexing techniques have been investigated also. memory devices, e.g., Flash USB keys, SD/micro-SD cards and In this paper, we make a step further in this direction by contactless badges. Examples of smart objects combining addressing the traditional problem of information retrieval queries computing power and large storage capacity (GBs of Flash over large file collections. A file can be any form of document, memory) are pictured in Figure 1. picture or data stream associated with a set of terms. A query can As smart objects gain the capacity to acquire, store and process be any form of keyword search using a ranking function (e.g., tf- data, new services emerge. Camera sensors tag photographs and idf) identifying the top-k most relevant files. The proposed search provide search capabilities to retrieve them [22]. Smart objects engine can be used in sensors to search for relevant objects in maintain the description of the surrounding objects [23] enabling their surroundings [17, 23], in cameras to search pictures by using the Internet of Things [1], e.g. shops like bookstores can be tags [22], in personal smart dongles to secure the querying of queried directly by customers in search of a certain product. Users documents and files hosted in an untrusted Cloud [4, 12] or in securely store their personal files (documents, photos, emails) in smart meters to perform analytic tasks (i.e., top-k queries) over Personal Data Servers [3, 4, 12, 18]. Smart meters record energy sets of events (i.e., terms) captured during time windows (i.e., consumption events and embedded GPS devices track locations files) [3]. Hence, this engine can be thought of as a generalized and moves. A myriad of new applications and services are being Google desktop or Spotlight embedded in smart objects. This work is licensed under the Creative Commons Attribution- Designing such embedded search engine is however challenging NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this due to a combination of severe and conflicting hardware license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain constraints. Smart objects are usually equipped with a tiny RAM permission prior to any use beyond those covered by the license. Contact and their persistent storage is made of NAND Flash badly adapted copyright holder by emailing [email protected]. Articles from this volume were invited to present their results at the 41st International Conference on Very to random fine-grain updates. Unfortunately, state-of-the-art Large Data Bases, August 31st – September 4th 2015, Kohala Coast, Hawaii. indexing techniques either consume a lot of RAM or produce a Proceedings of the VLDB Endowment, Vol. 8, No. 9 large quantity of random fine-grain updates. Few pioneer works Copyright 2015 VLDB Endowment 2150-8097/15/05. already considered the problem of embedding a search engine in sensors equipped with Flash storage [17, 20, 22, 23], but they 910 target small data collections (i.e., hundreds to thousands of files) 2.2 Search Engine Requirements with a query execution time which remains proportional to the As in [17], we consider that the search engine of interest in this number of indexed documents. By construction these search paper has similar functionality as a Google Desktop for smart engines cannot meet insertion performance and query scalability objects. Hence, we use the terminology introduced in the at the same time. Moreover none of them support file deletions Information Retrieval literature for full-text search. Then, a and updates, however useful in a number of scenarios. document refers to any form of data files, terms refers to any In this paper, we make the following contributions: forms of metadata elements, term frequencies refer to metadata element weights and a query is equivalent to a full-text search. • We introduce three design principles, namely Write-Once Partitioning, Linear Pipelining and Background Linear Full-text search has been widely studied by the information Merging, to devise an inverted index capable of tackling the retrieval community (see [24] for a recent survey). The core conflicting hardware constraints of smart objects. problem is, given a collection of documents and a user query • Based on these principles, we propose a novel inverted index expressed as a set of terms {ti}, to retrieve the k most relevant structure and related algorithms to support all the basic index documents according to a ranking function. In the wide majority operations, i.e., search, insertion, deletion and update. of the related works, the tf-idf score, i.e., term frequency-inverse • We validate our design through an extensive experimentation on document frequency, is used to rank the query results. A a real smart object platform and with two representative datasets document can be of many types (e.g., text file, image, etc.) and is and show that query scalability and insertion/deletion/update associated with a set of terms (or keywords) describing its content performance can be met together in various contexts. and weights indicating their respective importance in the document. For text documents, the terms are words composing the The rest of the paper is organized as follows. Section 2 details the document and their weight is their frequency in the document. For smart objects' hardware constraints, analyses the state-of-the-art images, the terms can be tags, metadata or visterms describing solutions and derives from this analysis a precise problem image subparts [22]. For a query Q={t}, the tf-idf score of each statement. Section 3 introduces our three design principles, while document d containing at least a query term can be computed as: Sections 4 and 5 detail the proposed inverted index structure and N algorithms derived from these principles. Section 6 is devoted to tf idf (d ) log f d ,t 1 log F the tricky case of file deletions and updates. We present the tQ t experimental results in Section 7 and conclude in Section 8. where fd,t is the frequency of term t in document d, N is the total number of indexed documents and Ft is the number of documents 2. PROBLEM STATEMENT that contain t. This formula is given for illustrative purpose, the weight between fd,t and N/Ft varying depending on the proposals. 2.1 Smart Objects’ Hardware Constraints Whatever their form factor and usage, smart objects share strong commonalities in terms of data management architecture. Indeed, a large NAND Flash storage is used to persistently store data and indexes, and a microcontroller (MCU) executes the embedded code, both being connected by a bus. Hence, the architecture inherits hardware constraints from both the MCU and the Flash. The MCUs embedded in smart objects usually have a low power CPU, a tiny RAM (few KB) and a few MB of persistent memory Figure 2.