<<

A Scalable Search Engine for Mass Storage Smart Objects Nicolas Anciaux 1,2 Saliha Lallali 1,2 Iulian Sandu Popa1,2 Philippe Pucheral1,2 1INRIA Rocquencourt 2Université de Versailles Saint-Quentin-en-Yvelines 78153 Le Chesnay Cedex, France 78035 Versailles Cedex, France [email protected] [email protected]

ABSTRACT built by querying these data. So far, all these data end up on This paper presents a new embedded search engine designed for centralized servers where they are analyzed and queried. This smart objects. Such devices are generally equipped with centralized model however has two main drawbacks. First, the extremely low RAM and large Flash storage capacity. To tackle ever increasing amount of data transferred from their source (the these conflicting hardware constraints, conventional search smart objects) to their destination (the servers) has a dramatic engines privilege either insertion or query scalability but cannot negative impact on environmental sustainability (i.e., energy meet both requirements at the same time. Moreover, very few waste) and costs (i.e., data transfer). Hence, significant energy solutions support document deletions and updates in this context. and bandwidth savings can be obtained by storing data locally in In this paper, we introduce three design principles, namely Write- smart objects, especially when dealing with large data and seldom Once Partitioning, Linear Pipelining and Background Linear used records [8, 19]. Second, centralizing data sensed about Merging, and show how they can be combined to produce an individuals introduces an unprecedented threat on data embedded search engine reconciling high insert/delete/update rate (e.g., at 1Hz granularity, an electricity smart meter can reveal the and query scalability. We have implemented our search engine on precise activity of the inhabitants [3]). a development board having a hardware configuration representative for smart objects and have conducted extensive experiments using two representative datasets. The experimental results demonstrate the scalability of the approach and its superiority compared to state of the art methods. Figure 1. Smart objects endowed with MCUs and NAND Flash 1. INTRODUCTION This explains the growing interest for transposing traditional data With the continuous development of sensor networks, smart management functionalities directly into the smart objects. Simple personal portable devices and Internet of Things, the tight query evaluation facilities have been recently proposed for sensor combination of computing power and large storage capacities nodes equipped with large Flash memory [8, 9] to enable filtering (Gigabytes) into many kinds of smart objects becomes reality. On operations. Relational database operations like selection, the one hand, large Flash memories (GBs) are now adjunct to projection and join for new generations of SIM cards with large many small computing devices, e.g., SD cards plugged into Flash storage capacities have been proposed in [4, 18]. More sensors [19] or raw Flash chips superimposed on SIM cards [4]. complex treatments such as facial recognition and the related On the other hand, microcontrollers are integrated into many indexing techniques have been investigated also. memory devices, e.g., Flash USB keys, SD/micro-SD cards and In this paper, we make a step further in this direction by contactless badges. Examples of smart objects combining addressing the traditional problem of information retrieval queries computing power and large storage capacity (GBs of Flash over large file collections. A file can be any form of document, memory) are pictured in Figure 1. picture or data stream associated with a set of terms. A query can As smart objects gain the capacity to acquire, store and process be any form of keyword search using a ranking function (e.g., tf- data, new services emerge. Camera sensors tag photographs and idf) identifying the top-k most relevant . The proposed search provide search capabilities to retrieve them [22]. Smart objects engine can be used in sensors to search for relevant objects in maintain the description of the surrounding objects [23] enabling their surroundings [17, 23], in cameras to search pictures by using the Internet of Things [1], e.g. shops like bookstores can be tags [22], in personal smart dongles to secure the querying of queried directly by customers in search of a certain product. Users documents and files hosted in an untrusted Cloud [4, 12] or in securely store their personal files (documents, photos, emails) in smart meters to perform analytic tasks (i.e., top-k queries) over Personal Data Servers [3, 4, 12, 18]. Smart meters record energy sets of events (i.e., terms) captured during time windows (i.e., consumption events and embedded GPS devices track locations files) [3]. Hence, this engine can be thought of as a generalized and moves. A myriad of new applications and services are being desktop or embedded in smart objects.

This work is licensed under the Creative Commons Attribution- Designing such embedded search engine is however challenging NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this due to a combination of severe and conflicting hardware license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain constraints. Smart objects are usually equipped with a tiny RAM permission prior to any use beyond those covered by the license. Contact and their persistent storage is made of NAND Flash badly adapted copyright holder by emailing [email protected]. Articles from this volume were invited to present their results at the 41st International Conference on Very to random fine-grain updates. Unfortunately, state-of-the-art Large Data Bases, August 31st – September 4th 2015, Kohala Coast, Hawaii. indexing techniques either consume a lot of RAM or produce a Proceedings of the VLDB Endowment, Vol. 8, No. 9 large quantity of random fine-grain updates. Few pioneer works Copyright 2015 VLDB Endowment 2150-8097/15/05. already considered the problem of embedding a search engine in sensors equipped with Flash storage [17, 20, 22, 23], but they

910 target small data collections (i.e., hundreds to thousands of files) 2.2 Search Engine Requirements with a query execution time which remains proportional to the As in [17], we consider that the search engine of interest in this number of indexed documents. By construction these search paper has similar functionality as a Google Desktop for smart engines cannot meet insertion performance and query scalability objects. Hence, we use the terminology introduced in the at the same time. Moreover none of them support file deletions Information Retrieval literature for full-text search. Then, a and updates, however useful in a number of scenarios. document refers to any form of data files, terms refers to any In this paper, we make the following contributions: forms of metadata elements, term frequencies refer to metadata element weights and a query is equivalent to a full-text search. • We introduce three design principles, namely Write-Once Partitioning, Linear Pipelining and Background Linear Full-text search has been widely studied by the information Merging, to devise an inverted index capable of tackling the retrieval community (see [24] for a recent survey). The core conflicting hardware constraints of smart objects. problem is, given a collection of documents and a user query • Based on these principles, we propose a novel inverted index expressed as a set of terms {ti}, to retrieve the k most relevant structure and related algorithms to support all the basic index documents according to a ranking function. In the wide majority operations, i.e., search, insertion, deletion and update. of the related works, the tf-idf score, i.e., term frequency-inverse • We validate our design through an extensive experimentation on document frequency, is used to rank the query results. A a real smart object platform and with two representative datasets document can be of many types (e.g., text file, image, etc.) and is and show that query scalability and insertion/deletion/update associated with a set of terms (or keywords) describing its content performance can be met together in various contexts. and weights indicating their respective importance in the document. For text documents, the terms are words composing the The rest of the paper is organized as follows. Section 2 details the document and their weight is their frequency in the document. For smart objects' hardware constraints, analyses the state-of-the-art images, the terms can be tags, metadata or visterms describing solutions and derives from this analysis a precise problem image subparts [22]. For a query Q={t}, the tf-idf score of each statement. Section 3 introduces our three design principles, while document d containing at least a query term can be computed as: Sections 4 and 5 detail the proposed inverted index structure and  N  algorithms derived from these principles. Section 6 is devoted to tf  idf (d )  log f d ,t  1  log    F  the tricky case of file deletions and updates. We present the tQ  t  experimental results in Section 7 and conclude in Section 8. where fd,t is the frequency of term t in document d, N is the total number of indexed documents and Ft is the number of documents 2. PROBLEM STATEMENT that contain t. This formula is given for illustrative purpose, the weight between fd,t and N/Ft varying depending on the proposals. 2.1 Smart Objects’ Hardware Constraints Whatever their form factor and usage, smart objects share strong commonalities in terms of data management architecture. Indeed, a large NAND Flash storage is used to persistently store data and indexes, and a microcontroller (MCU) executes the embedded code, both being connected by a bus. Hence, the architecture inherits hardware constraints from both the MCU and the Flash. The MCUs embedded in smart objects usually have a low power CPU, a tiny RAM (few KB) and a few MB of persistent memory Figure 2. Typical inverted index structure. (ROM, NOR or EEPROM) used to store the embedded code. The Classically, full-text search queries are evaluated efficiently using NAND Flash component (either raw NAND Flash chip or an inverted index, named I hereafter (see Figure 2). Given D={di} SD/micro-SD card) also exhibits strong limitations. In NAND a set of documents, the inverted index I over D consists of two Flash, the minimum unit for a read and a write operation is the main components [24]: (i) a search structure I.S (also called page (usually 512 Bytes, also called a sector). Pages must be dictionary) which stores for each term t appearing in the erased before being rewritten but the erase operation must be documents the number Ft of documents containing t and a pointer performed at a block granularity (e.g., 256 pages). Erases are then to the inverted list of t; (ii) a set of inverted lists {I.Lt} where each 4 costly and a block wears out after about 10 repeated write/erase list stores for a term t the list of (d, fd,t) pairs where d is a cycles. In addition, the pages have to be written sequentially in a document identifier in D that contains t and fd,t is the weight of block. Therefore, NAND Flash badly supports random writes. We the term t in the document d (typically the frequency of t in d). observed this same bad behavior both with raw NAND Flash The dictionary is constituted by all the distinct terms t of the chips and SD/micro-SD cards. Our own measurements shown in documents in D, and is large in practice, which requires Table 1 corroborate the ones published in [6]. organizing it into a search-efficient structure such as a B-tree. Table 1. Measured performance of common SD cards A query Q={t} is traditionally evaluated by: (i) accessing I.S to retrieve for each query term t the inverted lists elements {I.L } ; Time (ms) per I/O Read Seq.Write Rand. Write t tQ (ii) allocating in RAM one container for each unique document Kingston µSDHC 1.3 6.3 160 identifier in these lists; (iii) computing the score of each of these Lexar SDMI4GB-715 0.8 2.1 180 documents using a weight function, e.g., tf-idf; (iv) ranking the Samsung µSDHC Plus 1.3 2.9 315 documents according to their score and producing the k SiliconPower SDHC 1.4 3.4 40 documents with the highest scores.

911 2.3 State-of-the-Art Solutions ensure performance. The different proposals vary in the way the Data management embedded in smart objects is no longer a new log and the in-memory index are managed, and in the impact it topic. Many proposals from the database community tackle this has on the commit frequency. To amortize the write cost by a problem in the context of the Internet of Things, strengthening the significant factor, the log must be seldom committed, which idea that smart objects must now be considered as first-class data requires more RAM. Conversely, limiting the RAM size leads to sources. However, many works [3, 4, 19] consider a traditional increasing the commit frequency, thus generating more random database context and do not address the full-text search problem, writes. The RAM consumption and the random write cost are thus leading to different query processing techniques and indexing conflicting parameters. Under severe RAM limitations, virtually structures. Therefore, we focus below on works specifically no reduction of random writes can be obtained. addressing embedded search engines and then extend the review Partitioned indexes. In another line of work, partitioned indexes to a few works related to Flash-based indexing when the way they have been extensively employed in environments with insert- tackle the MCU or Flash constraints can enlighten the discussion. intensive workloads and concurrent queries on magnetic disks. Embedded search engines. A few pioneer works recently Prominent examples are the LSM-tree (i.e., the Log-Structured demonstrate the interest of embedding search engine techniques Merge-tree) [14] and its many variants (e.g., the Partitioned into smart objects equipped with extended Flash storage to Exponential file [10] and the bLSM-tree [15] to name but a few). The LSM-tree consists in one in-memory B-tree component to manage collections of files stored locally [16, 17, 20, 22]. These + works rely on a similar design of the embedded inverted index as buffer the updates and one on-disk B -tree component that proposed in [17]. Instead of maintaining one inverted list per term indexes the disk resident data. Periodically, the two components in the dictionary, each term is hashed to a bucket and a single are merged to integrate the in-memory data and free the memory. inverted list is built for each bucket. The inverted lists are stored The benefit is twofold. First the updates are integrated in batch, sequentially in Flash memory, within chained pages, and only a which amortizes the write cost per update. Second, the merge small hash table referencing the head of each bucket is kept in operation uses sequential I/Os, which reduces the disk arm RAM. The number of buckets is kept small such that the main movements and thus increases the throughput. Note that Google’s part of the RAM can be used as an insertion buffer for the and Facebook’s Cassandra employ a similar partitioning approach to optimize the indexing of key-value stores in inverted lists elements, i.e., (t, d, fd,t) triples. This approach complies with a small RAM and suits well the Flash constraints massively parallel and distributed architectures. The search by precluding fine grain random (re)writes in Flash. However, engine proposed in this paper shares the general idea of index each inverted lists corresponds to a large number of different partitioning. However, the similarity stops at the general level terms, which leads to a high query evaluation cost that increases because of major differences regarding the targeted hardware proportionally with the size of the data collection. The less RAM platforms (embedded systems versus high-end servers) and the available, the smaller the number of hash buckets and the more queries of interest (top-k keyword search versus classical key- severe the problem is. In addition, these techniques do not support value search). Consequently, our solution differs from the above document deletions, but only data aging mechanisms, where old mentioned ones in a number of aspects (e.g., RAM usage, index entries automatically expire when overwritten by new ones. partitioning organization, management of updates and deletions, A similar design is proposed in [22] that builds a distributed way of computing the top-k with a minimal RAM consumption). search engine to retrieve images captured by camera sensors. A As a conclusion, tiny RAM and NAND Flash persistent storage local inverted index is embedded in each sensor node to retrieve introduce conflicting constraints and lead to split state of the art the relevant images locally, before conducting the distributed solutions in two families. The insert-optimized family reaches search. However, this work considers powerful sensors nodes insertion scalability thanks to a small indexed structure buffered (with tens of MB of local RAM) equipped with custom SD card in RAM and sequentially flushed in Flash, thereby precluding boards (with specific performances). At the same time, the costly random writes in Flash. This good insertion behavior is underlying index structure is based on inverted lists organized in a however obtained to the detriment of query scalability, the similar way as in [17]. All these methods are highly efficient for performance of searches being roughly linear with the index size document insertions, but fail to provide scalable query processing in Flash. Conversely, the query-optimized family reaches query for large collections of documents. Therefore, their usage is scalability by adapting traditional indexing structures to Flash limited to applications that require storing only a small number storage, to the detriment of insertion scalability, the number of (few hundreds) of documents. random (re)writes in Flash (linked to the log commit frequency) B-tree indexing in NAND Flash. In the database context, being roughly inversely proportional to the RAM capacity. In adapting the B-tree to NAND Flash has received a great attention. addition, we are not aware of works addressing the crucial Indeed, the B-tree is a very popular index and its standard problem of random document deletions in the context of an implementation performs poorly in Flash [21]. Many recent embedded search engine. proposals [2, 13, 21] tackle this problem or the closely related problem of a key-value stored in Flash [7]. The key idea in these 2.4 Problem Formulation approaches is to buffer the updates in log structures that are In the light of the preceding sections, the problem addressed in written sequentially and to leverage the fast (random) read this paper can be formulated as designing an embedded full-text performance of Flash memory to compensate the loss of search engine that has the following two properties: optimality of the lookups. When the log is large enough, the updates are committed into the B-tree in a batch mode, to • Bounded RAM agreement: the proposed engine must be able to amortize the Flash write cost. The log must be indexed in RAM to respect a predefined RAM consumption bound (RAM_Bound),

912 precluding any solution where this consumption depends on the pictured in Figure 2 is assumed to satisfy query scalability by size of the document set. nature and is considered hereafter as providing a lower bound in • Full scalability: it must be scalable for queries and updates terms of query execution time. Hence, the objective of Linear (insertion, deletion of documents) without distinction. pipelining is to keep the performance gap between Q over and Q over I, both small and predictable (bounded by a The Bounded RAM agreement is required to comply with the 2 p given tuning parameter). Computing Q as a set-oriented widest population of smart objects. The consequence is that the composition of a set of Q over I , (with i=0,...p) would full-text search engine must remain functional even when very i i unavoidably violate the Bounded RAM agreement as p increases, little RAM (a few KB) is made available to it. Note that the since it will require to store all Q 's intermediate results in RAM. RAM_Bound size is a subpart of the total physical RAM capacity i Hence the necessity to organize the processing in pipeline such of a smart object considering that the RAM resource is shared by that the RAM consumption remains independent of p, and all software components running in parallel on the platform, therefore of the number of indexed documents. Also, the term including the . The RAM_Bound property is also linear pipelining conveys the idea that the query processing must mandatory in a co-design perspective where the hardware preclude any iteration (i.e., repeated accesses) over the same data resources of a given platform must be precisely calibrated to structure to reach the expected level of performance. This match the requirements of a particular application domain. disqualifies brute-force pipeline solutions where the tf-idf scores The Full scalability property guarantees the generality of the of documents are computed one after the other, at the price of approach. By avoiding to privilege a particular workload, the reading the same inverted lists as many times as the number of index can comply with most applications and datasets. To achieve documents they contain. update scalability, the index maintenance needs to be processed However, Linear Pipelining alone cannot prevent the performance without generating random writes, which are badly supported by gap between Q over and Q over I to increase with the Flash memory. At the same time, achieving query scalability 1 2 p the increase of p as (i) multiple searches in several small I .S are means obtaining query execution costs in the same order of i more costly than a single search in a large I.S and (ii) the inverted magnitude with the ideal query costs provided by a classical lists in are likely to occupy only fractions of Flash inverted index I. 1 2 p pages, multiplying the number of Flash I/Os to access the same amount of data. A third design principle is then required. 3. DESIGN PRINCIPLES P3. Background Linear Merging: To limit the total number of Satisfying the Bounded RAM agreement and Full scalability partitions, periodically merge partitions compliantly with the properties simultaneously is challenging, considering the Bounded RAM agreement and without hurting update scalability. conflicting MCU and Flash constraints mentioned above. To tackle this challenge, we propose in this paper an indexing The objective of partition merging is therefore to obtain a lower method that relies on the following three design principles. number of larger partitions to avoid the drawbacks mentioned above. Partition merging must meet three requirements. First the P1. Write-Once Partitioning: Split the inverted index structure I merge must be performed in pipeline to comply with the Bounded in successive partitions such that a partition is flushed only once RAM agreement. Second, since its cost can be significant (i.e., in Flash and is never updated. proportional to the total size of the merged partitions), the merge By precluding random writes in Flash, Write-Once Partitioning must be processed in background to avoid locking the index aims at satisfying update scalability. Considering the Bounded structure for unbounded periods of time. Since multi-threading is RAM agreement, the consequence of this principle is to parse not supported by the targeted platforms, background processing documents and maintain I in a streaming way. Conceptually, each can simply be understood as the capacity to interrupt and recover partition can be seen as the result of indexing a of the the merging process at any time. Third, update scalability requires document input flow, the size of which is limited by the that the total cost of a merge run be always smaller than the time RAM_Bound. Therefore, I is split in an infinite sequence of to fill out the next bunch of partitions to be merged. partitions , each partition I having the same internal 1 2 p i Taken together, principles P1 to P3 reconcile the Bounded RAM structure as I. When the size of the current I partition stored in i agreement and Full scalability index properties. The technical RAM reaches RAM_Bound, I is flushed in Flash and a new i solutions to implement these three principles are presented next. partition I is initialized in RAM for the next window. i+1 To ease the presentation, we introduce first the foundation of our A second consequence of this design principle is that document solution considering only document insertions and queries. The deletions have to be processed similar to document insertions trickier case of document deletions is postponed to Section 6. since the partitions cannot be modified once they are written. This means adding compensating information in each partition that will 4. WRITE-ONCE PARTITIONING AND be considered by the query process to produce correct results. LINEAR PIPELINING P2. Linear Pipelining: Compute each query Q with respect to the These two design principles are discussed together because the Bounded RAM agreement in such a way that the execution cost of complexity comes from their combination. Indeed, Write-Once Q over is in the same order of magnitude as the Partitioning is straightforward on its own. It simply consists in execution cost of Q over I. splitting I in a sequence of small indexes called Linear Pipelining aims at satisfying query scalability under the partitions, each one having a size bounded by RAM_Bound. The Bounded RAM agreement. A unique structure I as the one difficulty is to implement a linear pipeline execution of any query Q on this sequence of partial indexes.

913 Executing Q over I would lead to evaluate:    N  Top  W  f ,  , with d  D k   d ,t F   tQ  t  where Topk selects the k documents dD having the largest tf-idf scores, each score being computed as the sum, for all terms tQ, of a given weight function W taking as parameter the frequency fd,t of t in d and the inverse document frequency N/Ft. Our objective is to remain agnostic regarding W and then let the precise form of this function open. Let us now consider how each term of this expression can be evaluated by a linear pipelining Figure 4. Linear Pipeline computation of Q over terms ti and tj. process on a sequence . Computing f . If a document d overlaps two consecutive Computing N. We assume that the number of documents is a d,t partitions I and I , the inverted list L of a queried term tQ may global metadata maintained at insertion/deletion time and needs i i+1 t also overlap these two partitions. In this case the f score of d is not be recomputed for each Q. d,t simply the sum of the (last) fd,t value in Ii.Lt and the (first) fd,t value in Ii+1.Lt. To get the fd,t values, the inverted lists Ii.Lt have to be accessed. The pointers referencing these lists are actually stored in the dictionary which has already been read while computing Ft. According to the Linear pipelining principle, we avoid reading again the dictionary by storing these pointers in RAM during the Ft computation. The extra RAM consumption is minimal and bounded by the fact that the number of partitions is itself bounded thanks to the merging process (see Section 5).

Computing Topk. Traditionally, a RAM variable is allocated to each document d to compute its tf-idf score by summing the results of W(fd,t, N/Ft) for all terms tQ [24]. Then, the k best Figure 3. Consecutive index partitions (overlapping docs). scores are selected. Unfortunately, this approach conflicts with the Bounded RAM agreement since the size of the document set Computing F . F should be computed only once for each term t t t is likely to be much larger than the available RAM. Hence, we since F is constant for Q. This is why F is materialized in the t t organize the query processing in a pure pipeline way, allocating a dictionary part of the index ({t, Ft}  I.S), as shown in Figure 2. RAM variable only to the k documents having currently the best When I is split in , the global value of Ft should be scores. This forces the complete computation of tf-idf(d) to be computed as the sum of the local Ft of all partitions. The done for each d, one after the other. To meet this requirement complexity comes from the fact that a same document d may while precluding any iteration on the inverted lists, these lists are cross several partitions with the consequence of contributing maintained sorted on the document id. Note that if document ids several times to the global Ft if a simple sum is performed. The reflect the insertion ordering, the inverted lists are naturally Bounded RAM agreement precludes maintaining in RAM a sorted. Hence, the tf-idf computation sums up to a simple linear history of all the terms already encountered for a given document pipeline merging process of the inverted lists for all terms tQ in d across the parsing windows, the size of this history being each partition (see Figure 4). The RAM consumption for this unbounded. Accessing the inverted lists {I .L } of successive i t phase is therefore restricted to one variable for each of the current partitions to check whether they intersect for a given d would also k best tf-idf scores and to one buffer (i.e., a RAM page) per query violate the Linear Pipelining since these same lists will be term t to read the corresponding inverted lists I .L (i.e., I .L are accessed again when computing the tf-idf score of each document. i t i t read in parallel for all t, the inverted lists for the same t being read The solution is then to store in the dictionary of each partition the in sequence). Figure 4 summarizes the data structures maintained boundary of that partition, namely the identifiers of the first and in RAM and in Flash to handle this computation. last documents considered in the parsing window. Then, two bits firstd and lastd are added in the dictionary for each inverted list to 5. BACKGROUND LINEAR MERGING register whether this list contains one (or both) of these The background merging process aims at achieving scalable query documents, i.e., {t, Ft, firstd, lastd}  I.S. As illustrated in Figure costs by timely merging several small indexes into a larger index 3, this is sufficient to detect the intersection between the inverted structure. As mentioned in Section 3, the merge must be a lists of a same term t in two successive partitions. Whether an pipeline process in order to comply with the Bounded RAM intersection between two lists is detected, the sum of their agreement while keeping a cost compatible with the update rate. respective Ft is decremented by 1. Hence, the correct global value Moreover, the query processing should continue to be executed in of Ft can easily be computed without physically accessing the Linear Pipelining (see Section 4) on the structure resulting from inverted lists. During the Ft computation phase, the dictionary of the successive merges. Therefore, the merges have to preserve the each partition is read only once and the RAM consumption sums global ordering of the document ids within the index structures. up to one buffer to read each dictionary, page by page, and one RAM variable to store the current value of each Ft. To meet these requirements, we introduce a Sequential and Scalable Flash structure, called SSF, pictured in Figure 5. The

914 SSF consists in a hierarchy of partitions of exponentially scans each partition only once and produces the resulting partition increasing size. Specifically, each new index partition is flushed also sequentially. Hence, assuming b+1 is strictly lower than from RAM into the first level of the SSF, i.e., L0. The merge RAM_bound, one RAM buffer (of one page) can be allocated to operation is triggered automatically when the number of partitions read each partition and the merge is I/O optimal. If b is larger than in a level becomes b, the branching factor of SSF, which is a RAM_bound, the algorithm remains unchanged but its I/O cost predefined index parameter. The merge combines the b partitions increases since each partition will be read by page fragments i i rather than by full pages. at level Li of SSF, denoted by I1 ,...I b , into a new partition at level i1 Li+1, denoted by I j and then reclaims all partitions at level Li. Search queries can be evaluated in linear pipeline by accessing the partitions one after the other from partitions b to 1 in level 1 up to level n. Thus, the inverted lists are scanned in descending order of the document ids, from the most recently inserted document to the oldest one, and the query processing remains exactly the same as the one presented in Section 4, with the same RAM consumption. The merging and the querying processes could be organized in opposite order (i.e., in ascending order of the document ids) with no impact. However, order matters as soon as deletions are considered (see Section 6). SSF provides scalable query costs since the amount of indexed documents grows exponentially with the number of levels, while the number of partitions increases only linearly with the number of levels. Note that merges in the upper levels are exponentially rare (one i merge in level Li for b merges in L0) but also exponentially costly. To mitigate this problem, we perform the merge operations in background (i.e., in a non-blocking manner). Since the merge may consume up to b pages of RAM, we launch/resume it each Figure 5. The Scalable and Sequential Flash structure. time after a new partition is flushed in L0 of the SSF, the RAM The merge is directly processed in pipeline as a multi-way merge being empty at this time. A small quantum of time (a few hundred of all partitions at the same level. This is possible since the milliseconds in practice) is allocated to the merging process. Each dictionaries of all the partitions are already sorted on terms, while time this quantum expires, the merge is interrupted and its the inverted lists in each partition are also sorted on document ids. execution status (i.e., a cursor indicating the current Flash page So are the dictionary and the inverted lists of the resulting position in each partition) is memorized. The quantum of time is partition at the upper level. More precisely, the algorithm works chosen so that the merge of a given SSF level ends before the next in two steps. In the first step, the I.L part of the output partition is merge of the same level need to be triggered. In this way, the cost produced. Given b partitions in the index level Li, b+1 RAM of a merge operation is spread among the flush operations and pages are necessary to process the merge in linear pipeline: b remains almost transparent. This basic strategy is simple and does pages to merge the inverted lists in I.L of all b partitions and one not make any assumption regarding the index workload. page to produce the output. The indexed terms are treated one However, it could be improved in certain contexts, by taking after the other in alphabetic order. For each term t, the head of its advantage of the idle time of the platform. inverted lists in each partition is loaded in RAM. These lists are then consumed in pipeline by a multi-way merge. Document ids are encountered in descending order in each list and the output list 6. DOCUMENT DELETIONS To the best of our knowledge, our proposal is the first embedded resulting from the merge is produced in the same order. A search index to implement document deletions. This problem is particular case must be distinguished when two pairs (d, f1 ) and d,t actually of primary importance because deletions are required in (d, f2 ) are encountered in separate lists for the same d; this d,t many practical scenarios. Unfortunately, index updating increases means that document d overlaps two partitions and these two pairs significantly the complexity of the index maintenance by are aggregated in a single (d, f1 + f2 ) before being added to d,t d,t reintroducing the need for random updates in the index structure. I.L. In the second step, the metadata I.M is produced (see Figure We extend here the index structure to support the deletions of 3), by setting the value of firstd (resp. lastd) with the firstd (resp. documents without generating any random write in Flash. lastd) value of the first (resp. last) partition to be merged, and the I.S structure is constructed sequentially, with an additional scan of I.L. The I.S tree is built from the leaves to the root. This step 6.1 Solution Outline requires one RAM page to scan I.L, plus one RAM page per I.S Implementing the delete operation is challenging, mainly because tree level. For each list encountered in I.L, a new entry (t, Ft, of the Flash memory constraints which proscribe the presence_flags) is appended to the lowest level of I.S; the value Ft straightforward approach of updating in-place the inverted index. is obtained by summing the fd,t fields of all (d, fd,t) pairs in this The alternative to updating in-place is compensation, i.e., the list; the presence flag reflects the presence in the list of the firstd deleted documents’ identifiers (DDIs) are stored in an appropriate or lastd document. Upper levels of I.S are then trivially filled way and used as a filter to eliminate the ghost documents sequentially. This Background Merging process generates only retrieved by the query evaluation process. sequential writes in Flash and previous partitions are reclaimed in large blocks after the merge. This pipeline process sequentially

915 A basic solution could be to organize the DDIs as a sorted list in several consecutive partitions is equally treated as with the Flash and to intersect this list at query execution time with the inserted documents. inverted lists in the SSF corresponding to the query terms. Computing f . The f of a document d for a term t is computed However, this solution raises several problems. First, the d,t d,t as usual, with the salient difference that a document which has documents are deleted in random order, according to users' and been deleted appears twice: with the value (d, f ) (resp. (d, -f )) application decisions. Hence, maintaining a sorted list of DDIs in d,t d,t in the inverted lists of the partition I (resp. partition I ) where it Flash would violate the Write-Once Partitioning principle since i j has been inserted (resp. deleted). By construction i < j since a the list has to be rewritten each time a set (e.g., a page) of new document cannot be deleted before being inserted. DDIs is flushed from RAM. Second, the computation of the Ft for each query term t during the first step of the query processing Computing Topk. Integrating deleted documents makes the cannot longer be achieved without an additional merge operation computation of Topk more subtle. Following the Linear Pipelining to subtract the sorted list of DDIs from the inverted lists of the principle, the tf-idf scores of all documents are computed one after SSF. Third, the full DDI list has to be scanned for each query the other, in descending order of the document ids, thanks to a regardless of the query selectivity. These two last elements make linear pipeline merging of the insert lists associated to the queried the query cost dependent of the total number of deleted terms. To this end, the algorithm introduced in Section 4 uses k documents and then conflict with the Linear pipelining principle. RAM variables to maintain the current k best tf-idf scores and one Therefore, instead of compensating the query evaluation process, buffer (i.e., a RAM page) per query term t to read the we propose a solution based on compensating the indexing corresponding inverted lists. Some elements present in the structure itself. In particular, a document deletion is treated inverted lists correspond actually to deleted documents and must similarly to a document insertion, i.e., by re-inserting the be filtered out. The problem comes from the fact that documents metadata (terms and frequencies) of all deleted documents in the are deleted in random order. Hence, while inverted lists are sorted SSF. The objective is threefold: (i) to be able to compute, as with respect to the insertion order of documents, a pair of the form (d, -fd,t) may appear anywhere in the lists. In case a presented in Section 4, the Ft for each term t of a query based on the metadata only (of both existing and deleted documents), (ii) to document d has been deleted, the unique guarantee is to encounter have a query performance that depends on the query selectivity the pair (d, -fd,t) before the pair (d, fd,t) if the traversal of the lists (i.e., number of inserted and deleted documents relevant to the follows a descending order of the document ids. However, query) and not on the total number of deleted documents and (iii) maintaining in RAM the list of all encountered deleted documents to effectively purge the indexing structure from the largest part of in order to filter them out during the follow-up of the query the deleted documents at Background Merging time, while processing would violate the Bounded RAM agreement. remaining compliant with the Linear Pipelining principle. We The proposed solution works as follows. The tf-idf score of each present in the following the required modifications of the index document d is computed by considering the modulus of the structure to integrate this form of compensation. frequencies values |±fd,t| in the tf-idf score computation, regardless of whether d is a deleted document or not. Two lists are 6.2 Impact on Write-Once Partitioning maintained in RAM: Topk = {(d, score(d))} contains the current k As indicated above, a document deletion is treated similarly to a best tf-idf scores of documents which exist with certainty (no document insertion. Assuming a document d is deleted in the time deletion has been encountered for these documents); Ghost = {(d, score(d))} contains the list of documents which have been deleted window corresponding to a partition Ii, a pair (d, -fd,t) is inserted (a pair (d, -fd,t) has been encountered while scanning the inverted in each list Ii.Lt for the terms t present in d and the Ft value associated to t is decremented by 1 to compensate the prior lists) and have a score better than the smallest score in Topk. Topk insertion of that document. To distinguish between an insertion and Ghost lists are managed as follows. If the score of the current document d is worse than the smallest score in Topk, it is simply and a deletion, the frequency value fd,t for the deleted document id discarded and the next document is considered (step 2 in Figure is simply stored as a negative value, i.e., -fd,t. 6). Otherwise, two cases must be distinguished. If d is a deleted document (a pair (d, -fd,t) is encountered), then it enters the Ghost 6.3 Impact on Linear Pipelining list (step 3); else it enters the Topk list unless its id is already Executing a query Q over our compensated index structure sums present in the Ghost list (step 4). Note that this latter case may up to evaluate: occur only if the id of d is smaller than the largest id in Ghost,    N  making the search in Ghost useless in many cases. An important Top  W  f , , with d  (D   D  ) remark is that the Ghost list has to register only the deleted k   d ,t F   tQ  t  documents which may compete with the k best documents, to where D+ (resp. D-) represents the set of inserted (resp. deleted) filter them out when these documents are later encountered, which documents. makes this list very small in practice. Computing N. As presented earlier, N is a global metadata While simple in its principle, this algorithm deserves a deeper maintained at update time and then already integrates all insert discussion in order to evaluate its real cost. This cost actually and delete operations. depends on whether the Ghost list can entirely reside in RAM or not. Let us compute the nominal size of this list in the case where Computing Ft . The global Ft value for a query term t is computed the deletions are evenly distributed among the document set. For as usual since the local Ft values are compensated at deletion time illustration purpose, let us assume k=10 and the percentage of (see above). The case of deleted documents that overlap with deleted documents δ=10%. Among the first 11 documents encountered during the query processing, 10 will enter the Topk

916 list and 1 is likely to enter the Ghost list. Among the next 11 deletions by tuning the branching factor of the last index level documents, 1 is likely to be deleted but the probability that its since most of the data is stored in this index level. By setting a score is in the 10 best scores is roughly 1/2. Among the next 11 smaller value to the branching factor b’ of the last level, the ones, this probability falls to about 1/3 and so on and so forth. merge frequency in this level increases and consequently the n absorption rate also increases. Therefore, in our implementation Hence, the nominal size of the Ghost list is   k  1 , which  i we use a smaller value for the branching factor of the last index i1 level (i.e., b'=3 for the last level and b=10 for the other levels). can be approximated by   k  (ln(n)   ) . For 10.000 queried Typically, about half of the total number of deletions will be documents, n=1000 and the size of the Ghost list is only absorbed for b'=3 if we consider that the deletions are uniformly   k  (ln(n)   )  10 elements, far beyond the RAM size. In distributed over the data insertions. addition, the probability that the score of a Ghost list element competes with the Topk ones decreases over time, giving the 7. EXPERIMENTAL EVALUATION opportunity to continuously purge the Ghost list (step 5 in Figure 7.1 Experimental Setup 6). In the very improbable case where the Ghost list overflows Hardware platform. All the experiments have been conducted (step 6 in Figure 6), it is sorted in descending order of the on a development board ST3221G-EVAL (see www.st.com/web/ document ids, and the entries corresponding to low document ids en/catalog/tools/PF251702) equipped with the MCU STM32- are flushed. This situation remains however highly improbable F217IG (see www.st.com/web/catalog/mmc/FM141/SC1169/ and will concern rather unusual queries (none of the 300 queries SS1575/LN9/PF250172) connected to a micro-SD card slot. This we evaluated in our experiment produced this situation, while hardware configuration is representative of typical smart objects allocating a single RAM page for the Ghost list). [4, 17, 19]. The board runs the embedded operating system RTOS 7.0.1 (see freertos.svn.sourceforge.net/viewvc/freertos/tags/ V7.1.0/). The search engine code is stored on the internal NOR Flash memory of the MCU, while the inverted index is stored on a micro-SD NAND Flash card. We used for data storage two commercial micro-SD cards (i.e., Kingston MicroSDHC Class 10 4GB and Silicon Power SDHC Class 10 4GB) exhibiting different performance (see lines 1 and 4 of Table 1). The MCU has 128KB of available RAM. However, the search engine only uses a maximum amount of 5KB of RAM, to validate our design whatever the available RAM of existing smart objects and the RAM consumption of the OS and the communication drivers. Table 2. Desktop and synthetic datasets and query sets Desktop Synthetic Number of documents 27000 100000 Total Raw Text 63 MB 129 MB Figure 6. Linear pipeline computation of Q with deletions. Total Unique Words 337952 10,000 6.4 Impact on Background Pipeline Merging Total Word Occurrences 35624875 10000000 Average Occurrences per Word 26 988 The main purpose of the Background Merging principle, as Frequent Words 20752 1968 presented in Section 5, is to keep the query processing scalable with the indexed collection size. The introduction of deletions has Infrequent Words 317210 8032 actually a marginal impact on the merge operation, which Frequent Word Occurrences 6.14% 19.68% continues to be efficiently processed in linear pipeline as before. Infrequent Word Occurrences 93.85% 80.32% Moreover, given the way the deletions are processed in our KB per documents (avg, max) 8, 647 1.3, 1.3 structure, i.e., by storing couples (d, -fd,t) for the deleted Words per documents (avg, max) 1304, 105162 100, 100 documents, the merge acquires a second function which is to Total number of queries 837 1000 absorb the part of the deletions that concern the documents Queries with 1, 2 & 3 terms 85, 255, 272 200, 200, 200 present in the partitions that are merged. Indeed, let us come back Queries with 4 & 5 terms 172, 82 200, 200 to the Background Merging process described in Section 5. The main difference when deletes are considered is the following. Datasets and queries. Selecting a representative data and query When inverted lists are merged during step 1 of the algorithm, a set to evaluate our solution is challenging considering the new particular case may occur, that is when two pairs (d, fd,t) and diversity and quick evolution of smart object usages, explaining (d, -fd,t) are encountered in separate lists for the same d; this the absence of recognized benchmarks in this area. We then means that document d has actually been deleted; d is then purged validate our proposal using two use-cases where an embedded (the document deletion is absorbed) and will not appear in the keyword-based search engine is called to play a central role and output partition. Hence, the more frequent the Background which exhibit different requirements in terms of document Merging, the smaller the number of deleted entries in the index. indexing with the objective to assess the versatility of the solution. Taking into account the supplementary function of the merge, i.e., to absorb the data deletions, we can adjust the absorption rate of

917 Table 3. Statistics of the flush and merge operations Flush Merge Merge Merge Merge Merge [RAML0] [L0L1] [L1L2] [L2L3] [L3L4] [L4L5] Number of Read IOs 1 (1)* 90 (92) 503 (617) 2027 (2570) 11010(15211) 50997(73026) Number of Write IOs 9 (9) 71 (100) 339 (548) 1485 (2085) 9409 (14027) 47270 (66335) Exec. time on Kingston (sec.) 0.008 (0.0084) 0.58 (0.77) 2.9 (4.44) 13.2 (19.1) 84.6 (124.4) 436 (615) Exec. time on Silicon Power (sec.) 0.004 (0.0045) 0.38 (0.48) 1.94 (2.84) 8.67 (11.7) 54.5 (79.3) 278 (393) Total number of occurrences 73277 9160 1145 143 18 2 Inserted docs between flushes/merges 0.42 (16) 3 (42) 24 (232) 189 (1193) 1453 (6496) 8906 (10547) * The numbers given in brackets are maximum values, other values are average values. The first use-case relates to the Personal Cloud where a secure terms, and an average of 100 terms per document, chosen using a smart object embedding a Personal Data Server [3, 4, 18] is used zipfian distribution with a skew of 0.7. For this dataset, we to securely store, query and share personal user's files generated a query set of 1000 random queries with up to five (documents, photos, emails). These files can be stored encrypted terms per query. In the experiments, we analyze the cost of on a regular server (e.g., in the Cloud or in a PC) but the insertions and queries separately, one operation at a time. Indeed metadata describing them (keywords extracted from the file smart objects, contrary to central servers, rarely support parallel content, date, type, tags set by the user herself, etc.) are indexed or multi-task processing. Moreover, the RAM consumption and stored by the embedded Personal Data Server acting as a increases linearly with the number of operations executed in Google Desktop or Spotlight for the user's dataspace [12]. Indeed, parallel, a serious constraint in our context. Due to space the metadata are as sensitive as the files themselves and must be limitation, we focus in this paper on the results obtained with the managed in the secure smart object (e.g., a mass storage desktop dataset and provide results issued from the synthetic smartcard connected to a PC or an internet gateway) to prevent dataset only when the differences are significant. The reader any privacy breach. This use-case is representative of situations interested in the full set of measures can refer to a technical report where the indexing documents have a rich content (tens to [5]. This report also gives the pseudo-code of the measured hundreds of thousands of terms) and documents updates and algorithms and discusses results obtained using a third larger deletes can be performed randomly. To capture the behavior of dataset (the ENRON email dataset available at : https://www.cs. our solution in such context, we use the pseudo-desktop data cmu.edu/~enron/) which actually gives similar results as those collection and query set provided in [11] which is considered as presented here. representative of a personal desktop where searches, updates and deletes are performed. 7.2 Index Maintenance The second use-case is in the smart sensor context. Sensors can be According to the algorithms presented earlier, the insertions and smart meters deployed at home to enable new generation of deletions of documents produce a sequence of index partitions energy services, home gateways capturing a variety of events which are subsequently merged in the SSF. Given the issued by a growing number of smart appliances or car trackers RAM_Bound of 5KB, we set the branching factor b (intermediate registering our locations and driving habits to compute insurance levels in the SSF) to ss8 to decrease the merge frequency, and the fees and carbon tax [3]. In this case, documents are time windows branching factor b’ (last level in the SSF) to 3, to absorb faster the and terms are events occurring during this time window. Top-k document deletions. The insertion or deletion of a single queries are useful for analytic tasks and executing them at the document is very efficient, since the document metadata is sensor side helps reducing evaluation time, energy consumption preliminary inserted in RAM. Also, given the small size of the linked to data transmission costs and risk of private information RAM_Bound, flushing the RAM content into the level L0 of the leakage. We consider that in this use-case the indexing documents SSF is fast; it takes on average 6ms to write a partition in L0 in all have a poorer content (hundreds to thousands of terms/event types our experiments (see Table 3). per time window). We are not aware of publicly available datasets Table 3 shows also the SSF merge cost, which is periodically for this context and then generate a synthetic one. triggered (i.e., each time the number of flushed partitions in Li The statistics of the desktop dataset and query set are given in reaches the branching factor b). The table presents the number of Table 2. This dataset contains five representative types of IOs for the flush and merge operations performed in the different personal files (i.e., email, html, pdf, doc and ppt). The desktop SSF levels, and their execution times for the two tested SD cards, search is an important topic in the IR community, but real while inserting the 27K documents of our dataset and randomly personal collections of desktop files cannot be published for deleting 10% of them. In our experiments, the deletions are evident privacy issues. Instead, the authors in [11] propose a uniformly distributed over the inserted documents and uniformly method to generate pseudo desktop collections and show that such interleaved with the insertions. All these operations lead to an SSF collections have the same properties as real collections. As with six levels. As expected, the merge time grows exponentially recommended in [11], we preprocess the files in this collection by from L0 to L5, since the size of the partitions also increases by removing the stop words and stemming the remaining terms using (nearly) a factor of b. It requires a few seconds to merge the the Krovetz stemmer. The obtained number of terms in the partitions in the levels L0 to L3 and up to a few minutes in L4 to L5. vocabulary is large, i.e., 337952. We also use a set of 837 queries The merge time is basically linear with the size of the merged prepared for this dataset and provided in [11]. Different, the partitions in the number of reads and writes. The merge time can synthetic dataset has a much smaller vocabulary, i.e., 10000 vary especially in the first three levels of the SSF, depending on the term distribution in the indexed documents. However, the

918 partitions in L3 contain most of the term dictionary and the With the synthetic dataset [5], our index exhibits an average variation of the merge time in the upper levels is less significant. execution time of only 0.21s and a maximum of 0.42s. The slightly higher query times compared to the desktop dataset are Table 3 indicates that the most costly merges are also the less explained by the larger index size (see Table 6) and by the smaller frequent. Only 20 merges costing more than 15 seconds are vocabulary (which reduces the average query selectivity). triggered while inserting the complete set of documents and deleting 10% of them. However, blocking the index for a long 0,4 Non-blocking_Max Level 5 duration may be problematic for some applications and justifies 0,35 Non-blocking_Min the non-blocking merge implementation presented in Section 5. 0,3 Table 4 compares the maximum and average insertion/deletion Level 4 0,25 time in the index between blocking and non-blocking merge implementations. The time is measured as the RAM flush time 0,2 0,15 plus the merge time, if a merge is triggered (for the blocking Time (sec.) merge) or is currently in progress (for the non-blocking merge). 0,1 We observe that the cost of a blocking merge in L5 can take up to 0,05 393 seconds with Silicon Power, while this cost is spread among 0 the next 13844 insert/delete operations (each time the RAM is 124 16396 24665 flushed) in the non-blocking case. This leads to an extremely Number of documents large gap between the maximum and average insertion times in Figure 7. Query times with the non-blocking merge strategy the blocking case. The insertion of the synthetic dataset also leads to an SSF with 6 levels [5]. The average insertion time is similar 7.4 Impact of the Deletion Rate to the desktop dataset, e.g., with a non-blocking merge, the Table 5 shows the average query performance for different average cost is 0.29s for Kingston and 0.20s for Silicon Power. deletion rates with Silicon Power storage (we obtained similar results with the Kingston storage). We considered two cases. Table 4. Blocking vs. non-blocking merge performance (sec.) First, we inserted the whole dataset while deleting a number of Max. Avg. documents corresponding to the deletion rate (first line in Table Blocking merge (Kingst./Silicon P.) 615/393 0.16/0.10 5). In this case, the higher the deletion rate is, the lower the query Non-blocking merge (Kingst./Silicon P.) 0.23/0.15 0.21/0.13 time is since a good part of the deleted documents (app. 50%) will be purged from the index and decrease the query processing time. In the second case, the total number of active documents in the 7.3 Index Search Performance index is the same (i.e., 13500) regardless the deletion rate (second We evaluated the search performance of our index on our test line in Table 5). Hence, the higher the deletion rate is, the more board with the two SD cards, with both the blocking and non- documents we insert to compensate the deletions. In this case, a blocking merge implementations. Due to the similarity of the higher deletion rate leads to larger query times since part of the results, we present hereafter only the results obtained with the deletions are present in the index and have to be processed by the Silicon Power card. Figure 7 shows the average query time for the queries. However, the increase of the query times is relatively 837 search queries of the query set as a function of the number of small compared to the case with no deletions, i.e., less than 16% documents. The curves present the query cost before ("max" for deletion rates up to 50%. Globally, the index is robust with the curve) and after ("min" curve) each merge occurring in the higher number of deletions in both cases. index levels, i.e., from level 3 to level 5. We can observe that Table 5. Avg. query time (in sec.) with deletion (Silicon P.) locally, the query cost increases linearly with the number of partitions in the lower levels, and then decreases significantly Deletion rate 0% 10% 30% 50% after every merge operation. The large variations in the query cost Avg. query time (27k docs) 0.18 0.17 0.16 0.16 correspond to the creation of a new partition in the fifth level of Avg. query time (13k docs) 0.12 0.13 0.13 0.14 the SSF, while the intermediary peaks correspond to the creation of a partition in level 4 of the SSF (see the arrows in Figure 7). Table 6. Index (I.L/I.S) size (MB) varying the deletion rate Globally, the query time increases linearly with the number of Deletion rate 0% 10% 30% 50% indexed documents, but with a low factor. For example, after Desktop (SSF) 81/1.24 76/1.14 60/0.82 55/0.73 inserting 27K documents and deleting 10% of them, the average Desktop (classic) 81/0.4 73/0.4 57/0.4 40/0.4 query execution time is only 0.18s (maximum of 0.35s) with the Synthetic (SSF) 78 /0.97 74 /0.88 66 /0.78 58 /0.69 non-blocking merge implementation. The query times are lower Synthetic (classic) 78 /0.13 70 /0.13 55 /0.13 40 /0.13 with a blocking merge, i.e., an average execution time of 0.14s and a maximum of 0.28s. Indeed, in the non-blocking merge Table 6 shows the index size for the desktop and synthetic implementation, the number of lower level partitions can datasets after the insertion of all the documents in the collection temporarily exceed b, so that more partitions have to be visited. In and the uniform deletion of a certain percentage of the indexed our setting, the increase is on average of about 25% and documents. In each table cell, the first number indicates the represents approximately 0.04s seconds. This appears to be a fair cumulated size in MB of all the I.L parts of the SSF (i.e., the trade-off for applications that cannot accept unpredictable or global size of the inverted lists), while the second number gives unbounded update index latencies. the cumulated size of all the I.S parts of the SSF (i.e., the global size of the search structures). Also, Table 6 gives for each dataset

919 10000 1000 100 Inverted Index 10 SSF Microsearch 1 0,1 0,01

Time (sec. log. scale)0,001 0 10000 20000 Number of documents Fig. 8. Insert time comparison (Silicon P.) Fig. 9. Query time comparison(Silicon P.) Fig. 10. Overall performance comparison and deletion rate the index size of the classical inverted index Microsearch, SSF and the inverted index respectively (with used as reference (i.e., a single standard B+-Tree built with no Silicon Power). The higher insertion cost of the SSF compared to consideration for the Flash cost). Without deletions, the SSF Microsearch is generated by the SSF merges. But this gap seems index size is comparable to the classical inverted index size. acceptable for most of the applications and is outweighed by the Indeed, the size of the I.L part (inverted lists) is similar in both query performance and scalability benefit of the SSF (see next). indexes and the overhead introduced by SSF for the I.S part (each partition having its own search structure) is negligible since the Query performance. Figure 9 shows the query execution time for search structure represents less than 1% of the global index size. the three methods in function of the number of indexed As the deletion rate augments, the difference between SSF and the documents. Unsurprisingly, the Inverted Index reaches the best classical index increases. The explanation is that the deleted query performance. SSF appears as a good challenger (0.18s on documents are reinserted in SSF, which temporarily increases the the average to process a query compared to 0.07s with the index size, until a merge is triggered and absorbs part of the Inverted Index, with Silicon Power), this difference being deleted documents. Typically, we observed that about 45% to explained by the fragmentation of the SSF index structure. 55% of deletions are not absorbed after a high number of Microsearch exhibits the worst query performance and can document insertions and deletions. This makes the SSF index size definitely not scale to a large number of documents. On average, to be at most 40% larger than the inverted index size. Microsearch takes 880 seconds (14 minutes) to process a query. Even for a low number of documents, SSF outperforms Microsearch. The first reason is that Microsearch merges many 7.5 Comparison with the State-of-the-Art terms in the same inverted lists, so that large part of the index has This section compares our search engine with state of the art to be scanned. Second, Microsearch requires two passes over the indexing methods. We choose the classical inverted index (see inverted lists, one to compute the global Ft value of the term and a figure 2) to represent the query-optimized index family (although second one to compute the tf-idf score of the documents it has not been designed with embedded constraints in mind) and containing the term. For the synthetic dataset, the average query Microsearch [17] to represent the insert-optimized index family. times are 0.05s, 0.2s and 355s for the inverted index, SSF and Note that the other embedded search engines presented in Section Microsearch respectively (with Silicon Power). 2.3 rely on similar index structures with Microsearch. We used the same test conditions as above, i.e., a RAM_Bound equals to Overall performance. Figure 10 shows the speedup of SSF (i.e., 5KB. The insertions in the classical inverted index are first the ratio between the throughput of SSF and of the competitors) buffered in RAM until the RAM_Bound is reached, then applied with Silicon Power for workloads containing insertions and in batch to the index structure in Flash, generating random writes queries (with the pseudo-desktop collection) in different ratios. at this time. To be able to evaluate the queries under the RAM We obtained similar results with the other dataset or storage card. constraint, the inverted lists have to be maintained sorted on In most cases, SSF has (much) better throughput with both insert- document ids, which permits applying a linear pipeline query and query-oriented workloads, while being the sole versatile processing similar to the SSF. In the case of Microsearch, we used method. Practically, SSF will be the preferred index method a hash function with 8 buckets, since this value leads to the most unless the expected workload contains in an overwhelming balanced query-insert performance given the 5KB of RAM. proportion either insertions or queries. Besides, we only considered data insertions and queries in the Other concerns. The pros and cons of the SSF approach tests below, since Microsearch does not support deletions. compared to its competitors originates from the specific way the Insertion performance. Figure 8 shows the average insertion time index grows (by partitioning and merging) and the deletes are for the three methods (i.e., SSF, Microsearch and the classical managed (by reinserting deleted documents and absorbing deletes Inverted Index) while inserting the 27,000 documents in the during merges). While performances have been extensively desktop dataset. Microsearch and SSF have similar insert studied and compared above, the specificities of SSF may impact performance. On average, a document insertion in Microsearch other aspects of an indexing structure, more difficult to weight. takes about 0.08s and 0.33s in SSF (with Silicon Power). The First, partitioning introduces some variability in the query cost insertion time in the classical Inverted Index is at least two orders (see the stairway-like curves in Figures 8 and 9) which could be of magnitude higher (30s with Silicon Power) because of the prejudicial for real-time applications and could perturb a query costly random writes in Flash memory, clearly dismissing this optimizer. Note however that the occurrences of stairs are fully method in the context of smart objects. For the synthetic dataset, predictable, though it complexifies the optimization process. the average insertion times are 0.02s, 0.07s and 15s for Second, the way deletes are processed lead to an increase of the

920 index size and consequently, of the query cost. Nevertheless, this [7] Debnath, B., Sengupta, S. and Li, J. Skimpystash: Ram space negative effect is limited by the merge operations that permit to skimpy key-value store on flash-based storage. In ACM purge some of the deleted documents (see Table 6). Moreover, we SIGMOD, 2011, 25–36. do not see how to manage deletions differently without violating [8] Diao, Y., Ganesan, D., Mathur, G. and Shenoy, P. J. any of our design principles. Finally, we do not consider the Rethinking data management for storage-centric sensor problem of concurrent accesses in SSF, i.e., multiple processes networks. In Conference on Innovative Data Systems that query/update the index at the same time. While this seems not Research (CIDR'07), 2007. a primary concern today, this problem may deserve a deeper interest as smart objects become more powerful. [9] Huang, Y.-M. and Lai, Y.-X. Distributed energy management system within residential sensor-based heterogeneous network structure. In Wireless Sensor 8. ACKNOWLEDGMENTS Networks and Ecological Monitoring, 3 (2013), 35–60. This work is partially funded by project ANR-11-INSE-0005 KISS. [10] Jermaine, C., Omiecinski, E. and Yee, W.G. The partitioned exponential file for database storage management. The VLDB Journal 16, 4 (2007), 417-437. 9. CONCLUSION This paper presents the design of an embedded search engine for [11] Kim, J. Y. and Croft, W. B. Retrieval Experiments using smart objects equipped with extremely low RAM and large Flash Pseudo-Desktop Collections. In ACM CIKM', 2009, 1297- storage capacity. This work contributes to the current trend to 1306. endow smart objects with more and more powerful data [12] Lallali, S., Anciaux, N., Sandu Popa, I. and Pucheral, P. A management techniques. Our proposal is founded on three design Secure Search Engine for the Personal Cloud. In Sigmod, principles, which are combined to produce an embedded search 2015. http://dx.doi.org/10.1145/2723372.2735376 engine reconciling high insert/delete rate and query scalability for [13] Li, Y., He, B., Yang, R. J., Luo, Q. and Yi, K. Tree Indexing very large datasets. By satisfying a RAM_Bound agreement, our on Solid State Drives. In PVLDB 3, 1-2 (2010), 1195-1206. search engine can accommodate a wide population of smart objects, including those having only a few KBs of RAM. [14] O’Neil, P.E., Cheng, E., Gawlick, D. and O’Neil, E.J. The Log- Satisfying this agreement is also a mean to fulfill co-design Structured Merge-Tree (LSM-Tree). Acta Informatica, 33, 4 perspectives, i.e., calibrating a new hardware platform with the (1996). hardware resources strictly necessary to meet a given [15] Sears, R. and Ramakrishnan, R. bLSM: a general purpose performance requirement. The proposed search engine has been log structured merge tree. In ACM SIGMOD, 2012, 217-228. implemented on a hardware platform representative of smart [16] Tan, C. C., Sheng, B., Wang, H. and Li, Q. Microsearch: objects and the experimental evaluation validates its efficiency When search engines meet small devices. In Pervasive and scalability. We feel that our three design principles may have Computing, 2008, 93-110. a wider applicability and could pave the way to the definition of other embedded indexing techniques. It is part of our future work [17] Tan, C. C., Sheng, B., Wang, H. and Li, Q. Microsearch: A to try to validate this assumption. search engine for embedded devices used in pervasive computing. ACM Transactions on Embedded Computing 10. REFERENCES Systems, 9, 4 (2010). [1] Aggarwal, C. C. Ashish, N. and Sheth, A. The internet of things: [18] To, Q.-C., Nguyen, B. and Pucheral, P. Privacy-Preserving A survey from the data-centric perspective. Managing and Query Execution using a Decentralized Architecture and mining sensor data, Springer, 2013. Tamper Resistant Hardware. In EDBT, 2014, 487-498. [2] Agrawal, D., Ganesan, D., Sitaraman, R., Diao, Y. and [19] Tsiftes, N. and Dunkels, A. A database in every sensor. In Singh, S. Lazy-adaptive tree: An optimized index structure ACM Embedded Networked Sensor Systems (SenSys’11), for flash devices. In PVLDB 2, 1 (2009), 361-372. 2011, 316-332. [3] Anciaux, N., Bonnet, P., Bouganim, L., Nguyen, B., Sandu [20] Wang, H., Tan, C. C. and Li, Q. Snoogle: A search engine Popa, I. and Pucheral, P. Trusted cells: A sea change for for pervasive environments. In IEEE Transactions on personal data services. In Conference on Innovative Data Parallel and Distributed Systems, 21, 8 (2010), 1188–1202. Systems Research (CIDR'13), 2013. [21] Wu, C.-H., Kuo, T.-W. and Chang, L.-P. An efficient b-tree [4] Anciaux, N., Bouganim, L., Pucheral, P., Guo, Y., Folgoc, L. L. layer implementation for flash-memory storage systems. and Yin, S. MILo-DB: a personal, secure and portable database ACM Trans. on Embedded Computing Systems, 6, 3 (2007). machine. Distribut. and Parallel Databases, 32, 1 (2014), 37- [22] Yan, T., Ganesan, D. and Manmatha, R. Distributed image search 63. in camera sensor networks. In ACM Embedded Network Sensor [5] Anciaux, N., Lallali, S., Sandu Popa, I., Pucheral, P. A Systems (SenSys’08), 2008, 155–168. Scalable Search Engine for Mass Storage Smart Objects. [23] Yap, K.-K., Srinivasan, V. and Motani, M. Max: Wide area PRISM Technical Report. human-centric search of the physical world. ACM http://www.prism.uvsq.fr/~isap/files/RT.pdf Transactions on Sensor Networks, 4, 4 (2008), 1-34. [6] Bjorling, M., Bonnet, P., Bouganim, L. and Jonsson, B. T. [24] Zobel, J. and Moffat, A. Inverted files for text search uflip: Understanding the energy consumption of flash engines. ACM Computing Surveys, 38, 2 (2006). devices. IEEE Data Eng. Bull., 33, 4 (2010), 48–54.

921