Smartstore: a New Metadata Organization Paradigm with Semantic-Awareness for Next-Generation File Systems

University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Computer Science and Engineering, Department CSE Conference and Workshop Papers of 11-2009 SmartStore: A New Metadata Organization Paradigm with Semantic-Awareness for Next-Generation File Systems Yu Hua Huazhong University of Science & Technology, Wuhan, China, [email protected] Hong Jiang University of Nebraska-Lincoln, [email protected] Yifeng Zhu University of Maine, Orono, [email protected] Dan Feng Huazhong University of Science & Technology, Wuhan, China, [email protected] Lei Tian University of Nebraska-Lincoln, [email protected] Follow this and additional works at: https://digitalcommons.unl.edu/cseconfwork Part of the Computer Sciences Commons Hua, Yu; Jiang, Hong; Zhu, Yifeng; Feng, Dan; and Tian, Lei, "SmartStore: A New Metadata Organization Paradigm with Semantic-Awareness for Next-Generation File Systems" (2009). CSE Conference and Workshop Papers. 156. https://digitalcommons.unl.edu/cseconfwork/156 This Article is brought to you for free and open access by the Computer Science and Engineering, Department of at DigitalCommons@University of Nebraska - Lincoln. It has been accepted for inclusion in CSE Conference and Workshop Papers by an authorized administrator of DigitalCommons@University of Nebraska - Lincoln. Published in SC ‘09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, Portland, Oregon, USA, November 14–20, 2009. Copyright © 2009 ACM. Used by permission. SmartStore: A New Metadata Organization Paradigm with Semantic-Awareness for Next-Generation File Systems Yu Hua,1 Hong Jiang,2 Yifeng Zhu,3 Dan Feng,1 and Lei Tian 1,2 1. Huazhong University of Science & Technology, Wuhan, China 2. University of Nebraska–Lincoln, Lincoln, NE, USA 3. University of Maine, Orono, ME, USA Corresponding author — D. Feng Email: [email protected] , [email protected] , [email protected]; [email protected] , [email protected] ; [email protected] Abstract ciently extract useful knowledge from an ocean of data?,” “How to Existing storage systems using hierarchical directory tree do manage the enormous number of files that have multi-dimensional or not meet scalability and functionality requirements for expo- increasingly higher dimensional attributes?,” and “How to effectively nentially growing datasets and increasingly complex queries and expeditiously extract small but relevant subsets from large datas- in Exabyte-level systems with billions of files. This paper pro- ets to construct accurate and efficient data caches to facilitate high-end poses semantic-aware organization, called SmartStore, which and complex applications?”. We approach the above problems by exploits metadata semantics of files to judiciously aggre- first postulating the following. gate correlated files into semantic-aware groups by using in- • First, while a high-end or next-generation storage system can formation retrieval tools. Decentralized design improves sys- provide a Petabyte-scale or even Exabyte-scale storage ca- tem scalability and reduces query latency for complex queries pacity containing an ocean of data, what the users really (range and top-k queries), which is conducive to construct- want for their applications is some knowledge about the da- ing semantic-aware caching, and conventional filename-based ta’s behavioral and structural properties. Thus, we need to query. SmartStore limits search scope of complex query to a deploy and organize these files according to semantic cor- single or a minimal number of semantically related groups relations of file metadata in a way that would easily expose and avoids or alleviates brute-force search in entire system. such properties. Extensive experiments using real-world traces show that • Second, in real-world applications, cache-based structures have SmartStore improves system scalability and reduces query la- proven to be very useful in dealing with indexing among tency over basic database approaches by one thousand times. massive amounts of data. However, traditional temporal or To the best of our knowledge, this is the first study imple- spatial (or both) locality-aware methods alone will not be ef- menting complex queries in large-scale file systems. fective to construct and maintain caches in large-scale systems to contain the working datasets of complex data-intensive ap- 1. Introduction plications. It is thus our belief that semantic-aware caching, which leverages metadata semantic correlation and combines Fast and flexible metadata retrieving is a critical requirement pre-processing and prefetching that is based on range queries in the next-generation data storage systems serving high-end (that identify files whose attributes values are within given computing [1]. As the storage capacity is approaching Exabytes ranges) and top-k Nearest Neighbor (NN) queries1 (that lo- and the number of files stored is reaching billions, directory- cate k files whose attributes are closest to given values), will tree based metadata management widely deployed in conven- be sufficiently effective in reducing the working sets and in- tional file systems [2, 3] can no longer meet the requirements creasing cache hit rates. of scalability and functionality. For the next-generation large- scale storage systems, new metadata organization schemes are Although state-of-the-art research, such as Spyglass [8], re- desired to meet two critical goals: (1) to serve a large number veals that around 33% of searches can be localized into a sub- of concurrent accesses with low latency and (2) to provide flexi- space by exploiting the namespace property (e.g., home or proj- ble I/O interfaces to allow users to perform advanced metadata ect directory), it clearly indicates that a larger portion of queries queries, such as range and top-k queries, to further decrease must still be answered by potentially searching the entire file query latency. system in some way. The lack of effectiveness of exploiting spa- In the next-generation file systems, metadata accesses will tial and temporal localities alone in metadata queries lies in the very likely become a severe performance bottleneck as meta- fact that such kind of localities, while generally effective in rep- data-based transactions not only account for over 50% of all file resenting some static properties (e.g., directory and namespace) system operations [4, 5] but also result in billions of pieces of and access patterns of files, fail to capture higher dimensions of metadata in directories. Given the sheer scale and complexity of localities and correlations that are essential for complex queries. the data and metadata in such systems, we must seriously pon- For example, after installing or updating software, a system ad- der a few critical research problems [6, 7] such as “How to effi- ministrator may hope to track and find the changed files, which 1. Given a clear context in the paper, we will simply use top-k queries in place of top-k NN queries. 1 2 Y. HUA ET AL. IN SC ‘09: PROC. CONF. ON HIGH PERF. COMP. NETWORKING, STORAGE AND ANALYSIS (2009) exist in both system and user directories, to ward off malicious access sequences and semantic attributes are considered in the operations. In this case, simple temporal (e.g., access history) or evaluation of the correlation among files to improve file meta- spatial locality (e.g., directory or namespace) alone may not effi- data prefetching performance. The probability of inter-file ac- ciently help identify all affected files, because such requests for cess is found to be up to 80% when considering four typical file a complex query (range or top-k query) in turn need to check system traces. Our preliminary results based on these and the multi-dimensional attributes. HP [17], MSN [18], and EECS [19] traces further show that ex- In a small-scale storage system, conventional directory-tree ploiting semantic correlation of multi-dimensional attributes based design and I/O interfaces may support these complex que- can help prune up to 99.9% search space [20]. ries through exhaustive or brute-force searches. However, in an Therefore, in this paper we proposed a novel decentralized Exabyte-scale storage system, complex queries need to be judi- semantic-aware metadata organization, called SmartStore [21], to ciously supported in a scalable way since exhaustive searches can effectively exploit semantic correlation to enable efficient com- result in prohibitively high overheads. Bigtable [9] uses a static plex queries for users and to improve system performance in three-level B+-tree-like hierarchy to store tablet location informa- real-world applications. Examples of the SmartStore applica- tion, but is unable to carry out and optimize complex queries as it tions include the following. relies on user selection and does not consider multiple replicas of From a user’s viewpoint, range queries can help answer the same data. Furthermore, the inherent performance bottleneck questions like “Which experiments did I run yesterday that took less imposed by the directory-tree structure in conventional file sys- than 30 minutes and generated files larger than 2.6GB?”; whereas tem design can become unacceptably severe in an Exabyte-scale top-k queries may answer questions like “I cannot accurately re- system. Thus, we propose to leverage semantic correlation of file member a previously created file but I know that its file size is around metadata, which exploits higher dimensional static and dynamic 300MB and it was last

Load more