Size Matters: Improving the Performance of Small Files in Hadoop In: (Pp
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Red Hat Data Analytics Infrastructure Solution
TECHNOLOGY DETAIL RED HAT DATA ANALYTICS INFRASTRUCTURE SOLUTION TABLE OF CONTENTS 1 INTRODUCTION ................................................................................................................ 2 2 RED HAT DATA ANALYTICS INFRASTRUCTURE SOLUTION ..................................... 2 2.1 The evolution of analytics infrastructure ....................................................................................... 3 Give data scientists and data 2.2 Benefits of a shared data repository on Red Hat Ceph Storage .............................................. 3 analytics teams access to their own clusters without the unnec- 2.3 Solution components ...........................................................................................................................4 essary cost and complexity of 3 TESTING ENVIRONMENT OVERVIEW ............................................................................ 4 duplicating Hadoop Distributed File System (HDFS) datasets. 4 RELATIVE COST AND PERFORMANCE COMPARISON ................................................ 6 4.1 Findings summary ................................................................................................................................. 6 Rapidly deploy and decom- 4.2 Workload details .................................................................................................................................... 7 mission analytics clusters on 4.3 24-hour ingest ........................................................................................................................................8 -
Globalfs: a Strongly Consistent Multi-Site File System
GlobalFS: A Strongly Consistent Multi-Site File System Leandro Pacheco Raluca Halalai Valerio Schiavoni University of Lugano University of Neuchatelˆ University of Neuchatelˆ Fernando Pedone Etienne Riviere` Pascal Felber University of Lugano University of Neuchatelˆ University of Neuchatelˆ Abstract consistency, availability, and tolerance to partitions. Our goal is to ensure strongly consistent file system operations This paper introduces GlobalFS, a POSIX-compliant despite node failures, at the price of possibly reduced geographically distributed file system. GlobalFS builds availability in the event of a network partition. Weak on two fundamental building blocks, an atomic multicast consistency is suitable for domain-specific applications group communication abstraction and multiple instances of where programmers can anticipate and provide resolution a single-site data store. We define four execution modes and methods for conflicts, or work with last-writer-wins show how all file system operations can be implemented resolution methods. Our rationale is that for general-purpose with these modes while ensuring strong consistency and services such as a file system, strong consistency is more tolerating failures. We describe the GlobalFS prototype in appropriate as it is both more intuitive for the users and detail and report on an extensive performance assessment. does not require human intervention in case of conflicts. We have deployed GlobalFS across all EC2 regions and Strong consistency requires ordering commands across show that the system scales geographically, providing replicas, which needs coordination among nodes at performance comparable to other state-of-the-art distributed geographically distributed sites (i.e., regions). Designing file systems for local commands and allowing for strongly strongly consistent distributed systems that provide good consistent operations over the whole system. -
Big Data Storage Workload Characterization, Modeling and Synthetic Generation
BIG DATA STORAGE WORKLOAD CHARACTERIZATION, MODELING AND SYNTHETIC GENERATION BY CRISTINA LUCIA ABAD DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 2014 Urbana, Illinois Doctoral Committee: Professor Roy H. Campbell, Chair Professor Klara Nahrstedt Associate Professor Indranil Gupta Assistant Professor Yi Lu Dr. Ludmila Cherkasova, HP Labs Abstract A huge increase in data storage and processing requirements has lead to Big Data, for which next generation storage systems are being designed and implemented. As Big Data stresses the storage layer in new ways, a better understanding of these workloads and the availability of flexible workload generators are increas- ingly important to facilitate the proper design and performance tuning of storage subsystems like data replication, metadata management, and caching. Our hypothesis is that the autonomic modeling of Big Data storage system workloads through a combination of measurement, and statistical and machine learning techniques is feasible, novel, and useful. We consider the case of one common type of Big Data storage cluster: A cluster dedicated to supporting a mix of MapReduce jobs. We analyze 6-month traces from two large clusters at Yahoo and identify interesting properties of the workloads. We present a novel model for capturing popularity and short-term temporal correlations in object re- quest streams, and show how unsupervised statistical clustering can be used to enable autonomic type-aware workload generation that is suitable for emerging workloads. We extend this model to include other relevant properties of stor- age systems (file creation and deletion, pre-existing namespaces and hierarchical namespaces) and use the extended model to implement MimesisBench, a realistic namespace metadata benchmark for next-generation storage systems. -
Unlock Bigdata Analytic Efficiency with Ceph Data Lake
Unlock Bigdata Analytic Efficiency With Ceph Data Lake Jian Zhang, Yong Fu, March, 2018 Agenda . Background & Motivations . The Workloads, Reference Architecture Evolution and Performance Optimization . Performance Comparison with Remote HDFS . Summary & Next Step 3 Challenges of scaling Hadoop* Storage BOUNDED Storage and Compute resources on Hadoop Nodes brings challenges Data Capacity Silos Costs Performance & efficiency Typical Challenges Data/Capacity Multiple Storage Silos Space, Spent, Power, Utilization Upgrade Cost Inadequate Performance Provisioning And Configuration Source: 451 Research, Voice of the Enterprise: Storage Q4 2015 *Other names and brands may be claimed as the property of others. 4 Options To Address The Challenges Compute and Large Cluster More Clusters Storage Disaggregation • Lacks isolation - • Cost of • Isolation of high- noisy neighbors duplicating priority workloads hinder SLAs datasets across • Shared big • Lacks elasticity - clusters datasets rigid cluster size • Lacks on-demand • On-demand • Can’t scale provisioning provisioning compute/storage • Can’t scale • compute/storage costs separately compute/storage costs scale costs separately separately Compute and Storage disaggregation provides Simplicity, Elasticity, Isolation 5 Unified Hadoop* File System and API for cloud storage Hadoop Compatible File System abstraction layer: Unified storage API interface Hadoop fs –ls s3a://job/ adl:// oss:// s3n:// gs:// s3:// s3a:// wasb:// 2006 2008 2014 2015 2016 6 Proposal: Apache Hadoop* with disagreed Object Storage SQL …… Hadoop Services • Virtual Machine • Container • Bare Metal HCFS Compute 1 Compute 2 Compute 3 … Compute N Object Storage Services Object Object Object Object • Co-located with gateway Storage 1 Storage 2 Storage 3 … Storage N • Dynamic DNS or load balancer • Data protection via storage replication or erasure code Disaggregated Object Storage Cluster • Storage tiering *Other names and brands may be claimed as the property of others. -
A Decentralized Cloud Storage Network Framework
Storj: A Decentralized Cloud Storage Network Framework Storj Labs, Inc. October 30, 2018 v3.0 https://github.com/storj/whitepaper 2 Copyright © 2018 Storj Labs, Inc. and Subsidiaries This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 license (CC BY-SA 3.0). All product names, logos, and brands used or cited in this document are property of their respective own- ers. All company, product, and service names used herein are for identification purposes only. Use of these names, logos, and brands does not imply endorsement. Contents 0.1 Abstract 6 0.2 Contributors 6 1 Introduction ...................................................7 2 Storj design constraints .......................................9 2.1 Security and privacy 9 2.2 Decentralization 9 2.3 Marketplace and economics 10 2.4 Amazon S3 compatibility 12 2.5 Durability, device failure, and churn 12 2.6 Latency 13 2.7 Bandwidth 14 2.8 Object size 15 2.9 Byzantine fault tolerance 15 2.10 Coordination avoidance 16 3 Framework ................................................... 18 3.1 Framework overview 18 3.2 Storage nodes 19 3.3 Peer-to-peer communication and discovery 19 3.4 Redundancy 19 3.5 Metadata 23 3.6 Encryption 24 3.7 Audits and reputation 25 3.8 Data repair 25 3.9 Payments 26 4 4 Concrete implementation .................................... 27 4.1 Definitions 27 4.2 Peer classes 30 4.3 Storage node 31 4.4 Node identity 32 4.5 Peer-to-peer communication 33 4.6 Node discovery 33 4.7 Redundancy 35 4.8 Structured file storage 36 4.9 Metadata 39 4.10 Satellite 41 4.11 Encryption 42 4.12 Authorization 43 4.13 Audits 44 4.14 Data repair 45 4.15 Storage node reputation 47 4.16 Payments 49 4.17 Bandwidth allocation 50 4.18 Satellite reputation 53 4.19 Garbage collection 53 4.20 Uplink 54 4.21 Quality control and branding 55 5 Walkthroughs ............................................... -
Key Exchange Authentication Protocol for Nfs Enabled Hdfs Client
Journal of Theoretical and Applied Information Technology 15 th April 2017. Vol.95. No 7 © 2005 – ongoing JATIT & LLS ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195 KEY EXCHANGE AUTHENTICATION PROTOCOL FOR NFS ENABLED HDFS CLIENT 1NAWAB MUHAMMAD FASEEH QURESHI, 2*DONG RYEOL SHIN, 3ISMA FARAH SIDDIQUI 1,2 Department of Computer Science and Engineering, Sungkyunkwan University, Suwon, South Korea 3 Department of Software Engineering, Mehran UET, Pakistan *Corresponding Author E-mail: [email protected], 2*[email protected], [email protected] ABSTRACT By virtue of its built-in processing capabilities for large datasets, Hadoop ecosystem has been utilized to solve many critical problems. The ecosystem consists of three components; client, Namenode and Datanode, where client is a user component that requests cluster operations through Namenode and processes data blocks at Datanode enabled with Hadoop Distributed File System (HDFS). Recently, HDFS has launched an add-on to connect a client through Network File System (NFS) that can upload and download set of data blocks over Hadoop cluster. In this way, a client is not considered as part of the HDFS and could perform read and write operations through a contrast file system. This HDFS NFS gateway has raised many security concerns, most particularly; no reliable authentication support of upload and download of data blocks, no local and remote client efficient connectivity, and HDFS mounting durability issues through untrusted connectivity. To stabilize the NFS gateway strategy, we present in this paper a Key Exchange Authentication Protocol (KEAP) between NFS enabled client and HDFS NFS gateway. The proposed approach provides cryptographic assurance of authentication between clients and gateway. -
HFAA: a Generic Socket API for Hadoop File Systems
HFAA: A Generic Socket API for Hadoop File Systems Adam Yee Jeffrey Shafer University of the Pacific University of the Pacific Stockton, CA Stockton, CA [email protected] jshafer@pacific.edu ABSTRACT vices: one central NameNode and many DataNodes. The Hadoop is an open-source implementation of the MapReduce NameNode is responsible for maintaining the HDFS direc- programming model for distributed computing. Hadoop na- tory tree. Clients contact the NameNode in order to perform tively integrates with the Hadoop Distributed File System common file system operations, such as open, close, rename, (HDFS), a user-level file system. In this paper, we intro- and delete. The NameNode does not store HDFS data itself, duce the Hadoop Filesystem Agnostic API (HFAA) to allow but rather maintains a mapping between HDFS file name, Hadoop to integrate with any distributed file system over a list of blocks in the file, and the DataNode(s) on which TCP sockets. With this API, HDFS can be replaced by dis- those blocks are stored. tributed file systems such as PVFS, Ceph, Lustre, or others, thereby allowing direct comparisons in terms of performance Although HDFS stores file data in a distributed fashion, and scalability. Unlike previous attempts at augmenting file metadata is stored in the centralized NameNode service. Hadoop with new file systems, the socket API presented here While sufficient for small-scale clusters, this design prevents eliminates the need to customize Hadoop’s Java implementa- Hadoop from scaling beyond the resources of a single Name- tion, and instead moves the implementation responsibilities Node. Prior analysis of CPU and memory requirements for to the file system itself. -
A Searchable-By-Content File System
A Searchable-by-Content File System Srinath Sridhar, Jeffrey Stylos and Noam Zeilberger May 2, 2005 Abstract Indexed searching of desktop documents has recently become popular- ized by applications from Google, AOL, Yahoo!, MSN and others. How- ever, each of these is an application separate from the file system. In our project, we explore the performance tradeoffs and other issues encountered when implementing file indexing inside the file system. We developed a novel virtual-directory interface to allow application and user access to the index. In addition, we found that by deferring indexing slightly, we could coalesce file indexing tasks to reduce total work while still getting good performance. 1 Introduction Searching has always been one of the fundamental problems of computer science, and from their beginnings computer systems were designed to support and when possible automate these searches. This support ranges from the simple-minded (e.g., text editors that allow searching for keywords) to the extremely powerful (e.g., relational database query languages). But for the mundane task of per- forming searches on users’ files, available tools still leave much to be desired. For the most part, searches are limited to queries on a directory namespace—Unix shell wildcard patterns are useful for matching against small numbers of files, GNU locate for finding files in the directory hierarchy. Searches on file data are much more difficult. While from a logical point of view, grep combined with the Unix pipe mechanisms provides a nearly universal solution to content-based searches, realistically it can be used only for searching within small numbers of (small-sized) files, because of its inherent linear complexity. -
A Semantic File System for Integrated Product Data Management', Advanced Engineering Informatics, Vol
Citation for published version: Eck, O & Schaefer, D 2011, 'A Semantic File System for Integrated Product Data Management', Advanced Engineering Informatics, vol. 25, no. 2, pp. 177-184. https://doi.org/10.1016/j.aei.2010.08.005 DOI: 10.1016/j.aei.2010.08.005 Publication date: 2011 Document Version Publisher's PDF, also known as Version of record Link to publication University of Bath Alternative formats If you require this document in an alternative format, please contact: [email protected] General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. Take down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Download date: 02. Oct. 2021 Advanced Engineering Informatics 25 (2011) 177–184 Contents lists available at ScienceDirect Advanced Engineering Informatics journal homepage: www.elsevier.com/locate/aei A semantic file system for integrated product data management ⇑ Oliver Eck a, , Dirk Schaefer b a Department of Computer Science, HTWG Konstanz, Germany b Systems Realization Laboratory, Woodruff School of Mechanical Engineering, Georgia Institute of Technology, Atlanta, GA, USA article info abstract Article history: A mechatronic system is a synergistic integration of mechanical, electrical, electronic and software tech- Received 31 October 2009 nologies into electromechanical systems. Unfortunately, mechanical, electrical, and software data are Received in revised form 10 June 2010 often handled in separate Product Data Management (PDM) systems with no automated sharing of data Accepted 17 August 2010 between them or links between their data. -
Decentralising Big Data Processing Scott Ross Brisbane
School of Computer Science and Engineering Faculty of Engineering The University of New South Wales Decentralising Big Data Processing by Scott Ross Brisbane Thesis submitted as a requirement for the degree of Bachelor of Engineering (Software) Submitted: October 2016 Student ID: z3459393 Supervisor: Dr. Xiwei Xu Topic ID: 3692 Decentralising Big Data Processing Scott Ross Brisbane Abstract Big data processing and analysis is becoming an increasingly important part of modern society as corporations and government organisations seek to draw insight from the vast amount of data they are storing. The traditional approach to such data processing is to use the popular Hadoop framework which uses HDFS (Hadoop Distributed File System) to store and stream data to analytics applications written in the MapReduce model. As organisations seek to share data and results with third parties, HDFS remains inadequate for such tasks in many ways. This work looks at replacing HDFS with a decentralised data store that is better suited to sharing data between organisations. The best fit for such a replacement is chosen to be the decentralised hypermedia distribution protocol IPFS (Interplanetary File System), that is built around the aim of connecting all peers in it's network with the same set of content addressed files. ii Scott Ross Brisbane Decentralising Big Data Processing Abbreviations API Application Programming Interface AWS Amazon Web Services CLI Command Line Interface DHT Distributed Hash Table DNS Domain Name System EC2 Elastic Compute Cloud FTP File Transfer Protocol HDFS Hadoop Distributed File System HPC High-Performance Computing IPFS InterPlanetary File System IPNS InterPlanetary Naming System SFTP Secure File Transfer Protocol UI User Interface iii Decentralising Big Data Processing Scott Ross Brisbane Contents 1 Introduction 1 2 Background 3 2.1 The Hadoop Distributed File System . -
Collective Communication on Hadoop
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu Computer Science Department Indiana University Bloomington, IN, USA Abstract—Big data tools have evolved rapidly in recent years. MapReduce is very successful but not optimized for many important analytics; especially those involving iteration. In this regard, Iterative MapReduce frameworks improve performance of MapReduce job chains through caching. Further Pregel, Giraph and GraphLab abstract data as a graph and process it in iterations. However, all these tools are designed with fixed data abstraction and have limitations in communication support. In this paper, we introduce a collective communication layer which provides optimized communication operations on several important data abstractions such as arrays, key-values and graphs, and define a Map-Collective model which serves the diverse communication demands in different parallel applications. In addition, we design our enhancements as plug-ins to Hadoop so they can be used with the rich Apache Big Data Stack. Then for example, Hadoop can do in-memory communication between Map tasks without writing intermediate data to HDFS. With improved expressiveness and excellent performance on collective communication, we can simultaneously support various applications from HPC to Cloud systems together with a high performance Apache Big Data Stack. Fig. 1. Big Data Analysis Tools Keywords—Collective Communication; Big Data Processing; Hadoop Spark [5] also uses caching to accelerate iterative algorithms without restricting computation to a chain of MapReduce jobs. I. INTRODUCTION To process graph data, Google announced Pregel [6] and soon It is estimated that organizations with high-end computing open source versions Giraph [7] and Hama [8] emerged. -
Maximizing Hadoop Performance and Storage Capacity with Altrahdtm
Maximizing Hadoop Performance and Storage Capacity with AltraHDTM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created not only a large volume of data, but a variety of data types, with new data being generated at an increasingly rapid rate. Data characterized by the “three Vs” – Volume, Variety, and Velocity – is commonly referred to as big data, and has put an enormous strain on organizations to store and analyze their data. Organizations are increasingly turning to Apache Hadoop to tackle this challenge. Hadoop is a set of open source applications and utilities that can be used to reliably store and process big data. What makes Hadoop so attractive? Hadoop runs on commodity off-the-shelf (COTS) hardware, making it relatively inexpensive to construct a large cluster. Hadoop supports unstructured data, which includes a wide range of data types and can perform analytics on the unstructured data without requiring a specific schema to describe the data. Hadoop is highly scalable, allowing companies to easily expand their existing cluster by adding more nodes without requiring extensive software modifications. Apache Hadoop is an open source platform. Hadoop runs on a wide range of Linux distributions. The Hadoop cluster is composed of a set of nodes or servers that consist of the following key components. Hadoop Distributed File System (HDFS) HDFS is a Java based file system which layers over the existing file systems in the cluster. The implementation allows the file system to span over all of the distributed data nodes in the Hadoop cluster to provide a scalable and reliable storage system.