HDFS File Formats: Study and Performance Comparison

UNIVERSIDAD DE VALLADOLID ESCUELA TECNICA´ SUPERIOR INGENIEROS DE TELECOMUNICACION´ TRABAJO FIN DE MASTER MASTER UNIVERSITARIO EN INVESTIGACION´ EN TECNOLOGÍAS DE LA INFORMACION´ Y LAS COMUNICACIONES HDFS File Formats: Study and Performance Comparison Autor: D. Alvaro´ Alonso Isla Tutor: Dr. D. Miguel Angel´ Mart´ınez Prieto Dr. D. An´ıbal Bregon´ Bregon´ Valladolid, 29 de Junio de 2018 TÍTULO: HDFS File Formats: Study and Perfor- mance Comparison AUTOR: D. Alvaro´ Alonso Isla TUTOR: Dr. D. Miguel Angel´ Mart´ınez Prieto Dr. D. An´ıbal Bregon´ Bregon´ DEPARTAMENTO: Departamento de Informatica´ Tribunal PRESIDENTE: Dr. D. Pablo de la Fuente Redondo VOCAL: Dr. D. Alejandra Mart´ınez Mones´ SECRETARIO: Dr. D. Guillermo Vega Gorgojo FECHA: 29 de Junio de 2018 CALIFICACION´ : iii Resumen del TFM El sistema distribuido Hadoop se esta´ volviendo cada vez mas´ popular a la hora de al- macenar y procesar grandes cantidades de datos (Big Data). Al estar compuesto por muchas maquinas,´ su sistema de ficheros, llamado HDFS (Hadoop Distributed File Sys- tem), tambien´ es distribuido. Al ser HDFS un sistema de almacenamiento distinto a los tradicionales, se han desarrollado nuevos formatos de almacenamiento de ficheros para ajustarse a las caracter´ısticas particulares de HDFS. En este trabajo estudiamos estos nuevos formatos de ficheros, prestando especial atencion´ a sus caracter´ısticas particulares, con el fin de ser capaces de intuir cual´ de ellos tendra´ un mejor funcionamiento segun´ las necesidades de los datos que tengamos. Para alcanzar este objetivo, hemos propuesto un marco teorico´ de comparacion,´ con el que poder reconocer facilmente´ que´ formatos se ajustan a nuestras necesidades. Ademas,´ hemos realizado un estudio experimental para reforzar la comparacion´ anterior, y poder obtener conclusiones sobre que´ formatos son los mejores para los casos expuestos. Para ello se han seleccionado dos conjuntos de datos con distintas caracter´ısticas, y un grupo de consultas basicas,´ que se ejecutan con trabajos MapReduce escritos en Java y a traves´ de la herramienta Hive. El objetivo final de este trabajo sera´ ser capaces de identificar las distintas fortalezas y debilidades de los formatos de almacenamiento. Palabras clave Big Data, Hadoop, HDFS, MapReduce, Formatos de Almacenamiento. v Abstract The distributed system Hadoop has become very popular for storing and process large amounts of data (Big Data). As it is composed of many machines, its file system, called HDFS (Hadoop Distributed File System), is also distributed. But as HDFS is not a tradi- tional storage system, plenty of new file formats have been developed, to take advantage of its features. In this work we study that new formats to find out their characteristics, and being able to decide which ones can be better knowing the needs of our data. For that goal, we have made a theoretical framework to compare them, and easily recognize which formats fit our needs. Also we have made an experimental study to find out how the formats work in some specific situations, selecting two very different datasets and a set of simple queries, resolved with MapReduce jobs, written with Java or run using Hive tool. The final goal of this work is to be able to identify the different strengths and weakenesses of the file formats. Keywords Big Data, Hadoop, HDFS, MapReduce, File Formats. vii viii Agradecimientos Este trabajo ha sido posible gracias a la colaboracion´ de Boeing Research & Technol- ogy Europe (BR&T-E) y forma parte del proyecto AIRPORTS. Este proyecto esta´ finan- ciado por CDTI y MINECO (FEDER) ref. IDI-20150616, CIEN 2015. Quiero agradecer a todos los miembros de BR&T-E por dejarnos utilizar su cluster Hadoop para la real- izacion´ de las pruebas de este proyecto, y en especial a David Scarlatti por facilitarme toda la informacion´ que he ido necesitando sobre las caracter´ısticas del sistema. He de hacer un agradecimiento especial a mi companero˜ Ivan´ Garc´ıa, con el que he compartido largas conversaciones sobre el progreso que ´ıbamos haciendo en nuestros trabajos, que ayuda mucho a sobrellevar los malos momentos. Ademas,´ en mas´ de una ocasion´ le he tenido que pedir consejo sobre ciertos aspectos que el´ conoc´ıa mejor que yo. Tambien´ quiero dar las gracias a mis tutores Miguel Angel´ Mart´ınez y An´ıbal Bregon,´ por haber dedicado tanto tiempo en ayudarme y llevar un completo seguimiento de mi avance. Gracias a Miguel Angel,´ que de nuevo me ha prestado una de sus ideas para llevarla a cabo y hacer un proyecto de ella, aunque en esta ocasion´ el resultado es bas- tante distinto de lo que la idea inicial planteaba. Y especialmente agradecer a An´ıbal su especial dedicacion,´ que en los momentos finales del trabajo en los que Miguel Angel´ no estaba disponible, ha hecho un esfuerzo sobrehumano para sacar tiempo para mi trabajo, aconsejarme y guiarme, a pesar de ser un tema en el que el´ no esta´ especializado. Finalmente, y no por ello menos importante, gracias a mi familia y a mi pareja Ainoa, que tantas horas han tenido que escuchar mis problemas, sin saber que´ responderme porque no entend´ıan. Gracias por todo el apoyo prestado y por esos momentos en que me han ayudado a desconectar y descansar. ix x Contents 1 Introduction 1 1.1 Motivation . 2 1.2 Objectives and Results of the Work . 3 1.3 Methodology . 4 1.4 Document structure . 4 2 Distributed file systems 7 2.1 Basic concepts . 7 2.2 Characteristics . 8 2.2.1 Transparency . 8 2.2.2 Naming . 8 2.2.3 Scalability . 9 2.2.4 Remote Multiple Access . 10 2.3 Architectures . 10 2.4 Fault tolerance . 11 3 Hadoop Distributed File System 13 3.1 GFS . 14 3.2 HDFS Architecture . 15 3.3 HDFS Writing and Reading . 17 3.4 HDFS Blocks Robustness . 18 3.4.1 Replication . 18 3.4.2 Reading/writing Failure . 19 3.4.3 Namenode Failure . 20 4 HDFS storage formats 21 4.1 Text files . 23 4.1.1 CSV . 23 4.1.2 JSON . 23 4.2 SequenceFiles . 24 4.3 Avro . 26 4.4 RCFiles . 27 4.5 ORC . 29 4.6 Parquet . 30 4.7 Comparison . 33 xi 5 Experimental results 37 5.1 Methodology . 37 5.1.1 Mapreduce . 37 5.1.2 Datasets . 38 5.1.3 Considered Formats . 39 5.1.4 Queries . 40 5.1.5 Experimentation environment . 41 5.1.6 Used Parameters . 41 5.2 Results . 42 5.2.1 Opensky . 42 5.2.2 Github log . 57 6 Conclusions 79 6.1 Future work . 80 I APPENDICES 85 A Opensky Schema 87 B GitHub log Schema 89 xii List of Figures 3.1 HDFS architecture . 16 3.2 HDFS writing operation . 17 3.3 HDFS reading operation . 18 3.4 HDFS replication example . 19 4.1 SequenceFile record compression . 24 4.2 SequenceFile block compression . 25 4.3 Avro structure . 27 4.4 RCFile structure . 28 4.5 ORC structure . 30 4.6 Parquet schema example . 31 4.7 Parquet definition level example . 32 4.8 Parquet repetition level example . 32 4.9 Parquet file structure . 33 5.1 Opensky format sizes . 42 5.2 Opensky query 1 performance . 44 5.3 Opensky query 2 performance . 47 5.4 Opensky query 3 performance . 50 5.5 Opensky query 4 performance . 53 5.6 Opensky query 5 performance . 55 5.7 GitHub log format sizes . 57 5.8 GitHub log query 1 performance . 59 5.9 GitHub log query 2.1 performance . 62 5.10 GitHub log query 2.2 performance . 65 5.11 GitHub log query 3 performance . 68 5.12 GitHub log query 4 performance . 71 5.13 GitHub log query 5.1 performance . 73 5.14 GitHub log query 5.2 performance . 76 xiii xiv List of Tables 4.1 File formats comparison table . 34 5.1 Opensky format sizes (GB) . 42 5.2 Opensky query 1 results . 45 5.3 Opensky query 2 results . 48 5.4 Opensky query 3 results . 51 5.5 Opensky query 4 results . 54 5.6 Opensky query 5 results . 56 5.7 GitHub log format sizes (GB) . 58 5.8 GitHub log query 1 results . 60 5.9 GitHub log query 2.1 results . 63 5.10 GitHub log query 2.2 results . 66 5.11 GitHub log query 3 results . 69 5.12 GitHub log query 4 results . 72 5.13 GitHub log query 5.1 results . 74 5.14 GitHub log query 5.2 results . 77 xv xvi Chapter 1 Introduction Since several years, the Big Data concept is becoming more and more popular, but It is not something new. We can say that this concept reached maturity in 2013, when it was introduced as a new term in the Oxford English Dictionary (OED) [1], where it is defined as “Extremely large data sets that may be analysed computationally to reveal pat- terns, trends, and associations, especially relating to human behaviour and interactions”. However, this is not a very accurate definition; it only refers to the volume of the data and some fields in which it can.

HDFS File Formats: Study and Performance Comparison

Schematic Entry

File Formats for Big Data Storage Systems

Research Article Improving I/O Efficiency in Hadoop-Based Massive Data Analysis Programs

Lossless Compression of Internal Files in Parallel Reservoir Simulation

Arrow: Integration to 'Apache' 'Arrow'

Summer 2010 PPAXAXCENTURIONCENTURION Boston Police Patrolmen’S Association, Inc

Parquet Data Format Performance

Impala: a Modern, Open-Source SQL Engine for Hadoop

The Deep Learning Solutions on Lossless Compression Methods for Alleviating Data Load on Iot Nodes in Smart Cities

Forcepoint DLP Supported File Formats and Size Limits

Pdfnews 10/04, Page 2 –

Bicriteria Data Compression∗