Science Data Vaults in Monetdb: a Case Study

Universidade do Minho Escola de Engenharia Departamento de Informática Master’s Thesis Master in Informatics Engineering Science Data Vaults in MonetDB: A Case Study João Nuno Araújo Sá Supervisors: Prof. Dr. José Orlando Pereira Departamento de Informatica, Universidade do Minho Prof. Dr. Martin Kersten Centrum Wiskunde & Informatica, Amsterdam July 2011 ii Declaration Name: João Nuno Araújo Sá Email: [email protected] Telephone: +351964508853 ID Card: 13171868 Thesis Title: Science Data Vaults in MonetDB: A case study Supervisors: Prof. Dr. José Orlando Pereira Prof. Dr. Martin Kersten Dra. Milena Ivanova Year of Completion: 2011 Designation of Master: Master in Informatics Engineering É AUTORIZADA A REPRODUÇÃO INTEGRAL DESTA TESE APENAS PARA EFEITOS DE INVESTIGAÇÃO, MEDIANTE DECLARAÇÃO ESCRITA DO INTERESSADO, QUE A TAL SE COMPROMETE. University of Minho, 13th July 2011 João Nuno Araújo Sá ii Experience is what you get when you didn’t get what you wanted. Randy Pausch (The Last Lecture) iv Acknowledgments To Dr. José Orlando Pereira for accepting being my supervisor and for giving me this possibility to do my master thesis in Amsterdam. Apart from the distance, all the emails and recommendations were very helpful and I am thankful. To Dr. Martin Kersten for receiving me in such a recognizable place as CWI and providing me the opportunity to be responsible for this project. To Bart Scheers for all the patience talking about astronomical concepts and suggest- ing ideas to build a robust and solid use case. A special thanks to Milena Ivanova for all the support, advice, inspiration and friend- ship. During all the meetings her help was essential, and to all the intensive corrections on the text I am extremely grateful. To all the people of CWI, in particular the INS-1 group, for making me feel one of them through their sympathy and professionalism. Ao meu pai, José Manuel Araújo Fernandes Sá e à minha mãe, Fernanda da Conceição Pereira de Araújo Sá, por todo o apoio, compreensão e saudades durante este ano que estive fora. A toda a minha família, aos meus velhos amigos de Viana do Castelo e aos amigos que fiz em Braga. To all my Amsterdam friends for making this year memorable. I will never forget the moments we had together. v vi Resumo Hoje em dia, a quantidade de dados gerada por instrumentos científicos (dados cap- turados) e por simulações de computador (dados gerados) é muito grande. A quantidade de dados está a tornar-se cada vez maior, quer por melhorias na precisão dos novos intru- mentos, quer pelo aumento do número de estações que recolhem os dados. Isto requere novos métodos científicos que permitam analisar e organizar os dados. No entanto, não é fácil lidar com estes dados, e com todos os passos pelos quais ne- cessitam de passar (capturar, organizar, analisar, visualizar e publicar). Muitos são colec- cionados (captura), mas não são selecionados (organização, análise) ou publicados. Nesta tese focamo-nos nos dados astronómicos, que são geralmente armazenados em ficheiros FITS (Flexible Image Transport System). Vamos investigar o acesso a esses dados, e pesquisar informação neles contida, utilizando para isso uma tecnologia de base de dados. A base de dados alvo é o MonetDB, uma base de dados de armazenamento por colunas, de código livre, que já demonstrou ter sucesso em aplicações que analisam a carga de trabalho e aplicações científicas (SkyServer). Perante os resultados obtidos durante as experiências, a perceptível superioridade apresentada pelo MonetDB em relação à ferramenta STILTS quando mais computação é exigida, e por último, pelo sucesso na execução do conjunto de testes apresentado pelo astronómo que trabalha no CWI, podemos afirmar que o MonetDB é uma alternativa forte e robusta para manipular e aceder informação contida em ficheiros FITS. vii viii Abstract Nowadays, the amount of data generated by scientific instruments (data captured) and computer simulations (data generated) is very large. The data volumes are getting bigger, due to the improved precision of the new instruments, or due to the increasing number of collecting stations. This requires new scientific methods to analyse and organize the data. However, it is not so easy to deal with this data, and with all the steps that the data have to get through (capture, organize, analyze, visualize, and publish). A lot of data is collected (captured), but not curated (organized, analyzed) or published. In this thesis we focus on the astronomical data, typically they are stored in FITS files (Flexible Image Transport System). We will investigate the access and querying of this data by means of database technology. The target database system is MonetDB, an open-source column-store database with record of successful application to analytical workloads and scientific applications (SkyServer). Given the results of the experiments, the perceptible superiority presented by Mon- etDB over STILTS when more computation is required, and the success obtained during the execution of the use case proposed by an astronomer working at the CWI, we can declare that MonetDB is a powerfull and robust alternative to manipulate and access information contained in FITS files. ix x Contents 1 Introduction1 1.1 The need to integrate with repositories.....................2 1.2 Assumptions....................................3 1.3 Contributions...................................3 1.4 Approach......................................4 1.5 Project Objectives.................................4 1.6 Outline of report.................................4 2 Background7 2.1 Introduction to MonetDB............................7 2.2 Introduction to FITS...............................8 2.2.1 Applications of the FITS.........................9 2.2.2 The structure of a FITS file........................9 3 Contribution to MonetDB 13 3.1 Overview of the vaults.............................. 13 3.2 Architecture of the vault............................. 14 3.3 Attach a file.................................... 15 3.4 Attach all FITS files in the directory....................... 16 3.5 Attach all FITS files in the directory, using a pattern............. 16 3.6 Table loading................................... 17 3.6.1 Search for the ideal batch size...................... 21 3.6.2 BAT size representation of Strings in MonetDB............ 31 3.7 Export a table................................... 33 xi xii CONTENTS 4 Case Study 37 4.1 Overview...................................... 37 4.2 Attach a file.................................... 37 4.3 Attach all FITS files in the directory....................... 38 4.4 Attach all FITS files in the directory, giving a pattern............. 38 4.5 Load a table.................................... 38 4.6 Export a table................................... 39 4.7 Cross-matching astronomical surveys..................... 39 4.7.1 Query 1: Distribution of distances between sources in both surveys 41 4.7.2 Distribution of the distances smaller than 45 arc seconds...... 44 4.7.3 Normal Distribution of all the data................... 46 4.7.4 Frequency of the distances smaller than 5 arc seconds........ 47 4.7.5 Frequency of the r value between sources in both surveys..... 48 4.7.6 Query 2: extract & compare brightness in different frequencies.. 49 4.7.7 Query 3: extract the spectral index................... 51 4.7.8 Distribution of the spectal index.................... 51 4.7.9 Normal distribution of the spectral indexes.............. 52 5 Performance Experiments 55 5.1 Experimental Setting............................... 55 5.2 Test Files...................................... 56 5.3 Delegation experiments............................. 58 5.3.1 Selection and Filter delegation for Group number 1......... 58 5.3.2 Range Delegation for Group number 1................ 65 5.3.3 Statistics Delegation for Group number 1............... 69 5.3.4 Selection and Filter Delegation for Group number 2......... 71 5.3.5 Range delegation for Group number 2................. 73 5.3.6 MonetDB problem............................ 77 5.3.7 Projection delegation for Group number 2............... 80 5.3.8 Statistical Delegation for Group number 2............... 80 5.3.9 Summary of the tests for the first and second groups........ 82 CONTENTS xiii 5.3.10 Equi-join delegation for Group number 3............... 83 5.3.11 Band-join delegation for Group number 3............... 88 6 Related Work 93 6.1 CFITSIO...................................... 93 6.1.1 Fv...................................... 96 6.2 STIL......................................... 97 6.2.1 TOPCAT.................................. 102 6.2.2 STILTS................................... 106 6.3 Comparison between tools............................ 107 6.4 Astronomical data formats............................ 107 6.4.1 HDF5 Array Database.......................... 107 6.4.2 VOTable.................................. 109 6.4.3 Comparison between file formats.................... 111 7 Conclusion 113 7.1 Results and Overview.............................. 113 7.2 Future Work.................................... 114 Bibliography 116 xiv CONTENTS List of Figures 3.1 Three layers of a vault.............................. 14 3.2 Load of all the numerical types......................... 19 3.3 BAT and File sizes for each one of the numerical types............ 20 3.4 Batch 1 for the strings with 4 bytes....................... 22 3.5 Batch 1 for the strings with 8 bytes....................... 22 3.6 Batch 1 for the strings with 20 bytes...................... 22 3.7 Batch 10

Load more