Building a Scalable Global Data Processing Pipeline for Large Astronomical Photometric Datasets
Total Page:16
File Type:pdf, Size:1020Kb
Building a scalable global data processing pipeline for large astronomical photometric datasets by Paul F. Doyle MSc. BSc. Supervised by: Dr. Fredrick Mtenzi Dr. Niall Smith Professor Brendan O’Shea School of Computing Dublin Institute of Technology arXiv:1502.02821v1 [astro-ph.IM] 10 Feb 2015 A thesis submitted for the degree of Doctor of Philosophy January 2015 Declaration I certify that this thesis which I now submit for examination for the award of Doctor of Philosophy, is entirely my own work and has not been taken from the work of others save, and to the extent that such work has been cited and acknowledged within the text of my work. This thesis was prepared according to the regulations for postgraduate study by research of the Dublin Institute of Technology and has not been submitted in whole or in part for an award in any other third level institution. The work reported on in this thesis conforms to the principles and requirements of the DIT’s guidelines for ethics in research. DIT has permission to keep, lend or copy this thesis in whole or in part, on condition that any such use of the material of the thesis be duly acknowledged. Signature Date Acknowledgements The completion of this thesis has been possible through the support of my family, friends and colleagues who have helped and encouraged me over the last four years. I would like to acknowledge the support and guidance of my supervisors who remained ever confident that this journey would be enlightening and fulfilling; my children Connor, Oisín and Cillian, who shared their hugs of encouragement, and my wife Orla who gave me understanding, space and time to finish, which was no small sacrifice. To my colleagues and friends who generously gave their time and opinions I can assure you that this thesis was very much better for your contribution. Special thanks to my friends and colleagues at HEAnet, the Blackrock Castle Observatory, the Cork Institute of Technology, and the Institute of Tech- nology Tallaght who provided services and support which were fundamental to the research performed within this thesis. This research was also supported in part by Amazon Web Services Ireland who provided extensive online virtual infrastructure resources. Finally I would like to express my deepest thanks to my parents who first set me on the path to a higher education. I will endeavour to pass the same values on to my own children. iii Abstract Astronomical photometry is the science of measuring the flux of a celestial object. Since its introduction in the 1970s the CCD has been the principle method of measuring flux to calculate the apparent magnitude of an object. Each CCD image taken must go through a process of cleaning and calibration prior to its use. As the number of research telescopes increases the overall computing resources required for image processing also increases. As data archives increase in size to Petabytes, the data processing challenge requires image processing approaches to evolve to continue to exceed the growing data capture rate. Existing processing techniques are primarily sequential in nature, requiring increas- ingly powerful servers, faster disks and faster networks to process data. Existing High Performance Computing solutions involving high capacity data centres are both complex in design and expensive to maintain, while providing resources primarily to high profile science projects. This research describes three distributed pipeline architectures, a virtualised cloud based IRAF, the Astronomical Compute Node (ACN), a private cloud based pipeline, and NIMBUS, a globally distributed system. The ACN pipeline processed data at a rate of 4 Terabytes per day demonstrating data compression and upload to a central cloud storage service at a rate faster than data generation. The primary contribution of this research however is NIMBUS, which is rapidly scalable, resilient to failure and capable of processing CCD image data at a rate of hundreds of Terabytes per day. This pipeline is implemented using a decentralised web queue to control the compression of data, uploading of data to distributed web servers, and creating web messages to identify the location of the data. Using distributed web queue messages, images are downloaded by computing resources distributed around the globe. Rigorous experimental evidence is presented verifying the horizontal scalability of the system which has demonstrated a processing rate of 192 Ter- abytes per day with clear indications that higher processing rates are possible. iv Contents List of Figures x List of Tables xv Associated Publications xix Chapter 1 Introduction 1 1.1 Background . .3 1.1.1 Definitions . .3 1.1.2 Photometry . .5 1.1.3 Charge Couple Devices . .7 1.1.4 The Data Processing Challenge . 11 1.1.5 Research Scope . 12 1.2 Research Hypothesis . 13 1.3 Thesis Contributions . 14 1.4 Structure of this Thesis . 15 Chapter 2 Astronomical Photometry Data Processing 17 2.1 Introduction . 17 2.2 Standard Image Reduction Techniques . 18 2.2.1 Noise Sources . 18 2.2.2 Bias Frames . 20 2.2.3 Dark Current . 22 2.2.4 Flat Fielding . 22 2.2.5 Image Reduction . 25 2.3 Photometry using CCD images . 26 2.3.1 Centroid Algorithm . 26 v 2.3.2 Sky Background Estimation . 27 2.3.3 Calculating Flux Intensity values . 29 2.3.4 Calculating Instrumental Magnitude . 29 2.4 Data Sources . 29 2.4.1 Optical Space Telescopes . 31 2.4.1.1 Hubble Space Telescope and the James Webb Space Telescope 32 2.4.1.2 Kepler Mission . 33 2.4.1.3 Global Astrometric Interferometer for Astrophysics . 36 2.4.2 Large Ground-Based Telescopes . 36 2.4.3 Survey Projects . 38 2.4.4 Radio Astronomy . 40 2.5 Data Reduction Software . 40 2.5.1 Fits Formats and APIs . 41 2.5.2 IRAF . 41 2.5.3 NHPPS . 41 2.5.4 ESO: Common Pipeline Library (CPL) . 42 2.5.5 OPUS . 42 2.5.6 IUE . 43 2.5.7 Other pipelines . 43 2.6 Distributed Computing . 43 2.6.1 Scientific Projects Overiew . 47 2.6.2 SETI@home . 47 2.7 The data challenge . 49 2.7.1 Sequential versus Distributed Data Processing . 50 2.8 Conclusions . 52 Chapter 3 Research Methodology 54 3.1 Dataset . 55 3.1.1 Performance Analysis . 59 3.1.2 Parallel Data Processing . 60 3.2 System Designs . 62 3.2.1 Pixel Calibration - FEBRUUS Pilot . 62 3.2.1.1 Generate Master Bias . 64 vi 3.2.1.2 Generate Master Dark . 64 3.2.1.3 Generate Master Flat . 64 3.2.1.4 Pixel Cleaning Image Files . 68 3.2.1.5 Supporting Tools . 68 3.2.2 Virtual IRAF instances - Design 1 . 69 3.2.3 The ACN Pipeline - Design 2 . 72 3.2.4 NIMBUS Pipeline - Design 3 . 75 3.2.5 Conclusion . 79 Chapter 4 The Astronomical Compute Node (ACN) Pipeline 80 4.1 Overview . 81 4.2 System Architecture . 84 4.2.1 Storage Control . 84 4.2.2 Queue Control . 85 4.2.3 Worker Nodes . 85 4.2.3.1 Aperture Photometry in ACN-APHOT . 88 4.2.3.2 Centroid Algorithm . 88 4.2.3.3 Sky Background Algorithm . 91 4.2.3.4 Partial Pixel Algorithm . 91 4.2.4 Node Control . 91 4.3 Experimental Methodology . 94 4.4 Results and Discussion . 95 4.4.1 ACN1: ACN-APHOT Performance . 96 4.4.2 ACN2: Disk Storage Testing . 97 4.4.3 ACN3: Data Compression . 100 4.4.4 ACN4: Data Transfer . 102 4.4.5 ACN5: Pipeline Limits . 102 4.5 Conclusion . 106 Chapter 5 The NIMBUS Pipeline 108 5.1 Overview . 108 5.2 System Architecture . 112 5.2.1 Data Archive Cloud . 112 5.2.2 Distributed Worker Queue Cloud . 115 vii 5.2.3 System Monitoring . 121 5.2.4 Processing Cloud . 122 5.3 Experimental Methodology . 124 5.3.1 Experiment Metrics System . 129 5.3.1.1 Experimental Parameters . 135 5.4 Results and Discussion . 136 5.4.1 Simple Queue Service (SQS) Performance . 138 5.4.1.1 Exp:NIM1-1 SQS Write Performance Single Node . 139 5.4.1.2 Exp:NIM1-2 SQS Write Performance Multi-Node . 140 5.4.1.3 Exp:NIM1-3 SQS Distributed Read Performance . 144 5.4.1.4 Exp:NIM1-4 SQS Queue Read Rates . 146 5.4.1.5 Analysis . 148 5.4.2 Single Instance Node Performance . 148 5.4.2.1 Exp:NIM2-1 Single Instance Webserver Performance . 148 5.4.2.2 Exp:NIM2-2 Single Instance Performance by Type . 151 5.4.2.3 Exp:NIM2-3 Single Instance Multi-worker Performance . 152 5.4.2.4 Analysis . ..