Investigating Interoperability of the LSST Data Management Software Stack with Astropy
Total Page:16
File Type:pdf, Size:1020Kb
Investigating interoperability of the LSST Data Management software stack with Astropy Tim Jennessa, James Boschb, Russell Owenc, John Parejkoc, Jonathan Sicka, John Swinbankb, Miguel de Val-Borrob, Gregory Dubois-Felsmannd, K-T Lime, Robert H. Luptonb, Pim Schellartb, K. Simon Krughoffc, and Erik J. Tollerudf aLSST Project Management Office, Tucson, AZ, U.S.A. bPrinceton University, Princeton, NJ, U.S.A. cUniversity of Washington, Seattle, WA, U.S.A dInfrared Processing and Analysis Center, California Institute of Technology, Pasadena, CA, U.S.A. eSLAC National Laboratory, Menlo Park, CA, U.S.A. fSpace Telescope Science Institute, 3700 San Martin Dr, Baltimore, MD, 21218, USA ABSTRACT The Large Synoptic Survey Telescope (LSST) will be an 8.4 m optical survey telescope sited in Chile and capable of imaging the entire sky twice a week. The data rate of approximately 15 TB per night and the requirements to both issue alerts on transient sources within 60 seconds of observing and create annual data releases means that automated data management systems and data processing pipelines are a key deliverable of the LSST construction project. The LSST data management software has been in development since 2004 and is based on a C++ core with a Python control layer. The software consists of nearly a quarter of a million lines of code covering the system from fundamental WCS and table libraries to pipeline environments and distributed process execution. The Astropy project began in 2011 as an attempt to bring together disparate open source Python projects and build a core standard infrastructure that can be used and built upon by the astronomy community. This project has been phenomenally successful in the years since it has begun and has grown to be the de facto standard for Python software in astronomy. Astropy brings with it considerable expectations from the community on how astronomy Python software should be developed and it is clear that by the time LSST is fully operational in the 2020s many of the prospective users of the LSST software stack will expect it to be fully interoperable with Astropy. In this paper we describe the overlap between the LSST science pipeline software and Astropy software and investigate areas where the LSST software provides new functionality. We also discuss the possibilities of re-engineering the LSST science pipeline software to build upon Astropy, including the option of contributing affiliated packages. Keywords: LSST, Astropy, Astronomy Software, Python, Code Reuse 1. INTRODUCTION One major component of the Data Management system1 for the Large Synoptic Survey Telescope (LSST)2{4 is a pair of software pipelines capable of reducing LSST images to catalogs, one to produce alerts while observing and the other to produce data releases at regular intervals. The alert pipeline must be able to process data from the 3.2 gigapixel camera∗ and issue alerts on transient sources within 60 seconds of observing.5 The data release pipeline must be able to efficiently process the 11 data releases that will be made during the ten year survey. Further author information: (Send correspondence to T.J.) T.J.: E-mail: [email protected] ∗This equates to approximately 15 TB per night. See http://lsst.org/scientists/keynumbers for additional num- bers. Command-line Driver Scripts Cluster execution middleware Tasks (ISR, Detection, Co-adding, …) Camera abstraction Measurement Image Processing layer (obs_* Algorithms (meas_*) (ip_*) packages) Application Framework (comp. Middleware (I/O, intensive C++, SWIG) configuration, …) External C/C++ libraries (Boost, External Python Modules FFTW, Eigen, CUDA, ndarray, …) (numpy, pyfits, matplotlib, …) Figure 1. Structure of the LSST data management software. At the bottom there are external packages (black), on the lower left (red) are the computationally intensive packages that are mainly written in C++ with SWIG wrappers making the code available in Python, and the remainder (blue) are mostly Python. In addition to efficiency, the LSST pipelines must also provide unprecedented levels of control over systematic errors across a broad range of scientific domains. Most consumers of LSST data will likely do no image processing of their own, and even those that do will rely heavily on catalogs generated by the LSST pipeline. Furthermore, much of LSST data analysis is expected to happen in LSST's \Level 3 Environment," a combination of hardware and software that will allow scientists to run their own code and customized versions of LSST pipeline algorithms on LSST's middleware and hardware. While algorithmically advanced pipelines and public catalog releases have become the norm for large surveys over the past 20 years, LSST's will be the first for which the software must be accessible to and usable by scientists outside the pipeline team. The difficulty of sharing code between telescopes has long been a topic of discussion in the astronomy software community.6{9 Historically there have been some successes in fundamental libraries, such as SOFA,10 CFITSIO11 and wcslib,12 and in environments, such as IRAF,13 AIPS14 and Starlink.15 Despite these, there is still a tendency for a new project to embark on the creation of a new environment or software infrastructure if there are slight requirements mismatches and no external political will to adopt a specific software suite. The growth in the Open Source software movement16 and recent explosion in support tools, such as git and GitHub for code sharing,17 makes the process of software reuse far easier than it has been in the past. The LSST data processing pipeline software has been in development as an open source project since 200418{20 and during these 12 years of development much has changed in the Python, C++ and astronomy software world. 2. THE LSST SCIENCE PIPELINE SOFTWARE As with other software developed in the early 2000s, such as CASA,21 the LSST science pipeline software is based on a C++ core with a Python control layer, where for LSST the Python/C++ interface layer is handled via SWIG22 bindings. The general layout of the code is shown in Fig.1. At the time of writing the LSST science pipelines codebase is split between C++ and Python with of order 105,000 lines of Python and 125,000 lines of C++y. The software is spread among many different LSST root packages: afw Provides core LSST utility classes including geometric shapes, images, source detection, convolution and high-level exposure objects. daf The data access framework, also known as the Data Butler, provides an abstraction layer hiding data formats, data transport and locations from the user and providing them with standardized Python objects yThis total does not include comment or blank lines, nor does it include the 80,000 lines of Python and 1.6 million lines of C++ generated for the SWIG wrappers. representing the data. It provides facilities for handling simple versioning and time-based data access with validity ranges. ip Low-level image processing algorithms such as instrument signature removal and image differencing. meas Algorithms focused on processing individual sources in an image, including PSF modeling, WCS fitting, and source detection, deblending, and measurement. obs Observatory plugins to teach the system how to handle files from different instruments. pipe base Pipeline processing infrastructure. This allows discrete units of computation to be linked together to form pipelines. pipe tasks High-level algorithms, including initial exposure-level processing and calibration, image coaddition (with sophisticated artifact rejection and coaddition of PSF models), and consistent multi-passband pro- cessing of coadds to measure colors. The Task framework provides a way of packaging science algorithms so that they can be executed in a data-parallel fashion, with further extensions to allow MPI-based tightly- synchronized execution. The framework includes a Python-based configuration language that has default values, overrides from code as well as the end user, and hooks for validation of configured values. The LSST codebase currently contains components that are not of short-term interest for Astropy integration but might become so in the longer run given related work by Astropy contributors in some of these areas.23{25 These include the Data Butler and the Task framework. In terms of overlap with Astropy and affiliated packages the afw and meas packages are most relevant for immediate investigation as they provide some functionality that is of broader interest to the Astropy community but also have functionality that is also present within Astropy. The algorithms in the LSST codebase are not yet at the level of scientific quality ultimately needed for LSST, but they are already being used to process data from the Hyper Suprime-Cam Survey on the Subaru telescope,26, 27 one of the major precursors surveys for LSST, and generally represent the current state-of-the- art. 3. ASTROPY Python has become the dominant programming language for astronomy28{32 and is now the default choice for new projects and for teaching. The Astropy33 project grew out of a support mailing list in 2011 to become a very successful hub for the development of Python packages for astronomy. At the time of writing Astropy consists of approximately 105,000 lines of Python code. Astropy has two key approaches to drive adoption. The first is to place generic astronomy code into the core package that is brought in via import astropy to provide a solid, well-tested base for astronomy software development; the second is to support and encourage the concept of affiliated packages, which use the astropy package and follow the same philosophy on distribution and openness. Together these aim to provide a consistent environment of highly-usable, well-documented, easy-to-install code for astronomers using Python in their research. 4. SYNERGIES The LSST project is constrained by the requirements of the funding agencies to deliver a data management system on schedule and on budget.