Data Reduction

Data Reduction Parallel Storage Systems 2021-06-08 Jun.-Prof. Dr. Michael Kuhn [email protected] Parallel Computing and I/O Institute for Intelligent Cooperating Systems Faculty of Computer Science Otto von Guericke University Magdeburg https://parcio.ovgu.de Outline Review Data Reduction Review Motivation Recomputation Deduplication Compression Advanced Compression Summary Michael Kuhn Data Reduction 1 / 50 Optimizations Review • Why is it important to repeat measurements? 1. Warm up the file system cache 2. Randomize experiments for statistical purposes 3. Eliminate systematic errors 4. Eliminate random errors Michael Kuhn Data Reduction 2 / 50 Optimizations Review • What does the queue depth for asynchronous operations refer to? 1. Size of the operations 2. Number of operations in flight 3. Maximum size of an operation Michael Kuhn Data Reduction 2 / 50 Optimizations Review • What is a context switch? 1. The application switches between two open files 2. The application switches between two I/O operations 3. The operating system switches between two processes 4. The operating system switches between two file systems Michael Kuhn Data Reduction 2 / 50 Optimizations Review • Which is the fastest I/O setup in terms of throughput? 1. One client communicating with one server 2. One client communicating with ten servers 3. Ten clients communicating with one server 4. Ten clients communicating with ten servers Michael Kuhn Data Reduction 2 / 50 Outline Motivation Data Reduction Review Motivation Recomputation Deduplication Compression Advanced Compression Summary Michael Kuhn Data Reduction 3 / 50 Performance Development Motivation • Hardware improves exponentially, but at dierent rates • Storage capacity and throughput are lagging behind computation • Network and memory throughput are even further behind • Transferring and storing data has become a very costly • Data can be produced even more rapidly • Oen impossible to keep all of the data indefinitely • Consequence: Higher investment costs for storage hardware • Leads to less money being available for computation • Alternatively, systems have to become more expensive overall • Storage hardware can be a significant part of total cost of ownership (TCO) • Approximately 20 % of total costs at DKRZ, ≈ e 6,000,000 procurement costs Michael Kuhn Data Reduction 4 / 50 Performance Development... Motivation Computation Storage capacity Storage throughput Network throughput Memory throughput 100,000,000 10,000,000 1,000,000 100,000 10,000 1,000 100 10 Performance improvement Performance 1 0 5 10 15 20 25 30 Years • Computation: 300× every ten years (based on TOP500) • Storage capacity: 100× every 10 years • Storage throughput: 20× every 10 years Michael Kuhn Data Reduction 5 / 50 Example: DKRZ [Kunkel et al., 2014] Motivation 2009 2015 Faktor Performance 150 TF/s 3 PF/s 20x Node Count 264 2,500 9.5x Node Performance 0.6 TF/s 1.2 TF/s 2x Main Memory 20 TB 170 TB 8.5x Storage Capacity 5.6 PB 45 PB 8x Storage Throughput 30 GB/s 400 GB/s 13.3x HDD Count 7,200 8,500 1.2x Archive Capacity 53 PB 335 PB 6.3x Archive Throughput 9.6 GB/s 21 GB/s 2.2x Energy Consumption 1.6 MW 1.4 MW 0.9x Procurement Costs 30 Me 30 Me 1x Michael Kuhn Data Reduction 6 / 50 Example: DKRZ... [Kunkel et al., 2014] Motivation 2020 2025 Exascale (2020) Performance 60 PF/s 1.2 EF/s 1 EF/s Node Count 12,500 31,250 100k–1M Node Performance 4.8 TF/s 38.4 TF/s 1–15 TF/s Main Memory 1.5 PB 12.8 PB 3.6–300 PB Storage Capacity 270 PB 1.6 EB 0.15–18 EB Storage Throughput 2.5 TB/s 15 TB/s 20–300 TB/s HDD Count 10,000 12,000 100k–1M Archive Capacity 1.3 EB 5.4 EB 7.2–600 EB Archive Throughput 57 GB/s 128 GB/s — Energy Consumption 1.4 MW 1.4 MW 20–70 MW Procurement Costs 30 Me 30 Me 200 M$ Michael Kuhn Data Reduction 7 / 50 Approaches Motivation • Storage costs should be kept stable if possible • It is necessary to reduce the amount of data produced and stored • First step: Determine the actual storage costs • It is important to dierentiate dierent types of costs • Cost model for computation, storage and archival [Kunkel et al., 2014] • Analysis of dierent data reduction techniques • Recomputation, deduplication and compression Michael Kuhn Data Reduction 8 / 50 CMIP5 [Kunkel et al., 2014] Motivation • Part of the AR5 of the IPCC (Coupled Model Intercomparison Project Phase 5) • Climate model comparison for a common set of experiments • More than 10.8 million processor hours at DKRZ • 482 runs, simulating a total of 15,280 years • Data volume of more than 640 TB, refined into 55 TB via post-processing • Prototypical low-resolution configuration: • A year takes about 1.5 hours on the 2009 system • A month of simulation accounts for 4 GB of data • Finishes by creating a checkpoint (4 GB), another job restarts from it • Every 10th checkpoint is kept and archived • Data was stored on the file system for almost three years • Archived for 10 years Michael Kuhn Data Reduction 9 / 50 CMIP5... [Kunkel et al., 2014] Motivation CMIP5 System 2009 2015 2020 2025 Computation 10.50 0.55 0.03 0.001 Procurement 45.02 5.60 0.93 0.16 Storage Access 0.09 0.01 0 0 Metadata 0.04 0 0 0 Checkpoint 0 0 0 0 Archival 10.35 1.66 0.41 0.10 Total 66.01 7.82 1.38 0.26 • 2009: Computation costs are roughly the same as archival costs • Storage costs are more than four times higher • Storage and archival are getting (relatively) more expensive Michael Kuhn Data Reduction 10 / 50 HD(CP)2 [Kunkel et al., 2014] Motivation • High Definition Clouds and Precipitation for Climate Prediction • Improve the understanding of cloud and precipitation processes • Analyze their implication for climate prediction • A simulation of Germany with a grid resolution of 416 m • The run on the DKRZ system from 2009 needs 5,260 GB of memory • Simulates 2 hours in a wallclock time of 86 minutes • Model results are written every 30 model minutes • A checkpoint is created when the program terminates • Output has to be kept on the global file system for only one week Michael Kuhn Data Reduction 11 / 50 HD(CP)2... [Kunkel et al., 2014] Motivation HD(CP)2 System 2009 2015 2020 2025 Computation 165.07 8.72 0.44 0.02 Procurement 2.37 0.30 0.05 0.01 Storage Access 0.94 0.07 0.01 0 Metadata 0 0 0 0 Checkpoint 0.33 0.02 0 0 Archival 86.91 13.91 3.48 0.87 Total 255.29 22.99 3.97 0.90 • Higher computation costs than CMIP5 • 2009: Computation costs are roughly two times the archival costs • Lower storage costs because data is archived faster Michael Kuhn Data Reduction 12 / 50 Data Reduction Techniques Motivation • There are dierent concepts to reduce the amount data to store • We will take a closer look at three in particular 1. Recomputing results instead of storing them • Not all results are stored explicitly but recomputed on demand 2. Deduplication to reduce redundancies • Identical blocks of data are only stored once 3. Compression • Data can be compressed within the application, the middleware or the file system Michael Kuhn Data Reduction 13 / 50 Outline Recomputation Data Reduction Review Motivation Recomputation Deduplication Compression Advanced Compression Summary Michael Kuhn Data Reduction 14 / 50 Overview [Kunkel et al., 2014] Recomputation • Do not store all produced data • Data will be analyzed in-situ, that is, at runtime • Requires a careful definition of the analyses • Post-mortem data analysis is impossible • A new analysis requires repeated computation • Re-computation can be attractive • If the costs for keeping data are substantially higher than re-computation costs • Cost of computation is higher than the cost for archiving the data in 2009 • Computational power continues to improve faster than storage technology Michael Kuhn Data Reduction 15 / 50 Analysis [Kunkel et al., 2014] Recomputation • 2015 • Recomputation if data is accessed only once (HD(CP)2) or less than 13 times (CMIP5) • 2020 • HD(CP)2: Recomputation worth it if data is accessed less than eight times • CMIP5: Archival is cheaper if data is accessed more than 44 times • 2025 • Recompute if data is accessed less than 44 (HD(CP)2) or 260 (CMIP5) times Michael Kuhn Data Reduction 16 / 50 Quiz Recomputation • How would you archive your application to be executed in five years? 1. Keep the binaries and rerun it 2. Keep the source code and recompile it 3. Put it into a container/virtual machine Michael Kuhn Data Reduction 17 / 50 Problems: Preserve Binaries [Kunkel et al., 2014] Recomputation • Keep binaries of applications and all their dependencies • Containers and virtual machines have made this much easier • Eectively impossible to execute the application on diering future architectures • x86-64 vs. POWER, big endian vs. little endian • Emulation usually has significant performance impacts • Re-computation on the same supercomputer appears feasible • Keep all dependencies (versioned modules) and link statically Michael Kuhn Data Reduction 18 / 50 Problems: Source Code [Kunkel et al., 2014] Recomputation • All components can be compiled even on dierent hardware architectures • Most likely will require additional eort from developers • Dierent operating systems, compilers etc. could be incompatible • Might still require preserving all dependencies • Changes to minute details could lead to diering results • Dierent processors, network technologies etc. could change results • Can be ignored in some cases as long as results are “statistically equal” Michael Kuhn Data Reduction 19 / 50 Summary Recomputation • Recomputation can be worth it given current performance developments • Computation is developing much faster than storage

Data Reduction

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support