Benefits of Data Archiving in Data Warehouses 2 Benefits of Data Archiving in Data Warehouses
Total Page:16
File Type:pdf, Size:1020Kb
IBM Software White Paper Benefits of data archiving in data warehouses 2 Benefits of data archiving in data warehouses Contents This unchecked data growth often results in ever-increasing infrastructure and operational costs, poor data warehouse 2 Executive summary performance, and an inability to support complex data 3 Typical reasons for rapid data growth retention and legal hold requirements. 4 Challenges associated with data warehouse growth A data archiving solution helps organizations address these 5 Traditional data growth solutions that do not work challenges by allowing IT staff to intelligently move (and purge) historical and inactive data from production databases 6 Understanding data archiving into a more cost-effective location while still providing the capabilities to query, search or even restore data if needed. 9 Benefits of data archiving A tiered archiving strategy provides additional benefits in 10 Guiding principles and technology requirements terms of managing performance and cost-effectiveness. Data archiving can also alleviate data growth issues by: 11 Managing data growth responsibly with data warehouse archiving • Removing or relocating inactive and dormant data out of the database to improve data warehouse performance • Reducing the infrastructure and operational costs typically Executive summary associated with data growth Data warehouses are the pillars of business intelligence and • Leveraging proven policies and processes to cost-effectively analytics systems, often integrating data from multiple data manage multi-temperature data sources in an organization to provide historical, current or • Improving disaster recovery and backup/restore plans to even predictive analysis of the business. Information from consistently meet service-level agreements (SLAs) multiple internal or external transactional systems is extracted, • Supporting compliance with data retention, purge or transformed and loaded into data warehouses as atomic hold policies data. This cumulative data and the analytics systems that This paper describes a data lifecycle management strategy for leverage it provide the technology and methodology that help data warehouses that is designed to manage high-volume data organizations discover and develop meaningful insights. growth cost-effectively, and avoid performance degradation. Due to the consolidated nature of data warehouses, these data stores often suffer from rapid growth. Typical reasons for this phenomenon include expansion of data warehouses with new subject areas or data marts, compounded data growth from organic or inorganic business growth, or a “let’s keep it all, someone might need it” attitude toward historical data. IBM Software 3 Typical reasons for rapid data growth The “data tomb” effect: Data warehouses may become the The data warehouse is commonly an organization’s largest dumping ground for historical data from various transactional database. This is due to several factors: systems, with little regard to the true value of the business intelligence within this dead data. This “data tomb” effect Big data and the explosion in data volume: With the advent may be caused by the lack of an optimal archiving and data of big data technologies that help organizations generate retention strategy in the originating transactional system itself. insight from large information assets, companies are keeping unstructured and structured data that might have been thrown Expansion into new subject areas: Companies frequently away in the past. Apache Hadoop and similar technologies expand data warehouses with new subject areas and new data continue to gain momentum and adoption, and will provide sources, making them part of a central repository for the new ways of processing large amounts of such data, extracting enterprise or interconnected data marts. While this expansion intelligence from multi-structured data sources, and integrating can provide insights for crucial business activities, it can also the results into existing data warehouses for further analysis lead to significant data expansion. and reporting. 4 Benefits of data archiving in data warehouses Business growth: Larger organizations are often subject to Challenges associated with data compounded data growth from mergers and acquisitions, as warehouse growth well as organic business growth. Consolidation of multiple High-volume data growth and large warehouse implementations implementations into one results in a larger system. present multiple IT challenges and business risks. While many data warehouse solutions and architecture choices exist in Lack of retention and disposal policies: Unfortunately, the the market, every approach poses several common challenges business side of an organization may not provide IT teams (see Figure 1). with enough clarity on data retention and disposal policies. Most organizations have a “let’s keep it all, someone might Cost of ownership need it later” mentality for historical data, which prevents The impact of exponential data growth on infrastructure and them from exploring cost-effective data retention, hold or operational costs can be huge, often taking up most of an purge processes. organization’s data warehousing budget. Larger amounts of data require larger capacity, resulting in more hardware and storage Each of these factors provides an impetus for IT organizations requirements—as well as higher costs to maintain, monitor and to adopt data lifecycle management strategies and efficiently administer this infrastructure. Large data warehouses generally manage categories of data according to their value in a data require bigger servers and appliances, which may also increase warehousing architecture. software licensing costs for the database, database tooling, integration or business intelligence (BI) tools. Performance Database size Hardware capacity Figure 1. Performance and capacity challenges associated with data warehouse growth. IBM Software 5 In addition, IT departments must factor in the costs of Traditional data growth solutions a mirrored disaster recovery system, the data backup that do not work infrastructure, processes to copy large data sets within the SLA IT organizations may try to use conventional methods for window and replicas of the database across test environments. managing data growth, but these methods are habitually ineffective or fail to generate a cost-effective solution. Performance and availability Common techniques include: Large volumes of data and varying workloads can put a lot of stress on data warehouse systems. With a majority of Hardware upgrades: Trying to keep up with data growth has production data typically in an inactive state, the performance a huge impact on capital expenditure and frequent hardware and system availability of data warehouses suffer greatly as a upgrades. The traditional solution is to add more server nodes, result of unchecked data growth. or perform forklift upgrades to replace the data warehouse infrastructure. While hardware upgrades are inevitable, there are When the response time of critical queries and reporting other ways to defer these costs and reap better performance from processes starts to degrade, extract/transform/load (ETL) loads existing infrastructure—which may amount to huge savings. take longer and may extend past the SLA windows. Database backups run endlessly and the IT staff must operate in reactive Traditional backups: Large, monolithic backups are highly mode to contain these issues. These situations pose a significant redundant with historical and inactive data taking up most of risk to business continuity and system availability, because the space. Backups are not substitutes for archives; archives downtime can result in a lengthy system recovery period. are online or near-line and queryable. Backups cannot solve Cost-effective compliance data growth problems because they require creating a replica of the production data, and need to be taken frequently (on a Many data warehouses also feed data back into the weekly or monthly basis), which adds more overhead to the transactional systems, acting as systems of record in these growth problem. If IT teams use backups to archive data, it cases. These systems may be subject to audits, retention, can be difficult to retrieve the data within a short period of legal hold or e-discovery requests. Simply purging historical time. Information retrieval also poses a challenge when the data is not acceptable as a method for keeping up with data data schema in the original system has evolved. growth because compliance regulations may require data to be retained for a certain number of years, put on legal hold to satisfy discovery requests, or audited. Keeping all of the data in production databases is not a cost-effective way to retain data for compliance reasons. Also, if a data warehouse was used to make business decisions, it may be targeted for legal disclosure under e-discovery rules. 6 Benefits of data archiving in data warehouses Database partitioning: IT departments sometimes try to Understanding data archiving manage data growth by implementing a partitioning schema Data lifecycle management is a policy- and process-oriented in the traditional database management system (DBMS) to approach to efficiently control the flow of an information separate active data from historical data. However, partitioning system’s data throughout its lifecycle, from requirement in this way still may not reduce the overhead on the database to retirement. Data lifecycle