IBM Software White Paper

Benefits of archiving in data warehouses 2 Benefits of data archiving in data warehouses

Contents This unchecked data growth often results in ever-increasing infrastructure and operational costs, poor 2 Executive summary performance, and an inability to support complex data 3 Typical reasons for rapid data growth retention and legal hold requirements.

4 Challenges associated with data warehouse growth A data archiving solution helps organizations address these 5 Traditional data growth solutions that do not work challenges by allowing IT staff to intelligently move (and purge) historical and inactive data from production databases 6 Understanding data archiving into a more cost-effective location while still providing the capabilities to query, search or even restore data if needed. 9 Benefits of data archiving A tiered archiving strategy provides additional benefits in 10 Guiding principles and technology requirements terms of managing performance and cost-effectiveness. Data archiving can also alleviate data growth issues by: 11 Managing data growth responsibly with data warehouse archiving • Removing or relocating inactive and dormant data out of the database to improve data warehouse performance • Reducing the infrastructure and operational costs typically Executive summary associated with data growth Data warehouses are the pillars of business intelligence and • Leveraging proven policies and processes to cost-effectively analytics systems, often integrating data from multiple data manage multi-temperature data sources in an organization to provide historical, current or • Improving disaster recovery and /restore plans to even predictive analysis of the business. Information from consistently meet service-level agreements (SLAs) multiple internal or external transactional systems is extracted, • Supporting compliance with data retention, purge or transformed and loaded into data warehouses as atomic hold policies data. This cumulative data and the analytics systems that This paper describes a data lifecycle management strategy for leverage it provide the technology and methodology that help data warehouses that is designed to manage high-volume data organizations discover and develop meaningful insights. growth cost-effectively, and avoid performance degradation. Due to the consolidated nature of data warehouses, these data stores often suffer from rapid growth. Typical reasons for this phenomenon include expansion of data warehouses with new subject areas or data marts, compounded data growth from organic or inorganic business growth, or a “let’s keep it all, someone might need it” attitude toward historical data. IBM Software 3

Typical reasons for rapid data growth The “data tomb” effect: Data warehouses may become the The data warehouse is commonly an organization’s largest dumping ground for historical data from various transactional database. This is due to several factors: systems, with little regard to the true value of the business intelligence within this dead data. This “data tomb” effect Big data and the explosion in data volume: With the advent may be caused by the lack of an optimal archiving and data of big data technologies that help organizations generate retention strategy in the originating transactional system itself. insight from large information assets, companies are keeping unstructured and structured data that might have been thrown Expansion into new subject areas: Companies frequently away in the past. Apache Hadoop and similar technologies expand data warehouses with new subject areas and new data continue to gain momentum and adoption, and will provide sources, making them part of a central repository for the new ways of processing large amounts of such data, extracting enterprise or interconnected data marts. While this expansion intelligence from multi-structured data sources, and integrating can provide insights for crucial business activities, it can also the results into existing data warehouses for further analysis lead to significant data expansion. and reporting. 4 Benefits of data archiving in data warehouses

Business growth: Larger organizations are often subject to Challenges associated with data compounded data growth from mergers and acquisitions, as warehouse growth well as organic business growth. Consolidation of multiple High-volume data growth and large warehouse implementations implementations into one results in a larger system. present multiple IT challenges and business risks. While many data warehouse solutions and architecture choices exist in Lack of retention and disposal policies: Unfortunately, the the market, every approach poses several common challenges business side of an organization may not provide IT teams (see Figure 1). with enough clarity on data retention and disposal policies. Most organizations have a “let’s keep it all, someone might Cost of ownership need it later” mentality for historical data, which prevents The impact of exponential data growth on infrastructure and them from exploring cost-effective data retention, hold or operational costs can be huge, often taking up most of an purge processes. organization’s data warehousing budget. Larger amounts of data require larger capacity, resulting in more hardware and storage Each of these factors provides an impetus for IT organizations requirements—as well as higher costs to maintain, monitor and to adopt data lifecycle management strategies and efficiently administer this infrastructure. Large data warehouses generally manage categories of data according to their value in a data require bigger servers and appliances, which may also increase warehousing architecture. software licensing costs for the database, database tooling, integration or business intelligence (BI) tools.

Performance Database size

Hardware capacity

Figure 1. Performance and capacity challenges associated with data warehouse growth. IBM Software 5

In addition, IT departments must factor in the costs of Traditional data growth solutions a mirrored disaster recovery system, the data backup that do not work infrastructure, processes to copy large data sets within the SLA IT organizations may try to use conventional methods for window and replicas of the database across test environments. managing data growth, but these methods are habitually ineffective or fail to generate a cost-effective solution. Performance and availability Common techniques include: Large volumes of data and varying workloads can put a lot of stress on data warehouse systems. With a majority of Hardware upgrades: Trying to keep up with data growth has production data typically in an inactive state, the performance a huge impact on capital expenditure and frequent hardware and system availability of data warehouses suffer greatly as a upgrades. The traditional solution is to add more server nodes, result of unchecked data growth. or perform forklift upgrades to replace the data warehouse infrastructure. While hardware upgrades are inevitable, there are When the response time of critical queries and reporting other ways to defer these costs and reap better performance from processes starts to degrade, extract/transform/load (ETL) loads existing infrastructure—which may amount to huge savings. take longer and may extend past the SLA windows. Database run endlessly and the IT staff must operate in reactive Traditional backups: Large, monolithic backups are highly mode to contain these issues. These situations pose a significant redundant with historical and inactive data taking up most of risk to business continuity and system availability, because the space. Backups are not substitutes for archives; archives downtime can result in a lengthy system recovery period. are online or near-line and queryable. Backups cannot solve Cost-effective compliance data growth problems because they require creating a replica of the production data, and need to be taken frequently (on a Many data warehouses also feed data back into the weekly or monthly basis), which adds more overhead to the transactional systems, acting as systems of record in these growth problem. If IT teams use backups to archive data, it cases. These systems may be subject to audits, retention, can be difficult to retrieve the data within a short period of legal hold or e-discovery requests. Simply purging historical time. Information retrieval also poses a challenge when the data is not acceptable as a method for keeping up with data data schema in the original system has evolved. growth because compliance regulations may require data to be retained for a certain number of years, put on legal hold to satisfy discovery requests, or audited. Keeping all of the data in production databases is not a cost-effective way to retain data for compliance reasons. Also, if a data warehouse was used to make business decisions, it may be targeted for legal disclosure under e-discovery rules. 6 Benefits of data archiving in data warehouses

Database partitioning: IT departments sometimes try to Understanding data archiving manage data growth by implementing a partitioning schema Data lifecycle management is a policy- and process-oriented in the traditional database management system (DBMS) to approach to efficiently control the flow of an information separate active data from historical data. However, partitioning system’s data throughout its lifecycle, from requirement in this way still may not reduce the overhead on the database to retirement. Data lifecycle management policies include because the indexes remain the same size. Partitioning does not ensuring optimal application performance and archiving help reduce overall storage costs and maintenance windows; historical data to manage data growth while ensuring access to it also makes it difficult to restore or re-create selective data both production and archived data. Before archiving data, it is records located in a dropped partition from the time when the important to classify everything based on usage activity. database was on an older version. Certain analytical DBMSs don’t even support database partitioning. Data assessment and classification It is not uncommon for organizations to have millions or Homegrown solutions: Building a mature data archiving even billions of records across different fact tables that hold and purging solution in-house can be a very expensive and many years of accumulated information. However, it is quite time-consuming effort. The scripts and code require proper common for users and DBAs to find that the most active data handling of database referential integrity, error recoverability, is typically located within the last six months to two years of high-performance execution and consistent application of transactions. Anything earlier is queried infrequently. business rules and policies across a potentially large number of systems. Despite the huge investment, these solutions are Data in the warehouse can be classified according to its hard to maintain and do not provide much longevity in typical temperature—the access frequency, volatility and query organizations where people and technology change regularly. performance of the data. Hot data is frequently accessed and updated, and users expect optimal performance when Purging data: In many industries, companies must keep accessing this data. As data ages, it tends to “cool off,” large amounts of historical information (especially financial meaning that the probability of users accessing this data information) for compliance reasons. Data is subject to significantly decreases. the same SLAs—including those for data retention—as the transactional system itself, and for that reason must be covered by information lifecycle policies for standard corporate data. IBM Software 7

Archiving typically targets cold data and relocates it to a Data archiving more cost-effective storage medium (see Figure 2). However, Archiving in its simplest form involves the migration of the data must still be available for regulatory requests, information or data (typically historical) from an online application audits and long-term analysis—so the archived data should to a secondary (online, near-line or offline) system, making it be queryable and restorable (in the original location or a accessible as a long-term storage repository. As a recognized staged location). Data assessment and classification based information lifecycle management best practice, archiving on business usage is an important factor in an effective segregates inactive application data from current activity and archiving strategy. safely moves it to a different tier based on its value to the business. Consequently, smaller databases tend to deliver higher service levels with lower maintenance and operational overhead.

?

?

Coldest

Colder

Cold

Warm

Hot

Current Year 1 Year 3 Year 5 Year 7

Update access Reporting access Ad hoc access

Figure 2. Multi-temperature data classification based on access requirement. 8 Benefits of data archiving in data warehouses

Archiving in dimensional data warehousing Tiered storage archiving strategies or data marts Database archiving involves extracting a predefined set of Data warehousing uses different methods of data modeling. historical data (often time-based) from a set of tables while One popular approach—dimensional data warehousing— maintaining its data referential integrity; moving this data set involves fact and dimension tables, whereas others use a into either a secondary archive data warehouse or a file-based more normalized data model. There are two types of history data archive; and purging the historical transactional data tracking in dimensional data warehousing: from the source database. For higher query performance and access to larger data volumes of data, warm data may 1. Fact data changes: Granular fact records about a business be stored in another data warehouse instance, ideally on a event (such as a sale or transaction) are linked to a certain lower-cost infrastructure. For rarely accessed data, storing point in time, which are history-tracked and grow in large this “cold” data in compressed and queryable data archive numbers over time. These high-volume, historical and files may provide a more cost-effective solution compared detailed records are good candidates for archiving. to higher-tier storage.

2. Dimension data changes: Data in dimension tables may Organizations may leverage a combination of these archive also change over time and is known as slowly changing dimension stores to balance access performance requirements and cost- (SCD) data. In this case, attribute changes in a dimension such as effectiveness (see Figure 3). The archived systems would customer phone number or address may be tracked, can change leverage lower-cost storage devices such as Serial ATA (SATA), over time and often result in a sizeable amount of historical network-attached storage (NAS), content-addressable storage data. The larger the dimension tables in volume and number of (CAS), optical disks, tapes or cloud storage. attributes, the larger the data grows in SCD records. However, fact record growth is higher than SCD records.

Archive Archive Historical Contextual data data Complete Complete Archive data sets Historical data sets data Current data Restore Restore

Production data Archive data Data archive warehouse—hot warehouse—warm files—cold data, tier 1 data, tier 2 data, tier 3

Figure 3. A three-tier archiving strategy designed to optimize cost-effectiveness and performance of specific data sets on different tiers of storage. IBM Software 9

Access to archived data Benefits of data archiving While archiving strategy and architecture may look different Lower total cost of ownership for each implementation, there may be infrequent requirements to access the archived data. Archiving removes data from Data archiving can have a great impact on reducing total cost the production system, but this data is not lost—it is simply of ownership for the data warehouse and help with IT relocated based on its business value. In cases where a separate cost-savings initiatives. By deferring hardware upgrades in instance of an archive data warehouse is used, the queries production and disaster recovery environments, archiving could be directed to the archive instance directly. For scenarios enables companies to make the most efficient use of where combined reporting with production data is needed, data the existing infrastructure in a controlled data growth federation technologies could be leveraged as well. environment. Archiving strategies help the amount of data DBAs must actively manage and the amount of time they The data archive files created by the archiving solution should spend tuning or adjusting storage requirements—freeing allow access using industry-standard interfaces such as ODBC/ them up to focus on more strategic projects. Archiving JDBC, XML or SQL, via any standard reporting tool. Users also holds the potential to reduce software costs (such as can then browse or search the archives using browser-based warehouse and database licensing costs) associated with or other standard reporting mechanisms for auditing or larger data warehouses. compliance reasons. For heavier analytical requirements on larger sets of historical data, the archiving solution should By leveraging lower-cost storage tiers or lower-cost data allow users to restore archived data sets back to the original warehouse appliances with a tiered archiving strategy, location or a staged location. In general, because archived data organizations can purge inactive historical data once it is infrequently accessed, restoring data is rarely required. has been archived—reclaiming space in production data warehouse servers. In addition, archiving helps control the cost of capital and operational expenses related to database backup processes because the redundant and static historical data will be reduced in periodic backups. “Organizations which fail to deploy strategies to address data complexity and volume issues Improved performance and availability Archiving and purging inactive data helps significantly improve for their analytics by 2012 will experience query performance by reducing the amount of data and the more than doubling costs of ownership number of indexes and table scans that must be processed. for their data warehouse and mart Smaller data warehouses also perform better with batch environments in disorganized attempts processing, long-running reports and ETL jobs—avoiding overruns into other production usage requests. Archiving to meet this new demand.” makes performing periodic maintenance tasks easier and faster, and it streamlines restoration from backups in the event of a —“  Does the 21st-Century ‘Big Data’ Warehouse Mean the End of the failure for better system uptime and user productivity. Enterprise Data Warehouse?” 25 August 2011, Gartner 10 Benefits of data archiving in data warehouses

Streamlined risk and compliance management Data archiving helps organizations comply with data IBM InfoSphere Optim: A single, enterprise-scale retention and purge policies while providing queryable data lifecycle management solution archives for audit or e-discovery requests into historical data. IBM InfoSphere Optim™ software provides a central data The technology and processes also support data legal-hold management solution designed to scale to meet enterprise requirements. Plus, archiving enables organizations to apply needs. Whether addressing a single application, a data business policies to govern data retention and disposal and warehouse environment or a global data center, provides long-term solutions for storing historical data. organizations can use InfoSphere Optim solutions to streamline with a consistent strategy. Guiding principles and technology The unique relationship engine in InfoSphere Optim requirements provides a single point of control to guide data processing An enterprise-grade data archiving solution should meet four activities such as archiving, subsetting, migrating and key technology requirements: retrieving data. Reusable data management templates enable consistency and scalability, while advanced 1. Enterprise architecture security features provide support for role-based access Most enterprises rely on heterogeneous information assets, and activity permissions. solutions and platforms from multiple vendors. A single, InfoSphere Optim supports major data warehouse scalable data lifecycle management solution must support environments, including IBM PureData System for all of these major technologies, providing a common and Analytics, IBM InfoSphere Warehouse, Teradata and reusable interface and processes. The solution should also be Oracle. It also supports enterprise databases and optimized for high-performance connectivity to multiple operating systems, including IBM DB2, IBM Informix, data warehousing solutions (such as IBM® PureData™ IBM IMS™, IBM Virtual Storage Access Method (VSAM), System for Analytics, which leverages IBM Netezza® IBM z/OS, Oracle Database, Sybase, Microsoft SQL Server, technology; IBM DB2® and IBM InfoSphere® Warehouse; Microsoft Windows, UNIX and Linux. In addition, InfoSphere Optim supports key ERP and CRM packaged IBM Informix®; Teradata; Oracle; Microsoft SQL Server; applications such as Oracle E-Business Suite, PeopleSoft and Sybase) with support for major operating systems including Enterprise, JD Edwards EnterpriseOne, Siebel CRM, IBM z/OS®, IBM i, Linux, UNIX and Microsoft Windows. Amdocs CRM and SAP applications, as well as many custom applications. Such an enterprise solution should also support a tiered storage architecture for optimal balance between storage cost, performance and access requirements. Pre-built integration with hierarchical storage management (HSM) systems like IBM Tivoli® or EMC Centera also helps ease implementation of a tiered archive strategy. IBM Software 11

2. Complete business objects Archiving solutions must also provide the ability to import From a database perspective, a business object represents a an existing logical model and make changes to it for database group of related rows from related tables across one or more archiving. They should also provide an easy way for IT staff applications, together with its related metadata (information to incorporate logical data relationships manually for any about the structure of the database and about the data itself). custom relationships not represented at the physical layer. Capturing the complete business object offers a complete view of the business activity surrounding a particular transaction. 4. Universal access to archives The archiving solution should offer universal access to Data warehouses are required to represent these relationships archived data using industry-standard interfaces such as accurately, whether in a star schema, snowflake or hybrid ODBC/JDBC, XML or SQL, and reporting tools using these data model. When the high-level entity, such as an order, is interfaces, such as IBM Cognos®, SAP Crystal Reports, archived, the corresponding line items should be archived as Microsoft Excel and others. well. If this does not happen, then is lost. Such connections form a complete business object—so enterprises Managing data growth responsibly with should look for archiving software that represents and data warehouse archiving preserves such complex entities in a simple, easy-to-manage Data warehouses should not be allowed to grow into large, and high-performing way. expensive historical data repositories. Managing data growth with data warehouse archiving helps reduce costs, improve 3. Discovery and understanding data structures performance and increase availability for business-critical To archive complete business objects, enterprises need analytics and BI solutions while maintaining compliance archiving solutions with robust data discovery and metadata with data retention requirements. Together with IBM, mining capabilities. The solution should be able to discover, organizations can make a case for archiving in their data analyze and document data models with accurate schema warehouse implementations and evaluate the business value and data relationships from data warehousing systems in of managing data growth. multiple ways. It should allow IT staff to reverse-engineer a model from an existing source database by mining the For more information database catalog. Without this, the data model representation To learn more about IBM data archiving solutions and would have to be built manually. If there is no physical or best practices, contact your IBM representative or visit: documented data model representation, the solution should ibm.com/software/data/optim have automated capabilities for analyzing data values and data patterns to identify relationships that offer greater accuracy and reliability than manual analysis. © Copyright IBM Corporation 2013

IBM Corporation Software Group Route 100 Somers, NY 10589

Produced in the United States of America February 2013

IBM, the IBM logo, ibm.com, Cognos, DB2, IMS, Informix, InfoSphere, Optim, PureData,Tivoli and z/OS are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml

Netezza is a trademark or registered trademark of IBM International Group B.V., an IBM Company.

Linux is a registered trademark of Linus Torvalds in the United States, other countries or both.

Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries or both.

UNIX is a registered trademark of The Open Group in the United States and other countries.

Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates.

This document is current as of the initial date of publication and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates.

The client examples cited are presented for illustrative purposes only. Actual performance results may vary depending on specific configurations and operating conditions. THE INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS” WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements under which they are provided.

The client is responsible for ensuring compliance with laws and regulations applicable to it. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the client is in compliance with any law or regulation.

Please Recycle

IMW14686-USEN-00