WHITE PAPER Rethinking Your Data Retention Strategy to Better Exploit

WHITE PAPER Rethinking Your Data Retention Strategy to Better Exploit the Big Data Explosion Sponsored by: Dell Richard L. Villars Marshall Amaldas October 2011 IDC OPINION The continued generation of business-critical semistructured data (including large volumes of machine-generated data [MGD] from smart sensors and mobile devices) is changing the storage dynamic in a wide range of industries and organizations. Making investments to extract value from this expanding pool of information is fast becoming a core business mandate, but such efforts can quickly lead to spiraling IT costs and growing corporate risk without the right data retention and long-term archiving strategy. Making the wrong choice in a technology decision (e.g., deciding between an OLTP, OLAP, or OLDR approach to data storage) can lead to significantly high data management and retention costs in both the short run and the long run. It can also jeopardize compliance and privacy standards for data such as call detail records (CDRs) and trading records. IT organizations need to deploy active archival storage solutions that address the total cost of ownership (TCO) for archival data at many layers. Specifically, such a solution: Provides a semistructured archive platform that's significantly less expensive than archiving that same information on individual database, data warehouse, or file systems Maximizes the utilization of that hardware with intelligent data management/reduction software Reduces the ongoing operational burden of the archival storage environment When selecting a storage and data management partner to help you manage the "Big Data" challenge, you will need a partner that can address the entire spectrum of data assessment, data retention, and data use requirements of this new environment. Dell, as a leading designer and provider of IT solutions optimized for Big Data analytics, is also providing enterprise-class solutions that address the cost, performance, and intelligence requirements at the heart of Big Data retention and active archiving. Global Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.872.8200 F.508.935.4015 www.idc.com F.508.935.4015 P.508.872.8200 USA 01701 MA Framingham, Street Speen 5 Headquarters: Global INFORMATION EVERYWHE RE, BUT WHERE ' S THE KNOWLEDGE? For the first 40 years of the IT industry, the main data challenge for most organizations was enabling/recording more and faster business transactions, often referred to as structured data. Today, much of the focus is on more and faster exchanges of information (e.g., documents, medical images, movies, gene sequences, data streams, tweets) from scale-out cloud clusters to systems, PCs, mobile devices, and living rooms. This information is often categorized as unstructured data (e.g., image, audio, or video files) or semistructured data (e.g., emails, logs, call detail records). Semistructured data is often overlooked, but with the advent of RFID tracking, smart sensors, mobile devices with geospatial information, and a growing array of data collection devices, MGD will be a leading driver of the data explosion. The business challenge for the next decade will be finding ways to better analyze, monetize, and capitalize on all this MGD (see Figure 1). It will be the age of Big Data. For the IT organization, the challenge will be to implement an archival storage system that ensures that this information is reliably and efficiently ingested, protected, organized, accessed, and preserved. F I G U R E 1 Changing Business Priorities in a F a s t - Shifting World Companies rely on a growing range of devices, data sources, and applications to compete in today's evolving business environment. Facebook MORE MORE DEVICES APPLICATIONS VMware salesforce.com MORE Apple MORE CONTENT DATA The range of information created, accessed, and retained affects how companies organize datacenters and retain information. Source: IDC, 2011 2 #230747 ©2011 IDC The Ongoing Data Explosion Data creation is occurring at a record rate. In 2010, the world generated over 1 zettabyte (ZB) — that's 1 million petabytes (PB) — of data; by 2014, we will generate 7ZB a year. While much of this is "unsaved" or highly duplicated data like personal photos or copies of music/videos, one of the fastest-growing and most important sources of growth is machine-generated data: Financial transactions. With the consolidation of global trading environments and the greater use of programmed trading, the volume of transactions that need to be collected and analyzed can double or triple in size, while the transaction volumes can also fluctuate much faster, more widely, and more unpredictably, and competition among firms forces trading decisions to be made at ever smaller intervals. Smart instrumentation. The use of intelligent meters in "smart grid" energy systems that shift from a monthly meter read to an "every 15 minutes" meter read can translate into a multi-thousandfold increase in data generated. Similar data bursts are looming in healthcare, where low-cost gene sequencing will have a profound impact on medical data volumes. Mobile devices. Until quite recently, the main data generated on landline and traditional mobile phones was limited to CDRs with caller, receiver, and length of call data. With smartphones and tablets, additional CDR data to harvest includes geographic location, text messages, browsing history, and (thanks to the addition of accelerometers) even motions. All of this data creates new opportunities to "extract more value" in sectors such as energy, human genomics, healthcare, retail, online search, surveillance, and finance, as well as many other areas. IDC believes that organizations that are best able to make real-time business decisions based on machine-generated data streams at the lowest possible cost will thrive, while those that are unable to embrace and make use of this expanding data source will increasingly find themselves at a competitive disadvantage in the market. This situation will be particularly true in industries that are experiencing high rates of business change and aggressive consolidation. Big Data Valu e : W h a t ' s in It for Me? Regardless of industry or sector, the ultimate value of Big Data implementations will be judged based on one or more of three criteria: Does it provide more useful information? For example, a major retailer might implement a digital video system throughout its stores, not only to monitor theft but also to implement a Big Data pattern detection system to analyze the flow of shoppers — including demographical information such as gender and age — through the stores at different times of the day, week, and year. ©2011 IDC #230747 3 Does it improve the fidelity of the information? For example, a number of earth science and medical epidemiological research teams are using Big Data systems to monitor and assess the quality of data being collected from remote sensor systems; they are using Big Data not just to look for patterns but also to identify and eliminate false data caused by malfunctions, user error, or temporary environmental anomalies. Does it improve the timeliness of the response? Consumer products companies can use kiosks like Coca-Cola's Freestyle to collect real-time consumer taste preferences in different regions. This move makes it easier to tune promotions and control inventory levels on a regional or even store-by-store basis. Big Data Analytics Versus Retention: Distinct Solutions for Distinct Needs Today, a number of Big Data analytics solutions use a combination of open source software frameworks such as Hadoop and MPP (massively parallel processing) hardware architectures to support compute- and data-intensive applications that can consume multiple petabytes of disk storage across thousands of individual server nodes. Both hardware and software components of such analytics systems are optimized for performance where the data distributed over multiple nodes is kept redundant for resiliency and high-availability reasons. The MPP architecture–based systems are designed such that compute and storage are tightly coupled to minimize contention for resources. While these solutions are best suited to run complex large-scale analytics where performance is the prime objective these systems are not suitable targets for long-term retention of big data content. A key element in all these use cases is that organizations must be able to continually go back and reanalyze the same machine-generated data sets over and over again. They need to continually look for patterns stretching over hours, days, months, and years. If it's too expensive to retain the needed historical data or too difficult to organize the data for timely, ad hoc retrieval, organizations won't be able to capitalize on their collected information. The key question you need to be asking is whether your current storage environment can handle this new data explosion and the data retention challenges it will create. Traditionally, MGD was treated like either structured or unstructured data sets: 1. It was maintained in a database or data warehouse (leveraging SAN-attached storage), which is very expensive and can significantly impact performance, unless an organization used the archiving functions (not also provided) for each application. In this approach, the data is also trapped in a single application environment and is difficult to repurpose/reuse. 2. It was pushed down as a blob (sometimes aptly called a TARball) onto a file system to be retained. In this approach, an organization sacrificed the structure detail, significantly impacting the querying and analytical ability and, once again, the ability to repurpose/reuse. Because MGD was often linked to a tape library, it posed significant data retrieval burdens. 4 #230747 ©2011 IDC 3. It was kept as a set of personal files on a file server or NAS device and then either orphaned (when the owner left) or deleted. In both cases, the ability to access the data and to manage its retention/disposal for regulatory reasons was severely compromised.

WHITE PAPER Rethinking Your Data Retention Strategy to Better Exploit

Data Retention Policy (GDPR Compliant)

Basic Overview of Data Retention Mandates – Privacy and Cost

Facts About the Federal Government's Data Retention Scheme

10. GCHQ. Handling Arrangements for Bulk

NSA) Surveillance Programmes (PRISM) and Foreign Intelligence Surveillance Act (FISA) Activities and Their Impact on EU Citizens' Fundamental Rights

Data Localization and the Role of Infrastructure for Surveillance, Privacy, and Security

2017 Data Mining Report to Congress October 2018 2017 DHS Data Mining Report

Humanitarian Futures for Messaging Apps

Two Years After Snowden

II. Data Retention: the Basics

A Practical Guide to Implementing a Data Retention Policy

Benefits of Data Archiving in Data Warehouses 2 Benefits of Data Archiving in Data Warehouses