WHITE PAPER

Rethinking Your Retention Strategy to Better Exploit the Big Data Explosion Sponsored by: Dell

Richard L. Villars Marshall Amaldas October 2011

IDC OPINION

The continued generation of business-critical semistructured data (including large volumes of machine-generated data [MGD] from smart sensors and mobile devices) is changing the storage dynamic in a wide range of industries and organizations. Making investments to extract value from this expanding pool of information is fast becoming a core business mandate, but such efforts can quickly lead to spiraling IT costs and growing corporate risk without the right data retention and long-term archiving strategy.

Making the wrong choice in a technology decision (e.g., deciding between an OLTP, OLAP, or OLDR approach to ) can lead to significantly high and retention costs in both the short run and the long run. It can also jeopardize compliance and standards for data such as call detail records (CDRs) and trading records. IT organizations need to deploy active archival storage solutions that address the total cost of ownership (TCO) for archival data at many layers. Specifically, such a solution:

 Provides a semistructured archive platform that's significantly less expensive than archiving that same information on individual database, , or file systems

 Maximizes the utilization of that hardware with intelligent data management/reduction software

 Reduces the ongoing operational burden of the archival storage environment

When selecting a storage and data management partner to help you manage the "Big Data" challenge, you will need a partner that can address the entire spectrum of data assessment, data retention, and data use requirements of this new environment. Dell, as a leading designer and provider of IT solutions optimized for Big Data analytics, is also providing enterprise-class solutions that address the cost, performance, and intelligence requirements at the heart of Big Data retention and

active archiving. Global Headquarters: 5 Speen Street Framingham, MA 01701 USA P.508.872.8200 F.508.935.4015 www.idc.com F.508.935.4015 P.508.872.8200 USA 01701 MA Framingham, Street Speen 5 Headquarters: Global

INFORMATION EVERYWHE RE, BUT WHERE ' S THE KNOWLEDGE?

For the first 40 years of the IT industry, the main data challenge for most organizations was enabling/recording more and faster business transactions, often referred to as structured data. Today, much of the focus is on more and faster exchanges of information (e.g., documents, medical images, movies, gene sequences, data streams, tweets) from scale-out cloud clusters to systems, PCs, mobile devices, and living rooms. This information is often categorized as unstructured data (e.g., image, audio, or video files) or semistructured data (e.g., , logs, call detail records). Semistructured data is often overlooked, but with the advent of RFID tracking, smart sensors, mobile devices with geospatial information, and a growing array of data collection devices, MGD will be a leading driver of the data explosion.

The business challenge for the next decade will be finding ways to better analyze, monetize, and capitalize on all this MGD (see Figure 1). It will be the age of Big Data. For the IT organization, the challenge will be to implement an archival storage system that ensures that this information is reliably and efficiently ingested, protected, organized, accessed, and preserved.

F I G U R E 1

Changing Business Priorities in a F a s t - Shifting World

Companies rely on a growing range of devices, data sources, and applications to compete in today's evolving business environment.

Facebook

MORE MORE DEVICES APPLICATIONS

VMware

salesforce.com MORE Apple MORE CONTENT DATA

The range of information created, accessed, and retained affects how companies organize datacenters and retain information.

Source: IDC, 2011

2 #230747 ©2011 IDC

The Ongoing Data Explosion

Data creation is occurring at a record rate. In 2010, the world generated over 1 zettabyte (ZB) — that's 1 million petabytes (PB) — of data; by 2014, we will generate 7ZB a year. While much of this is "unsaved" or highly duplicated data like personal photos or copies of music/videos, one of the fastest-growing and most important sources of growth is machine-generated data:

 Financial transactions. With the consolidation of global trading environments and the greater use of programmed trading, the volume of transactions that need to be collected and analyzed can double or triple in size, while the transaction volumes can also fluctuate much faster, more widely, and more unpredictably, and competition among firms forces trading decisions to be made at ever smaller intervals.

 Smart instrumentation. The use of intelligent meters in "smart grid" energy systems that shift from a monthly meter read to an "every 15 minutes" meter read can translate into a multi-thousandfold increase in data generated. Similar data bursts are looming in healthcare, where low-cost gene sequencing will have a profound impact on medical data volumes.

 Mobile devices. Until quite recently, the main data generated on landline and traditional mobile phones was limited to CDRs with caller, receiver, and length of call data. With smartphones and tablets, additional CDR data to harvest includes geographic location, text messages, browsing history, and (thanks to the addition of accelerometers) even motions.

All of this data creates new opportunities to "extract more value" in sectors such as energy, human genomics, healthcare, retail, online search, surveillance, and finance, as well as many other areas. IDC believes that organizations that are best able to make real-time business decisions based on machine-generated data streams at the lowest possible cost will thrive, while those that are unable to embrace and make use of this expanding data source will increasingly find themselves at a competitive disadvantage in the market. This situation will be particularly true in industries that are experiencing high rates of business change and aggressive consolidation.

Big Data Valu e : W h a t ' s in It for Me?

Regardless of industry or sector, the ultimate value of Big Data implementations will be judged based on one or more of three criteria:

 Does it provide more useful information? For example, a major retailer might implement a digital video system throughout its stores, not only to monitor theft but also to implement a Big Data pattern detection system to analyze the flow of shoppers — including demographical information such as gender and age — through the stores at different times of the day, week, and year.

©2011 IDC #230747 3  Does it improve the fidelity of the information? For example, a number of earth science and medical epidemiological research teams are using Big Data systems to monitor and assess the quality of data being collected from remote sensor systems; they are using Big Data not just to look for patterns but also to identify and eliminate false data caused by malfunctions, user error, or temporary environmental anomalies.

 Does it improve the timeliness of the response? Consumer products companies can use kiosks like Coca-Cola's Freestyle to collect real-time consumer taste preferences in different regions. This move makes it easier to tune promotions and control inventory levels on a regional or even store-by-store basis.

Big Data Analytics Versus Retention: Distinct Solutions for Distinct Needs

Today, a number of Big Data analytics solutions use a combination of open source software frameworks such as Hadoop and MPP (massively parallel processing) hardware architectures to support compute- and data-intensive applications that can consume multiple petabytes of disk storage across thousands of individual server nodes. Both hardware and software components of such analytics systems are optimized for performance where the data distributed over multiple nodes is kept redundant for resiliency and high-availability reasons.

The MPP architecture–based systems are designed such that compute and storage are tightly coupled to minimize contention for resources. While these solutions are best suited to run complex large-scale analytics where performance is the prime objective these systems are not suitable targets for long-term retention of big data content.

A key element in all these use cases is that organizations must be able to continually go back and reanalyze the same machine-generated data sets over and over again. They need to continually look for patterns stretching over hours, days, months, and years. If it's too expensive to retain the needed historical data or too difficult to organize the data for timely, ad hoc retrieval, organizations won't be able to capitalize on their collected information. The key question you need to be asking is whether your current storage environment can handle this new data explosion and the data retention challenges it will create. Traditionally, MGD was treated like either structured or unstructured data sets:

1. It was maintained in a database or data warehouse (leveraging SAN-attached storage), which is very expensive and can significantly impact performance, unless an organization used the archiving functions (not also provided) for each application. In this approach, the data is also trapped in a single application environment and is difficult to repurpose/reuse.

2. It was pushed down as a blob (sometimes aptly called a TARball) onto a file system to be retained. In this approach, an organization sacrificed the structure detail, significantly impacting the querying and analytical ability and, once again, the ability to repurpose/reuse. Because MGD was often linked to a tape library, it posed significant data retrieval burdens.

4 #230747 ©2011 IDC 3. It was kept as a set of personal files on a file server or NAS device and then either orphaned (when the owner left) or deleted. In both cases, the ability to access the data and to manage its retention/disposal for regulatory reasons was severely compromised.

Failing to make use of systems that are built specifically for meeting the long-term retention and compliance needs of MGD data pools will make Big Data analytics ambitions cost prohibitive and risky. You need a Big Data retention solution like Dell's Big Data Retention, which recognizes MGD as historic at creation and immediately commits MGD to an intelligent, long-term, online retention pool. This approach eliminates much of the high cost associated with databases/SAN storage while maintaining the critical data context that is lost in NAS environments.

Table 1 provides a set of questions that can help you decide whether you have a Big Data retention problem. If you recognize any of these issues, you need to start thinking about your current data retention strategy and how you can move to a more active archival storage environment.

T A B L E 1

D o Y o u Have a Big Data Retention Problem ?

Potential Pain Point Key Indicators of a Problem

Extreme volumes of data that in many cases are caused Do you have or are you expecting daily data volumes that by automated or machine-generated data sets outstrip the ingestion capability or storage capacity of your database or repository?

Are you concerned about the impact of this data growth on storage budgets?

Out-of-control costs associated with retaining, managing, Are some of the data sets that you are retaining automated or and supporting these new data pools machine generated and typically historic upon creation?

Do you store these data sets in traditional transactional databases/data warehouses or as big blobs or files?

Do you need to keep more of this data for discovery or analytics purposes that require the preservation of the structured data sets (usually lost when they are stored as a file)?

Difficulty managing and supporting disparate silos Do you need to archive , files, SharePoint, and other (archives and data stores) across the enterprise unstructured data in addition to databases or other forms of structured data?

Are you concerned about the long-term TCO for either or both archive platforms used for each environment?

Need to back up (protect) these new data sources is Are you having a hard time consistently meeting your backup threatening backup windows for existing applications and and recovery SLA requirements, or are your failing stressing existing backup facilities altogether?

Source: IDC, 2011

©2011 IDC #230747 5 THE CHANGING NATURE OF ARCHIVAL STORAGE IN THE WORLD OF B I G D A T A

Data retention via archiving has a long history as an IT practice, stretching back decades. Archiving of structured data was required for regulatory/contractual purposes or as a method of maintaining high levels of system performance (removing inactive data to free up capacity and I/O performance on databases). The archive data was rarely, if ever, accessed again and was stored on inactive media (e.g., tapes). Moving the data was expensive, time consuming, and often tied to the backup process, which made retrieval after a relatively short time (e.g., 30 to 90 days) time consuming and difficult.

The rapid growth in digital data triggered by the arrival/explosion of the , along with a series of business scandals, increased the scope of the data retention/disposition problem. For IT managers in the past decade, regulation and compliance requirements mandated that organizations retain semistructured data (e.g., CDR and trading records) for even longer periods of time. They also mandated that this archived material be more quickly accessible for eDiscovery purposes.

Concurrently, the move to online collaboration began to generate larger amounts of emails, office documents, and rich media data, which must also be retained and archived. These mandates added a further layer of complexity because just storing the data was not enough. Organizations also needed to preserve the data in context.

Both of these developments drove organizations to greater reliance on active (HDD-based) storage for their long-term data storage needs. IDC estimates that in 2010, organizations around the world deployed 4,465 petabytes of new disk storage capacity just to store copies of their data for availability and retention purposes; by 2015, they will be deploying 16,538 petabytes (see Figure 2). While solving the access time issue, this addition of another disk storage tier often poses even greater storage asset management challenges.

6 #230747 ©2011 IDC F I G U R E 2

Worldwide Enterprise Disk Storage Consumption, 2 0 1 0 – 2 0 1 5

80,000 70,000 60,000 50,000 40,000 30,000

20,000 (Petabytesshipped) 10,000 0 2010 2011 2012 2013 2014 2015

Structured data Replicated data

Unstructured and semistructured data

Source: IDC's Enterprise Disk Storage Consumption Model, September 2011

M a c h i n e - Generated Data and Big Data Reframe the Data Retention C h a l l e n g e

Today, the rate of data growth and the diversity of data types are reaching unprecedented levels. The traditional archiving jobs of preservation and active application offload remain daunting challenges. At the most basic level, the sheer data volume increase associated with new and fast-growing machine-generated environments can pose significant archival challenges. The shift to intelligent meters as part of a smart grid energy system would lead to a 3,000-fold increase in machine- generated data that a utility would be collecting on a monthly basis.

Certain industries such as financial services (market, trading, and tick data) and (logs, CDRs for lawful intercept) are continuously generating vast quantities of data at a rate of billions of MGD records a day. With retention requirements ranging from a few years to indefinitely, the demand for raw storage will only accelerate unless we figure out how to be smarter about how data is retained.

What's different in the new world of machine-generated data and Big Data analytics is the need to continually go back and mine this data over and over again. You're not just retaining it; you're continually reusing it.

Standard database and data warehouse applications aren't optimized to handle ingestion of such volumes of data, and they are even less suitable platforms from a cost and performance standpoint when it comes to archiving. At the same time, the

©2011 IDC #230747 7 previously mentioned utility can't just park the data on some tapes. Organizations are constantly deploying new sets of analytic applications that continually go back and analyze behaviors (and then make real-time adjustments) on an hourly, daily, weekly, monthly, and even yearly basis.

The context and techniques for mining that data will change and evolve. Any data retention solution that locks the retained data into a traditional hierarchical database or file structure severely impacts the long-term cost of storing the MGD. More important, it greatly reduces the long-term value of the data.

Active Archiving for Big Data

The primary data management challenge associated with Big Data is to ensure that the data is retained (to satisfy compliance needs at the lowest possible costs) while also keeping up with the unique and fast-evolving scaling requirements associated with new business analytics efforts. Organizations that strike this balance will boost efficiency, drive down cost, and be in a far better position to capitalize on Big Data innovations.

Firms must be able to mine their historical data to analyze and extract data for market intelligence, product planning, and inventory planning. In R&D environments, reuse of historical information can yield vast savings in time and effort, which in turn saves money and in some cases provides competitive advantage by shrinking the time required to bring products to market.

Today, many of these Big Data projects are best described as "junior science projects" with a small core of servers and storage assets. From a business and an IT governance standpoint, however, these kinds of "junior science projects" can quickly turn into the next "Manhattan project" with companywide and industrywide business, organizational, and legal consequences. IT organizations need to deploy active archival storage solutions that address three major requirements:

 Rapid, continuous, and intelligent movement of "instantly historical" data from the data-generating devices/applications onto the active archive system. This ability ensures that the source application continues to run at maximum efficiency in terms of performance and reliability and that the underlying IT assets (servers and primary storage systems) aren't compromised by having to support multiple, incompatible workloads.

 Flexibility in data ingest capability. The amount of machine-generated data can vary significantly from time to time, depending on the amount of activity that is experienced by a monitoring system. Financial trade monitoring systems can experience very high levels of activity due to an external event that causes panic, which in turn could trigger a sudden surge in the number of trades. The active archive target should be able to accommodate such variation and be able to ingest data at different rates as required.

8 #230747 ©2011 IDC  Rapid, nondisruptive scalability of archival storage capacity and I/O performance. This modularity makes it easier to launch initial, limited machine- generated capabilities, without jeopardizing the ability to meet rapidly expanding requirements for capacity and performance. You may outgrow a specific module, but you never want to outgrow the archival platform. When you're talking about hundreds of terabytes (TB) to multiple petabytes of information, migrating to a new platform should not be a necessary option.

 Built-in efficiency. Unchecked data growth is bound to become a burden over the course of time, even on an archive tier. IT organizations need to look for solutions that take full advantage of proven efficiency technologies that are purpose designed to make the most out of machine-generated data to achieve targets.

 Flexible, nonhierarchical data organization based on an object-based storage foundation. This flexibility is critical because one of the key tenets of Big Data applications is the ability to deal with new and unpredictable data patterns. Machine-generated data required for one purpose today may prove absolutely critical for enabling some new analytic algorithm tomorrow. An object- based approach to storing information eliminates the risk that a data-organizing approach that makes sense now doesn't render the data difficult to extract or useless in the future.

 Preserve metadata. The metadata attributes of machine-generated data are much richer than those of other content types, which makes them very useful for analytical purposes. Having the ability to make sense of relationships between different data sets using common metadata and attributes is a key analytical value of the data. IT acquisition decision makers need to make sure that their choice of storage system does not depreciate this value.

Because most organizations are new to the subject of active archiving for machine- generated content, purchase decision makers need to look for solutions and vendors that place a high emphasis on providing complete service and support throughout the implementation.

Don't Overlook Data/Information Security/Privacy

As in the case of other content types, regulation and compliance is also an important consideration of machine-generated data. For example, the USA stipulates strict retention requirements on CDR. Telecommunications organizations need to make sure that this information is stored such that it is not modified from the time it is created.

If the data involved is sensitive for reasons of privacy, enterprise security, or regulatory requirement, then misplacement or misuse of retained data can represent a serious security breach. More traditional database management systems support security policies that are quite granular, protecting data at both the coarse-grained level and the fine-grained level from inappropriate access.

©2011 IDC #230747 9 Today, Big Data applications generally have no such safeguards. Enterprises that include any sensitive data in Big Data operations must ensure that the data itself is secure and that the same policies that apply to the data when it exists in databases or files are also enforced in the Big Data context. Failure to do so can have serious negative consequences.

The archival storage environment, as the common retention point for all machine- generated data, must enable advanced, yet easy-to-leverage, data/information security capabilities. It must include:

 The ability to automatically place specified records on disks that have WORM capability

 Monitoring and reporting capabilities, which will help IT administrators make informed infrastructure and policy decisions proactively

The remainder of this white paper examines how well Dell's Big Data Retention solution addresses the need for compliant, enterprise-class Big Data/MGD retention and on-demand access.

DELL ' S BIG DATA RETENTION A N D A C T I V E ARCHIV E STORAGE SOLUTION

Dell is a leading provider of IT products and services for organizations around the world. It provides the computing systems at the heart of machine-generated data devices. It is also a leader in designing and deploying servers optimized for Big Data analytics compute platforms that play a key role in monetizing the value of machine- generated data.

Now, Dell is also providing enterprise-class solutions that will be at the heart of Big Data retention and active archiving.

Dell set a goal of creating a complete archival solution (hardware, software, and professional services) that cost-effectively solves the "Big Data" retention/archive problem and enables a better way to retire and archive legacy applications. For Dell, cost-effectiveness means addressing the TCO of archival data at many layers:

 Providing an MGD/Big Data–optimized archive platform that is significantly less expensive (and more useful) than archiving that same information on individual database, data warehouse, or file systems

 Leveraging the least expensive hardware (without compromising performance/reliability)

 Maximizing the utilization of that hardware with intelligent data management/reduction software

 Reducing the ongoing operational burden (provisioning, migrating, and administrating) of the archival storage environment

10 #230747 ©2011 IDC The company introduced the Big Data Retention solution in 2011. It's designed to provide a low-cost, standard foundation for , long-term retention, and on-demand data retrieval of historical data (machine generated and all other forms of semistructured data). Big Data Retention is a single platform for retaining structured, unstructured, and semistructured data across an unlimited number of data sources, formats, and types. It is based on a clustered Web storage service that utilizes a peer-scaling design that can start at 1TB and extend to exabytes. The DX Object Storage Platform abstracts underlying technology by integrating compute, network, and storage resources into one delivery unit.

Key characteristics of the Big Data Retention solution include:

 Fast deployment with minimal administration overhead and no special tuning: Minimal administration required compared with specialized DBA requirements associated with many traditional repositories and ability to rapidly search and retrieve data using native SQL for seamless integration with existing systems within the enterprise

 Scalable performance with high data ingestion rates and fast queries: Ability to load big data volumes (billions of records/day, petabytes/month)

 Scalable, intelligent archival storage capacity to handle big data volumes: Ability to dedupe at the structured data value and pattern level, leveraging the resulting 40 to 1+ compression ratio (97%+ reduction in size)

 Integrated compliance features such as configurable retention rules and audits: Ability to set flexible configurable retention and expiry rules for the life cycle of the data with guaranteed read-only immutability and audited access

Dell provides customers with multiple deployment options for the Big Data Retention solution. IT organizations can deploy Big Data Retention as an on-premise system within their own datacenter. In addition, Dell will be providing a cloud-based solution running in Dell's cloud datacenters.

Dell DX Object Storage Platform Is Archival S t o r a g e a t the Core of Big Data Retention

One of the key components of the Big Data Retention solution is the Dell DX Object Storage Platform. The DX allows IT organizations to effectively archive both structured data (e.g., from RDBMSs) and semistructured data (e.g., logs, call data records, other MGD) while still supporting reliable query via SQL or any business intelligence tool using ODBC/JDBC. IT organizations can archive terabytes to petabytes of semistructured data on the Dell DX Object Storage Platform while minimizing TCO through the use of advanced data reduction capabilities built into the DX.

Key design characteristics of the DX platform include:

 System persistence. Adding/replacing hardware components (e.g., disks or controllers) is automated and nondisruptive and doesn't require any data organization changes. Only one physical migration (the original one) on the DX platform is necessary.

©2011 IDC #230747 11  System resiliency. The system is self-healing. If a hardware element fails, you just plug in a replacement and the system automatically restores what needs to be restored.

 System flexibility. The system supports highly granular file-level management to enable great effectiveness of data reduction services and employs an open API so that your data won't be trapped in a single, proprietary environment.

The DX Object Storage Platform makes it possible for IT organizations to ingest billions of records per day, accumulating petabytes of data per month. More important, it ensures that this data is properly retained based on legal/governance or business analytics requirements.

Challenges/Opportunities f o r D e l l

Given the continued rapid growth of machine-generated data and the increasing role of big in organizations' new application and services plans, the storage and information management challenges posed by Dell's data-driven customers and prospects will only increase in the coming years. Dell needs to address a number of requirements as it expands its role in organizations' active archiving and data analysis environments:

 Continue to improve underlying storage hardware capacity, performance, and power management efficiencies through more tunable/intelligent automated data movement and support for even denser/more power-efficient HDD solutions

 Establish closer technical and business ties with leading analytics (Big Data) application suppliers that will make it easier for customers to fully exploit the information stored within the Big Data Retention archive

 Extend the reach of the Big Data Retention solution to better address the active archive and data mining needs of medium-sized and small businesses through further expansion of Dell's cloud-based offering

FINAL THOUGHTS

When assessing the impact of machine-generated data and supporting analytic applications on your IT infrastructure, you'll find that the challenges extend from data creation, to data collection, to data retention, and, finally, to ongoing analysis. This new environment represents both big opportunities and big challenges for CIOs. Almost every CIO dreams about making IT a more valued asset to the organization. Big Data projects are at the frontier of the business, where the majority of the most significant business expansion or cost reduction opportunities lie. Taking a lead in leveraging machine-generated data provides the CIO with a chance to be a strategic partner with business unit leaders.

Because speed is strategically important in many early efforts, it will be tempting for business unit teams to move forward without IT support. You will find, however, that Big Data issues emerge at surprisingly low data volumes. They manifest themselves when the balance between the value of data and the cost of retention becomes an issue.

12 #230747 ©2011 IDC Making the wrong choice in a technology decision (e.g., deciding between an OLTP, OLAP, or OLDR approach to data storage) will lead to significantly high data management and retention costs in both the short run and the long run.

It will also expose the organization to greater risks when it comes to IT and corporate governance. Your IT team needs to recognize that it must think differently (as well as quickly) and fight for a seat at the table as analytic and data archiving strategies are developed. You need to ensure that the solution:

 Reduces infrastructure cost by ingesting and querying large volumes of data on commodity infrastructure while cutting demand for physical storage through compression

 Retains huge volumes of records without a need to roll up or aggregate while managing record life cycles through configurable retention policies, preserving a secure and immutable data model

 Retrieves information speedily using standard SQL over ODBC/JDBC, enabling enterprises to leverage existing business information, reporting, and analytics investments

When selecting a storage and data management partner to help you, picking the best product isn't enough. You will need a partner that can address the entire spectrum of data assessment, data retention, and data use requirements of this new environment.

IDC believes that building successful business cases around the intersection of machine-generated data and Big Data analysis can be accomplished only through a tight alignment of critical thinking across both IT and the business. You will need a partner that can help you capitalize on new initiatives quickly and cost-effectively.

As a CIO, you want to be more involved in the business; creating the right data management infrastructure for the retention and active archiving of machine- generated data can bring your IT organization front and center in the next major business effort.

Copyright Notice

External Publication of IDC Information and Data — Any IDC information that is to be used in advertising, press releases, or promotional materials requires prior written approval from the appropriate IDC Vice President or Country Manager. A draft of the proposed document should accompany any such request. IDC reserves the right to deny approval of external usage for any reason.

Copyright 2011 IDC. Reproduction without written permission is completely forbidden.

©2011 IDC #230747 13