The Illusion of Space and the Science of

and Deduplication Bruce Yellin

EMC Proven Professional Knowledge Sharing 2010

Bruce Yellin Advisory Technology Consultant EMC Corporation [email protected] Table of Contents What Was Old Is New Again ...... 4 The Business Benefits of Saving Space ...... 6 Data Compression Strategies ...... 9 Data Compression Basics ...... 10 Compression Bakeoff ...... 13 Strategies ...... 16 Deduplication - Theory of Operation ...... 16 File Level Deduplication - Single Instance Storage ...... 21 Fixed-Block Deduplication ...... 23 Variable-Block Deduplication ...... 24 Content-Aware Deduplication ...... 26 Delta Block Optimization ...... 27 Primary and Secondary Storage Optimization ...... 29 In-Line versus Post-processing ...... 31 Primary Storage Optimization - In-line versus Post-Processing ...... 33 Secondary Storage Optimization - In-line versus Post-Processing ...... 36 Source versus Target ...... 37 Secondary Storage Optimization - Who Makes What? ...... 39 Optimization Software versus Hardware ...... 40 Data Communications Optimization ...... 41 The Law and Storage Optimization ...... 43 Storage Optimization “Gotchas” and Misadventures ...... 45 Before You Purchase a Optimization System ...... 47 Conclusion ...... 52 Appendix I – MD5 and SHA-1 ...... 55 Appendix II – SNIA Deduplication and VTL Features ...... 57 Footnotes ...... 58

Disclaimer: The views, processes or methodologies published in this compilation are those of the authors. They do not necessarily reflect EMC Corporation’s views, processes, or methodologies

2010 EMC Proven Professional Knowledge Sharing 2 2009 was a turbulent year for the IT world. Faced with severe macroeconomic pressures and uncertainty, we have all participated in cost containment and reduction plans trying to do more with less. In spite of all the grim financial news, the information age continues to expand1 at an incredible rate forcing data storage to grow 40-60% a year. For example, Americans sent 110 billion text messages in a single month2 with a projected annual rate of 2 trillion messages.

The growth is attributed in part to the adoption of business intelligence suites, enterprise and web applications3, and additional regulations. In our own everyday lives, we create and share multimedia documents, depend on email, tweet, text, and instant messages.

Responding to the growth, companies add (virtual) servers, network devices, and more storage. Like an iceberg where the bulk of the mass lies beneath the surface, each additional terabyte of primary data might need 5-30 times that amount in additional tape or disk capacity. More is spent on floor space, power and cooling. There have been substantial increases in networking costs, especially as a greater emphasis is placed on disaster recovery (DR). And of course, personnel are placed under greater stress, especially when to not complete on time. This is clearly not the “green” way to go!

The outlook for 2010 and beyond does not provide any relief according to a recent study4. IDC says information is increasing by a factor of 5 while 2,500 -FOLD 5 DVD 2,000 Growth in YEARS 4 RFID budgets are increasing by only 20% and IT staff is Digital TV MP3 players 1,500 increasing by only 10%. They also found the Digital cameras Camera phones, VoIP 1,000 Medical imaging, Laptops, administrative and overhead storage costs are 4-7X 486 Data center applications, Games Exabytes Satellite images, GPS, ATMs, Scanners 500 Sensors, Digital radio, DLP theaters, Telematics the capital expense over the next four years! Peer-to-peer, Email, Instant messaging, Videoconferencing, Exabytes CAD/CAM, Toys, Industrial machines, Security systems, Appliances 0 2008 2009 2010 2011 2012 Improving storage operational efficiency is imperative for organizations with flat or slightly increasing budgets while they try to also improve performance and reliability. IT leaders have begun to tier data, use thin provisioning, set hard quotas, and even archive inactive data. Most managers are also placing big bets on compression and the hottest concept, deduplication.

2010 EMC Proven Professional Knowledge Sharing 3 Compression and deduplication algorithms are powerful weapons to use against “evil” data sprawl. They both effectively reduce physical storage requirements by leveraging CPU Not while I’m cycles. The secret is to know how, when, and where to use here! them. Over the millennia, the common element has been to I Love Data save space. Less space saves money, time, and even fosters Sprawl ! new concepts that might ordinarily be considered impractical.

What Was Old Is New Again

During World War II, a blivet was a slang expression conveying the idea that it was impossible to get “ten pounds of manure in a five pound bag”. A blivet, also synonymous for a seemingly intractable problem, might have also been an apt description of trying to store 4TB of data on a 2TB disk drive, except for the science of compression. We live in a miraculous world where every so often, technology changes the rules of the game. Space compression is today’s game changer allowing us to reduce data by 30%, 60% and even 90%!

Compression, a gem of the IT world, is one of those rare technologies with a lineage going back thousands of years. The ancient Greeks wrote their documents on papyrus, a thick paper-like material from the pith of the papyrus plant5. It was scarce and scriptio continua ("continuous script" in Latin) allowed people to squeeze more words into smaller areas6 by removing spaces in sentences. For example, here is part of a 181 C.E. “letter from Apollonius and Herminus to Herodes and other managers of the public bank, authorizing them to receive the tax on the sale of a slave.”7. Look closely and you will not see a single space!

The Romans made extensive use of Latin abbreviations and left out spaces practicing scriptio continua on coins and other mementos to save space. For example, the coin to the right dating back to the Roman Empire uses these 32 letters: “NEROCLAVDCAESARAVGGERPMTRPIMPPP”

2010 EMC Proven Professional Knowledge Sharing 4 Inserting spaces gives us these individual words: NERO (his name) CLAVD (part of his name and this stands for Claudius) CAESAR (an Imperial title with its roots in the family name of Julius Caesar) AVG (is AVGVSTVS (Augustus), the highest authority in Rome) GER (is GERMANICVS, Ruler or Conqueror of Germania) PM (is PONTIFEX MAXIMVS, or supreme priest) TR P(stands for TRIBVNICIA POTESTAS, Power or Potency of the Tribunate) IMP (is IMPERATOR meaning the ruler is the Commander-In-Chief of the armed forces) PP (is PATER PATRIAE or father of the country). The translation has 198 letters, spaces and punctuation: “Caesar Augustus Nero Claudius, High Priest and Ruler of Rome and Germania, Supreme Commander of the armies of Rome, the father of his country, leader of the Triumvirate for as long as he shall live.8” This produced a compression ratio of 198/32 = 6:1.

COMPRESSION OR DEDUPLICATION RATIO - The reduction in space expressed as a ratio. If we shrink 30GB down to 10GB, it has a 3:1 ratio. original _ size 30GB Ratio = = = 3:1 compressed _ size 10GB

I am not sure who invented the shot glass, but next time you visit a pub, notice how they benefit from compression when stacked. One night, I happily measured how their trapezoidal shape reduced their stacked height from 12” to almost 6”– a 2:1 reduction.

My favorite example of physical compression occurred while my family and I were sitting at a very small table in a Parisian café years ago. Using stemware, the waiter was easily able to maximize the table’s small surface area because the bottom of the glasses fit neatly under the elevated edge of the plates. In the photo on the left, the plate and glass takes up about 14.5”, but when the glass stem is under the plate, it takes up 12.5” or a reduction of 2”.

Getting back to data optimization, you may have used compression products such as WinZip or PKZip, or had data processed with advanced methods called data deduplication (which also employs data compression). Let’s journey into the world of data compression and deduplication.

2010 EMC Proven Professional Knowledge Sharing 5 DATA COMPRESSION – the digital representation of information with fewer bits stored in a smaller space or transmitted with less bandwidth. Compression takes a large data sequence and replaces it with a smaller one. Data compression produces a 2-3:1 reduction and is reconstituted on demand.

DATA DEDUPLICATION - a phrase created by database administrators wanting to remove duplicate records. The definition now includes the intelligent removal of redundant data at the file or sub-file level. File level deduplication and compression typically yields a 4-6:1 savings while fixed or variable sub-block approaches can reach 10-20:1 or higher ratios. Data is reassembled on demand.

The Business Benefits of Saving Space

The reason companies want to implement some form of space optimization is to save space, achieve faster processing, or cost effectively provide new business services. Saving space is obvious - if your file server is almost full, optimization could postpone the purchase of more storage, reduce or eliminate magnetic tape, or even gain network bandwidth efficiency. Less obvious is the increased performance possible without Data will expand to occupy all buying new servers, storage or bandwidth. It could even available space. In other words, be a fundamental technology - for example, HDTV the more hard drive space and would be impossible without compression. memory a system has, the larger

programs become and the more Oracle includes data compression in its 10g and 11g memory they use. releases and documented9 that full table scans of a non- - Parkinson’s Law of Data compressed table took 12 seconds and only 6 seconds when compressed. They noted that servers with fast multi-core CPUs and relatively slow disk subsystems ran compressed transactions faster because the overhead from mechanical disks was reduced. Testers compressed data as much as 12:1. “…queries against compressed tables incur virtually no query performance penalty. Our full table scan experiments show that compression decreases elapsed time by 50% while increasing total CPU time slightly (5%).”

When I bought my first PC, if there was data on my floppy disk, it is likely I put it there. For the most part, with a small amount of data housekeeping, I kept growth in check. These days, I tend

2010 EMC Proven Professional Knowledge Sharing 6 to be like “Oscar Madison” when it comes to disk clutter. A lot of my storage is based on email, spreadsheets, Visio drawings and PowerPoint. I have become a “data packrat” spending most of my time creating new content and very little time managing it.

DOUBTING THOMAS - every company has one. He might ask “If storage is so cheap, why are you bothering to save money on it?” True – in the late 1970’s, a diskette cost $4 and $2,000 got you a 5MB hard drive10 ($400/MB or $500,000/TB!) Today, a 1TB drive costs under $100 (1/100th of a penny/MB.) Your response? The concern is not the individual PC but the centralized storage supporting a company. Controlling the data sprawl of 5,000 employees is difficult with strained budgets. Running out of space is not an option. Add backups, snapshots and clones, off-site protection, record keeping, developing and testing new applications, and soon you realize space compression can be a savior.

Are you a fellow corporate data packrat? If so, you put a tremendous burden on your back-end infrastructure. Do you create documents and email them to colleagues? As time goes on, that document gets backed up weekly, placed on web sites, reused in newer efforts and even archived. This illustration shows how a 1MB document takes up 51MB when it gets backed up and might exist for years. My customers and I often get into discussions that none of us really knows how to curb our appetite for data and that we are fighting a losing battle to control its growth.

With drive prices falling and wages hopefully rising, it is more cost effective to add capacity than to manage it. says that storage management can be 10X the storage cost11. The focus has shifted from the price per terabyte to the cost of everything else that touches storage.

Saving space automatically reduces or postpones future CAPEX12. As requirements grow, squeezing more data into a smaller space means you avoid buying additional storage. Many companies realize space reductions of 50% or more and achieve a compelling return on

2010 EMC Proven Professional Knowledge Sharing 7 “I have made this [letter] longer, investment (ROI)13 since deferring storage purchases is because I have not had the immediate. OPEX14 are controlled when the cost to time to make it shorter.” manage the data stays the same – i.e., the same floor - Blaise Pascal space, power, cooling, etc. With tight budgets, improving storage utilization rates helps support new initiatives.

For new applications, a smaller, “greener” footprint can sometimes make or break a project. For Avg Commercial Power Price 2009 example, 15 x 1TB disks supply about 11TB of 14.47 6.98 12.78 7.26 6.44 15.5 space after formatting. With 3:1 compression, 8 7.7 15.27 16.05 9.49 6.14 7.05 9.23 16.2 they provide 33TB of space in the same rack 5.88 9.64 17.55 6.99 6.67 8.55 14.45 7.62 9.56 9.34 6.62 12.06 6.26 9.01 7.8 13.45 space with the same power and cooling 8.03 6.46 6.63 12.45 8.53 15 16 8.92 8.32 requirements. EMC , HP and others offer 7.67 7.96 8.65 8.41 8.79 8.98 8.87 power calculators to help create a compression 10.73 8.2 11.89 ROI. These tools need cents per kilowatt- 14.64 20.54 hour17 power costs as shown in this “flag” chart.

When primary storage is reduced, secondary storage (backups and replication of information) is also reduced for “instant savings.” Business continuity (BC) data, snapshots and backups are smaller and accomplished in far less time. When data protection time is reduced, so is risk. Reducing backup windows through space compression can be a major operational benefit that may be impossible to achieve by just “throwing hardware” at the problem.

Reducing secondary storage allows data to be transmitted faster using smaller capacity network links making DR scenarios more affordable. One of my customers saved a lot of money sending deduped data over a lower cost OC1 WAN circuit instead of a more expensive OC3. Remote office users load documents faster from the central site and benefit from a common backup infrastructure. For example, downloading a 20MB PowerPoint over a T-1 line takes about 100 seconds. Deduplicating with a 6:1 ratio cuts the amount of time to 17 seconds.

Other benefits from space compression include the ability to control the number of full-time employees (FTEs) needed to manage storage. Their OPEX includes salary, raises, healthcare, retirement benefits, desk space, etc. Lastly, if you outsource storage administration, you may be able to control management fees if your costs are based on capacity used versus raw capacity.

2010 EMC Proven Professional Knowledge Sharing 8 Space saving technology costs money, so naturally there is a great deal to consider. The next two sections explore the pros and cons of compression and deduplication strategies to help you decide on a direction. While various systems are used to illustrate technology, it is not a substitute for performing your own detailed product review.

Data Compression Strategies

Data compression is an established and routinely practiced technique that safely stores more data in the same physical space or transmits it with less bandwidth. While blivets of 2:1 data compression might have been unheard of a hundred years ago, it can be logically explained how math magic is used to substitute a short code for a longer data pattern. There are two compression categories - lossy and lossless.

LOSSY COMPRESSION – the original file is altered and can not be recreated because some data is permanently removed. While this seems like a bad idea, it can be “good enough” if the experience is acceptable. For example, a DVD holds 4.7-8.5GB of data, yet it can not hold a 2 hour uncompressed movie. MPEG-2 encodes onto DVDs by discarding pieces of audio and information. DVD players use MPEG-2 decoders to decompress the movie.

LOSSLESS COMPRESSION – the original data can be reconstituted. Example - WinZip compresses and decompresses text without any dropout.

JPEG photographs use . Cameras that allow you to select image quality and file size discard information. For example, 8 megapixel cameras create 8MB RAW image photos. Using JPEG to save 2MB images trades permanent resolution loss for storing more photos on a memory card. This is not an issue when printing photos since printers have lower resolution than compressed photos. Compare the original image on the left to the degraded one on the right. Too much data was removed when aggressive compression was used.

The balance of this paper focuses on since it is unacceptable to lose data.

2010 EMC Proven Professional Knowledge Sharing 9 Data Compression Basics

Data is stored with fixed-size 8 bit ASCII codes where the upper-case “A” is “01000001” and a Time lower-case “a” is “01100001”. Samuel Morse developed a communication Letter Morse Code Units E 1 method using very short codes for letters like “e” which occur frequently in I 2 T 3 S 3 English (1 out of every 8 letters) and longer codes for less frequent letters like A 4 N 4 “z” (1 out of 1350 letters). Please see “Static Model” for a full frequency list. H 4 U 5

In the 1830’s, Samuel Morse and Alfred Vail18 developed a lossless electromagnetic device and code to send letters and numbers triggered by short and long presses of a simple switch. A single short click was an “e”, a single long click was a “t”, and the infrequent letter “z” was 3 short clicks followed by a space and another short click. There were no lower and upper case letters – just letters. M O R S E C O D E -- --- •-• ••• • (space) -•-• --- -•• •

Compression works when data follows a non-uniform distribution. If the characters in this paper had the same frequency and were uniformly random, compression would fail. In English, letters like “a”, “e”, “i”, “o” and “u” appear often and “x”, “q” and “z” appear infrequently.

In this section, we explore the statistical and dictionary approaches that leverage these distributions. They each encode incoming data to reduce the size according to one these models: static, semi-adaptive and adaptive. The model follows a flow where source data is compressed and either put on disk or transmitted, and decompressed on demand.

2010 EMC Proven Professional Knowledge Sharing 10 Statistical Compression – (aka ) assigns a probability for each character, symbol or string based on the data content. For example, if the letter “a” appeared frequently in the text, the probability would be higher and yield greater compression than a less frequent letter. When the overall coding probability matches the original data, the compression is optimal and can achieve better results than dictionary compression. Various approaches use fixed or variable length substitution codes. There are three types of statistical compression modeling – static, semi-adaptive and adaptive.

Dictionary Compression – uses an index method to scan a coded dictionary for a group of characters or a phrase and replaces it with a unique pointer to the location in the dictionary. The pointer is much smaller than the set of characters or phrases and as a result, achieves a considerable amount of compression. The dictionary contains phrases that are found in the incoming text or expected to be found. This approach is sometimes called substitution coding. Dictionary compression uses static, semi-adaptive and adaptive.

Here is a description of these three models: 1. Static Model – a fixed, rigid model or codebook that never changes. It is very efficient Letter Frequencies of English compared to for predicted data and requires one pass to perform the Brothers Grimm and Spanish English Bros. Grimm Spanish E 12.7% E 12.9% E 13.7% compression or decompression. It is not suggested for T 9.1% T 9.6% A 12.5% A 8.2% A 8.3% O 8.7% O 7.5% H 7.9% S 8.0% general purpose use. For example, compressing English text I 7.0% O 7.5% R 6.9% N 6.7% N 6.8% N 6.7% S 6.3% I 6.0% I 6.3% using a static codebook might never be optimal. “H” occurs H 6.1% S 5.6% D 5.9% R 6.0% D 5.3% L 5.0% D 4.3% R 5.2% C 4.7% 6.1% of the time in English l and 7.9% of the time in “The L4.0% L4.2% T4.6% C 2.8% W 3.0% U 3.9% Brothers Grimm Fairy Tales”, yielding sub-optimal U 2.8% U 2.7% M 3.2% M 2.4% G 2.2% P 2.5% W 2.4% M 2.1% B 1.4% compression. It is even less efficient on Spanish text since “H” F 2.2% F 2.0% G 1.0% G 2.0% C 1.9% Y 0.9% Y 2.0% Y 1.9% V 0.9% occurs only 0.7% of the time. Some text is over optimized P 1.9% B 1.5% Q 0.9% B 1.5% P 1.3% H 0.7% V 1.0% K 1.0% F 0.7% while other text is under optimized. A case where static K 0.8% V 0.8% Z 0.5% J 0.2% J 0.1% J 0.4% X 0.2% X 0.1% X 0.2% modeling works well is when a word-based Bible codebook Q 0.1% Q 0.1% W 0.0% Z 0.1% Z 0.0% K 0.0% compresses a Pastor’s sermon. That same book performs poorly with “War and Peace”.

Approaches that use statistical modeling with a static codebook include Static Huffman, Morse Code, and Golomb-Rice coding19. Approaches using dictionary compression with a static codebook include digram coding (2 letter groupings). The most common English digrams are TH, HE, AN, IN, HA, OR, ND, RE, ER, ET, EA, and OU.20

2010 EMC Proven Professional Knowledge Sharing 11 2. Semi-Adaptive Model – the codebook is generated statistically during the first pass of the data stream. In the second pass, data is encoded with that codebook. The final codebook is included with the compressed data so a decoder receiving the stream can decode it. This method is optimal for just a specific data source. While larger codebooks increase efficiency, the codebook can overwhelm the size of the original data with a short data stream becoming larger after compression than the original (i.e., negative compression). This effect is easily seen with random (uniform) data. Two pass methods are generally not used with real-time applications or for data communications.

A design that uses statistical modeling with a semi-adaptive codebook is semi-adaptive Huffman21, and a design that uses dictionary compression with a semi-adaptive codebook is WLZW compression (a word-based modification of classic LZW)22

3. Adaptive Model –the most popular general method is an adaptive codebook. The statistics can be built on top of a basic initial set of values or from scratch. As data flows into the model and broken into phrases, it is statistically analyzed to construct the probabilistic codebook at that moment. The phrase is sent uncompressed until the same phrase is found later in the input stream, at which point a much smaller pointer is sent in its place. There is no need to send the codebook as it is automatically regenerated by the decoder using the same encoder process – i.e. data is dynamically decompressed as a function of the incoming data stream. Unlike semi-adaptive, the codebook optimally reflects the incoming data and it works in a single pass.

The UNIX “compact” program and adaptive are examples of statistical compression with an adaptive codebook. Dictionary adaptive compression schemes are usually faster than statistical adaptive compression because they use fewer CPU cycles and less memory. Examples of dictionary compression with an adaptive codebook are the popular LZ77, LZ78, WinZip, PKZip, and Gzip coding models23.

There are also “specialty” data compression methods like Run-Length Encoding (RLE). It produces great results when data contains repetitive characters as in the case of images and audio streams. FAX machines use RLE and static Huffman compression. It works by replacing repeating symbols with a single symbol and a count at the byte or bit level.

2010 EMC Proven Professional Knowledge Sharing 12 I wrote a simple 10 line program to read an ASCII file and compressed it with RLE. I was able to reduce a familiar ASCII pattern from 912 characters to 441 or about 2:1. With a bit more time, I

XXXXXXXXXXXXXXX 11X15 12; 09X08.02X09 10; 06X06.14X04 could have achieved a 3:1 reduction. XXXXXXXX..XXXXXXXXX XXXXXX...... XXXX XXXXX...... XXXX 08; 04X05.18X04 07; 03X04.23X03 05; XXXX...... XXX XXX...... XX.....XX...... XXX 02X03.10X02.05X02.08X03 03; My program created the string on the XX...... XXX...XXX...... XX 01X02.12X03.03X03.10X02 XX...... XXX...XXX...... XX XX...... XXX...XXX...... XX 02;X02.13X03.03X03.11X02 XX...... XXX....XX XX....XXX...... XXXX..XX 01;X02.13X03.03X03.12X02;X02.27X03.04X right. In yellow, “ 11X15 12;” codes XX....XXX...... XXXX..XX XX....XXXX...... XX.....XX 02;X02.04X03.21X04.02X02;X02.04X03.21X XX...... XXX...... XXX...... XX XX...... XXXXXX...... XXXXX...... XX 04.02X02;X02.04X04.19X02.05X02;X02.07X XX...... XXXXXXXXXXXXXXXX...... XX the first line as blank space (11 times), XXX...... XXXXXXXXXXXXX...... XXX 03.15X03.06X02;X02.07X06.09X05.07X02;X XXX...... XXXXX...... XXX XXX...... XXXX 02.09X16.09X02;X03.09X13.09X03 01; XXXX...... XXXX 01X03.12X05.12X03 02; 02X03.26X04 03; XXXXX...... XXXXX X (15 times), blank space 12 times. XXXXXX...... XXXXXX 03X04.23X04 04; 04X05.19X05 05; XXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXX 06X06.13X06 07; 07X23 08; 10X17 11; The semicolon marks a new line.

In today’s world, you will likely see the statistically adaptive Huffman code and the dictionary adaptive LZ24 code (created by Abraham Lempel and Jacob Ziv). There are hundreds of other coding designs that leverage Huffman or LZ methods.

Compression Bakeoff

New compression algorithms are tested using datasets from the Calgary and Canterbury

25 Comp Comp Decomp Efficiency Corpuses . While these older Compressed Ratio time time (lower is Program Size (GB) % (sec) (sec) better) Method(s) ARJ 116 63% 35 10 17,469 LZ, Huff collections are useful, a more modern GZIP 115 64% 16 10 9,224 LZ JAR 101 68% 37 9 3,386 LZ, Huff set of test data can be found on LHARK 113 64% 27 10 10,517 LZ LZ2A 114 64% 138 19 45,869 LZ 26 LZPX(J) 92 71% 110 111 5,927 LZ, Werner Bergman’s website . From LZTurbo 114 64% 6 9 4,601 LZ PKZIP 115 64% 16 12 8,955 LZ SLUG 112 65% 4 7 2,857 LZ this data, I charted each method Tornado 106 67% 14 8 2,690 LZ, Arithmetic Coding WinACE 90 72% 74 12 1,787 LZ, Huff WinRAR 90 71% 46 10 1,243 LZ, Huff, Prediction Partial Match against the original source. With WINZIP 86 73% 95 32 1,786 LZ, Huff, Shannon-Fano, Prediction Partial Match TestSet 316 0% 0 0 0 test data 316GB of data from 510 files of various data types, here are some methods (out of 245) that “model 'real-world' performance of lossless data compressors.” All of these methods use a base of the dictionary-based LZ compression and then add additional algorithms, such as statistical Huffman compression.

From this chart, WINZIP has the best ratio Compression Size and Ratio Compressed Size (GB) Compression Ratio % reducing 316GB to 86GB (73%) with a 140 74% 27 120 72% combination of LZ, Huffman, Shannon-Fano , 100 70% 80 68% 28 66% and Prediction Partial Match methods. PKZIP 60 64% 40 62% 20 60% (64% and 115GB) produces average results 0 58% ) (J o G AR ad U RK ZIP J n L A Z2A ARJ r S H L G L PKZIP using primarily the LZ method. WINZIPWinACEWinRARLZPX To LZTurbo

2010 EMC Proven Professional Knowledge Sharing 13 We get an efficiency ranking by factoring in the time it takes to compress and decompress the data. PKZip (compress 16 seconds, decompress Compression and Decompresstion Time 12 seconds) was 6X faster than WinZip Compression time (sec) Decompression time (sec) 160 120 (compress 95 seconds, decompress 32 seconds) 140 100 120 100 80 but achieves a slightly higher compression ratio. 80 60 60 40 40 20 20 WinZip would give you storage efficiency without 0 0

o o d A b J CE AR AR RK r ZIP A R J A KZIP AR regard to processing time, but if you wanted real- SLUG LZ2 G LH P WINZIPWin Win LZPX(J) Torna LZTu time communications, you would pick PKZip. The chart also shows that LZPX(J) and LZ2A are probably unacceptable for general use.

Efficiency (lower is better) The same website offers an efficiency calculation 50,000 (left to the reader to examine) which takes into 40,000 30,000 account compression scores and timing. The 20,000 lower value represents the more efficient 10,000 0 ) method. While it depends on how “efficiency” is IP R o G IP J Z RK R JA LU A GZ A IN PX(J rnad S LZ2A PKZIP W LH defined, WinZip is favored over PKZip. WinACEWinRARLZ To LZTurbo

Compression success is all about ratios. To the untrained eye, there is a big difference between 100:1 compression and a 500:1 ratio, but in reality they are very close. For example, 100:1 is a Space Compression versus Cost per Terabyte 99% reduction, so a 1TB allocation is Compress Reduction Cost per $100 Ratio Ratios TB reduced to just 10GB. A 500:1 1:1 1/1=0% $100.00 2:1 1/2=50% $50.00 $80 3:1 2/3=67% $33.33 reduction translates to 2GB. How much 4:1 3/4=75% $25.00 $60 5:1 4/5=80% $20.00 6:1 5/6=83% $16.67 value should you place in an additional $40 7:1 6/7=86% $14.29 8:1 7/8=88% $12.50 8GB savings? If 1TB cost $100, 2:1 9:1 8/9=89% $11.11 $20 10:1 9/10=90% $10.00 compression drops the cost to $50/TB. 100:1 99/100=99% $9.09 $0 1:1 2:1 3:1 4:1 5:1 6:1 7:1 8:1 9:1 10:1 100:1 500:1 500:1 499/500=99.8% $8.33 At 10:1, it drops to $33/TB. The difference between 100:1 and 500:1 saves you just 76¢/TB ($9.09-$8.33). You should decide if paying more for a significantly higher ratio has value.

Data Type Compress Dedupe Reduction Why? Some applications, such as MS MS-Office Files Yes Yes ~3:1 Observed results MS-Office Files (XML) No Yes (1) Varies (2) Data is compressed by Office when saved Video Data is compressed at the camera. Time stamps on Office 2007, use compression to Surveillance Data No No None each frame prevent block-based deduplication Video Files When blocks are part of identical video files; store documents – Word “.docx”, (avi, wmv, etc.) No Yes (1) Varies (2) deduplicated files in multiple folders or directories Cloned images require the same amount of physical storage space as the template from which they were Excel “.xlsx” and PowerPoint Virtual Server Images Yes Yes Very High created, but their content is largely reduced 1. Multiple occurrences of the same Microsoft Office files, video files and other file types deduplicate effectively if 29 duplicate files are stored using different file names or in different folders. “.pptx” use LZ “zip” . 2. Overall capacity reduction for these data types depends upon the # of occurances.

2010 EMC Proven Professional Knowledge Sharing 14 Compression is also part of various file and operating systems. In 1990, Stac Electronics released “Stacker” to double disk capacity allowing a 20MB hard drive to provide 40MB of capacity. In 1993, Microsoft released DoubleSpace with MS-DOS 6.0. Both products were based on LZ compression. Decades later, the Stacker is still in use in Cisco routers offering the IP Payload Compression Protocol.

In 1995, Microsoft incorporated LZ77 compression in Windows NT 3.51 as part of NTFS. All the administrator needs to do is check a box on a property page to enable it. Sun Microsystems incorporates transparent LZJB compression in their open-source ZFS30 file system. “In addition to reducing space usage by 2-3X, compression also reduces the amount of I/O by 2-3X. For this reason, enabling compression actually makes some workloads go faster.”31

Back then, compression fell out of favor because it ran slowly and put a strain on the CPU. These days, compression’s comeback is a result of faster multi-core processors which mask slow disk and network speeds. Users find it makes sense to trade CPU cycles for fewer I/Os or smaller bandwidth. Modern CPUs “…are 100 times faster than those of the early ‘90s, but the disk I/O channel is only 16 times bigger than it was then.”32

Data compression is also vital for lowering communication costs. For example, Cisco’s 7000 router series offer 2.3:1 hardware compression ratios33. Please keep in mind that ratios are difficult to predict since they vary by the second as different data types are transmitted. Vendors recommend against comparing product ratios since it leads to false expectations – you should do your own testing with real data. Cutting bandwidth requirements in half or more can make remote communications affordable.

Compression and deduplication (see Time to transmit data in minutes (4 KB block size) - no Compression Size 100Base-T GigE 10GigE 4 Gb FC T-1 T-3 DS-1 DS-3 OC-1 OC-3 1GB 2 0 0 1 104 4 105 4 3 1 the next section) significantly reduces 100GB 161 16 2 85 10,403 357 10,478 359 310 91 1TB 1,645 164 16 874 106,530 3,655 107,295 3,677 3,173 930 transmission time. As shown, 10TB 16,448 1,645 164 8,738 1,065,301 36,552 1,072,945 36,767 31,729 9,300 Time to transmit data in minutes (4 KB block size) - 3:1 Compression sending a 1TB file over an OC-3 Size 100Base-T GigE 10GigE 4 Gb FC T-1 T-3 DS-1 DS-3 OC-1 OC-3 1GB 10 0 0 351 351 10 100GB 54 5 1 28 3,468 119 3,493 120 103 30 circuit takes 930 minutes or 15½ 1TB 548 55 5 291 35,510 1,218 35,765 1,226 1,058 310 10TB 5,483 548 55 2,913 355,100 12,184 357,648 12,256 10,576 3,100 hours. It could be sent in 310 minutes Time to transmit data in minutes (4 KB block size) - 6:1 Deduplication and Compression Size 100Base-T GigE 10GigE 4 Gb FC T-1 T-3 DS-1 DS-3 OC-1 OC-3 (about 5 hours) with a 3:1 reduction 1GB 00 0 0 171 171 10 100GB 27 3 0 14 1,734 59 1,746 60 52 15 or half that time with a 6:1 reduction. 1TB 274 27 3 146 17,755 609 17,882 613 529 155 10TB 2,741 274 27 1,456 177,550 6,092 178,824 6,128 5,288 1,550

2010 EMC Proven Professional Knowledge Sharing 15 Data Deduplication Strategies

As we discussed, primary data is growing rapidly which in turn creates a backup burden. Available bandwidth is also an issue when data is sent to another site. Enter data deduplication. Deduplication is aimed at identical data patterns. Typically, the change rate for primary data is less than 5%34 a day. When it is backed up daily or weekly, 95% or more has not changed since the last copy, so a twenty fold or greater space reduction is possible. That far exceeds what compression can achieve. Dedupe’s reduction rates can lower both capitol and operating expenses for backup (secondary) storage. That is why deduplication is a hot topic these days.

Deduplication is not strictly tied to secondary storage. File server consolidation increases the amount of redundant data making dedupe practical for primary storage. Network bandwidth can be reduced when redundant data is removed before it is sent and added back in when received.

Deduplication does not work in every case. For example, one of my customes tried it on encrypted data and it did not deduplicate or compress. By design, encrypted data appears random so identical documents are unique. Other environments where deduplication will not make sense is when the change rate is high, or the data is fairly unique or temporary in nature.

The benefits of deduplication resembles those found with compression: 1. Provides more usable space from the same set of storage devices. 2. Simplifies management. 3. Backups run faster and can make BC/DR practical. 4. Saves money, power, and labor by postponing additional storage purchases.

Deduplication - Theory of Operation

Deduplication is easily illustrated with a building block example35. A housec is built from plans (metadata) with pieces arranged in a specific orderd. Reducing the house down to basic shapese (the end state for deduplication), you could precisely reconstruct the house at will.

2010 EMC Proven Professional Knowledge Sharing 16

Deduplication could not exist without pointers. At a high level, pointers are addresses or values that give the location of data blocks on a disk system. It is no different than looking up a name in a phone book to get an address. Once a deduplication algorithm knows the same data already exists, it creates a pointer to that block in the new data stream and removes the duplicate block. Some people get confused when pointers are discussed and get that “deer in the headlights” look. They are often surprised to learn that Windows and other operating Pointers? systems rely on pointers and does not write their MS-Word documents to disk as a single image or as contiguous blocks. Pointers also support data growth and are part of everyday processing. Working with metadata, they help prevent duplicate blocks from being stored or transmitted a 2nd or 3rd time.

In this pointer example, “Myfile.doc” has 4 blocks (similar to paragraphs in a document). Please Hypothetical directory layout follow the pointers to reassemble the blocks for Myfile.doc: Start End # Filename Block Block Blocks Myfile.doc 0 1 4 ƒ “Start Block” is the first pointer and points to block #0, Myfile(2).doc 2 7 4 Yourfile.doc 8 15 4 ƒ which then has a pointer to block #4, ƒ followed by block #5, ƒ and finally block #1 which is under “End Block”. When Microsoft Word opens “Myfile.doc”, the operating system retrieves all 4 blocks [0,4,5,1] in the proper pointer sequence.

Deduplication replaces redundant blocks with pointers to unique blocks. This diagram36 shows how three documents look “before” and “after” deduplication. Source data is fed into a dedupe Before After algorithm where the most popular approaches create a unique Deduplication Deduplication 1 1 1 4 1 4 3 3 code or “signature” for the entire file or blocks of the file. Only 2 1 2 1 3 3 3 3 2 identical files (or identical blocks) can have the same signature. 5 4 4 5 2 4 4 2 6 2 6

Signatures and reference counts (used to track how many duplicate files or blocks exist) are kept in a database or “master index.” An entry in the master index means a file or block (at times called a chunk or ) has already been stored on the target disk and should not be stored a second time. Duplicates are detected when an incoming signature matches an existing signature in the master index. With a match, a very small pointer to the original file or block is written to the target disk instead of the much larger file or block. In addition, a reference counter is incremented so the system can track how many duplicates are in use. If there are multiple

2010 EMC Proven Professional Knowledge Sharing 17 instances of the same file or block, the master index tracks it with multiple pointers. Without a match, the data is compressed, written to the target disk, and the master index is updated with the signature, pointer, and an initialized reference counter – all awaiting a future match. On demand, data is reconstituted or “hydrated” to its original pristine form by combining unique data with deduplicated data as indicated by the master index, pointers and metadata.

CAUTION - it is not enough to check for the same file name, size or date modified. You need to know their contents are identical. In this example, could you be 100% sure these files were the same by looking at the obvious attributes?

Deduplication fundamentally differs from compression when the file or sub-file contains data previously deduped. Compression does not track which files it already processed while deduplication knows all about the data that it has transformed. The next time it processes an identical file or sub-file, it concludes it is a duplicate and substitutes pointers for the actual data. For example, if you have ten identical files to compress, you would get ten compressed files each about 1/3 the original size. If you deduplicated ten identical files, you could easily get a single file (which would then be compressed by about 2/3) and nine pointers to that file.

To restore deduplicated data, you need the same amount of space it originally had - hydration can create an “insufficient storage” situation.

Deduplication systems allow the creation of tape copies of hydrated data. Some systems, like Commvault’s Simpana, support tape copies of deduplicated data.

From a different perspective, a full backup of 1,000 Windows XP users has 1,000 copies of the 10MB Movie Maker directory. Given that directory never changes, the backup should really only have one deduped and compressed 5MB directory and 999 pointers – a 2,000:1 reduction! Then multiply that savings by the number of other files that never change!

2010 EMC Proven Professional Knowledge Sharing 18 In this dedupe example37, assume I had 3 x 4MB documents, each with 4 x 1MB blocks of data: ƒ “Myfile.doc” is my original Word document and internally contains unique blocks 1-4. ƒ I email you Myfile.doc and you store it on the same server as “Myfile(2).doc”. It is exactly the same text as Myfile.doc and logically has blocks 1-4. The deduplication system knows it already processed these blocks because the sub-file hash codes are in the master index, so it just stores very small pointers to those unique blocks. For now, think of a hash code as a unique numerical identifier for a text block just like each U.S. citizen has a personal social security number. ƒ You then change my name to yours and save it as “Yourfile.doc”. It is very close to the original document but differs only by the name change. Deduplication realizes the changes are just in block 5 and stores it along with pointers to blocks 2-4. Root

MyFiles YourFiles Logical Logical Here is a view of simple deduplication math. With a Myfile.doc Size GB Myfile(2).doc Size GB Block #1 1.0 --> Block #1 0.000000001 Block #2 1.0 --> Block #2 0.000000001 1GB block, Myfile.doc is 4GB. Once Myfile(2).doc is Block #3 1.0 --> Block #3 0.000000001 Block #4 1.0 --> Block #4 0.000000001 deduped, the 1GB blocks are replaced by tiny GB disk size 4.0 GB disk size 0.000000004 Logical Yourfile.doc Size GB pointers saving a sizable amount of space. Myfile(2).doc now occupies Block #5 1.000000000 --> Block #2 0.000000001 only 0.000000004GB instead of 4GB. Yourfile.doc has a unique block #5 --> Block #3 0.000000001 --> Block #4 0.000000001 from the name change in a previous example, so it occupies 1GB followed GB disk size 1.000000003 by pointers to the unchanged blocks 2-4, so Yourfile.doc occupies 1.000000003GB of space.

Pointers and signatures are shown in this diagram of the Yourfile.doc altered Yourfile.doc. A unique digital signature is created based on whether file or sub-file deduplication Yourfile.doc Yourfile.doc Compute the hash index on either the whole file or fixed/variable chunks Hash Hash is used. The signature is created from a hash algorithm “10231894” “43881294” If the hash is already in the table, store the pointer and increment the counter. and is checked against values already collected by the Master Index Else, store the block and create a pointer deduplication system. In this example, Yourfile.doc has and initialize the counter.

Ptr Ptr Ptr The end result is a linked list of blocks #5 one unique block (#5) and pointers to blocks 2-4. #2 #3 #4 and pointers to deduplicated disk blocks

Similar to how a file system works, when a user deletes a deduped file, just the pointers to the blocks are deleted. A reference counter is also decremented so the system can determine if there are other active files still using those blocks. When the counter reaches zero, the block is marked for deletion and the space is made available after a reclamation process is run.

2010 EMC Proven Professional Knowledge Sharing 19 MESSAGE-DIGEST ALGORITHM 5 (MD538) AND SECURE HASH ALGORITHM (SHA-139) - Creating unique dedupe signatures for files or blocks is critical. Using a 68 character example "Now is the time for all good men to come to the aid of their country“, MD5 produces the unique 128-bit, 32 digit 59195b3b3f080275f3a6af7acdd31a5c hexadecimal hash code. Changing a single letter, such as the capitol letter “N” to a lower case “n” yields the new value 60a5d53255faa66033c9851adb40947a. The Declaration of Independence is 5a5e2718adb958ac52e1680a665aa96f.

SHA-1 produces a 160-bit, 40 hexadecimal digit code. Our original text yields 48770a59c6a5b8215239a8569534082b6fbbb734. SHA-512 uses 512 bits.

These are public domain algorithms. Please see the appendix for more details.

Let’s backup Myfile.doc weekly with dedupe. For simplicity, we will ignore compression. A full Before Weekly Size 10 Weeks backup of three 4GB files takes 12GB of space. If deduplication GB of Backup GB Notes Full Backup 12 =12x10=120 Every block gets backed up every week these files are unchanged, backing them up for 10 Using 10 Weeks of deduplication, Weekly Size Backup weekly backup GB GB Notes weeks uses 120GB of space. Deduping week #1 Week #1 5.000000007 5.000000007 Just blocks 1-5 get backed up once Week #2 0.000000012 5.000000019 1-12 are pointers to unchanged blocks uses 5.000000007GB. Week #2 uses only Week #3 0.000000012 5.000000031 " Week #4 0.000000012 5.000000043 " Week #5 0.000000012 5.000000055 " 0.000000012GB and contains just pointers (blocks 1- Week #6 0.000000012 5.000000067 " Week #7 0.000000012 5.000000079 " 5 did not change and pointers replaced the 1GB Week #8 0.000000012 5.000000091 " Week #9 0.000000012 5.000000103 " blocks). After 10 weeks, the backup is just over 5GB Week #10 0.000000012 5.000000115 " compared to the original 120GB – a 24X reduction! Here is a visual First backup ABCDEFA G Backup Stream #1 representation for what you might expect using dedupe on a theoretical HIJKAHLMBackup Stream #2 NO IAP D QRBackup Stream #3 first backup compared to the second backup. Only red blocks are Second backup ABCDS F A G Backup Stream #1 40 HIT K A H L M Backup Stream #2 backed up showing the impact dedupe makes . N O I A P D Q R Backup Stream #3

Deduplication can occur at the file level or at the sub-file level. Sub-file dedupe is performed in one of four ways: fixed block, variable block, content-aware and delta block. Each has its pros and cons. For example, byte level analysis can yield a high dedupe ratio at the expense of processing time and memory utilization.

Data Deduplication with Compression Source Target In-Line Post-Processed File-level Fixed Block Variable Block Content Aware Delta Block Deduplication Deduplication Deduplication Deduplication Deduplication Your Original Data

2010 EMC Proven Professional Knowledge Sharing 20 File Level Deduplication - Single Instance Storage

File level deduplication (often called single-instance storage or SIS) proves whether two files are identical, and if they are, just one is kept. As we have seen, it is not sufficient to check if file names or sizes are the same. A bit-for-bit comparison could conclude they were the same, but an easier and faster way is to compute the file’s hash signature and compare it to a master index of previously processed signatures. If the computed signature exists in the master index, it is a duplicate and only a very small pointer needs to be stored on disk or transmitted. In addition, a counter in the master index is incremented and other metadata is logged.

Microsoft was assigned If there is no match, the file is unique and stored on disk or patent 5813008 in 1998 transmitted. The new signature is placed in the master index along for “single instance with a disk pointer, an initialized counter and other metadata. storage of information”.

Should the source file ever get modified or altered by just a single bit, as in the case of putting your own name on a colleague’s PowerPoint presentation, it becomes a unique file in its own right even though the majority of the PPT has not changed.

I used a free “Easy Duplicate Finder” utility on my laptop and learned that 19% of my files were duplicates. Since Windows XP does not deduplicate files, I left it alone (I told you I was like Oscar Madison). Saving space may not be cost effective on laptops, but dedupe makes sense on file servers with hundreds of thousands of files supporting hundreds of users. To locate duplicate files, try asking your dedupe vendors for a free analysis or invest in a tool like Northern’s Storage Reporter41.

Some operating systems incorporate file level deduplication. Windows 2000 implemented it through “SIS links” and “groveler” modules. “The [SIS] filter keeps the data that backs SIS links in files in a special directory called the SIS Common Store. SIS links may be created in two ways: A user may explicitly request a SIS copy of a file by issuing the SIS_COPYFILE file system control, or SIS may detect that two files have identical contents and merge them. The groveler (so called because it grovels through the file system contents) detects duplicate files.”42

2010 EMC Proven Professional Knowledge Sharing 21

“Microsoft IT reports that SIS has reduced storage on Average Average Total Space Space # of Actual # Space Server Type its file servers by 25 percent to 40 percent, depending Savings Savings Servers of Savings % (GB) Sampled Servers (GB) 43 Client Software Install upon the type of content stored.” This table shows Shares - Hub 33% 68 22 34 2,300 Client Software Install sizeable SIS savings with most of it coming from Shares - Branch Office 24% 17 70 111 1,839 Server Software Install duplicate system files. Up through Exchange 2007, Shares 48% 47 21 34 1,608 International Version Product Shares 42% 215 2 2 859 SIS helped control the storage proliferation caused by Archived Products 63% 545 2 2 1,090 mailing the same message or attachment to multiple Remote Installation Services 40% 3 52 91 278 Weighted by Average people (it was removed from Exchange 2010.) Space Savings 54% 169 274 7,973

File-level dedupe is highly efficient because the entire object is removed with a just few CPU cycles. Popular use cases include file archiving, data governance control, and with graphical or previously compressed data.

An ESG Group study examined file-level versus variable-block deduplication for archiving primary storage and found only a small difference between the two methods44. Clearly, you should examine this study in more detail, but it does point out the nature of the data can dictate which method to use.

Some applications are focused on processing objects - a use case for file level dedupe. For example, with email you often take the entire object and send it to another person or mailing list. Archiving or “aged-out” files, such as email that must be legally retained, original legal documents with signatures, blueprints, etc., also do well with file level dedupe. While backup does not alter your primary copy of data, archiving replaces your primary “gold” file copy in your primary system with a pointer. Many companies want archived documents treated as complete objects and not sub-file deduplicated. There are also legal concerns – please see the section on the law and storage optimization.

Some data types do not lend themselves to sub-file dedupe yet benefit from file-level dedupe. For example, medical images, MPEG4 video, digital photos, CAD drawings, and audio already use compressed formats and appear too unique to benefit from sub-file level processing. File level dedupe is effective with primary storage when multiple copies of the same file exists and

2010 EMC Proven Professional Knowledge Sharing 22 with secondary storage backups of information that does not change, as in the case of medical X-rays which are never altered after they are created.

Fixed-Block Deduplication

As we’ve seen, file level deduplication is great with an exact match between a new document and one already processed, but it does not address duplication within or between documents. Fixed-block dedupe is a refinement that works at the sub-file level. For example, a Windows NTFS file system uses fixed size clusters allowing dedupe to work efficiently on pre-parsed blocks or “chunks”. Each block is assigned a unique signature just like file level dedupe.

In this behind the scenes example, a 128KB data file is first broken into 8 blocks45. Each 16KB block is processed and given a unique hash signature for comparison against the master index. A match causes a pointer to be stored on the target disk or transmitted instead of the entire block. Blocks 1 and 5 have the same codes so block 5 is replaced with a pointer, thus saving 16KB of space. A reference counter in the master index is updated and other metadata is stored. Without a match, the hash, a disk block pointer, and an initialized counter are put in the master index. The disk block is then compressed and stored on the target disk or transmitted.

When the block size is too large, you begin to approximate the inefficiency of file level deduplication. The smaller the block size, the more likely the duplication match, but the higher the CPU and memory overhead. The vendor community has not standardized on a block size.

NetApp’s WAFL operating system uses 4KB blocks upon which they also assign deduplication hash codes46. In contrast, Symantec uses 128KB blocks for its PureDisk dedupe product47. Hifn’s real-time deduplication and LZ compression PCI adapter for primary and secondary storage servers uses a configurable fixed size approach48. Their BitWackr™ SHA-1 algorithm “produces a 160-bit hash from fixed length data blocks” which are set from 4-32KB at file system creation time. The adapter, used by companies such as Quantum, offloads processing from the host CPU. OEM prices start at $500.

2010 EMC Proven Professional Knowledge Sharing 23

Fixed-block works well with daily and weekly backups because the backup application already performs the chunking. It also does a good job with structured files like databases.

There are drawbacks to fixed block dedupe. First, adding just a letter in an initial block can throw off ensuing blocks causing them to be unique. For example, deleting a single word on this page causes this block and subsequent blocks to change. The second drawback is that block level designs use more processing resources to maintain the master index than a file-level approach. Since the index retains unique hash codes for each block, a system supporting 100TB of data would need 6.25 billion objects with a 16KB block or 25 billion objects with a 4KB block. Based on the hash code size, the smallest index needs an additional 175-700GB of disk.

Variable-Block Deduplication

What differentiates variable-block from fixed- block is the additional efficiency gained by moving from pre-set boundary chunking to a largest common substring method. As shown49, variable-block dedupe increases the likelihood duplicate blocks are found even with changes to parts of the data. Similar to how certain compression designs work best against specific data types, explicit pattern analysis methods break data into optimal variable blocks by scanning for semantic breaks. Should the variable- block size equal the entire file (very inefficient), you get file level deduplication.

50 Let's do the same thing with variable “Clipper Notes” ran an article that An example of fixed-blocks. The blue and blocks. Notice the detection of the word white block get unique hash values. "THE" is placed it in its own block. showed the power of variable-block NOW ISTHETIME N OW IS THE TIME FOR ALL GOOD M FOR ALL GOOD M over fixed block. Using our earlier EN TO COME TOT ENTOCOMETOT HE AIDOFTHEIR HEAIDOFTHEIR COUNTRY. COUNTRY. example, you will see in the bottom two Variable-blocks are very resilient and Watch what happens when the first word facilitate changes in blocks without groups of variable blocks that the "NOW" is changed to "TODAY". disturbing ensuing blocks. TODAY IS THE TI T ODAY IS THE TI frequency of the word “THE” is be ME FOR ALL GOOD ME FOR ALL GOOD MEN TO COME TO MEN TO COME TO THE AID OF THE THE AID OF THE separated from the word “THEIR” IR COUNTRY. IR COUNTRY. which does not occur elsewhere in the blocks.

2010 EMC Proven Professional Knowledge Sharing 24 The UNIX function is a popular substring matching method. A output of diff source target diff source target -y aaaaa aaaaa aaaaa aaaaa simple example using “source” and “target” files, diff –y produces bbbbb bbbb bbbbb | bbbb ccccc ccccc ccccc ccccc ddddd 12abc ddddd | 12abc the results in the yellow column. You can see how this concept lends eeeee fffff > fffff eeeee eeeee eeeee itself to variable length substring matching and deduplication. >

The Rabin-Karp fingerprint algorithm is an advanced string matching method found in IBM’s TSM dedupe product51. It is a simple design for quickly matching two strings and lends itself to the deduplication process. Intuitively, we know the “brute force” method for finding a string inside another – start at the beginning and try and find the smaller string inside the larger string. Harvard Professor Michael Rabin and Richard Karp discovered an efficient rolling hash method for string searching. This flowchart shows my simplified interpretation of design52.

In some cases, approaches used to select breakpoints are covered by patents. Dr. Ross Neil Williams, an Australian computer scientist, published a 1991 thesis titled “Adaptive Data Compression”53 which led to U.S. patent 599081054 in 1999. To market his work, he formed Rocksoft Pty Ltd in 2001. The name for his technology is “blocklet”. A blocklet is formed when an algorithm scans the source data and breaks it into intelligent pieces using pattern recognition. Pointers are created with a hash code table for matching purposes.

Dr. William’s blocklet journey did not end with Rocksoft. His company was acquired by tape library maker ADIC in March, 2006, which was then acquired by tape library maker Quantum in August, 2006. In late 2006, Quantum threatened to sue Data Domain for patent infringement for allegedly trying to use the Rocksoft approach. The matter was resolved when Data Domain cross-licensed Quantum’s patent along with the payment of 390,000 shares of Data Domain stock valued at almost $3M55. On August 14, 2007, Quantum sued Riverbed Technology56 for using its design and Riverbed counter-sued claiming Quantum infringed upon their patent 711624957. These lawsuits ended in 2008 with Riverbed agreeing to pay Quantum $11M58. In March, 2008, Quantum signed an OEM license with EMC, and by July, 2009, EMC acquired Data Domain and ceased selling the Quantum-based product.

2010 EMC Proven Professional Knowledge Sharing 25 Meanwhile, Avamar (formerly Undoo.com and an EMC company since November, 2006) had filed patent 6810398 on October, 2004 for a “sticky byte factoring” approach summarized as “A system and method for un-orchestrated determination of data sequences using "sticky byte" factoring to determine breakpoints in digital sequences such that common sequences can be identified. Sticky byte factoring provides an efficient method of dividing a data set into pieces that generally yields near optimal commonality. This is effectuated by employing a rolling hashsum…”59 This diagram60 shows data being broken into variable-blocks when a “threshold” is crossed. Some of Avamar's patents reference Dr. William’s original work.61

So if variable-block deduplication produces great results, why not use it everywhere? 1. Compared to other approaches, variable-block uses a great deal of CPU and memory to work efficiently. With low change rates, fixed-block can usually achieve acceptable results with less overhead. Conversely, high change rates favor variable-block dedupe. 2. It may not produce enough extra space compared to the resources consumed or cost, especially when small file systems are involved. 3. Not all data streams lend themselves to be broken apart.

Content-Aware Deduplication

Another approach is content-aware deduplication (sometimes called application-aware or byte- level deduplication). By leveraging the original data format, data deduplication can intelligently apply variable-block deduplication strategies compared to “after the fact” fixed or variable-block methods. For example, if you knew you were deduplicating PowerPoint slides, your intelligent break could be on a slide boundary. With Word, you could break on pages or paragraphs. Deduplicate email and you would be sensitive to message boundaries.

Used in communications, Silver Peak’s patented (7571344) “Network Memory” can detect the result of opening and closing a file without making any changes. For example, Excel modified bytes (in red) without any updates62. With so much change, variable-block

2010 EMC Proven Professional Knowledge Sharing 26 designs might treat the data as new instead of pointers to existing blocks. Their content-aware dedupe handles small changes using “instructions” such as retrieve 20 bytes from location “a” to location “b” with 1 delta “XX” to track changes. This produces 19 duplicate bytes and 1 unique byte for efficient communications since it clearly shows “where repetitive data starts and stops, and where changes occur within the data.”

Ocarina Networks makes primary storage optimization products that leverage content-aware deduplication to expand previously compressed files, such as .zip files or images, and then deduplicate (and compress) them for greater space savings. Being content-aware, Ocarina detects commonality between embedded objects such as JPEG images that are used in Word documents and PowerPoint slides even if the image uses different dimensions.

Content-aware deduplication is also found in products from NEC, Commvault and ExaGrid. NEC’s HYDRAstor leverages the IBM TSM, EMC NetWorker, CommVault Simpana and Symantec NetBackup63 formats to optimally store just the changed blocks. CommVault’s Simpana says they are “able to more accurately find and reduce common patterns in the data stream across disparate applications, file systems and data types”64. ExaGrid’s products are content-aware and use delta block deduplication65 (which is explained in the next section).

This approach is resource intensive. As a result, many content-aware backup products first write source data to a holding area to complete the backup, and then dedupe it with post-processing.

Lastly, since content-aware dedupe understands data formats, application upgrades should be reviewed with your backup vendor, such as when you move from MS Exchange 2003 to 2007.

Delta Block Optimization

Delta block, also called delta differential, is a very simple technology (in contrast to variable- block or content-aware deduplication) focused on backing up individual files. The first time it is used, the entire file is backed up. Since it knows what it processed previously, only the newly changed blocks are backed up. Delta block works well when a fraction of the files are changed every day or only a few small modifications are made to the same block.

2010 EMC Proven Professional Knowledge Sharing 27 Working at the file level, it does not deduplicate multiple similar files, but “prevents” some duplicate data from being backed up. For example, if I email Myfile.doc to a colleague, both are treated as separate documents. IBM’s TSM product uses delta block to record only data changes after an initial full backup. In contrast, deduplication uses hash codes and a master index to track unique and duplicate blocks system wide.

Delta block is a mature design popular with individual users and small to medium size firms. Products such as Iron Mountain’s Connected, Seagate’s EVault and EMC/Decho’s Mozy backup changed blocks, so it is practical to backup computers over the Internet or WAN. To restore a file, the original file is retrieved and changed blocks are applied. Interestingly, storing the blocks in the original backup copy prevents that file from being restored if the need arose. File restores are slower than backups since the whole file is restored, not just a few blocks. Restoration time depends on network speed and distance from the central backup server.

While a good fit for remote offices, delta block can consume a lot of CPU and memory resources, especially in virtualized environments. Virtualization taps into unused CPU cycles with multiple “guests” running on the same physical server, so CPU intensive backups can cause applications to experience slowdowns. If this happens, you might use off-hour backups.

There are other delta block approaches, some working at the byte level instead of the block level such as HyperFactor from IBM/Diligent Technologies66. Shipping since 2005, HyperFactor is part of the IBM ProtecTier VTL. The Taneja Group claims it provides a 25:1 savings. Diligent’s history is very interesting. As the story goes, in the late 1990’s, EMC employed Doron Kempel (a former Israeli intelligence officer) and Moshe Yanai (a former Israeli tank commander). Mr. Kempel had a falling out with EMC67 and was subsequently blocked from becoming Chairman and CEO of SANgate Systems because of an EMC non-compete contract. (SANgate eventually became SEPATON, which spells “NO TAPES” backwards). Meanwhile, Mr. Yanai, known by many as the creator of the EMC Symmetrix storage array, became a co-founder of Diligent along with Mr. Kempel, who is now the Chairman and CEO of Diligent. Some of Diligent’s initial products were designed in the EMC Israeli development center through 2002 and to this day, some of their products are sold through EMC. Mr. Yanai’s XIV storage company was purchased by IBM in 2007 and Diligent was purchased by IBM in 2008.

2010 EMC Proven Professional Knowledge Sharing 28 VIRTUAL TAPE LIBRARY (VTL), NAS AND OST are approaches to backing up your environment. Backup devices often support all three options using a common deduplication data pool regardless of the approach. Software emulation allows a VTL with virtual tapes and drives to act as a physical fibre-channel tape library. For example, NetBackup can “think” a VTL is a STK library. The NAS (NFS or CIFS) approach makes a backup device appear as a mount-point or file share to your server, so files are copied to the backup device. Symantec’s OST (OpenStorage), found in NetBackup, uses a set of APIs over IP to tightly integrate backups with the backup device.

Some vendors combine delta block with content-aware deduplication. Unlike the universal block deduplication which treats all data the same, content-aware approaches (see the previous section) understands specific application formats. An example of a content-aware delta block technology is SEPATON’s DeltaStor. The DeltaStor architecture compares “Word documents to Word documents or Oracle databases to Oracle databases to identify the objects containing duplicate data.”68 The premise is there is little overlap, for example, between Word and Oracle.

Primary and Secondary Storage Optimization

Now that we have explored how compression works and deduplication finds extraneous data, this section examines how these techniques are used for primary and secondary storage. The next section covers the optimization of data communications. The Taneja Group uses two terms – Primary Storage Optimization (PSO) and Secondary Storage Optimization (SSO) 69.

PRIMARY STORAGE OPTIMIZATION (PSO) – the optimization of primary storage environments such as NAS using compression and sometimes deduplication. Low latency is imperative and fast response time is critical.

SECONDARY STORAGE OPTIMIZATION (SSO) – in contrast to user-demand primary environments, SSO targets backup and archive, often employing deduplication and compression. For extra protection, the data may be sent to a secondary site.

2010 EMC Proven Professional Knowledge Sharing 29 The Taneja group created a chart70 showing where PSO and SSO fit into a storage infrastructure. Primary storage has less data redundancy compared to repetitive backups of primary data stored on secondary storage. In addition, primary data needs to provide fast application response time and be highly available while secondary data is the “insurance policy” that protects primary data should any be lost, so response time is not a critical issue. For every terabyte of expensive primary storage, there could be 10-30X or more terabytes of secondary storage. And with a DR plan, the replication workload would also increase.

We have learned that data is retrieved from a disk system using pointers whether it is deduped or not. As a result, pointers do not play a significant performance role when data is hydrated. The concern is simultaneous access to the same data. Since deduplication stores only unique information, it uses fewer disks to store it. That is where the science of I/O profiles comes in. How many I/O operations are needed for good performance and how many physical drives are needed to deliver that performance? A good drive is capable of 200 I/Os, so applications that might have used ten striped drives has a profile of 2000 I/Os. When the data is reduced to a single drive, its 200 I/Os profile could cause a performance issue.

With the nature of primary and secondary storage being so dissimilar, different techniques are needed for each. Primary storage growth is kept in check by deleting unwanted data, archiving, compression and sometimes deduplication. The easiest way to control the growth of secondary data is through deduplication (and compression). When you compress or deduplicate primary storage, it is fairly easy to predict how quickly a single file could be hydrated for application use, but care should be taken when many users need to access the same data. It should be safe to dedupe storage that has not been modified in the last 6-12 months, but care should be taken with data that has annual cycles or needs to be frequently searched.

Virtual machines are primary storage applications that work well with deduplication and compression. Primary storage can have hundreds of VMDKs (files that comprise the VM). If you

2010 EMC Proven Professional Knowledge Sharing 30 examined these VMDKs, you would see the same version of Windows or Linux over and over again. Think of the space savings you could derive if all your VMDKs pointed to a single copy of Linux versus their own copy. If done correctly, the downside is just a small performance impact.

SSO works wonders with backups of primary storage. Deduplication allows for the frugal backup of primary storage to secondary storage within operational windows and enhances BC/DR. Backup deduplication efficiency is directly related to primary data change rates, the data types, retention policies, and the amount of data.

Deciding between PSO and SSO is a tough question. PSO might reach 2-3X with compression or even 5X with deduplication and compression. Meanwhile, SSO can yield a 20-30X (95%- 97%) reduction. It may come down to which area has more value. It is nearly impossible to predict which approach yields optimal cost benefits based on your company’s data. Vendors can provide a preliminary analysis and supply sizing and estimated TCO, CAPEX and OPEX savings using their proprietary tools, but testing with real data is needed to judge effectiveness and ease of use. While modern dedupe systems impose a small overhead on primary storage (as do concepts like “thin provisioning”), a performance issue would more likely arise from reduced disk spindles and high I/O requirements, or trying to reduce active data or databases.

In-Line versus Post-processing

After choosing PSO or SSO, you need to choose between in-line or post-process data optimization. Here are some basic definitions:

IN-LINE DEDUPLICATION – data deduplication performed as it enters the device and before it is written to disk or transmitted.

POST-PROCESS DEDUPLICATION – data deduplication performed after the data has been initially stored.

If you need to get backup data off-site sooner, then you might want in-line processing. If you want a shorter backup session without regard to completing compression or deduplication or even replication, then you may favor post-processing. In-line could be too slow for primary data yet it could be fast enough for data communications.

2010 EMC Proven Professional Knowledge Sharing 31

In terms of backup, in-line processing (also called in-band) achieves the first objective. It is a concurrent process that leverages powerful multi-core processors and places the master index in memory so it can compress or dedupe data in real-time before storing or transmitting it. As a result, data is only processed once. In contrast, post-processing lands the data on disk and once the last backup job completes, it starts deduplication/compression before finally writing it to disk. Without a landing area, in-line backup needs fewer disks - space is used only for pointers and unique compressed data. Fewer disks mean less rack space, lower CAPEX, and reduced power and cooling requirements.

For example, post-processing 30 TB of source data can require 30 TB of target capacity before deduplication begins. In-line needs a fraction of this space. When transmitting backup data off- site71, in-line sends it after deduping it on a first-in, first-out basis rather than waiting for the backup to finish before deduping and transmitting it. This can result in a shorter RTO.

RECOVERY TIME OBJECTIVE – (RTO) the amount of time once a disaster is declared until a failed system is returned to service at a particular RPO.

RECOVERY POINT OBJECTIVE – (RPO) the data loss expressed as a point in time. For example, if it is 5 P.M. and you recover from your 3 A.M. backup, your RPO was 14 hours ago. Shorter RPOs and RTOs typically cost more.

However, if real-time in-line backup processing falls behind, it slows down the incoming data stream resulting in a longer backup session than post-processing. This may not be an issue if it is faster than the tape system it is replacing or achieves an overall lower time (backup, dedupe, and transmit). With data communications, as data enters a device, it is deduped and sent.

With the second objective, the goal is to get backup data to the security of an alternate device as soon as possible. Post-processing (also called out-of-band) works serially with the entire incoming data stream landing on disk, thereby completing a backup session. It generally achieves the fastest possible backup even though it has yet to dedupe/compress/transmit the

2010 EMC Proven Professional Knowledge Sharing 32 data before re-writing it to disk. It could restore a file faster if it has not yet been deduplicated. However, it uses more disk space and takes more time than in-line’s concurrent approach.

The goal of in-line and post-processing compression and deduplication appliances is to “set it and forget it” – a phrase TV’s inventor extraordinaire Ron Popeil uses for his rotisserie ovens. Easier solutions tend to lower the administrative burden (OPEX) on your staff. Let’s review both approaches in more detail.

Primary Storage Optimization - In-line versus Post-Processing

As discussed, various compression techniques have been available for many years. Some methods experienced more success than others. Utilities like PKZIP produced incredible results for free or low cost, but were left to individual users for their own files (post-processing) because they were hard to use with everyday transactions. Other approaches like DoubleSpace and compressed file systems (in-line) have met with varying degrees of success. LZ hardware compression has been offered by companies like Cisco and Ciena for years.

Some companies use compression algorithms to build products that work seamlessly with primary storage. Founded in 2004, StorWize uses in-line LZ compression and IP Network masks the complexity from the user by placing a CIFS and NFS appliance (a StorWize A StorWize B server running software) in front of your file server. Files are reduced “up to 15X depending upon the file type.”72 In 2008, “Storage Magazine” quoted a Active and Passive Links StorWize customer who increased their NetApp usable capacity 360% from NAS Head A NAS Head B

73 300GB to 1.1TB . The device compresses LAN data and stores it on your Disk Array NAS system. As data is needed, the process reverses and data is hydrated into its original form. Their partners include industry giants EMC, Hitachi, IBM, HP, and NetApp.

A company formed in 2008 called WhipTail Technologies (WhipTailTech.com) sells a 100% solid-state disk (SSD) storage appliance called Racerunner. It is notable because it uses in-line deduplication and compression on primary storage. With HiFn cards, it achieves a 4-10:1 reduction. Depending on the data, its usable capacity could well exceed its 1.5-6TB of SSD.

Other companies are also starting to offer in-line PSO products. For example, founded in 2007, GreenBytes ships in-line deduplication and compression storage appliances with SAS drives74.

2010 EMC Proven Professional Knowledge Sharing 33 Ingesting data at 950MB/s (3.2TB/h), their GB-4000 unit is faster than either a mid-tier Quantum or Data Domain SSO solution. Bell Data announced an in-line compression and deduplication product called BridgeSTOR based on the HiFn deduplication and compression technology.75

Ocarina, founded in 2007, uses LZ compression and dedupe in a post-processing solution. Their out-of-band “Optimizer” device shrinks a file by reading it, deduping and compressing it, and storing it back on the server. Users retrieve data through an in-line “ECOreader” software

76 driver on their host system or through a hardware appliance. This Isilon file server NAS Clients diagram shows the out-of-band Optimizer processes data after it is written to the

ECOReader NAS device. Using a content-aware approach, they amazingly dedupe data types Appliances

such as photos, music, and video files that other solutions can not dedupe Isilon IQ Optimizer Isilon IQ nor compress. Data is broken down into logical pieces and processed with Optimizer Isilon IQ dedicated algorithms. For example, if a picture was found in a Word document and the same but larger picture was in a PowerPoint presentation, only a single copy would be retained.

One of Ocarina’s press releases, (which I believe should be regarded with skepticism and is hotly debated by others), states “An independent study has found that the Ocarina ECOsystem delivers as much as 57X better data reduction than industry leader NetApp. A head-to-head trial confirmed that on every single data set tested, the complete Ocarina solution had significantly better results than NetApp dedupe. Results ranged from 181% to well over 2000% greater reductions.”77 You are encouraged to research the merits of this study.

NAS storage vendors NetApp and EMC both provide post-processing deduplication of primary storage, but do it differently. NetApp pioneered PSO with a file level dedupe “A-SIS” product in 200778 (later renamed NetApp Deduplication for FAS). They post-process data after it lands on disk with a command-line utility run on a scheduled, preferably off-hours, or on a demand basis. Recently, they reduced the VMFS footprint of 9 VMware servers 71% from 234GB to 67GB through dedupe79. As data is needed, it is hydrated with only a small performance penalty.

In the graph to the right80, NetApp notes space savings for various data types. They conservatively recommend that deduplication not be used for “…customers in environments where performance is paramount. Also, if the system deploying deduplication sustains heavy

2010 EMC Proven Professional Knowledge Sharing 34 read/write activity without any off-peak hours for post-processing, we recommend that deduplication not be deployed on that system.”81 I would further not recommend it for active multi-user databases. For more on NetApp’s deduplication design, please see the section on file level deduplication.

In early 2009, EMC shipped their unified platform with PSO file level deduplication82 based on their “…Avamar acquisition, and LZ77 data-generic compression from their RecoverPoint acquisition…”.83 Unlike NetApp’s approach, Celerra also compresses the data Typical space achieving a 2:1 reduction independent of duplicate files. With a Technology saving File-level deduplication 10% 84 Celerra NX4 , this chart compares the effect of different Fixed block deduplication 20% Variable block deduplication 28% deduplication and compression methods on a 900GB data set. Compression 40%–50% EMC determined that file level dedupe and compression offered the most cost effective, least impacting solution while still delivering 50-60% efficiency. In a different demo85, a 470MB CIFS file system was transparently reduced 43% in seconds to 270MB. (Note: Do not compare the NetApp to EMC savings since the data each system was deduping was completely different.)

Both NetApp and EMC believe in post-process deduplication of VMware server environments. They point out a large number of common components can be reduced when consolidating physical servers into virtual servers. For example, 30 guest Linux environments could mean 29 “extra” copies of Red Hat components representing hundreds of gigabytes of duplicate storage. Thinking back to the blivet story, the notion of squeezing virtual machines onto physical servers is amazing, but the story gets even better when the storage footprint can be reduced 30:1!

Clearly, the cost of implementing primary storage deduplication is of major importance. To offset these costs, the solution needs to provide benefits such as: 1. Controlling the growth of primary storage. 2. Handle complex environments found in distributed organizations. “…75% of midsize organizations in the having an average of six branch office locations.”86 3. Easy to use and administer. 4. Use less bandwidth to support DR transmissions.

Optimization of primary storage is just beginning to take hold. Innovation will be spurred on by increased email use with larger attachments, applications with audio and video tracks, and an

2010 EMC Proven Professional Knowledge Sharing 35 increased reliance on collaborative tools such as MS-SharePoint. You will likely see content- aware deduplication (previously discussed) become a NAS feature over the next few years.

Secondary Storage Optimization - In-line versus Post-Processing

For over two decades, the magnetic tape industry has been using in-line and post-processing hardware compression to minimally double usable capacity. In 1984, IBM offered in-line hardware compression on their 3480 tape systems. In 1992, StorageTek’s Iceberg compressed data on disk prior to copying the data to tape – if you could afford the multi-million dollar price tag. The largest tape systems available today from IBM use SLDC compression (variant of LZ) and can fit 3TB of data on a single cartridge with 3:1 compression.

With vendors promoting disk-based backups, the debate is whether SSO should be performed as the backup data arrives in the VTL or after it completes. Some solutions start deduping after some data has landed and the first backup job completes – a hybrid approach.

In the past, in-line backup products were too slow to keep up with the enterprise backup activity. These days, high end products like Data Domain’s DD880 can handle 5.4TB/hour with multiple incoming streams (1.28TB/hr with a single stream), due in part to the processing power of a quad-socket quad-core Xeon motherboard. Other vendor designs are grid based to address scalability goals or leverage solid-state disk. In the future, some will use 8-core Intel Nehalem- EX processors and even multi-terabyte memory. Data Domain increased “…performance by nearly 36X and in capacity by more than 56X over the last 5 years87”.

Post-processing’s simpler design which prevents backup degradation has won over many fans. From a performance perspective, these fans believe the issue is how fast the incoming data stream lands on disk given the rest of the time can be used for deduplicating, compressing and transmitting. While post-processing requires enough space to hold the entire backup, the “extra” space is subsequently reused. So if you have an 8 hour backup window, the most critical issue is whether the deduplication processing completes in the next 16 hours before the next backup session starts. If you need to create an immediate tape copy once the backup lands on disk, then you also favor post-processing because there is no hydration step - the data has yet to be deduplicated or compressed so the tape process takes less time. Also, some post-processing solutions can import older backup tapes (such as Spectra Logic’s nTier VTL88).

2010 EMC Proven Professional Knowledge Sharing 36

There is also an in-line/post-processing hybrid solution. Systems like FalconStor and Quantum have the flexibility to start deduping shortly after the logical backup process starts or first backup job completes without impacting the incoming backup stream. After some or all of the deduplication is complete, they start replicating the data remotely.

Source versus Target

Another hotly debated issue that pertains only to secondary storage is whether backup deduplication (SSO) should occur at the source (backup client or server) before transmission over the LAN/WAN or at the target (VTL)89.

SOURCE DEDUPLICATION – identifies duplicate data at the source and sends only unique data over the backup network to a central store. Network efficiency makes this appealing for remote backups.

TARGET DEDUPLICATION – all data is transferred over the backup network to the backup device where it is deduplicated using one of several methods.

Backing up and deduplicating at the source side is accomplished through a software agent (and perhaps a different backup application) on the host. The agent breaks the backup data into blocks and creates their hash codes. The codes are compared to the master index, and if they exist, the blocks are not transmitted to the central location.

Target side deduplication works in a similar manner, except the hash code is created either as the data enters the backup appliance or after all the data has landed on the device, or a combination of these approaches.

The source-side champions point out: 1. When data change rates are low, the source-side approach uses very little bandwidth since the only data sent is pointers and (compressed) unique blocks. As a result, backup time is greatly reduced. For example, with 10TB of primary data, if 1% was new and was compressed 2:1, only 50GB (10TB * 1%)/2 is backed up. Advocates say it is senseless to target backup 10TB when a source backup reduces it to less than 100 GB.

2010 EMC Proven Professional Knowledge Sharing 37 2. Source-side approaches also tend to use content-aware algorithms allowing for higher deduplication ratios than “brute force” fixed or variable-block methods.

3. When consolidating servers, you want Avamar VMware VCB Backup Results Backup Elapsed Time Protected Stored Reduction Factor to avoid creating a backup bottleneck. # (HH:MM:SS) (GB) (GB) (protected/stored) 1 7:36:00 1,700 164.6 10X A server has just a few NIC cards for 2 1:29:00 2,000 8.0 250X backup, so if it supports 10 guests, the 3 0:13:26 2,000 0.2 10,000X same cards back up 10X the data in the same amount of time. This example shows Avamar (source-side dedupe) reducing the VMware backup times significantly90. 4. A problem facing small remote offices is the lack of backup and recovery IT staff. Since source-side deduplication reduces bandwidth cost and time, backups can run over the WAN. There is no need to worry if backups were done, or the tapes were put into a fire- proof box or taken offsite (by whom?). CAPEX and OPEX savings are also possible if tape backup equipment and staff in remote offices can be saved or reduced. 5. Bi-directional replication with source-side dedupe allows for a hybrid model where offices backup to each other. This BC/DR strategy can save money by replacing third party tape storage with replication to another office.

Target-side deduplication proponents point out: 1. To accomplish source-side deduplication, you may have to use a different backup application or deploy additional agents which can lead to higher CAPEX and OPEX and possibly a security issue if not properly maintained. Target-side is easy to deploy since nothing changes on the server side and the same backup software can be used. 2. This approach approximates the time tested tape-backup methodology. 3. Incorporating legacy backups with source-side deduplication requires the legacy data to be restored first and then deduplicated. Target-side deduplication allows legacy data to be fed into the deduplication device as a data stream. 4. Source-side approaches typically do not support mainframes, System i (AS/400, iSeries), and other environments. This could lead to multiple backup architectures. 5. Target-based deduplication, especially when it tracks backup data throughout a (global) enterprise, allows for more data to be analyzed usually resulting in higher dedupe ratios. 6. If data changes significantly from one backup to the next, the source-side client can be burdened by a large processing overhead which could slow down production.

2010 EMC Proven Professional Knowledge Sharing 38 Secondary Storage Optimization - Who Makes What?

EMC, IBM, NEC and others offer in-line backup (secondary storage) deduplication devices, while FalconStor, NetApp, SEPATON, Sun and others sell post-processing solutions. Products differ in their use of fixed or variable-block deduplication, ingestion rates, cost, source or target side dedupe, etc. SNIA’s “Data Protection Initiative Members Buyer’s Guide”91 includes a set of definitions and summaries of many of these solutions (note – not all vendors are listed). Please see Appendix II for Deduplication and VTL Features.

HP offers both in-line and post-processing solutions. Quantum’s “adaptive mode” is a policy- based solution since it dedupes in-line. Should the data arrive too quickly, the Quantum unit stores it for post-processing so it can catch up. EMC’s DL3D products were based on Quantum systems, but were discontinued when they acquired Data Domain.

NOTE - unlike sample data compression streams, the lack of deduplication benchmarks makes it almost impossible to validate vendor claims. Use your own unique data to test and evaluate effective ratios and timings.

The internet is full of comparison lists. For some links, please see the footnote section92 or try a Google query with this phrase +"product roundup" +"data deduplication", etc.

2010 EMC Proven Professional Knowledge Sharing 39 Optimization Software versus Hardware

When selecting a dedupe solution, there are software approaches like ArcServe from Computer Associates which uses your own disk or dedicated storage appliances like Data Domain’s? They both use in-line variable-block hash dedupe, but ArcServe uses your backup media server’s CPUs and memory to carry out calculations.

First, let’s discuss the obvious - there is no such thing as a 100% software solution. Software must run on a server. This issue comes down to running dedupe software on your own server and storage, or supplied in a purpose-built powerful appliance with dedicated storage. There are also hybrid products such as EMC’s NetWorker backup software which is integrated with their Avamar source-side or target-side Data Domain hardware deduplication products.

Some of the criteria to keep in mind include: 1. Scalability – with data growing, high performance is critical to complete backups within a time window, and that means scalability is key. Software solutions may need multiple general purpose servers or additional capacity to scale. Make sure the single master index can grow or share hash information over a network with other global repositories. 2. Performance – data dedupe is processor, memory and I/O intensive, especially when ingestion, restoration and replication occurs on the same platform. Dedicated appliances tend to perform specialized tasks more efficiently than general purpose systems. 3. TCO – software solutions can have lower CAPEX, especially as add-ons to backup products. For example, CommVault has a dedupe plug-in license for Simpana 8. OPEX costs may be lower if the staff already uses the backup product. ESG’s Lauren Whitehouse says cost, integration with existing backup processes, and performance93 is the top criteria companies use when choosing dedupe solutions.

Some software solutions, like Acronis’ Backup & Recovery 10 product, are quite sophisticated offering block level dedupe at the source or target. This diagram94 shows it keeps CPU usage to a minimum through a “quick hash

2010 EMC Proven Professional Knowledge Sharing 40 calculation” to find a match. To keep memory usage low when processing hash codes, software approaches may require fast mechanical or solid-state disks.

Appliances tend to be much easier to manage and expand since the vendor does most of the work. With software solutions, you need to manage your own hardware and software, especially when performance issues crop up, replication issues arise, or you need to scale it to support hundreds of terabytes. A trade-off is lower acquisition cost versus a possibly higher risk if you run into an issue. For example, while I trust this is not always the case, a user posted a blog question on October 13, 2009 asking why his ArcServe backup with dedupe ran almost twice as long as his previous version without dedupe. “When I had normal file system device on BrightStor 11.5 the data migration used to finish by 7 hrs and now it takes an additional 6 hrs to complete.”95 A person answered the question 2½ months later “There may be nothing you can do with the current configuration to speed things up…deduplication has a lot of overhead”.

Generally, software solutions are used in small backup situations with limited or homogeneous servers, and to keep costs low. Large businesses with scalability issues, tight backup windows, diverse support needs and higher performance requirements select hardware solutions.

Data Communications Optimization

Data communications and data growth are like “oil and water” – they don’t get along very well. Increasing WAN circuit capacity can be a major budget expense depending on circuit length. For example, a T1 (1.54 Mb/s) can cost $250-$500/month96 while a T3 (45 Mb/s) can cost $4,000-$16,000/month. The information explosion also hinders WAN DR protection because of high circuit cost. Using dedupe, circuit requirements can be dramatically reduced when new or changed compressed blocks and small pointers are transmitted.

Data transmission typically uses Lempel-Ziv (LZ) compression for rates of 2-3:1, and when deduplication and compression is used, 5-10:1. Let’s look at an example. Assume we send 1TB of data across a WAN. With WAN Backup Example TB to Effective T-1 Link T-3 Link OC-1 Link OC-3 Link Backup Bytes 1.5 Mb/s 45.0 Mb/s 51.8 Mb/s 155.5 Mb/s 2:1 compression, 1TB Usable portion estimate 85% 1.3 Mb/s 38.3 Mb/s 44.0 Mb/s 132.2 Mb/s With Just Compression (2:1) Weekly Backup 1 549,755,813,888 37.0 days 30.5 hours 26.5 hours 8.8 hours shrinks to ½ TB Incremental Backup (10%) 0.1 54,975,581,389 3.7 days 3.0 hours 2.6 hours 53 min With Deduplication (10:1) and Compression (2:1) (549,755,813,888 bytes). Change Rate of Incremental (10%) 1 5,497,558,139 8.9 hours 18 min 16 min 5 min Using a T-3 circuit (38.3 megabits/sec of bandwidth), the data is sent in 30 hours. With a 1%

2010 EMC Proven Professional Knowledge Sharing 41 daily change rate (10% daily x 10% incremental), the T-3 sends 5,497,588,139 bytes (5.12 GB) in 18 minutes since only changed (deduplicated) blocks are sent. These calculations represent the “steady state” – i.e., the first full copy likely takes longer since compression and minor deduplication will only reduce the data by perhaps 2-3:1.

Aside from transmitting backup data, deduplication can produce a big savings when sending primary data. As discussed in the content-aware deduplication section, Silver Peak performs real-time deduplication when sending data between sites. The first time through the appliance, very little data is deduplicated. The benefits appear in subsequent transmissions as data is cataloged in the master index and newly compressed data and pointers are passed down the wire. For example, Silver Peak says they reduce EMC SRDF/A traffic by 82-90% (5.5-10:1)97

Using our earlier Excel illustration from the content-aware section, Silver Peak claims their network memory byte-aware dedupe with byte offsets is far more network efficient when data changes every few bytes than by block-based hash methods. For example, “…representing 10,000 bytes of data with a 16 byte token yields data reduction factors up to 625 (10,000 / 16).98” With byte-level optimization, “representing anything smaller than 16 bytes with a token (hash code) would actually expand rather than reduce the amount of information transferred.”

NOTE - when sending or copying deduplicated data to a device that does not support the same deduplication methodology (i.e. a tape drive), the data must first be hydrated. This can negatively impact the deduplication ROI.

Riverbed employs hashed variable-length In-band solution blocks99, content-aware methods and other concepts to reduce WAN traffic. Their Steelhead device can be used in-band100 or out-of-band, Out-of-band solution and should reduce SRDF/A traffic 80% (5:1)101. As mentioned, content-aware solutions are sensitive to file formats, with various reports claiming Riverbed had performance issues transmitting AutoCAD 2007 formatted files. The issue was fixed when AutoCAD 2010 format was released.102

2010 EMC Proven Professional Knowledge Sharing 42 Cisco’s WAAS uses data redundancy elimination (DRE) which is somewhat similar to Silver Peak’s approach. “DRE inspects incoming TCP traffic and identifies data patterns. Patterns are identified and added to the DRE database, and they can then be used in the future as a compression history, and repeated patterns are replaced with very small signatures that tell the distant device how to rebuild the original message.”103 Cisco says they achieve a 90% traffic reduction (10:1) in VMware environments.

Some companies use replicated data dedupe as a “poor man’s” DR plan. With data landing on a remote site, it can be loaded on standby systems faster than full tape restores. If asynchronous storage frame replication is too expensive, data dedupe with replication may be your answer.

Brocade, Ciena, F5, NetEx and others also offer data communication optimization products.

The Law and Storage Optimization

Data governance is one of the many areas where deduplication and the law intersect. It deals with corporate regulatory obligations for managing certain documents. These duties generally ensure the data remains available, usable, secure and un-modified for a specified time period. With a legal mandate that files not be changed on a go-forward basis and a goal to save space, companies generally prefer to use file level deduplication (for reasons we will shortly explore). The notion is the electronic document should be treated as though it was a paper document and not paragraphs that could be assembled, however properly, if needed.

IS HASH CODING REPLACING THE BATES STAMP? Invented by Edwin Bates in 1893, the Bates stamp is a page and document sequential numbering system used by lawyers to track documents104. In our digital age, there is interest in replacing the hand stamp with a unique hash code displayed on the document. The Federal Rules of Civil Procedure accept the “digital” Bates document stamp.

Attorneys also benefit from deduplication. With massive amounts of digital data they charge their clients to review, they adopted “Bates” hash coding to uniquely identify documents so they are read just once. Lawyers define two terms – deduplication by custodian, which “results only

2010 EMC Proven Professional Knowledge Sharing 43 in the removal of duplicates within one person's data set” and deduplication by case which deals with “the removal of duplicate documents within the data set for the entire case”105.

Lawyers paid to review documents are naturally in favor of deduplication by custodian (individual owner of the information) because there are fewer duplicates - i.e., more documents to review equals more billable hours. So email I send you gets reviewed twice, even if you just add “Thanks” to your reply. The court views deduplication by custodian as a lawyer presenting a clients’ own documents they had in their possession – a simple, logical argument. It is confusing to explain the deduplication by case argument because generally speaking, it is the custodian of a document who is legally able to authenticate it; a prerequisite for admissibility.

Deduplication by case costs less as the number of duplicates increases – i.e. fewer originals. That means fewer billable hours are needed for the overall review. However, it may be difficult to prove who created the document without the underlying metadata. Deduplication can confuse a non-technical jury if the proponent of admissibility is unable to demonstrate the document's authenticity. That’s what happened in the case of Nursing Home Pension Fund v. Oracle. In a lawsuit filed in 2001, dismissed in 2003, reinstated in 2004, dismissed in June, 2009, and appealed in July, 2009, the “defendants produced more than 1,665 emails to or from Oracle CEO Larry Ellison, but only 15 of those emails were actually identified as having come from Ellison's email box “106. The court found spoliation (that is, the court found that the party seeking Ellison's emails would be entitled to a favorable jury instruction about the email evidence) because it was not proven if Mr. Ellison ever read the other emails. The judge reasoned that it could have been valuable for the party requesting the emails to have them directly from Ellison's email box rather than from other employees’ email boxes after dedupe. Why? Because Ellison could argue that he never received them and therefore never read that particular email. The case might have had a different outcome if the correct email metadata was produced.

Another interesting legal aspect concerns document immutability and file versus sub-file deduplication. A jury understands how you can prove two documents are the same, but they may not comprehend how variable-block deduplication understands that certain 0’s and 1’s are unique while others are duplicates.

Language in “Criminal Penalties for Altering Documents”, Sec 802 of the Sarbanes-Oxley Act of 2002 makes corporations unwilling to risk sub-file deduplication in court. It is not clear if the

2010 EMC Proven Professional Knowledge Sharing 44 courts have ruled whether sub-file deduped, electronically reproduced and printed documents legally represent an original. Proof beyond a doubt may be needed to support the claim that an original document can be recreated or that “proprietary” hash algorithms and methods properly detect duplicates and do not mistakenly delete unique blocks. The risk of a document declared inadmissible may be a hurdle to your legal department’s adoption of deduplication practices. They may not want to sub-file dedupe certain types of documents. There is no such argument for file level deduplication – it is easy to understand since the document is the same or its not.

Interestingly, every document stored on disk is clearly changed. Operating systems store ASCII bytes in blocks with pointers leading to non-contiguous allocations or broken into packets when sent as email. You would need a forensic data analyst to testify that the data has not changed, regardless of how it is stored. So before implementing deduplication, you should review your legal department’s e-discovery and data governance policies. For example, should email be file- level deduped while other data variable-block deduped? There is no issue with compression.

Storage Optimization “Gotchas” and Misadventures

While storage optimization can be highly beneficial for your organization, be sure your plan is well thought out. No one will want to hear how your company went out of business because you saved some disk space. In the end, there is truth behind “too much of a good thing”.

So with that in mind, here are some of the “gotchas” you want to avoid: 1. Earlier, there was an example of Oracle running faster when compressed, but the opposite can also happen. By compressing or deduping a busy primary storage I/O workload onto fewer disks, it could run slower since the remaining disks must perform more work. For example, VMware uses VMDK files to run virtual machines (VM). Suppose you have 100 Windows VMs and each needs 50 IOPS107 (5,000 IOPS). One 10,000 RPM disk delivers about 130 IOPS, so if half of the VMs are active, you need

100VM * 50 IOPSperVM *50% Active = 19 Drives to supply the proper 130 IOPSper 10,000 RPMdrive performance profile. Deduping VMDKs means fewer or just one copy of core files spread on far fewer drives.

2010 EMC Proven Professional Knowledge Sharing 45 NetApp addresses this issue with a Performance Acceleration Module II (PAM) that puts frequently used data into cache so it is not retrieved from spinning drives. EMC’s Celerra dedupes relatively inactive files and supports active workloads with cache and even solid-state drives. However, if your goal is to save money, you don’t want to overspend to increase performance on data that has been slowed by over-zealous deduplication. 2. If you invested a great deal of time and money making sure your primary data was highly available, be sure the optimization solution does not have single points of failure. 3. You want to do thorough testing because you know you can not count on 5000:1 deduplication ratios. Factors that affect ratios are: ƒ File systems, email, and databases fare well with deduplication. ƒ Full versus incremental backups – the more “fulls”, the higher the ratio. Conversely, “incrementals” by definition will have less duplication. ƒ Transient data – it does not makes sense to deduplicate temporary data, yet it still needs to be backed up. Adjust your policy to just use compression for this data type. ƒ Encrypted data doesn’t dedupe or compress. Deduping compressed data, except at the file level, is hit or miss and probably not worth trying. ƒ Imagery and audio does not sub-file deduplicate. Applications like time stamped video can not be block deduplicated since the time is coded on each video frame. ƒ High change rates – deduplication systems do better with lower change rates. If your data is very “seasonal” or experiences significant changes from one day to the next, it may not be a good deduplication candidate. ƒ Longer retention periods - results in higher ratios because it increases the likelihood the same file or block will be referenced. 4. Hydrating data could cause you to run out of space in your primary storage area. While it sounds obvious, the less free space you have, the greater the risk you will run out of it. 5. You can run out of secondary storage space, especially with post-processing deduplication. If you have more data to deduplicate in a 24 hour period than your system can handle, what will happen when the next day’s backup begins? While deduplication gains efficiency as it processes more data, it takes time to ramp up. 6. When creating tape copies of deduplicated data, the likely hydration could result in a large number of tapes. Also consider how much time it could take to hydrate all the data necessary to make a full set of tapes. If you need tapes, look into systems that offer a tape-out feature using deduplicated data such as those from CommVault.

2010 EMC Proven Professional Knowledge Sharing 46 7. Make sure you can select which primary data should not be deduped. The legal aspects of deduplication (see previous section) may require different dedupe policies. 8. If your backup time window is already too short, then an in-line dedupe system has to be very powerful or it could increase your backup time. You may want to first use archiving to reduce your primary storage so you have less data to backup. 9. While deduplication and WAN replication strategies save time and money, determine how much bandwidth and time it requires should you need to do a full remote restore. If the amount of data is large or your network is small, it could take days or weeks to transmit. Available bandwidth is a serious issue that may need larger circuits “just in case”. If you have a disaster and need to restore from the remote appliance, you may want to move it next to the primary appliance to accomplish the restore. 10. Make sure you thoroughly investigate the TCO and ROI for any deduplication solution. In the end, they can be expensive, so they must demonstrate significant savings.

Before You Purchase a Optimization System

If you been in the IT industry for a while, questions are sometimes answered with “it depends”. This is also the answer when purchasing optimization equipment because no one knows how effective it will be with your data. Claims are useful solely as indicators of general performance since test data will almost never resemble your production data. In the end, it is not like buying a car based on which vehicle has more front seat legroom.

Metrics Can Be Confusing Manufacturers do not want to confuse their customers, but products often use different capacity ratings and metrics, which if you are not careful, can be misleading. For example, the “cough medicine” industry labels their bottles with both ounces and milliliters, yet their dosage is in teaspoons (1 teaspoon= .16 ounces or 4.92 mL.) LTO-4 tape drives have a maximum transfer rate of 120 MB/s without compression and 240 MB/s with 2:1 compression (equivalent to LTO-4 Data Domain Quantum DXi7500 EMC 108 compressed DDX compressed .82TB/hr). If a tape library raw 2:1 DD 880 (16-DD 880) Adaptive Deferred 2:1 DL 4400 Max Rate 120 MB/s 864 GB/hr Max Rate 5.4 TB/h 86.4 TB/h 1.8 TB/h 3.2 TB/h 4 TB/hr 7.9 TB/h has 12 tape drives, it can MB/s 120 240 TB/h 5.4 86.4 1.8 3.2 4.0 7.9 MB/m 7,200 14,400 MB/hr 5,662,310 90,596,966 1,887,437 3,355,443 4,194,304 8,283,750 MB/hr 432,000 864,000 MB/m 94,372 1,509,949 31,457 55,924 69,905 138,063 handle up to 9.8TB/hr TB/h 0.41 0.82 MB/s 1,573 25,166 524 932 1,165 2,301 compressed. In contrast, appliances such as the Data Domain DD 880 transfer up to 5.4 TB/hr. So rather than get confused between MB/s and TB/hr, make a unit comparison chart so you can

2010 EMC Proven Professional Knowledge Sharing 47 evaluate “apples to apples”. For example, to learn how many LTO-4 tape drives are needed to 5.4TB / hour equal the maximum ingestion rate of a DD 880, use this formula: = 6.6 = 7drives . 0.82TB / hour

To understand how these devices could improve your backup environment, you need to know how fast you can drive the rest of your infrastructure since the fastest target GigE 10 GigE Mb/s 1,000 10,000 deduplicating device can not help a backup issue if feeding circuits are too TB/h 0.43 4.29 small. The math can be confusing when you use a LAN for backup data. For example, if your environment uses a single GigE LAN, the most uncompressed data you could pump through it is less than 0.43 TB/hour. Likewise, a 10 GigE LAN can not exceed 4.29 TB/hour. Generally, a circuit is not capable of 100% of its theoretical rating and experts suggest the maximum could be 80-85% of its advertised capability. So how many GigE circuits do you really need to drive 5.4TB / hour the DD 880 at full capacity? = 15.7circuits . With 10 GigE circuits, replace 0.43 with 80% * 0.43TB / hour 4.29. This comes in handy when you have a backup window problem to solve. Also remember your results could be further reduced by signal noise and other latency issues.

Testing For data compression, the set of defacto datasets include an assortment of text and binary files in the older from the 1980’s and 1990’s, and the newer . Cisco uses both of these datasets when testing their WAN data compression. You can use these files with your compression solution to get a feel for how efficient it is, but given your data will most likely not match either set, you are better off testing with your own data.

Deduplication likely costs more than compression for primary storage, so does the space saving justify the added expense? Conversely, compression may not save as much space but may cost considerably less. Also, determine which solution is easier to operate. Can you access your data if hardware decompression fails – i.e. does software decompression take over?

Your evaluation should take into account data types, how often data changes and the degree of redundant data. With WAN deduplication and compression, the effectiveness is tied to the type of data, circuit quality and distance. For example, if a particular approach reduces WAN traffic 6:1 with Word or AutoCAD documents, it might only delivery 3:1 using storage frame replication.

2010 EMC Proven Professional Knowledge Sharing 48 Test PSO with your current backup and archive solutions. Does the data need to be hydrated before it is backed up to a tape or archived? Does hydration need extra space? If a substantial portion of the data is compressed or deduplicated, where will it fit when it is hydrated?

For secondary storage, use practical retention periods. The place to start is with an assessment by the solution provider. Assessments take some time and can be hard to compare since you can not run two in parallel. Investigate the result when making small or large changes to flat files, databases, images, etc. There are some great success stories out there. For instance, Chris Babcock109 with older Data Domain gear squeezed 26 TB of backup data in 4 TB of space. With an 84% reduction in less than two months of use, his comment was “I have to say WOW!!!”

How “painful” is the first backup? Does it complete before the next one needs to start? Don’t forget to test restoration speed and steps while backing up new data. What changes will you need to make to your backup run book with this new equipment? Test the performance impact for both single and multi-user operations. How does the solution run when the CPU is nearing its limit? How well are CIFS and NFS handled? How efficient is replication?

Vendors love users who test with production data because keeping the unit is tempting, but should you decide not to purchase it, how will you disengage it from your production data? While tests with real data make sense, do you want to risk testing in a production environment?

Work with your vendor to test what happens when power supplies fail. Try to simulate a drive (or two) going bad. Test the solution for a few weeks. You may want to attend training before the test begins. When replicating, what happens if the WAN goes down or has line noise?

You want to understand the ease of integrating the new technology into your environment. Must you replace technologies you already own or retrain your staff on new processes? Every time you change or add something, you need to test it, and that means risk. To that end, does it support your existing software titles, does it interface with any existing APIs that are in use, can you use your existing storage or do you need to buy more and perhaps a different brand (which could mean higher support costs). If you make changes to your environment, how much effort does it take to keep your deduplication or compression system in lock-step?

2010 EMC Proven Professional Knowledge Sharing 49 Cost, ROI and TCO

Effects of Deduplication Storage on TCO Components Understand the hardware, software, TCO Category TCO Component Effect of Deduplication Storage Calculation of Costs/Savings training and multi-year maintenance cost Reduction or elimination of need No additional tape hardware, Tape Backup for any or additional tape libraries, possible elimination of current Hardware & drives, or media servers in local hardware, and avoidance of future and compare it to your projected CAPEX Hardware Maintenance and/or remote offices. hardware. Dedupe Backup Incremental cost of deduplication Hardware & hardware for storage and WAN Incremental initial costs plus any and OPEX savings. While every situation Maintenance vaulting/replication. additional required over 3 years. Backup Software With deduplication storage, no 110 Licenses & additional software. Avoids cost of Subtract cost of additional licenses has special considerations, this chart Software Maintenance additional backup licenses. required by tape backup. Reduced labor in tape mounting, lists the savings categories for data Labor (Backup handling, and transporting from Admin FTEs) remote offices. Number of hours saved per week. Labor (Sysadmin, Number of restores per week times Support deduplication over tape backup. The web Backup Admin number of hours saved per restore FTEs) Time saved due to faster restores. due to data being kept online. Labor (Sysadmin, End-user time saved per year due Number of users affected, times has many ROI and TCO documents, and Backup Admin to faster restores. number of restores per week Reduced number of tapes in inventory and added per year after your finance department may have a implementing D2D) times cost of Supplies Tape Media Reduction in number of tapes. tape. methodology to calculate these values. In Reduction in storage, Offsite Tape transportation, and tape recall Storage & costs. Potential elimination of Average reduction in invoiced costs general, you should evaluate in-line Services Transportation service contracts at remote sites. after implementing D2D. solutions against post-processing solutions at the same discounted price point. It becomes an academic exercise to compare performance, efficiency and flexibility with list prices.

You should gauge how easy it is to operate as well as its resiliency in everyday use. Consider the WAN replication costs of sending deduplicated data to another site. If you can’t test replication, then understand how easy it is to implement replication at a later stage. Remember to add upgrade costs to your calculations - who performs the expansion and what is the cost?

Learning how much space will be saved when you implement the solution may require a test environment or using it in production. What is the reduction in backup time? Do you need intelligent backup software for databases and other complex structures to detect which records have/have not been modified –check with your backup software supplier.

This diagram111 shows the best use case for dedupe is for data with “less frequent” access coupled with medium-term retention. If your data has short retention, then dedupe may not get you a substantial ROI and you might want a compression VTL. Likewise, you may want to couple dedupe with magnetic tape for infrequent access or long multi-year retention periods. Tape is “green” since storing it in a drawer consumes no energy.

2010 EMC Proven Professional Knowledge Sharing 50

Try standardizing on a single PSO deduplication solution. Multiple vendors in small IT shops can often become a support nightmare when something goes wrong. Always implement cross training so the business is supported when a staffer is unreachable on vacation.

Scalability and Availability If your primary data doubles or triples, will you need to buy another device or will the unit scale? What is the cost? Are there features you can add and will the upgrade be non-disruptive? With SSO units, at what point will you “out grow” the solution and what steps will you need to take?

If your backup plan requires multiple devices, as in the case of a large enterprise, you do not want each device working independently. Global dedupe allows these units to be aggregated so unique data is only stored once. A critical issue is how much data is processed during a backup window. For example, Data Domain’s DD880 does not currently support global dedupe, but it can process 129TB/day (24 Global Integrated Tape Virtu- OST Product Supplier VTL/NAS Dedupe Dedupe Tape alization Support Copan Revolution FalconStor Software Inc. VTL 9 9 9 9 hours x 5.4TB). If you EMC/Data Domain DDX Series Data Domain Inc. BOTH 9 9 EMC Disk Library 1500/3000 Quantum Corp. VTL/NAS 9 9 9 EMC Disk Library 4000 FalconStor VTL 9 9 9 backup more than 129TB of ExaGrid EX Series ExaGrid Systems Inc. NAS 9 9 FalconStor FDS FalconStor NAS 9 9 FalconStor VTL FalconStor VTL 9 9 9 9 9 data a day, you should think Fujitsu Eternus CS Fujitsu VTL 9 9 GreenBytes GB-X Series GreenBytes Inc. NAS/iSCSI 9 about global dedupe. This Gresham Clareti Storage Director Gresham Storage Solutions VTL 9 9 HP D2D Backup System Series HP NAS 9 HP Virtual Library System Sepaton Inc. VTL 9 9 table lists vendors that offer IBM ProtecTIER IBM VTL 9 9 NEC Hydrastor NEC Corp. of America NAS 9 9 NetApp NearStore VTL NetApp Inc. VTL 9 9 global deduplication and Overland REO Series Overland Storage Inc. VTL Quantum DXi-Series Quantum BOTH 9 9 9 112 Sepaton S2100 Series Sepaton VTL 9 9 other functionality . Sun StorageTek VTL FalconStor VTL 9 9 9 9

In a similar vein, make sure the solution is rugged, and that means RAID protection, hot-spares, active-active or active-passive clustering, dual power feeds, and redundant components.

Service, Support and References What happens if the unit breaks? Who services it, how long does a repair take, and is help available 24x7 with 2 hour service? Can you fix it yourself in return for lower maintenance costs? How are parts supplied and for how many years will they be available? Is it repaired under power or must it be shut off? Is there an automatic notification of a part or system failure?

2010 EMC Proven Professional Knowledge Sharing 51 Not completing backups on time is just not acceptable. What are the management reports and diagnostics like? Who will help you troubleshoot a performance problem? Are software patches and upgrades handled non-disruptively or must you take the system down or reprocess data?

Is your PSO or SSO solution supported by your storage or backup vendor? If you buy a target deduplication device, can you use your existing backup software? To avoid finger pointing on escalation issues, ask if joint service agreements are in place. For example, if you place a StorWize unit in front of a NetApp appliance, will NetApp still service their equipment?

No one wants to be the “first kid on the block”. Ask for references and have conversations with them to see how long they have been using the solution, issues they faced with support, the ease of doing business with the vendor, etc.

Conclusion

If only information wasn’t growing at such a rapid pace, or the economic times were better, or even if staff and other costs were lower, you probably wouldn’t be considering compression or deduplication. But these harsh realities are here. The good news is these problems are not insurmountable - they can be addressed.

The best advice is to reduce primary storage. Marc Staimer says the burdened cost of a “…terabyte of primary storage is now $43,000 a year” and Gartner says 20-40% of corporate data is junk (“no business value”)113. Archiving or deleting data permanently frees up primary space and greatly reduces the backup burden placed on secondary space and WAN bandwidth. Who knows how much space could be saved with just 5 minutes of disk housekeeping a day – it would also make a good deterrent for downloading the entire internet.

Beyond deletion and archiving, the key to selecting the best solution is based on identifying the problem you want to solve. For example, if you have a small amount of data, individual utilities could be your answer. If you are running out of space on your primary file servers, compression and even deduplication solutions are becoming popular. A corporate-wide backup issue where hundreds of terabytes need to be backed up within a finite timeframe requires a deduplication solution. There are also hybrids combining various features that even include magnetic tape.

2010 EMC Proven Professional Knowledge Sharing 52 Compressing primary storage is a no brainer – it does not alter the data and works regardless of data change rates. Given the speed of processors today (versus the slow speed of disk I/O), you probably won’t see a performance difference using compression. Faster processors are setting the stage for the deduplication of primary storage, but you need to be careful using it – deduplication of active primary storage databases could cause you to find a new employer.

With static data endlessly backed up, duplication wastes limited funds and is a significant drain on resources. Deduplication is a mature, “must have” technology that solves the challenges of backups, budgets and risk, and can usually pay for itself in a year or two. Solutions can address remote office data issues, poor performance, business continuity, and security concerns.

When it comes to backups, you have lots of choices for where it should be done, how quickly it is processed, and whether the backup data is electronically replicated. While it is easy to transport tapes using the Chevy Truck Access Method (CTAM), you do not want to lose unencrypted tapes. If you stay with a tape backup strategy, you will continue to stream redundant data114 and have long restoration efforts. Tape can not scale with tight backup windows, media costs are high, labor to handle tapes keeps increasing, transportation and offsite storage is expensive and risky, and with a disaster, recovery is time consuming and uncertain.

As technology evolves, compression and deduplication will become standard features and not add-on products. It is no different when buying a car; air conditioning used to be an option but is now considered standard equipment. Speaking of cars, George Peck, the President of the Michigan Savings Bank in 1903, advised Henry Ford’s lawyer against investing in the now famous car company saying “The horse is here to stay but the automobile is only a novelty, a fad.”115”. Eventually the car replaced the horse. The first TiVo was introduced in the late 1990’s and it took less than 20 years before people stopped buying VCR tapes. Storage optimization is the new car or the new Tivo. The data optimization journey is really just beginning and where it is headed is anyone’s guess. Years from now you will likely look back and say that the science

2010 EMC Proven Professional Knowledge Sharing 53 of data optimization really took off in the early 2000’s. Perhaps someday, all data will only be stored once! You should evaluate its effectiveness. The ROI is compelling!

2010 EMC Proven Professional Knowledge Sharing 54 Appendix I – MD5 and SHA-1 Algorithms

MD5 was developed in 1991 by Professor Ronald Rivest of MIT116 as a cryptographic way of making messages secure. With storage and communication optimization, the focus changed to using Message Digest 5 (MD5) algorithm codes to determine if data is unique. In 2004, reports surfaced that MD5 unique hash codes could in fact generate duplicates – i.e., a hash collision. At that time, the EMC Centera was a highly visible product using MD5 to help store unique content, so the fear was that two unique documents could incorrectly have the same hash number. For example, if two patients had X-rays archived, a patient could receive the wrong image. Fear ensued even though experts claimed this was a mathematical improbability and blown out of proportion. Truth be told, the only way to guarantee uniqueness is a full byte-by- byte comparison. In the end, EMC took steps to further ensure the uniqueness of the two documents and even transitioned to the SHA-256 algorithm in part to get past the concern.

However, few issues are black and white and computing is full of estimates. For example, the calculation of 1/3 is only an approximation with binary math. The question is if the MD5, SHA-1 or SHA-256 algorithm is good enough. Let’s examine these algorithms and judge for ourselves.

The mathematics behind MD5 is beyond the scope of this paper, however, the basic steps to create a hash number are117: 1. Pad message so its length is 448 mod 512 2. Append a 64-bit original length value to message 3. Initialize the 4-word (128-bit) MD buffer 4. Process message in 16-word (512-bit) blocks: ƒ Using 4 rounds of 16 bit operations on message block & buffer ƒ Add output to buffer input to form new buffer value 5. Output hash value is the final buffer value

2010 EMC Proven Professional Knowledge Sharing 55 MD5 can create 340,282,366,920,938,463,463,374,607,431,768,211,456 or 2^128 unique codes meaning the chance of a collision is 18,446,744,073,709,600,000 or 2^64. It is more likely an asteroid will cause a global disaster (1:100,000,000)118.

When SHA-1 is used, the number of codes goes from 2^128 to 2^160. As a number, it is 1,461,501,637,330,900,000,000,000,000,000,000,000,000,000,000,000.

To reduce a collision risk even further, the SHA-224, SHA-256, SHA-384 or even the SHA-512 can be used. If you want to feel secure, then use SHA-256. It makes the chance of a collision extremely remote. The number is 115,792,089,237,316,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 ,000,000,000,000 which translates into 2^256 unique codes. In fact, the number of codes it can generate is “significantly more than the number of atoms in our galaxy (around 2^226)119.” If that is not enough, work began on the SHA-3 family functions120 in 2007.

So at what point do we reach “good enough”?

2010 EMC Proven Professional Knowledge Sharing 56 Appendix II – SNIA Deduplication and VTL Features

Company Product / Series Design Approach Granularity When Where Compression Deduplication Deduplication Occurs Occurs Data Domain Appliance Series DD120, DD510, DD530, DD565, DD690, Appliance Variable Inline Target Yes DD880 & ES20 Expansion Shelves EMC EMC Avamar Data Store Grid Storage Variable Inline Source Yes EMC Avamar Virtual Edition for VMware Software Variable Inline Source Yes Exar Hifn BitWackr for Linux Component Fixed Inline Target Yes (Hifn Technology) Hifn BitWackr for Windows FalconStor FalconStor Virtual Tape Library Appliance Gateway Software Fixed Post-process Target Yes FalconStor Virtual Tape Library Virtual Appliance Software Fixed Post-process Target Yes FalconStor File-interface Deduplication System Appliance Gateway Software Fixed Post-process Target Yes FalconStor File-interface Deduplication System (FDS) Virtual Software Fixed Post-process Target Yes Appliance HP HP Accelerated Deduplication Appliance Differencing Post-process Target Yes HP Dynamic Deduplication Appliance Variable Inline Target Yes Hitachi ProtecTIER Software Variable Inline Target Yes NEC HYDRAstor HS8-2000 Grid Storage Variable Inline Target Yes NetApp Deduplication For FAS Storage System Fixed Post-process Target No Deduplication For V-Series Gateway Fixed Post-process Target No Deduplication For VTL Appliance Variable Post-process Target Yes Permabit Enterprise Archive Business Series Grid Storage Variable Inline Target Yes Enterprise Archive Data Center Series Grid Storage Variable Inline Target Yes Quantum DXi2500-D Appliance Variable Policy-based Target Yes DXi3500 Appliance Variable Inline Target Yes DXi7500 Express, DXi7500 Enterprise Appliance Variable Policy-based Target Yes SEPATON DeltaStor Appliance, Grid Storage Differencing Post-process Target Yes SUN Virtual Tape Library Plus with ECO Appliance Fixed Post-process Target Yes DEDUPLICATION PRODUCT MATRIX

Company Product / Series Implementation Virtual Drives Partitions / Replication Compression Virtual Libraries Data Domain Appliance Series VTL Option: DD120, DD510, DD530, DD565, Appliance 1-128 1-64 Virtual Replication Yes DD690, DD880, & ES20 Expansion Shelves Libraries Gateway Series Gateway 1-128 1-64 Virtual Replication Yes Libraries DDX Array Series VTL - 4, 8, & 16 Software up to 2048 4 to 1024 Virtual Replication Yes Libraries EMC EMC Disk Library 4106 Appliance 1024 1 to 128 Replication Yes EMC Disk Library 4206 Appliance 2048 1 to 256 Replication Yes EMC Disk Library 4406 EMC Disk Library for Mainframe (DLm120) Appliance 512 32,768 Via special request Yes EMC Disk Library for Mainframe (DLm960 Appliance 1536 98,304 Via special request Yes FalconStor FalconStor Virtual Tape Library Appliance 1 to 1024 1 to 128 Export to Tape, Yes Replication FalconStor Virtual Tape Library Gateway 1 to 1024 1 to 128 Export to Tape, Yes Replication FalconStor Virtual Tape Library Software 1 to 1024 1 to 128 Export to Tape, Yes Replication FalconStor Virtual Tape Library Virtual Appliance Appliance 1 to 16 1 to 4 Replication Yes Hitachi ProtecTIER Software 256 1 to 16 No Yes Hewlett-Packard HP StorageWorks 12000 Virtual Library System EVA Gateway Gateway 1024 1 to 128 Replication Yes HP StorageWorks 9000 Virtual Library System Appliance 1024 1 to 128 Replication Yes HP StorageWorks 6000 Virtual Library System Appliance 128 1 to 16 Replication Yes HP StorageWorks D2D4000 Backup System Appliance 1 to 4 1 to 24 Replication Yes HP StorageWorks D2D2500 Backup System Appliance 1 to 4 1 to 8 Replication Yes NetApp NetApp VTL300 Appliance 1 - 1500 1 - 256 Yes Yes NetApp VTL700 Appliance 1 - 1500 1 - 256 Yes Yes NetApp VTL1400 Appliance 3000 512 Yes Yes Quantum DXi7500 Express; DXi7500 Enterprise Appliance 160 64 Replication Yes DXi3500 Appliance 64 16 Replication Yes DX3000 & DX5000 Appliance 64 16 NA Yes DX30 & DX100 Appliance 500 64 NA Yes SEPATON 2100-ES2 Series 1000 Appliance/Grid 3,072 3,072 Yes - with partner Yes Sun Microsystems Sun StorageTek VTL Plus 1200,1202, 2600,3600 Appliance 2048 256 Secure Tape, export Yes to tape Sun StorageTek VSM Appliance 65536 256 Yes Yes VTL PRODUCT MATRIX

2010 EMC Proven Professional Knowledge Sharing 57 Footnotes

1 http://www.emc.com/collateral/demos/microsites/idc-digital-universe/iview.htm 2 http://www.latimes.com/news/nation-and-world/la-na-census-texting16-2009dec16,0,2439930.story 3 http://www.symantec.com/content/en/us/about/media/DataCenter08_Report_Global.pdf 4 http://www.emc.com/collateral/demos/microsites/idc-digital-universe/iview.htm 5 http://en.wikipedia.org/wiki/Papyrus 6 “Adaptive Data Compression” by Ross Neil Williams, Page 2 7 http://special.lib.gla.ac.uk/teach/papyrus/oxyrhynchus.html 8 http://ancientcoinsforeducation.org/pdf/coin_anat.pdf 9 http://www.vldb.org/conf/2003/papers/S28P01.pdf, page 8 10 http://oldcomputers.net/ibm5120.html http://oldcomputers.net/ibm5120.html 11 http://download.microsoft.com/download/9/8/f/98f3fe47-dfc3-4e74-92a3-088782200fe7/TWWI05022_WinHEC05.ppt 12 Capitol Expenses (CAPEX) are associated with assets like hardware and software owned for a long period of time, e.g. 3-4 years. 13 Return On Investment (ROI) - a performance measurement to evaluate or compare the efficiency of an investment(s). To calculate ROI, the benefit (return) of an investment is divided by the cost of the investment the result is expressed as a percentage or a ratio. http://dictionary.reference.com/browse/return+on+investment 14 Operating Expenses (OPEX) are associated with everyday costs such as maintenance, power, cooling, training, wages, etc. 15 http://powercalculator.emc.com 16 http://h30099.www3.hp.com/configurator/calc/Power%20Calculator%20Catalog.xls 17 http://www.eia.doe.gov/cneaf/electricity/epm/table5_6_a.html 18 http://historywired.si.edu/detail.cfm?ID=306 19 Introduction to data Compression - Morgan Kaufmann 20 “Data Compression - The Complete Reference” - David Salomon, page 3 21 http://www.fadden.com/techmisc/hdc/lesson04.htm 22 http://frakira.fi.muni.cz/izaak/MEMICS-proceedings-2008/papers/p138/compWBcomp.pdf 23 Introduction to data Compression - Morgan Kaufmann 24 http://en.wikipedia.org/wiki/LZ77 25 http://corpus.canterbury.ac.nz/descriptions/ 26 Data taken from http://www.maximumcompression.com/data/summary_mf2.php. For more info, please see http://www.maximumcompression.com/index.html 27 http://en.wikipedia.org/wiki/Shannon%E2%80%93Fano_coding 28 http://www.cs.waikato.ac.nz/~ihw/papers/95JC-WT-IHW-Unbound.pdf 29 http://articles.techrepublic.com.com/5100-10878_11-6180692.html 30 an acronym for the Zettabyte File System 31 http://www.opensolaris.org/os/community/zfs/whatis/ 32 http://analytics.informationweek.com/abstract/24/753/Storage-Server/data-doubles-down.html 33 http://www.cisco.com/warp/public/cc/pd/ifaa/pa/czsvaa/prodlit/csa70_ds.pdf 34 http://www.datadomain.com/calc/faq.html 35 http://senior-net.com.ar/chile_tecnico/descargar2.php?idarchivo=58&archivo=Virtualizacion_de_Cintas_y_Deduplicacion_de_datos_Parte1.pdf 36 http://www.ibm.com/developerworks/wikis/download/attachments/106987789/TSMDataDeduplication.pdf?version=1 37 Adapted from this EMC presentation http://finland.emc.com/collateral/campaign/global/forums/deduplication.pdf 38 http://en.wikipedia.org/wiki/MD5 39 http://en.wikipedia.org/wiki/Sha-1 40 http://viewer.bitpipe.com/viewer/viewDocument.do?accessId=10772107 41 http://www.northern.net/en/Storage-Management/Reporting/ 42 http://www.usenix.org/events/usenix-win2000/full_papers/bolosky/bolosky.pdf 43 http://download.microsoft.com/download/8/a/e/8ae7f07d-b888-4b17-84c3-e5a1976f406c/SingleInstanceStorage.doc 44 http://www.emc.com/collateral/analyst-reports/010208-esg-emc-centera-ease-of-use.pdf 45 Storage Area Networks For Dummies, page 342 46 http://dstelevaulting.com/pressroom/print_uk_press/computerwire_article.htm 47 http://www.networkcomputing.com/informationweekreports_doc_2008_207602796.pdf, page 62 48 http://www.hifn.com/uploadedFiles/Products/Solution_Bundles/Data_De_Dupe/HifnWP-BitWackr-2.pdf 49 http://blog.druvaa.com/2009/01/09/understanding-data-deduplication/ 50 http://www.clipper.com/research/TCG2007016.pdf 51 http://www.xmailserver.org/rabin.pdf 52 http://www.sparknotes.com/cs/searching/hashtables/section4.rhtml 53http://books.google.com/books?id=lBY6CcXU59EC&printsec=frontcover&dq=%22adaptive+data+compression%22&cd=1 54 http://www.freepatentsonline.com/5990810.pdf 55 http://www.datadomain.com/products/faq-patents.html 56 http://dockets.justia.com/docket/court-candce/case_no-5:2007cv04161/case_id-194998/ 57 http://www.freepatentsonline.com/7116249.pdf 58 http://itknowledgeexchange.techtarget.com/storage-soup/quantum-riverbed-bout-ends-with-11m-handshake/ 59 http://www.freepatentsonline.com/6810398.pdf 60 http://storageconference.org/STORAGECONFERENCE/2003/presentations/C08-Olsen.pdf

2010 EMC Proven Professional Knowledge Sharing 58

61 http://searchdatabackup.techtarget.com/news/article/0,289142,sid187_gci1299830,00.html# 62 http://www.silver-peak.com/assets/download/pdf/wp_SilverPeak_byte-level_deduplication.pdf 63 http://www.hpcwire.com/topic/storage/NEC-Primes-HYDRAstor-for-100-Years-of-Storage-Archive-59365007.html?page=2 64 http://www.commvault.com/solutions-deduplication.html 65 http://www.exagrid.com/resources/ExagridByteLevelWhitepaper.pdf 66 http://whitepapers.pcmag.com/index.php?option=com_categoryreport&task=thankyou&title=2416&pathway=no&vid=830 67 http://www.storagenewsletter.com/news/business/ibm-diligent-acquisition 68 http://www.lighthousecs.com/ui/user/File/SEPATON_Deduplication_Cost_Reductions.pdf 69 http://tanejagroup.blogspot.com/2009/08/enter-primary-storage-optimization.html 70 http://online.qmags.com/IFS0109/Default.aspx 71 http://www.datadomain.com/pdf/DataDomain-TechBrief-Inline-Deduplication.pdf 72 http://storwize.com/Public/Inhouse_PDFs/Corporate_Brochure.pdf 73 http://searchstorage.techtarget.com/magazineFeature/0,296894,sid5_gci1334543_mem1,00.html 74 http://getgreenbytes.web7.hubspot.com/Default.aspx?app=LeadgenDownload&shortpath=docs%2fgb-x_fact_sheet.pdf 75 http://public.hifn.com/Hifn%20Document%20Library/Bell_Data_BridgStor_091609.pdf.pdf 76 http://www.isilon.com/pdfs/brochures/Isilon_Ocacina_print_3_02_09.pdf 77 http://ocarinanetworks.com/company/background 78 http://www.reference.com/browse/wiki/NetApp 79 http://partners.netapp.com/go/techontap/matl/downloads/asis/flash/ASIS.html 80 http://www.stemmer.de/service/workshops/sbb2008sep/download/netapp.pdf 81 http://communities.netapp.com/servlet/JiveServlet/previewBody/1060-102-1-1030/FAQ%20FAS%20Deduplication%2003_07_08.pdf 82 http://www.emc.com/about/news/press/2009/20090223-01.htm 83 http://onlinestorageoptimization.com/index.php/emc-dedupe-beyond-data-domain/ 84 http://www.dell.com/downloads/global/products/pvaul/en/nx4-dedup.pdf 85 http://www.youtube.com/watch?v=Ti_KE8Aj9Yc 86 http://download.microsoft.com/download/8/a/e/8ae7f07d-b888-4b17-84c3-e5a1976f406c/SingleInstanceStorage.doc, page 4 87 http://www.datadomain.com/news/press_rel_072009.html 88 http://www.spectralogic.com/index.cfm?fuseaction=home.displayFile&DocID=2352 89 http://www.misa.on.ca/en/conferences/resources/CDWEMCPresentation.pdf 90 http://www.industryedge.com/collateral/analyst-reports/esg-avamar-lab-validation-report.pdf 91 http://www.snia.org/forums/dmf/knowledge/white_papers_and_reports/SNIA_DMF_V6_DPI_Guide_4WEB.pdf 92 http://searchstorage.techtarget.co.uk/news/article/0,289142,sid181_gci1346197,00.html OR http://searchdatabackup.techtarget.com/generic/0,295582,sid187_gci1346356_mem1,00.html OR http://media.techtarget.com/searchStorage/downloads/StoragemagOnlineNovDec2009_final.pdf 93 http://info.commvault.com/forms/ESGBriefOnDataDeduplicationDiversity 94 http://reg.accelacomm.com/servlet/Frs.frs?Context=LOGENTRY&Source=source&Source_BC=99&Script=/LP/50603808/reg& 95 http://www.backupcentral.com/phpBB2/two-way-mirrors-of-backup-central-mailing-lists-4/ca-brightstor-arcserve-7/ca-arcserve-12-5-data-deduplication-101176/ 96 http://www.infobahn.com/research-information.htm 97 http://www.sidepath.com/wan/files/Improve%20EMC%20SRDFA%20with%20Silver%20Peak%20Systems.pdf 98 http://www.silver-peak.com/assets/download/pdf/wp_SilverPeak_byte-level_deduplication.pdf 99 http://www.dciginc.com/2008/10/riverbed-dedupes-data-domain-managing-encrypt.html 100 In-band and out-of-band refers to their placement in a traffic flow. There are pros and cons to each approach. 101 http://www.riverbed.com/docs/SolutionBriefRiverbedEMCSRDFA.pdf 102 http://www.theregister.co.uk/2009/05/27/riverbed_autocad_2010/ 103 http://www.cisco.com/en/US/solutions/collateral/ns340/ns517/ns224/ns377/white_paper_c11-550350.pdf 104 http://ralphlosey.files.wordpress.com/2007/09/hasharticlelosey.pdf 105 http://www.law.com/jsp/legaltechnology/pubArticleLT.jsp?id=1202433318789 106 http://www.deathbyemail.com/2008/10/is-de-duplication-legal.html 107 Input-Output Operations per Second (IOPS) is a performance measurement based on platter rotation speed, head movement and other factors. 108 The LTO-5 is expected to offer a maximum transfer rate of 0.96/hr 109 http://www.symantec.com/connect/forums/de-dup-quantum-vs-data-domain#comment-1539571 110 http://forms.datadomain.com/go/datadomain/WS_HP_WP_ROITCO 111 http://salestools.quantum.com/querydocretriever_inc.cfm?ext=.pdf&mime=application/pdf&filename=576635.pdf 112 http://media.techtarget.com/searchStorage/downloads/StoragemagOnlineNovDec2009_final.pdf 113 http://viewer.media.bitpipe.com/970677580_377/1253542304_853/White-Paper---Six-Easy-Ways-to-Control-Storage-Growth_5995EF.pdf 114 Hierarchical tape technologies (e.g. IBM’s TSM) allow for new and modified files and sub-files to be backed up – similar to deduplication. 115 http://www.networkworld.com/columnists/2010/012210-backspin.html 116 http://en.wikipedia.org/wiki/MD5 117 http://www.cs.northwestern.edu/~ychen/classes/cs395-w05/lectures/class4.ppt 118 http://www.amnh.org/education/resources/rfl/pdf/cosmic_insert_glossary.pdf 119 http://blog.permabit.com/?p=6 120 http://en.wikipedia.org/wiki/SHA_hash_functions

2010 EMC Proven Professional Knowledge Sharing 59