Beginners Guide to Corruption and How to Avoid It

© 2019 Veeam Software. Confidential information. All rights reserved. All trademarks are the property of their respective owners. Beginners Guide to Corruption and How to Avoid It

Contents

“But the was successful!” ...... 3

Types of corruption ...... 4

Failed to decompress LZ4 block (and similar)...... 4

All instances of storage are corrupted ...... 5

Internal VM issues ...... 5

Misconfigurations ...... 6

Tools and tips...... 7

3-2-1 backup strategy...... 7

SureBackup...... 7

Health Check...... 7

Veeam Validator...... 8

Recommended job settings ...... 8

What about the Agents?...... 12

Conclusion ...... 12

About Veeam Software ...... 13

© 2019 Veeam Software. Confidential information. All rights reserved. All trademarks are the property of their respective owners. 1 Beginners Guide to Corruption and How to Avoid It

When choosing a backup solution, one of the most important decision factors is reliability. Understandably, when it comes to restores, backup administrators expect not only flexibility from the software, but also a guarantee that data can be restored. Given how multifaceted and nuanced the topic of data corruption is, it is safe to say that if a vendor guarantees 100% reliability of they are not telling the whole truth.

Still, many backup administrators simply assume that backups are fail safe. As a result, a situation where data cannot be restored from a backup can come as a huge shock and is often seen as the backup vendor failing to provide what was promised. In reality, there are different kinds of data corruption that have different causes and it is a misconception to put the blame purely on backup software (as you will see in this white paper, Veeam® Backup & Replication™ cannot be blamed for any type of corruption described. If it were, the Veeam team would have fixed it long time ago). On the bright side, many backup providers, including us here at Veeam, provide a number of tools that can help reduce the risk of running into an unrecoverable backup.

My hands-on experience in Veeam Support puts me at the center of some of these situations. In this white paper, I will examine different corruption types and provide advice on the countermeasures, based experience from working with customers on various types of Veeam Backup & Replication infrastructures. For administrators thinking of buying Veeam Backup & Replication and testing out our trial, I hope that this white paper achieves two goals:

1. Set the expectations straight on what Veeam offers and what we cannot promise, if we want to be honest.

2. Show that Veeam Backup & Replication has all the tools that, if used right, can make a very unlikely event.

For existing customers, I encourage you to read the white paper to understand the potential risks and review your Veeam Backup & Replication setups to make sure you are using the product to its maximum potential.

Disclaimer

Much of the guidance in this content comes from first-hand experience working with support cases. This white paper is not intended as a definitive guide, as it is not possible to cover all possible situations. New threats might also arrive in the future. If you have a potential corruption issue it is always advised to open a Veeam support case to do an analysis and resolve this matter correctly.

© 2019 Veeam Software. Confidential information. All rights reserved. All trademarks are the property of their respective owners. 2 Beginners Guide to Corruption and How to Avoid It

“But the backup was successful!”

This is a phrase that support teams sometimes hear from clients when we must give the bad news. For us, support engineers, this means one thing — a fundamental misunderstanding of backup process and what Veeam Backup & Replication, offers as a product. It is a big mistake to shift responsibility for hardware, operating system and application health from proper monitoring tools to Veeam Backup & Replication. Admittingly, sometimes Veeam Backup & Replication does seem to have such capabilities. It requires many components to work properly and uses many third-party APIs, so in my support practice I’ve heard countless cases where errors in Veeam Backup & Replication revealed underlying infrastructural issues clients did not realize were there. However, this is no more than a positive side effect.

Before we go into discussing corruption types more deeply, as well as related countermeasures, it’s important to highlight some fundamental principles which already might help to reveal potential risks for backup corruption. The main point is this: Veeam Backup & Replication does an image-level backup of a VM and saves this information to a backup file. If the VM contains corrupted data (for example, one of its volumes became a raw space), it will appear like that in the backup. If VM was configured incorrectly (for example, using independent disk or physical RDM), that will translate into data missing in the backup file. If something happens to the backup file (due to storage problems, virus attacks or manual deletion), this backup will not be usable anymore.

All these examples may seem very obvious, but they describe some of the complaints that we get in our everyday support practice! So be your own counsellor, be wary of the quality of data that you are backing up and try to look at the core of the potential issue within your setup first.

© 2019 Veeam Software. Confidential information. All rights reserved. All trademarks are the property of their respective owners. 3 Beginners Guide to Corruption and How to Avoid It

Types of corruption

In this section we will examine the most common situations which may lead to inability to restore data from backups.

Failed to decompress LZ4 block (and similar)

How to reveal corruption: SureBackup®, Health Check, Veeam Validator, attempt to decompress corrupted block. KB on topic: https://www.veeam.com/kb1795 Data inside a backup file (.VBK, .VIB, .VRB) is stored in compressed blocks. A block can be saved incorrectly due to underlying issues with the storage. I will pass the mic to our Senior VP, Anton Gostev, who described it in the following manner:

In human language, the issues look like:

1) We ask storage to write “MOM,” but it writes “DAD” instead and returns success.

2) We ask storage to write “MOM,” and it writes “MOM” and returns success, but if you try to read the data block, you get “MAM.”

3) We ask storage to commit the write of “MOM” to disks, and it returns success, but does not actually write data to disks, keeping it in buffer for a short period of time for performance optimization purposes.

Answering your question, we can only judge on these reported successes, so we mark the job as successful. This is why, it is very important to use SureBackup to verify that what was written into the backup file is what we asked, especially once you get this error at least once and your backup storage becomes a suspect.

Even if data is saved correctly, there is still a risk that it can eventually be corrupted (an issue known as “bit rot”). No storage vendor can guarantee absolute and it is more of the question of number of errors per amount of data. Our only recommendation is to stay away from cheap low-end NAS devices that use dubious optimization techniques to show better performance and suffer every now and then from bugs in firmware that can result in data corruption. Once again, Anton Gostev said it all years ago on the Veeam forum.

If couple of 0s and 1s inside a compressed block get swapped, an attempt to decompress, the block (typically during restore) will fail. There is both good and bad news here. The good news is that backup is still restorable. Veeam support can provide special modified agents that allow you to skip the corrupted blocks. So, if corruption was minimal, you have a very high chance of restoring your data. The bad news, however, is that such corruption can be hard to discover. Most operations in Veeam Backup & Replication do not require blocks decompression. A corrupted block can travel from one backup file to another through merges, synthetic fulls or backup copy and not be discovered. The countermeasure here is regular backup verification and some tricks with job settings, both of which we’ll discuss later in this white paper.

Note that the error message can be different, depending on what part of the backup file suffered from corruption. It is impossible to describe every error here, so be sure to open a support ticket if you are experiencing issues, as Veeam support may be able to help you.

© 2019 Veeam Software. Confidential information. All rights reserved. All trademarks are the property of their respective owners. 4 Beginners Guide to Corruption and How to Avoid It

All instances of storage metadata are corrupted

How to reveal corruption: SureBackup, Health Check, Veeam Validator, practically any attempt to use the backup file. KB on topic: https://www.veeam.com/kb1886 Inside every backup file (.VBK, .VIB, .VRB) there exists a special metadata that contains information of what is stored where in the file (akin to Master File Table of NTFS volume). Without this data, the content of the backup file becomes meaningless 0s and 1s. Given the importance, metadata is saved two times in a backup file on separate offsets and these instances are never updated at the same time.

If one instance gets corrupted, nothing bad will happen. Veeam will continue operations normally and eventually metadata will be updated with proper information. However, if both instances get corrupted, the part of the backup chain which depends on the corrupted file becomes unusable. Unfortunately, there is no way to restore data from such backup. The backup chain will need to be restarted by performing an active full.

If you encounter such corruption, it is likely a result of a serious issue in the storage, as corruption must happen on two different portions of the file in a short period of time. We will talk about tools that can help you to check storage health in the next chapter.

As usual, there is both good and bad news here. The bad news is that such a backup file is completely unusable. Unfortunately, even Veeam support won’t be able to help here. A slight consolation is that if backup chain is actively used, such corruption will not stay hidden for long. Even if all periodic checks are off, just an attempt to use the corrupted file (for example, during incremental runs or by backup copy job) will result in an error saying: “All instances of the storage metadata are corrupted. Failed to download disk. Reconnectable protocol device was closed. Failed to upload disk. Agent failed to process method {DataTransfer.SyncDisk}.” Corruption due to CBT bugs

How to reveal corruption: SureBackup or manual restore.

KBs on topic: https://www.veeam.com/kb1940, https://www.veeam.com/kb2075, https://kb.vmware.com/s/article/55800 CBT corruption was a nightmare come true for many vSphere users and became an epitome for “silent data corruption.” Indeed, the backup was running fine and the health check did not reveal the corruption, because from its perspective, the backup file was absolutely healthy. However, if you feed a backup solution rubbish, you will get rubbish in the backup as well. And CBT was giving Veeam Backup & Replication rubbish instead of proper data. CBT corruption could only be found by SureBackup, that actually starts VMs in an isolated environment or by doing a restore (and at this point it could be already too late).

A lot of versions of vSphere contained different CBT bugs. It happened in version 5.5, in version 6.0 (twice) and most recently in 6.5 and 6.7 when vVols are used. So, even though things look stable right now, there is no guarantee that it won’t happen again. It highlights the necessity to set up a proper verification process, which will be discussed later.

Internal VM issues

How to reveal corruption: SureBackup, manual restore, proper monitoring tool on guest OS.

So far, we assumed that data corruption comes from outside of the VM. But, this is far from truth. Data inside a VM is just as susceptible to corruption. It is not a rare scenario when a volume with data turns into a raw space or a rarely used database becomes inconsistent. Without proper monitoring, some time may pass before the problem gets noticed. It is especially sad if no long-term retention is implemented, so when corruption is revealed, the last valid restore point has already been deleted.

Can backup software be blamed for this? In my opinion, no. A healthy backup implies healthy source data (trash in, trash out, remember!). Veeam Backup & Replication does not offer application or OS monitoring. Though, admittingly, some job options may indirectly help with that (more on that later). But shifting responsibility like that is a recipe for disaster.

© 2019 Veeam Software. Confidential information. All rights reserved. All trademarks are the property of their respective owners. 5 Beginners Guide to Corruption and How to Avoid It

Misconfigurations

Though this is not a corruption, misconfigurations can lead to data loss, so they must be mentioned. Most misconfigurations come from a poor understanding of hypervisor-level backups and particular settings. Typically, the result is data that is missing from the backup or was saved in an unrecoverable state. Here are some of examples from support practice:

1. Independent disks and physical RMD. These cannot be snapshotted and subsequently will be excluded from the backup. Veeam Backup & Replication gives a note about it in job statistics, but sometimes it goes unnoticed before it’s too late.

2. In-guest iSCSI disks. These are not visible on hypervisor level, so they won’t be present in the backup.

3. Exclusions in backup jobs. Veeam Backup & Replication allows the exclusion of certain disks or whole VMs (if VMs are added as part of container). Due to human error, such exclusions can be forgotten about.

4. VMs deleted by retention. If a VM is not backed up for a certain period (for example, VM is added to a job as part of a container and then migrated, so it is no longer part of this container), data may eventually get lost. If your job is working in forward incremental mode, the VM will be lost when the last restore point containing this VM is deleted. For other backup modes, keep in mind “remove deleted items” option. If set too low, you risk accidentally losing your data!

5. Stub files. Some vendors offer software that moves files to a special storage behind the scenes, leaving instead a “stub” file on the disk (essentially a pointer). When backing up a VM with stub files, you are not backing up “real” files. Instead, you will get a backup with “empty” files.

Storage Spaces in VM. Such setup is not supported neither by Veeam, nor by Microsoft. See https://www.veeam.com/kb1989. There are many other scenarios that make support engineers scratch their head. The golden rule is to try out different restores beforehand to avoid surprises at the critical moment.

© 2019 Veeam Software. Confidential information. All rights reserved. All trademarks are the property of their respective owners. 6 Beginners Guide to Corruption and How to Avoid It

Tools and tips

Now that we covered some of the challenges, let’s see what Veeam Backup & Replication can offer to minimize the risks of having unrecoverable backups.

3-2-1 backup strategy

If you think about it, all the checks that Veeam Backup & Replication offers are there to give an early warning that a backup is unusable. They won’t help in a situation when data is already missing in production. Keeping more than one backup copy will help you get the data back. So implementing 3-2-1 Rule should be the number one priority if you want to greatly minimize risks. Luckily, Veeam Backup & Replication supports a variety of secondary destinations for a backup job: Backup copy (including cloud), tape and storage snapshots (if you have compatible storage).

SureBackup

SureBackup is Veeam Backup & Replication’s ultimate backup verification tool. It allows you to actually make sure that guest OS starts and applications respond to commands. (Veeam Backup & Replication contains a number of predefined scripts for some applications, but you can create your own as well.)

Health Check

As a relatively recent addition, Health Check is a post-job activity that recalculates hashes of each block in the backup and compares them to what is stored in file metadata. A couple of things to note: 1) As mentioned, Health Check is a post-job activity. Meaning, it will only run after the job has created a new restore point. If you schedule it for a day when the job does not run, Health Check will run the next time the job runs. 2) Health Check is a very laborious task, so expect it to take a long time. The exact duration will depend on storage performance. Expect a longer time on NAS devices and an especially long time on deduplication appliances. 3) If the job is using forward incremental mode and the backup chain consists on several subchains (VBK+VIBs), only the most recent subchain will be verified. 4) Health Check is affected by Backup Window settings. Keeping in mind the expected long duration of Health Check, setting a stringent Backup Window can result in constant Health Check failures. If Health Check finds corruption, it will inform the user (the job session will be indicated as failed) and immediately start the repair process. Several scenarios are possible depending on the backup mode, the type of corruption and the types of restore points:

© 2019 Veeam Software. Confidential information. All rights reserved. All trademarks are the property of their respective owners. 7 Beginners Guide to Corruption and How to Avoid It

Decompression error in incremental backup. Consider the image below. During the original check, corruption was found in incremental backup corbckpD2018-09-30T100803.vib. Consequently, next two incremental backups were also rendered unusable. Then, the repair process starts. It will create a snapshot on VM in production and will transfer the corrupted blocks from the VM to the latest restore point, “reconnecting” latest VIB to the most recent valid restore point. The corrupted points will remain in the chain but will be left unusable. They still contain some valid blocks though, so we do not delete them to minimize the traffic during repair (however Veeam Backup & Replication will still have to read the whole disk content of VM during the repair stage to find the necessary blocks).

Decompression error in full backup . Similar to the previous scenario, however since the VBK is corrupted, Veeam Backup & Replication has to invalidate the whole chain. Next, missing blocks are loaded into the latest VIB, giving the possibility to restore VM to the latest state.

Metadata corruption in incremental backup . Since metadata corruption renders backup files completely unusable, information about a corrupted increment and all dependent increments gets deleted from the chain. After that, Health Check creates a new increment from scratch and links it to the last valid point in the chain. Keep in mind that though corrupted points are deleted from the Veeam Backup & Replication database, the files remain on the repository (for analysis purposes). You will need to delete them manually to free up the space.

Metadata corruption in full backup . If metadata is corrupted in VBK, the whole chain becomes invalid. The job will run an active full to recreate the VBK.

Reverse incremental chain. Health Check does not verify rollbacks (.VRB). Only VBK is verified. Depending on the type of corruption, it is updated with valid blocks or is completely recreated.

Veeam Validator

While Heath Check can only run after a new restore point has been created, Validator is a standalone tool that allows you to do “ad hoc” validation. You can find details on the usage in the following KBhttps://www.veeam.com/kb2086 . The verification process that Validator uses is identical to Health Check. However, unlike Health Check, Validator does not have a repair mechanism in case corruption is detected. If verification constantly shows points being corrupted, you are likely facing a storage issue . You can use an open source tool Corruption Finder (CoFi), created by Veeam support engineer Vsevolod Zubarev to perform an independent test . CoFi reproduces Veeam Backup & Replication behavior by writing blocks of data to the storage and calculating their hashes . Later, hashes are recalculated to verify that they were changed . Depending on how often you encounter backup corruption on average, you can set CoFi to write and verify certain amount of data . You can download CoFi following this link https://github.com/yandexx/cofi.

Recommended job settings

Picking the right job settings can make your backups become more resilient — directly or indirectly.

Backup mode. Each of has pros and cons, but it is undeniable that forward incremental mode is the most reliable. By doing a regular full backup you are dividing the chain into several “sub chains,” lowering the chances of losing the backup chain completely. Full backups can be created using synthetic or active full method. As it was noted, a corrupted block can be copied into synthetic full from a previous point, so that makes active full more trustworthy. If you don’t want to do regular active fulls, you can have a mixed scenario. For example, schedule synthetic weekly and active full monthly. If both synthetic and active fulls fall on the same day, Veeam Backup & Replication will not create two full backups, only active fulls will be created.

© 2019 Veeam Software. Confidential information. All rights reserved. All trademarks are the property of their respective owners. 8 Beginners Guide to Corruption and How to Avoid It

Application-aware processing. On Windows machines, AAP uses VSS framework, which in turn works with volumes and VSS- aware applications. AAP is primary used for creation of transaction-consistent backups. However, since VSS is part of Windows functionality, sometimes VSS errors can help discover more underlying issues with VM or applications. But, don’t forget to set it to “require success!”

© 2019 Veeam Software. Confidential information. All rights reserved. All trademarks are the property of their respective owners. 9 Beginners Guide to Corruption and How to Avoid It

Swap\deleted blocks exclusion. Once again, the primary goal of these features is to decrease the size of a backup file. However, since it is done by reading MFT of a volume, its failure may mean that something is wrong with the volume. Exclusions work only for NTFS volumes.

Backup copy job. You can turn backup copy jobs into a makeshift verification tool! Backup copy job allows you to choose compression method. By default, it is set to automatic, meaning backup copy job will use the compression mechanism of the source job. That also means that corrupted block can be copied “as is” and will end up in secondary chain. By explicitly choosing a compression mechanism, that is different from the source, you will force backup copy job to recompress the blocks, which will cause a decompression error if a block is corrupted. It is not recommended to use extreme or high compression as they are very CPU-intensive. Try using optimal and dedupe-friendly compression (especially if you are using a deduplication appliance as a secondary target!)

© 2019 Veeam Software. Confidential information. All rights reserved. All trademarks are the property of their respective owners. 10 Beginners Guide to Corruption and How to Avoid It

Remove deleted items after . If the job runs in any mode other than forward incremental, be careful not to set this option too low. If you set it to one day for example and then for some reason VM is excluded from the job for one day, all the data for this VM will be scrubbed from the backup chain.

Guest scripts . Though this kind of functionality can be easily done without Veeam Backup & Replication using native OS tools like Windows Scheduler, technically Veeam Backup & Replication allows you to execute user-written scripts on guest OS as part of a job execution. For example, you can write a script to run Checkdisk and generate a report.

© 2019 Veeam Software. Confidential information. All rights reserved. All trademarks are the property of their respective owners. 11 Beginners Guide to Corruption and How to Avoid It

What about the Agents?

Though recently released, Veeam Agent for Microsoft Windows and Veeam Agent for Linux already gained popularity and appreciation from Veeam clients. It would be wrong not to mention these products as well. Since Veeam Agent for Microsoft Windows and Veeam Agent for Linux are designed to do backups of physical machines, they operate directly on OS level, so some of the issues described in this white paper are not applicable to them. However, two of the most common issues — decompression errors and metadata corruption — can happen with Agent backup files as well. So, what does Veeam have to offer as safety measures?

Veeam Backup & Replication 9.5 Update 3 brought a much-awaited feature — agent management from the Veeam Backup & Replication console. That also means that integrated agents can benefit from some of Veeam Backup & Replication’s existing functionality. Agent management is available in two forms — as a “backup policy” and a “backup job.” “Backup policy” is just a configuration that will be sent over to the locally installed agent. Though you can select Veeam Backup & Replication repository as a target, this gives only a limited integration. To get full integration, you need to create a “backup job” by choosing the “Managed by backup server” option.

This will allow you to enable periodic Health Checks under Storage — Advanced — Maintenance. Agent backups can also be used as a source for backup copy or backup to tape job, enabling the 3-2-1 Rule. Unfortunately, because agents are working with physical machines, our most effective tool, SureBackup, is not available for agent backups (at least for the moment).

Veeam Backup & Replication console integration is available only for paid editions of the Veeam Agents . Conclusion

Backup corruption is an ever-present risk. It is the duty of every diligent backup administrator to minimize this risk. Luckily, Veeam Backup & Replication has all the tools to make a permanent data loss a very unlikely event. We hope that this white paper gave you a better understanding of where corruption is coming from and ideas of what can be improved in your environment.

One of the most effective ways to avoid corruption with Veeam is to use SureBackup, as mentioned earlier. This can catch issues with the surrounding environment (such as storage) as well as potential system issues with how Veeam backups are written.

© 2019 Veeam Software. Confidential information. All rights reserved. All trademarks are the property of their respective owners. 12 Beginners Guide to Corruption and How to Avoid It

About Veeam Software

Veeam® recognizes the new challenges companies across the globe face in enabling Intelligent for the Hyper-Available Enterprise™, where business that must operate 24.7.365. To address this, Veeam has pioneered a new market of Availability for the Enterprise™ by helping organizations meet recovery time and point objectives (RTPO™) of < 15 minutes for all applications and data, through a fundamentally new kind of solution that delivers high-speed recovery, data loss avoidance, verified protection, leveraged data and complete visibility.Veeam Availability Suite™, which includes Veeam Backup & Replication™, leverages virtualization, storage, and cloud technologies that enable the modern data center to help organizations save time, mitigate risks, and dramatically reduce capital and operational costs.

Founded in 2006, Veeam currently has more than 294,000 customers worldwide, and adds an average of 4,000 new customers each month. Veeam‘s global headquarters are located in Baar, Switzerland, and the company has offices throughout the world. To learn more, visit http://www.veeam.com.

© 2019 Veeam Software. Confidential information. All rights reserved. All trademarks are the property of their respective owners. 13 Beginners Guide to Corruption and How to Avoid It

© 2019 Veeam Software. Confidential information. All rights reserved. All trademarks are the property of their respective owners. 14