WHITE PAPER

Active Directory Forest Disaster Recovery: What You Don’t Know Will Hurt You By Gary L. Olsen

Sponsored by WHITE PAPER

Table of Contents

AD Disasters...... 1 AD Forest Failure—Causes and Case Studies...... 2 The Danger of the Domain Admin Account...... 2 Improper Authoritative Restore...... 2 Reverse Time Change...... 3 Schema Change...... 3 Forest Functional Mode reversal...... 4 USN Rollback...... 4 RID Pool Depletion...... 4 Failure of all DCs in forest root...... 4 The Case for Backups...... 5 Recovery...... 6 Recovery Manager for Forest Edition...... 7 About the Author...... 8

Sponsored by WHITE PAPER

here are many forms of AD disasters that will cause a significant impact on your business. Since AD is the core authentication method for many enterprises, mission critical (and standard) Tapplications, access to files, and many other functions that businesses rely on, are impacted by AD failure. It is needless to say that a speedy and efficient recovery from AD failure is critical in order to minimize business impact. Today, with companies reducing IT staff to save cost, IT staff members must wear many hats. Only large enter- Figure 1 prises can afford full Active Directory administration staff, making it difficult to not only monitor AD but condition where deletion of an OU object is prohibited to repair problems. In the case of AD disasters, it is critical (Figure 1). While recovery of objects with native Windows to and get everything back on line as soon as tools is well documented, and can be done, it is not for the possible. A speedy recovery of mistakenly deleted faint of heart. Remember – you get what you pay for. One objects, for example, is not only critical to keeping of the most time-tested and efficient tools for AD recovery business functions such as email, file access, authentica- is Recovery Manager for Active Directory from Dell tion and critical applications on line for all users; it is Software. RMAD takes all the procedures and best prac- good job security for the IT staff. As an IT professional, tices and presents them in an easy to use menu driven you don’t want to be reading a step by step recovery application. No worrying about restoring backlinks, users procedure while the CIO watches over your shoulder, before groups, etc. Remember that recovery of AD objects waiting for his account to be restored so he can send out is time sensitive for your company. some critical email. Any discussion about AD disaster recovery usually de- scribes domain failure as the next level of DR. In a single AD DISASTERS domain forest, this equates to a forest recovery and While AD is amazingly self-healing, there are AD failures requires all DCs to be restored to a known good state. In a that require recovery and are well documented not only by multiple domain forest, recovering a single domain in the , but by many authors in books, whitepapers, and forest is more complex for reasons such as: internet articles. These failures can be in the form of : • Global Catalogs need to be rebuilt • Domain Controller failure • Global Catalogs in other domains may hold orphaned • Accidental bulk deletion of objects (i.e. deleting an OU objects of the broken domain with 10,000 users) • Lingering objects could occur in the restore of the • Domain failure domain • Forest failure • External trusts must be re-established

Of the failure types in this list, only a complete forest Even so, domain failures can be recovered. For single failure should make you lose . domain forests, the domain recovery is the same as the forest recovery. A domain controller failure in a properly configured domain with a minimum of two and preferably three DCs Recovery of the entire forest is a challenge in a multiple in a domain is not a big deal. The other DCs pick up the domain forest and is the focus of this whitepaper. While load and while you may have a performance hit, it won’t there are many articles written about AD Forest recovery, be a disaster. Repair/ the , re-promote or only a few even touch on causes and then they only restore from backup, case closed. mention some abstract idea, without example. It is important to understand specific causes of AD forest Recovery from accidental bulk deletion of objects, more failure in order to properly prepare for it. difficult in a multi-domain forest, has been mitigated with NTDSutil features in Windows 2003 and the default

1 WHITE PAPER

AD FOREST FAILURE – CAUSES AND Improper Authoritative Restore CASE STUDIES One of the “tools” Microsoft provides is the Authoritative While a complete forest failure is somewhat rare, it can Restore, available in the NTDSUtil command line tool. happen. The danger is often from apathetic IT staff that Authoritative Restore, typically used to recover bulk fails to recognize the causes of such a failure and fail to deletion of objects rolls AD back in time for a single proactively prepare for such a disaster. If a forest failure object, a subtree of objects or the entire AD database. In only occurs once in the history of the company, failure to one case, I worked with a University who had a single recover could mean the business failure of that company. forest but about 80 domains. Each college in the Univer- It should be obvious that preparation for such a disaster is sity had its own domain so as to be autonomous. This worthwhile. worked well until one day a domain admin decided to do an authoritative restore. Probably due to either ignorance For a complete forest failure to happen, the entire AD or mistakenly choosing a wrong option, this admin database would have to be corrupt and irrecoverable, or all recovered the entire configuration naming context from a DCs in all domains would fail the same time and have no two week old backup, in addition to his domain objects valid backup of any of them in a single domain forest, the he was trying to recover. Combined with the university’s domain is the forest. At first blush, it would seem that these 3rd level IT staff having removed and added several DCs conditions would be rare if not impossible. Consider the after that backup was made, it had a disastrous effect of following possible conditions that could exist, requiring a reanimating deleted DCs and losing existing DCs. Since forest recovery. Note that I’ve included actual experience this information is held in the configuration naming and case studies to reinforce these cases. context, it affected many of their domains, causing a wide spread outage. This did not result in a complete forest The danger of the domain admin account failure, but it stretched across many domains and caused Few IT professionals who administer Active Directory widespread outage. Fortunately we had the forest understand the power of the domain admin account. This structure in tact but it took many days to repair the account cannot be trumped or limited. Its rights can’t be damage. Of course this happened at the beginning of the diminished or restricted. Microsoft has long taught that the semester so the outage was greatly magnified. only security to the domain admin account is to “trust” the administrator. I’ve worked many issues with IT managers The results of this outage were: 1) the administrator who want to protect the AD environment from their responsible was fired and 2) the IT manager wanted to domain admins who they don’t know if they can trust. know why a domain admin had the ability to restore the Tongue in cheek I suggested polygraph tests. In one case, a configuration naming context. This is an example of the domain admin was disabling auditing, performing some power of the domain admin account and demonstrates destructive operations, and then re-enabling auditing. All we could see is that the auditing change was made and when, but not by who.

In a more serious case, I was reviewing the AD design for a global company who revealed that they had no idea how many users had domain admin privileges. This is a disaster waiting to happen. In looking for causes for forest failure, remember a rogue, or at least untrained domain admin could be the root cause of such a failure and will be noted in examples shown in this whitepaper.

It is best practice to limit the number of users with domain admin privs, but a disgruntled employee or even an inexperienced employee with domain admin privileges can cause catastrophic failure.

Figure 2

2 WHITE PAPER

that misuse of the power of this account can indeed cause a failure that requires forest recovery procedures. Note that to access the powerful NTDSutil tool to perform authoritative restore, delete server objects and other destructive actions, the DC must be booted into Directory Service Repair Mode (DSRM) which has its own local administrative account and password. If a domain admin doesn’t know that password, he or she can simply login to their domain admin account, use NTDSutil to change the DSRM password, reboot to DSRM and login. Again, the domain admin has all power.

Reverse Time Change Active Directory uses the Kerberos authentication proto- col which relies heavily on accurate time between all Figure 3 domain clients and servers, with the domain controllers being time servers. Figure 2 shows the time services service ticket errors notifying of expired passwords, target hierarchy in a multiple domain forest. Each domain has a principal name errors, etc. The rogue time server, about an PDC Emulator (PDC) that serves as authoritative time hour later, reset the time a year forward. This cycle server for all DCs in the domain. When clients join the repeated several times. We were able to restore service domain, the DC they authenticate to becomes their after many hours of troubleshooting but t hen the time authoritative time server. In the multiple domain configu- reset cycle was repeated and we had to do it over again ration, the child domain PDCs use the root domain’s PDC before finally discovering the cause which we solved by as their authoritative time server. As a best practice, the changing to a reliable external time server. We also root domain PDC synchronizes to a reliable external time discovered a prevention of this situation, described in KB source. Kerberos requires each client computer in the 884776 http://support.microsoft.com/kb/884776. Setting domain (and forest) to be in sync with DCs and servers, two registry keys (a maximum and minimum value) within 5 minutes by default in Windows domains. This is prevents the time change from occurring greater that a configurable but it is not recommended to change it. defined amount. We set these keys so that the time could Thus, if a client computer tries to logon to a domain and not be changed more than 15 minutes either way. Note the client’s is more than 5 minutes skewed that the pre-2008 default was 0xFFFFFFFF which would from the authenticating DC’s system time, logon will be accept any change. Windows 2008 reduced this to a 48 denied. This applies to applications, file access, and all hour default. Good protection. server resources using authentication for access. The “system” time is always measured in UTC and is not While we recovered from this instance, what if the time affected by time zones. Time skew failures can cause: source had not corrected itself? There would have been wide spread lingering and orphaned objects because • Computer account password failures garbage collection would have purged the tombstoned • Trust account password failures objects prematurely. Trying to recover those objects is very • AD Replication failures difficult and time consuming and can cause replication • Any application requiring authentication using user or failures, authentication failures, etc. depending on the computer account passwords objects involved. There have been instances reported where the time could not be rolled back (or forward In one instance a company was experiencing a wide depending on the circumstance), which required a forest spread authentication failure of all clients in all domains. recovery. This could be a true disaster. The root cause was found in a system log event shown in Figure 3. The external time source has mistakenly set the Schema Change time for the root domain PDC in reverse one year. This Most discussions around forest corruption usually mention caused massive errors in the system logs, the netlogon log schema changes. While the ability to modify classes and and Directory Services logs. These included Kerberos attributes in the schema is limited to Schema Admins

3 WHITE PAPER

group (in the forest root domain), accounts can be added. I member servers or clients). This action is one-way. Once the discussed the possibility of schema corruption with one DFM in all domains and the FFM in the forest is switched to company and they were not worried because they have no Windows 2008R2 mode, for example, it cannot be reversed. developers or others who make schema modifications. I then asked about popular AD aware applications – not Occasionally there will be reasons such as a legacy applica- only Exchange but 3rd party applications and trouble- tion, or perhaps even a licensing or funding issue that shooting tools. We made a long list of these applications prevents some DCs from being upgraded for a time, so and they were surprised at how many applications had sometime it isn’t as easy as flipping the switch. If the DFM made changes to the AD Schema. This is a great feature or FFM is switched to Windows 2008R2 and it is discovered allowing AD extensions to leverage AD authentication and later that there is some requirement that one or more DCs tightly bind the application to AD but the downside is that have to be at Windows 2008, the only way to reverse this they modify the schema in an unknown fashion. And what and put the DFM and FFM back to Windows 2008 mode is about uninstalling those applications and tools? The to restore the forest. Granted, this would be the result of classes and attributes they created cannot be purged poor planning and would have to be a highly critical opera- (though they can be defuncted). The only way to “fix” tion to restore, but it could happen. Remember the Schema problems, since all DCs hold a of the schema, discussion on the power of the domain admin? This could is to roll back the entire forest to a known good state be performed by a rogue or uninformed domain admin. before the schema problems occurred. Schema issues that can require a forest recovery include: USN Rollback The USN rollback is essentially an unintended authoritative • Application with a schema extension that corrupts the restore. It is caused by a DC being restored from a backup schema that was not created with an AD aware backup utility which • Application or programmatic change that modifies a does not reset the invocation ID. Microsoft KB875495 http:// schema component that other applications or system support.microsoft.com/875495 describes this. This can also functions use and stops them from proper operation happen if a virtual DC is restored by backing up and restor- • “Clean Up” schema extensions. (there is no purge ing the virtual machine file rather than backing it up inside capability) the virtual machine instance itself. The result is that the DC loses touch with the other DCs for a time and does not It is a best practice to test any application schema exten- replicate with its partners. This DC reports to its partners that sion or programmatic schema change in a test lab that it is up to date (though it’s not) and the partners don’t mimics the production environment. A thorough testing replicate updates to it. The only repair is to manually demote of these changes will prevent surprises when it is intro- the DC, clean up the AD and promote it again. duced into production. Don’t rely on vendor reports. Granted, this would have to involve all DCs in a domain or Forest Functional Mode reversal forest, but if there is a domain that has all virtual DCs and Beginning with , the Forest Functional they are improperly backed up and one by one restored for Mode (FFM) was introduced, adding to the Domain Func- various reasons, the problem could spread over time until tional Mode (DFM) from Windows Server 2000. In order to no DCs are replicating, requiring a forest recovery. take advantage of new features that apply to the configura- tion naming context (such as the replication performance Failure of all DCs in forest root improvements in Windows 2003), the forest had to be at In the case of the multi-domain forest, the forest root holds Windows 2003 FFM. In order to switch to FFM 2003, all the forest together. There are certain components that only Domains must be at Windows 2003 DFM. In order for a exist in the forest root domain including but not limited to: domain to be at Windows 2003 DFM, all DCs in the domain must be Windows Server 2003. This is referred to as Native • GC and Domain DNS resource records in the _MSDCS Mode. While a domain or forest can be in “mixed mode” DNS domain in the forest root domain and contain down-level versions of Windows server, all the • Trust objects for trust relationships with the other functionality of the newer version of Windows won’t be domains available. This is true for Windows Server 2008 and 2008R2 • Security groups such as Schema Administrators as well. (Note: this applies to domain controllers only – not • Root level objects

4 WHITE PAPER

Obviously, if all DCs in the forest root become inoperable with strates that it is not all that hard for a forest recovery to be no available live DCs available, the entire forest would require required. There are a number of things that could have recovery. While this would seem to be an unlikely probability, I made this impossible to recover the way we did. Note that actually assisted a company to recover from just such an this took about three weeks to complete. It could have incident. Of course this does require some failure of best been prevented with a second DC or a current backup. practices to occur but, it can occur. In this case the IT staff had a root domain we’ll call Corp. with two child domains, The Case for Backups America.corp.net and EMEA.corp.net. Each had a delegated In the several case studies described previously, especially zone and all user accounts, member server accounts, etc. were the case with the forest root failure, it should be obvious in the child domains. The root Corp.net domain was a that backups are critical. Making daily backups are a “placeholder” and held administrative accounts only. fundamental operation of any data center of any size. Many organizations even require users to perform daily While the child domains each had two domain controllers, backups. Yet in the 12+ years I’ve worked in Active Direc- the root domain had only one. But, they reasoned, they tory support, I’m amazed at how many admins who call had a RAID 5 configuration across 4 drives so they were don’t have a valid backup. Of course those that do have safe. Things were fine until one catastrophic day when two backups usually don’t need to make a support call. drives in the Raid set failed and when they went to get the backup tape it was 11 months old. In spite of the complete Backing up domain controllers require special consider- unavailability of the root domain, there was no impact to ations that normal server backups don’t require. Failure to the users. Of course it had to be fixed. I contacted a perform these backups will negate even the most diligent number of colleagues but no one had experienced this and recovery efforts. Here are the criteria for making successful could offer no solution other than a forest recovery. domain controller (AD) backups: Spending over a week, since users were not affected, they built a lab to reconstruct what the production environment • Use backup software that is “AD Aware”. That is, the had, then performed the following steps: invocation ID must be reset on the restored DC. Without this feature, a USN rollback will occur. 1. Build lab using current backups on a private network 2. Set the system time on the root DC to date of the backup 3. Restored 11 month old backup to root DC 4. Set the Tombstone Lifetime to 365 days to prevent lingering objects 5. Set Strict Replication Consistency key to “1” to prevent Lingering objects. This can be done with the native Repadmin tool using: Repadmin /regkey DC_List +strict Note that best practice dictates this key to be set to strict in normal production conditions. 6. Uncheck the GC option on Corp-DC1 (so it is not a GC any more), then re-enable it after recovery is complete 7. Health Check the DCs to make sure there are no errors (verify trusts, check event logs, etc) 8. Let replication take place 9. Test cross domain authentication 10. Re-enable the DC to be a GC 11. Promote at least one additional DC in the root domain. 12. Repeated the process in production

In my opinion, this company was very lucky but it demon- Figure 4

5 WHITE PAPER

• Only necessary to backup/ restore system state • Backups are only good for the Tombstone Lifetime (TSL). If the TSL is 180 days, backups are only good for 180 days. Restoring backups older than the TSL will inject lingering objects into Active Directory. • Validate backups – occasion- ally test restoration in a lab. Make sure the backups work! • Provide for offsite storage of backups. Natural and military disasters such as 9/11, Hurricane Katrina and others should be enough Figure 5 evidence that offsite storage FSMO roles, it will save time later. - perhaps out of city or state storage – is in order. • Make this DC a GC • Use DCpromo to create at least one other DC in the Recovery root domain. Microsoft authored a forest recovery whitepaper in 2006 that • Repeat this process to each of the child domains. was updated in 2011. http://technet.microsoft.com/en-us/ • Enable GCs in all domains library/planning-active-directory-forest-recovery(v=ws.10).aspx • Post recovery steps including recovering or rebuilding any user accounts that were created after the backup Like any other whitepaper from Microsoft, it uses native was taken. There could be computer accounts, pass- tools and describes the general idea required to restore an word resets, etc. required. entire forest. While it does represent excellent work on behalf of Microsoft, it is very complex and probably about 20 pages long in its entirety. It covers planning phases, how to determine if you do need a forest recovery and how to recover the forest. While this whitepaper has outlined many causes of AD failure that would require forest recovery, the Microsoft document is an excellent read for additional back- ground. Certainly the decision of whether to perform a forest recovery is critical to the process and Microsoft is lending us their experience in this area.

Microsoft describes forest recovery to contain several key steps. Of course this is assuming current, valid backups are available.

• Physically isolate all DCs from the network • Use backup to restore a DC in the forest root domain. This should not have been a GC before the failure, contain a writeable copy of the most recent backup that did not have the problem that is requiring the forest recovery, and hopefully the PDC as it will have the latest password changes and serve as the authorita- tive time server. If it is a DNS server and contain other Figure 6

6 WHITE PAPER

This sounds pretty straight forward. However, using DCpromo to restore each DC, then performing full sync of all GCs in the forest can be prob- lematic. If the AD sites are in a small geographic region and the AD database is small in size, this could be a valid procedure and certainly allows a clean build. However, if the AD database is 5-8 GB as is the case for many environments, or if the sites are scattered across the globe, or if there are slow network links, using DCPromo across the network could take a long time. Figure 7 In addition, this is a complex process with a lot of steps. And again, do you want the CIO Reviewing the tool, the first that jumped out at me was the watching over your shoulder while you and your team attempt backup management capability. One of the challenges with to restore this? What is needed is a tool to do this and Dell AD disaster recovery, as has been described previously, is to Software offers it in the form of the Forest Recovery module for create timely, valid backups. A recovery on any level is only their Recovery Manager for Active Directory (RMAD). as good as the backups to recover it from. Backups have to be not only taken regularly and frequently but they must be Recovery Manager for Active Directory: local to the DC that is to be recovered. Imagine a tool that Forest Recovery Module. will permit configuration of backups on all DCs in all The Forest Recovery module for AD is an add-on to the RMAD tool that has been around for some time and has a large installed base so chances are you have seen it, heard about it or even own it. Dell Software designed this tool using Microsoft’s forest recovery whitepaper as a guide and basically wrapped all the steps, configurations and best prac- tices into a menu driven tool that will allow a very neat, clean forest recovery and ensure that the proper steps were followed. Dell has used the Forest Recov- ery Module very successfully in critical situations for their customers to perform actual recovery of AD forests due to causes such as those described previously, and it works well. Figure 8

7 WHITE PAPER

locations, then use those backups to restore the DCs locally in Figure 7, the DSRM administrator password is reset when needed – forest recovery or otherwise. RMAD does (configurable).Figure 8 shows the final stages of the recovery. the backup management and configuration and the Forest The progress display is helpful to know what is actually being uses those backups to restore the DCs. done. It’s easy to see that the tool is taking care of all the recovery details required and eliminating not only the time RMAD permits full configuration and control of backups. required by the admin, but human error as well. Computer collections are created based on management criteria such as backups, and backup agents are deployed All of the actions noted here can be done from a single or from the console. The properties for each collection permits: multiple consoles. It might be desirable to install a console in multiple datacenters (such as a data center in each • System State configuration region). These consoles can be tied together so that if one is • Scheduling (including purging of old backups) not available, all DCs (backups and recovery) can be • Password encryption managed from any of the consoles, making the consoles • Storage on a DC (Figure 4), console or network share themselves redundant. (unlike the native Install From Media) • Send Email and other notification formats in case of Overall, Active Directory Forests can fail for a variety of backup failure. reasons, including the ones described in this paper. When a full forest recovery is required, it is critical to the business The properties for each DC in the collections show the of the company to restore AD as soon as possible. While backup history of the DC. there are manual processes expertly defined by Microsoft themselves to perform a successful recovery, it can be a Running a test in the lab, Computer Collections were made long, time consuming process and is subject to human for the root domain (Acme.lab) and for two child domains error. In addition, having properly validated backups taken (Finance and Sales). With valid backups created, use the regularly is crucial to such a recovery. You can’t recover Forest Recovery Console and: what you don’t have!

1. Use “Backup Criteria” to define the backup to be used The Recovery Manager for Active Directory and the Forest (Figure 5) Recovery Console combine to present a powerful tool to a. Optionally configure alerts to send notification on aid the administrator in a forest recovery and provide that progress backups are valid, and best practices are followed to b. Optionally configure pauses to create delays at ensure a successful recovery. specified points 2. Select Verify to verify backup configuration ABOUT THE AUTHOR 3. Select Recovery GARY OLSEN holds a BS in industrial education and a master’s degree in computer-aided manufacturing from Bringham Figure 6 shows a warning displayed at the beginning of the Young University. He has worked in the IT industry for more recovery process. This is a nice feature to remind the admin of than 20 years, and has served on Microsoft’s Beta Technical critical steps that should be taken before the recovery is Support Teams for both Windows Server 2003 and Windows begun. Figure 7 shows a snapshot of the recovery in progress. I 2000 Server; as an HP/Compaq consultant on Active Directory was impressed with the detail of the console. Note that each design; and as a provider of advanced technical support DC in each domain is displayed; showing the site, FSMO roles, through HP’s Customer Support Center. A frequent speaker at backup used and age of the backup. In the lower portion, Windows and HP technical conferences, he is author of individual progress can be shown for each selected DC. Note : Active Directory Design and Deployment.

Sponsored by

8