I’ll borrow an introduction from a blog post I wrote a few days ago titled NFS and Name Resolution because it pretty much applies to this blog post as well:
Sometimes I take things for granted. For instance, the health and integrity of the lab environment. Although it is “lab”, I do run some workloads which are key to keep online on a regular basis. Primarily the web server which this blog is served from, the email server which is where I do a lot of collaboration, and the Active Directory Domain Controllers/DNS Servers which provide the authentication mechanisms, mailbox access, external host name resolution to fetch resources on the internet, and internal host name resolution.
The workloads and infrastructure in my lab are 100% virtualized. The only “physical” items I have are type 1 hypervisor hosts, storage, and network. By this point I’ll assume most are familiar with the benefits of consolidation. The downside is that when the wheels come off in a highly consolidated environment, the impacts can be severe as they fan out and tip over down stream dependencies like dominos.
Due to my focus on VMware virtualization, the Microsoft Active Directory Domain Controllers hadn’t been getting the care and feeding they needed. Quite honestly, there have several “lights out” situations in the lab due to one reason or another. The lab infrastructure VMs and their underlying operating systems have taken quite a beating but continued running. Occassionally a Windows VM would detect a need for a CHKDSK . Similarly, Linux VMs wanted an FSCK. But they would faithfully return to a login prompt.
A week ago today, the DCs succumbed to the long term abuse. Symptoms were immediately apparent in that I could not connect to the Exchange 2010 server to access my email and calendar. In addtion, I had lost access to the network drives on the file server. Given the symptoms, I knew the issue was Active Diriectory related, however, I quickly found out the typcal short term remedies weren’t working. I looked at the Event Logs for both DCs. Both were a disaster and looking at the history, they had been ill for quite a long time. I was going to have to really dig in to resolve this problem.
I spent several of the following evenings trying to resolve the problem. As each day passed, anxiety was building because I was lacking email which is where I do a lot of work out of. I had cleaned up AD meta data on both DCs, I had removed DCs to narrow the problem down, I examined DNS checking the integrity of AD integrated SRV records. I had restored the DCs to an isolated network from prior backups to no avail. Although AD was performing some base authentication, there were a handful of symptoms remaining which would indicate AD was still not happy. A few of the big ones were:
- Exchange Services would either not start or would hang on starting
- SYSVOL and NETLOGON shares were not online on the DCs
- NETDIAG and DCDIAG tests on the DCs both had major failures, primarily inability to locate any DCs, Global Catalog Servers, time servers, or domain names
All of these problems utlimately tied to an error in the File Replication Service log on the DCs:
Event Type: Warning
Event Source: NtFrs
Event Category: None
Event ID: 13566
Date: 6/10/2010
Time: 9:15:56 PM
User: N/A
Computer: OBIWAN
Description:
File Replication Service is scanning the data in the system volume. Computer OBIWAN cannot become a domain controller until this process is complete. The system volume will then be shared as SYSVOL.To check for the SYSVOL share, at the command prompt, type:
net shareWhen File Replication Service completes the scanning process, the SYSVOL share will appear.
The initialization of the system volume can take some time. The time is dependent on the amount of data in the system volume.
I had waited a long period of time for the scan to complete, but it had become apprent that the scan was never going to complete on its own. After quite a bit of searching, I came up with Microsoft KB Article 263532 How to perform a disaster recovery restoration of Active Directory on a computer with a different hardware configuration. Specifically, step 3j provided the answer to solving the root cause of the problem. There is a registry value called BurFlags located in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NtFrs\Parameters\
Backup/Restore\Process at Startup\. The value needs to be set to d4 to allow SYSVOL to be shared out.
Once this registry value was set, all of the problems I was experiencing went away. Exchange services started and I had access to my Email after a four day inbox vacation. I had been through a few instances of AD meta data cleanup but this turned out to be a more complex problem than that. I am thankful for internet search engines because I probably would have never solved this problem without the MS KB Article. I was actually coming close to wiping my current AD and starting over, although I knew that would be pretty painful considering the integration of other components like Exchange, SQL, Certificate Services, DNS, Citrix, etc. that was tied to it.
Jason
I too have had similar issues with regard to file system corruptions on both Windows and Linux guests. Every few days they would require a chkdsk, which ran and succeeded but was darn annoying. It turned out to be storage related as i suspected and ended up causing allot more pain then it initially appeared. The vmware logs also pinpointed the problem pretty well and i seem to recall a KB article which detailed my issue. Glad you got it all back up.
I experienced a similar issue to this on a customer site earlier on in the year, whereby a disk problem on a DC had introduced inconsistencies to the FRS data, leading to corruption of the SYSVOL share and it’s local Jet DB. The BurFlags key did the trick, and is a very useful thing to be aware of when troubleshooting FRS.
A word of caution when using the BurFlags registry key though – select the authoritave server for the FRS replication data and database restore very carefully. I’ve seen people read through articles on TechNet, then rush through the process of marking servers as authoritative/non-authoritative without thinking through the consequences of their actions. The end result can be (and has been for several people I know) a wonderfully working FRS replica set, but with data which is either inconsistent with AD (such as outdated GPO’s and login scripts), or which is missing altogether!
Lots more useful information on the process of using BurFlags here:
http://support.microsoft.com/kb/290762