A lovely day at work
I had a wonderful day at work this Wednesday. I rebooted a server running under VMware ESX after installing some updates. After the reboot, the server for some reason decided it wanted to do some time travel. It reset its clock to February 14, 2008. Despite the odd time travel, this kind of issue usually resolves itself after a few minutes when the server synchronizes its clock in the domain hierarchy. However, the server in question was the root domain controller (PDC emulator) – the server that is at the top of the time synchronization scheme in the domain. So, in a matter of minutes, most of the other servers and as far as I know, virtually all clients in the domain (~450), followed suit and also set their clocks back to February. This is when things started going wrong. The phone started ringing and people were coming into my room asking me what was wrong with the network.
At this time, I was starting to panic a bit. So I logged into VMware VirtualCenter and looked at the clock on the ESX servers. Two of them were accurate, but the third was indeed set to February 14, 2008 and NTP wasn’t running. I tinkered a bit with this and managed to force it to the correct date. Due to a bug in ESX, I was however unable to enable NTP via the Infrastructure Client, but that’s another story.
Then something even more strange happened. All of a sudden, the top domain controller was convinced the date was September 29, 2009 (!). Before I noticed this, the date had begun replicating all over the network again and now things really started to break. People were unable to login, network printing failed, mapped network drives became unavailable, software licenses expired etc.
And now, the grand finale. As a security measure, we have developed a Windows service that runs on the network and disables user accounts in Active Directory that haven’t been used in 6 months or more. Guess what this service did when it thought the date was September 2009? Well, it disabled all user accounts in Active Directory of course. All 6500 of them! When I got the e-mail from the service, I really started to panic.
I immediately reset the clock on the domain controller and started working on a way to re-enable all the user accounts that had been disabled. This turned out to be a lot easier than expected so that particular problem was relatively easy to remedy.
Everything wasn’t nice and dandy though. Replication between the domain controllers was royally f*!#&d. Because the time had previously been set back to February 2008, the domain controllers believed replication had not taken place in about 10 months. This led to this lovely error:
It has been too long since this machine last replicated with the named source machine. The time between replications with this source has exceeded the tombstone lifetime.
The proposed solution is to demote the domain controller that has lost connection with the forest, remove inconsistent deleted objects with repadmin and then restart replication. I wish! For some reason, the domain controllers were refusing to talk to each other so we were unable to demote the controller that was out of whack. We had to forcibly seize the roles from the broken controller, remove it from the domain completely and then remove all traces (metadata) of the old controller on the functioning controller using ndsutil. After many hours of scratching our heads and with some help, we were successful. We ended up with only one domain controller, but that we can live with for a few days.
It’s on days like this you really feel that you earn your pay.