Introduction to the “Don’t be THAT guy” blog:
I come in to work every day and get calls with these exact scenarios. Of course, the names have been changed to protect the innocent and to keep the guilty from having to find new jobs. The reason I wanted to start detailing some of these cases was to provide some cautionary tales for any one in the position of network or server administrator. Therefore, read this, what I hope to be another of many blog entries to come, so that you might avoid being “THAT guy”.
Don’t be THAT guy ….The case of “Assault By Security Template”.
The call came in from a secretive government agency saying that their DNS zone had gone missing. I was immediately reminded of some Scottish gents and their bad luck. (See the last DBTG installment here: http://blogs.technet.com/networking/archive/2008/08/08/don-t-be-that-guy-the-case-of-the-missing-dns-zone.aspx). It seemed like I wasn’t going to be that lucky this time. Since I was talking to a secretive government agency, I knew that there would be NO data coming in for me to analyze, no Live Meeting for me to see, only my ability to ask the right questions to probe their ailing DNS.
I asked them what events they were getting and they informed me that they were receiving the following:
Event Type: Warning Event Source: DNS Event Category: None Event ID: 4013 Date: 7/17/2008 Time: 11:12:17 AM User: N/A Computer: SERVER Description: The DNS server was unable to open the Active Directory. This DNS server is configured to use directory service information and can not operate without access to the directory. The DNS server will wait for the directory to start. If the DNS server is started but the appropriate event has not been logged, then the DNS server is still waiting for the directory to start.
Event Type: Error Event Source: DNS Event Category: None Event ID: 4000 Date: 7/17/2008 Time: 11:12:17 AM User: N/A Computer: SERVER Description: The DNS server was unable to open Active Directory. This DNS server is configured to obtain and use information from the directory for this zone and is unable to load the zone without it. Check that the Active Directory is functioning properly and reload the zone. The event data is the error code.
I suspected that this might be more than just a DNS problem, therefore encompassing more than just myself in the networking group. I quickly dialed our Directory Services (DS) group and asked them if they could help me make sure that Active Directory (AD) was healthy.
While the DS engineer dug into AD, I asked the customer if they had any other DNS servers that were working and he said yes. We then used another DC/DNS server that still had the zone populated to create an alternate DNS server that people could point to for DNS. This provided the customer with some relief for name resolution while we were still digging for the cause of the first server loss.
While we continued our research using the information we had, the customer started to report new and different problems and errors. They were beginning to have problems logging onto client machines and they were being prompted for credentials when trying to access file shares. I could understand how the client logon issues could be related to a DNS server being down if clients were not yet pointing to the new DNS server for proper name resolution. But when the file shares started to require the users to type in their credentials I knew something else was quite wrong.
At about this time things started to really go south as users that had been logged onto machines started losing their ability to connect to the Exchange server. Even after we set the Exchange and non-working clients to point at the working DNS server the problems persisted. We were no longer looking at a simple DNS server down issue. We were looking at a cascading failure of many systems within the domain. The pressure was mounting for us to find a solution.
As the Directory Services engineer and I were researching all of the clues that we had been provided, we heard one of the customers state, “Hey, look. The KDC service is stopped and disabled.”
At the exact same time the DS engineer and myself said, “WHAT!?”
“The KDC service on this Domain Controller (DC) is stopped and set to disabled.”
We both requested that he set that service to Automatic and start it on the DC that had initially had the problem loading the DNS zone.
Once the KDC service was started and the DC was able to get fresh Kerberos tickets it was able to load the proper DNS zones from AD. A quick inspection of all of the DC’s in their environment showed that the KDC service had indeed been disabled on them all. This begs the obvious question: “How did this happen?”
Enter: THAT guy. The previous day a junior administrator had been given the task of applying security templates to a set of servers. This template was meant for member servers and one of the settings that the template contained was setting the KDC service to “Disabled” for startup. Unfortunately, THAT guy applied the template to more than just the servers that were expected, s/he (I guess it could have been THAT gal, too) also applied the template to all of the DC’s as well. I never found out the details of what happened so I could forewarn you, dear reader, but such is the nature of secretive government agencies.
Now even though I do not know exactly HOW the templates got applied (there are a couple differing ways: manually or with ADM files), I can at least put the pieces together to determine what happened afterward. Once THAT guy applied the template to each server this started a countdown of 10 hours where the existing Kerberos tickets would expire. As these tickets started to expire, the customer experienced the symptoms we witnessed:
- The DC did not have access to AD in order to load the AD integrated DNS zone because he did not have a valid Kerberos ticket.
- Users could not access shares without being prompted for credentials. This was due to Kerberos failing and then NTLM coming into play.
This also explains why things started to get worse over time. The gradual expiration of tickets across the domain led to an increasing number of issues.
Now, since you are reading this I am assuming you are at least geeky enough to be familiar with Kerberos authentication. If not here are some links to give you a quick primer:
Below is a screenshot of the Service in its disabled state. I used the HiSecDc template which does NOT have this service disabled but I set this to disabled here to show how it might have been done in a custom template.
Now, what could have been done to possibly prevent this from happening? And yes, I can already hear folks blurting out, “Don’t let junior admins loose with security templates!”
But, that can’t be the answer because we all need to learn somewhere and sometime otherwise there would never be any Senior Admins. No, in this case I have to reiterate my call for the use of a separate test environment. As painful and expensive as many people think those are, just imagine how painful this incident was for the affected users; how do you measure the cost of all lost productivity from this?
In reality, a beefy server with LOTS of RAM can handle several virtual servers to host test environments for most situations. This situation could have at least been prevented by testing on a virtual DC and few virtual member servers and clients.
Another thing I want to suggest isn’t in the prevention category but can help with recovery in situations like this. Get some of the common tools to monitor your environment and get familiar with what NORMAL looks like. If you don’t know what is normal in your environment, how do you expect to spot what is BROKEN right away? The reason I bring this up is due to the lack of data that we had to troubleshoot this case. Had we been given a network trace we would have seen tons of Kerberos errors and that would have been a huge clue to resolving the issue faster. Had someone there been able to look at the traces we asked for and see that there were lots of Kerberos errors that were not NORMALLY there, the pain would have stopped far sooner than it did.
- Test environments are not only cool, they can help you keep a job. Use the argument that it is cheaper to buy a server to run Hyper-V than it is to recover the lost business from your next network outage.
- Get to know what Normal looks like in your environment with tools like NetMon, etc.
- Other tips from my previous blog also apply: Have someone else look at what you just configured just before you hit apply, if you have any doubt. Also, as painful as change controls can be they help stop things like this.
So ends another cautionary tale. And remember: Don’t be THAT guy.