Some days, it just doesn't pay to get out of bed. Unfortunately you never know if a given day is one of them until you have someone walk in with a "quick problem". Usually these interruptions are easily deflected with a simple "What is the problem you are having?" Hopefully the answer is "DNS" or "Active Directory", though "SQL Server" or "Hardware Question" work just as well, since there are others who are experts at these areas. But no, this time the answer is "I have a network sniff I need someone to look at". Since I actually do quite a bit of network sniff analysis, I can't really punt this one off so I say "sure send it to me".
A few minutes later, I get an e-mail with a Netmon capture file. The engineer stops back by and explains the problem: "If I use the NLB cluster's virtual IP address, SQL connections are slow." Dang, if he had used the word SQL the first time I could have punted this, but nnNNNOOOoooo, he had to say "network sniff". Well, since I already agreed to look at this capture I can't back out now. As a common delaying ploy (while I try to think of some way to pawn this off on someone else) I restate the problem: "So, when you hit the vip it takes a long time but if you hit the dip it doesn't?" (vip == Virtual IP Address, dip == the specific machines IP address). "Yes, and the capture file is the failure case." The response isn't long enough for me to come up with a good excuse to send them to someone else, so we crack open the capture file (I really need to learn to get a realistic cough on cue).
HHhhhmmmm, the capture file is interesting. We see a TCP/IP connection and some communication then we see a few NETBIOS name lookup calls, followed by some more TCP/IP communication. The interesting point is the fact that the NETBIOS calls all fail "Destination Unreachable" - which also seems to be the cause of some latency. We then hop on the boxes and get a network capture of the working case for comparison sake. As we compare, we note that the NETBIOS calls don't happen in the working case, so things are slightly different.
Looking closer we see that SQL is sending back a SSL_Self_Signed_Fallback cert. We assume (remember that word) that obviously there must be something in the SQL reply that includes the server name since that is the big difference in the client usage. To verify this we map the server name to the cluster virtual IP address in the client machines hosts file and pull the other machines out of the cluster. What do you know, hit the machine name (now cluster name by the hosts file), no delay.
So begins the 8-10 hour investigation into "How do you get SQL to send a more friendly certificate?" There are a number of kb articles and websites that are valuable to this investigation. Example: http://support.microsoft.com/kb/318605 specifically teaches you that the certificate needs have the machines FQDN in order to be used, if the registry value "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.x\MSSQLServer\SuperSocketNetLib" is zero length or is missing, however "If the registry value is set, SQL Server tries to use that certificate". Great! All we have to do is get a common cert, get it in the registry entry and off we go, problem solved...
Ok, problem number one, I don't actually know where to get a valid cert from within MSCOM for machines within MSCOM, so we have to go find the security expert, Mr. Cert. We explain our issue and he says no problem, he can whip up a self signed cert for us. Within minutes we have a cert and have put the friendly name in the registry. Restart SQL - SQL Fails to start. Ok, lets try the serial number. Restart SQL - SQL Fails to start. Ok, lets try the thumbprint. Restart SQL - SQL Fails to start. Ok, lets try changing the value from a string to a binary (since the KB says binary value). Failure, Failure, Failure. At this point I have invested somewhere around 6 hours in this investigation... just my hours. I am not counting the hour that I spent explaining this to my lead to ask for advice, the two hours spent by a peer understanding and making suggestions, the 2 hours spend by the Security Lead when his office was closer than Mr. Cert's office. Nor the time spent by Mr. Cert or the SE who originally brought up the issue after investigating for a while. All we know is that no matter what we put in that registry setting, SQL fails to start.
At this point Mr. Cert runs back to his office to spend the next 4 hours trying to get SQL to accept the CERT. We can't even get SQL Configuration manager to see the cert unless we use the machine FQDN. Which leaves me holding the bag, trying to make more sense out of the network trace; since that is about all I have left.
After spending way too many hours on this (spread across multiple days), I hijacked a peer who makes the random comment "maybe the problem isn't the cert but is the fact that it is an IP address without a NETBIOS name"? HHmmm... simple check, change the hosts file to give the name FOOOO to the VIP. Ah crap, it works flawlessly. Verify with a quick check of the network capture, no delay, happy with the cert, everything is fine. AAAAARRRRGGGGHHHHH (people stop me from throwing chairs out the window and jumping the half story to my death). To bad we pulled the other machines out of the cluster before we did the earlier test of mapping the machine name to the IP address (since we assumed that it was the name that was special).
Yes, the fix is a simple one, which a review of the SQL Client code confirms. SQL Client requires IP addresses to have a name associated with them at some point. Doesn't really matter what the name is, they just want to do a reverse DNS and have it succeed - they don't actually seem to do anything with this name. No, this is not frustrating, this is learning!!! ;^)
So, off the SE goes, happy that he has a solution, though I suspect feeling both sheepish for the simplicity of the fix as well as a bit enlightened regarding the aptitude of a certain debug engineer. On the bright side, Mr. Cert did finally beat SQL into submission and can now tell us how to get other certificates to work... oh well.
So, what did we (and hopefully you) really learn?
· Don't assume anything (something I learned before but apparently forgot).
· Make sure you get a full description of the problem before you agree to look at something.
· SQL Server 2005 secures all login attempts, possibly using a fallback self generated certificate.
· SQL Client will attempt to do a reverse lookup on the IP address if an IP address is used.
· SQL Client doesn't seem to care what the IP address maps to, as long as it can successfully complete the reverse lookup.
· The NETBIOS lookup failures can add up to 30 seconds of delay for very little apparent impact (everything works eventually, even if the lookup fails).
· SQL Server requires a string key at the appropriate place (HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.x\MSSQLServer\SuperSocketNetLib) or SQL Server will fail to start.
o If that key contains anything invalid, SQL Server fails to start.
o Even though you use the SQL Configuration UI to modify the registry key, if the cert doesn't have proper permissions SQL Server fails to start.
o There is a way to get a valid entry in there, and SQL Server won't fail in that one case. Hopefully Mr. Cert will write us a blog that teaches us how to do that.
· Some days it doesn't pay to get out of bed, and this was one of them.