Hi, David here. Today I wanted to talk about something that we see all the time here in Directory Services, but that doesn’t usually get a lot of press. It’s a condition we call port exhaustion, and it’s a problem that will cause TCP and UDP communications with other machines over the network to fail.
Port exhaustion can cause all kinds of problems for your servers. Here’s a list of some symptoms:
– Users won’t be able to connect to file shares on a remote server
– DNS name registration might fail
– Authentication might fail
– Trust operations might fail between domain controllers
– Replication might fail between domain controllers
– MMC consoles won’t work or won’t be able to connect to remote servers.
That’s just a sample of the most common symptoms that we see. But here’s the big one: You reboot the server(s) involved, and the problem goes away – temporarily. A few hours or a few days later, it comes back.
So what is port exhaustion? You might think that it’s where the ports on the computer get tired and just start responding slower over time – but, well, computers aren’t human, and they certainly aren’t supposed to get tired. The truth is much more insidious. What port exhaustion really means is that we don’t have any more ports available for communication.
Now, some administrators out there are going to suspect a memory leak of some kind when this problem happens, and it’s true that memory leaks can cause the same type of issues (I’ll explain why in a moment). But usually we find that most of the time, memory isn’t the issue, and you can end up trying to troubleshoot memory problems that aren’t there.
In order to understand port exhaustion, you need to first understand that everything I listed above requires servers to be able to initiate outbound connections to other servers. It’s the word outbound that’s important. We usually think of network connectivity requirements in inbound terms – our clients need to connect to a server on a specific TCP or UDP port, like port 80 for web browsing or port 445 for file shares (SMB). But we very rarely think about the other side of that, which is that the communication has to have a source port available to use.
As you might know, there are 65,535 ports available for TCP and UDP connections in TCP/IP. The first 1024 of those are reserved for specific services and protocols to use as senders or listeners. For example, DHCP requests will always come from port 67 on a client, and the DHCP service (the server component) always listens on port 68. That means that they listen on these ports for inbound communications. Beyond that, ports get dynamically assigned to services and applications for either inbound or outbound use as needed. A port can normally only do one thing – we can either use it to listen for connections from other machines on the network, or we can use it to initiate connections to other machines on the network, but we usually can’t do both (some services cheat and use ports bi-directionally, but this is relatively rare).
So 65535–1024 is still 64511 ports. That’s a lot! We should almost never run out, right? You’d think so, but there’s another limitation here that you might not be aware of, and that limitation is that we don’t actually use the full range of ports for any dynamic communications. Dynamic communication is any sort of network communication that doesn’t already have a port specifically reserved for sending or receiving it – in other words, the vast majority of network traffic that a Windows computer generates.
By default in the Windows operating system, we only have a limited number of ports available for outbound communications. We sometimes call these user ports, because user-mode processes are what we really expect to be using these things most often. For example, when you connect to a file server to access a file, you’re connecting to (usually) either port 445 or port 139 on the other side to retrieve that file. However, in order to negotiate the session, you need a client port on your computer to use for this, and so the application making the connection (Windows Explorer, in the case of browsing files) gets a dynamically-assigned port to use.
Since we only have a limited number of ports available by default, you can run out of them – and when you run out, you’re no longer able to make new outbound connections from your computer to other computers on the network. This can cause an awful lot of communication to break down – including the communication that’s needed to authenticate users with Kerberos.
In Windows XP/2003 (and earlier) the dynamic port range that we use for this was 1024-5000 by default. So, you had a little less than 4000 ports available for outbound network communication. Ports above that range were generally reserved for application listeners. In Windows Vista and 2008, we changed that range to be more in line with IANA recommendations. If you’re curious, you can read the KB article here. The upshot of the changes is that we actually have a larger default dynamic range in Vista and 2008, but we also messed up everyone who’s ever configured internal firewalls to block high ports (which by the way is something we don’t recommend doing on an internal network. Either way, the end result is that you’ve got a few more ports available to use by default in Vista and 2008.
Even so, it’s still possible to run out of ports. And when this happens, communication starts to break down. We run into this scenario a lot more often than you might think, and it causes the types of issues I detailed above. 99% of the time when someone has this problem, it happens because an application has been grabbing those ports and not releasing them properly. So, over time, it uses up more and more ports from the dynamic range until we run out.
In most networks there are potentially dozens, if not hundreds, of different applications that might be communicating with other servers over the network – security tools, management and monitoring tools, line of business applications, internal server processes, and so on. So when you have a problem like this, narrowing down which application is causing the problem can be a challenge. Fortunately, there are a couple of tools that make this easier, and the best part is, they come with the operating system.
The first tool is NETSTAT. Netstat queries the network stack and shows you the state of your network connection, including the ports you’re using. Netstat can tell you which ports are in use, where the communication is going, and what application has the port open.
Another cool tool is Port Reporter. Port Reporter does everything that Netstat does, but it runs in real-time rather than just a point-in-time snapshot like Netstat does. Netstat is included in Windows, but you can download Port Reporter for free from our website. (All my examples in this blog will use Netstat).
So, if you suspect that you might have a port exhaustion problem, then you’d want to run this command:
netstat –anob > netstat.txt
This runs Netstat and dumps the output to a text file. You’d want to use a text file since trying to look at the output inside a command prompt is a quick way to give yourself a migraine. Once you’ve done this, you can examine the text file output, and you’ll be able to see what processes are using up ports. What you want to look for is entries where the same process is using a lot of ports on the machine. That is the most likely culprit.
Here’s an example of what you get with netstat (I’ve snipped it for length)
Notice that you can see the port you’re using locally, the one you’re talking to remotely, and what the state of the connection is. You can also get the process ID (that’s the o switch in the netstat command), and you can even have netstat try to grab the name of the process (use netstat –anob).
What you’re looking for in the output is a single process that is using up a large number of ports locally. So for example, on my machine above we can see that PID 608 is using several ports. Usually what will happen when you run into port exhaustion is that you will see that one (or two) processes are using 90-95% of the dynamic range. The other piece of information to look at is where they’re talking to remotely, and what the state of the connection is. So, if you see a process that’s using up a lot of ports, talking to a single remote address or several remote addresses, and the state of the connection is something like TIME_WAIT, that’s usually a dead giveaway that this process is having a problem and not releasing those ports properly.
Once you have this information, you can usually get things working again by turning off the offending process – but that’s only a temporary fix. Odds are, whatever was causing the problem was a legitimate piece of software that you want to have running. Usually when you get to this stage we recommend contacting the vendor of that application, or taking a look at whatever other servers the application might be communicating with, in order to get a permanent fix.
I mentioned above that memory leaks can cause this behavior too – why is that exactly? What happens is that in order to get a port to use for an outbound connection, processes need to acquire a handle to that port. That handle comes out of non-paged pool memory. So, if you have a memory leak, and you run out of non-paged pool, processes that need to talk to other machines on the network won’t be able to get the handle, and therefore won’t be able to get the port they need. So if you’re looking at that Netstat output and you’re just not seeing anything useful, you might still have a memory issue on the server.
At this point you really should be contacting us, since finding and fixing it is going to require some debugging. Cases that get this far are rare however, and most of the time, the Netstat output is going to give you the smoking gun you need to find the offending piece of software.
– David “Fallout 3 Rules” Beach