I recently worked in a case where my customer was getting the following and very common error:
500.13 “Server too busy”
Description: An unhandled exception occurred during the execution of the current request. Please review de stack trace for more
Information about the error and where it originated in the code
[HttpException (ox800004005): Server Too Busy] System.Web.HttpRuntime.rejectRequestInternal
(HttpWorkerRequest wr) + 148
Framework Version: 1.1.4322.2407
As we know errors with Status 500 are associates to crashes in web applications and represent Server errors. Causes may vary from Stack Overflow, Stack corruption, Heap corruption, divide by Zero, Null pointer, unhandled exceptions and more. So a troubleshooting is needed
For the Initial Troubleshooting I asked for the following info and checked the following items
1) What is the application about? It is a portal for User control and services request serving all the National company branches
2) What recent changes have been done over the application? None, this answer is very important because this discard the fact that there were changes in the application.
3) What recent changes have been done over the hardware? None
4) Have the amount of users increased? No, wow this discarded the fact that problem is because capacity
5) Thread pool KB821268 has been applied being this a 1.1 .net application? Yes
6) Parameter debug = false in all web.config files? Yes
7) What value has the kernel Queue Limit in the application pool? 1000, Ok this is something could be rise in most of production applications but I have never seen this to cause a system outage.
8)From when the application is behaving like this? 5 days ago. Here we can inferred that
A Workaround to stabilize temporarily the server
1) in the meantime while customer setup a tool to capture a dump, setting a recycling rule base in number of request (a number must be defined) gave some reliability to the server.
2) raise the Kernel queue Limit to 4500 o 5000, this is recommended for very busy sites
3) Raise the HttpRequest appQueueLimit in web.config file from 100 to 5000
In order to do the troubleshooting we need to check all this stuff if want to see what happened at the moment
1) IIS Logs, windows sub status
2) Check how many worker processes are running (if web garden has been change to more than 1)
3) Check Kernel queue Limit
4) Check performances counter: specially ASP.NET v1.1.4322 Request Queued and ASP.NET Apps v1.1.4322 Request in Application Queue
5) From the SQL perspective: queries that take a long time can be suspect
6) Disable all recycle rules and Get a Dump at the moment users get the Server too busy error, taking a dump and analyzing tend to be the faster way to find the Root cause of a problem, but of course it require some deep knowledge of after opening the dump
After making an simple analysis of the dump using DebugDiag I get:
The following threads in IIS_CO~2.DMP are making a .NET remoting call and waiting on the remote server to respond
These threads are also waiting on data to be returned from another server via WinSock.
The call to WinSock originated from 0x01e8b113 and is destined for port 9080 at IP address 10.160.223.40
( 11 18 22 25 27 29 30 31 32 33 34 36 37 38 39 40 42 43 44 45 46 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 74 75 77 79 )
threads blocked (48 threads)
In this case there was a third machine (with IP address 10.160.223.40) used for monitoring purposes connected to the server and that machine was inhibited causing threads to be in a waiting state.
So this is one of this cases where the problem is not the server experimenting the symptoms but an external component.
So the most important lesson here is:
You need to see the big picture, get a deployment diagram of everything that is part of the whole system because the root cause can be from a point that don’t necessarily is the one having the symptoms