Isolating problems that seems to be related to the ISA Server – Part III

1. Introduction

On the last session, back in July last year we saw an issue that on the first perspective looks like an ISA Server issue, however it end up to be another component. The scenario that I’m going to describe now is really something that could be interpreted as an issue with ISA Server, however only was analysis was it possible to prove that ISA Server was actually the victim.

2. Scenario – Server’s Performance Gradually Become Less Responsive Until it Stops Accepting Incoming Connections

That was the description that customer gave us, and additionally he said that if he stops the Microsoft Firewall Service the server starts accepting new connections. During the time of the issue the following event was also appearing in the Application Log:

Event Type: Error

Event Source: Microsoft Firewall

Event Category: None

Event ID: 21280

Date: 2/10/2008

Time: 11:23:25 PM

User: N/A

Computer: ISASRV

Description:

The size of the free non-paged pool fell below the system-defined minimum. ISA Server will reject new connections unless they are initiated by the ISA Server computer. In addition, the timeout for idle TCP connections was reduced. This alert may indicate an attack on the ISA Server computer. Use the log viewer to examine ISA Server traffic. This event may also be raised when the ISA Server computer does not have the resources needed to handle legitimate traffic. If this is the case, you may need to add more memory to the system.

As the event says, this could be potentially an issue with ISA Server, but usually this event is caused by something outside of ISA. we needed to narrow it down to understand why we were receiving this error.

2.1. Collecting Data

Since we were dealing with a server’s performance degradation in general, we were still unsure that ISA Server was the real root cause for this problem. We had to collect data from the operating system to analyze the performance in general. Here what the action plan was:

Preparing the Server

1) Configure the system to get a Kernel memory dump when the server became unresponsive. To do this configure the server according to KB244139.

2) Configure the system to get PoolMon. Use KB177415.

3) Configure PerfMon. Use the objects suggested by the article “Monitoring and Troubleshooting Performance” on Microsoft TechNet.

When the issue happens

1) Use the keyboard combination to get the dump as mentioned on KB244139. This will generate a blue screen (which is expected).

2) Right after the server comes up run the Setup/Perf MPSReports.

3) Get the PoolMon result.

4) Get the Perfmon result.

3.3. Analyzing

Even without having the Microsoft Internal symbols you will still be able to review some key data using the public symbols. Download the debugging tools and configure the symbols according to the Debugging Tools and Symbols: Getting Started Page in Windows Hardware Developer Central.

Reviewing the dump one of the main points that came up was the excessive use of Non Paged Pool, there the result:

0: kd> !vm

*** Virtual Memory Usage ***

      Physical Memory: 917316 ( 3669264 Kb)

      Page File: \??\C:\pagefile.sys

      Current: 2095104 Kb Free Space: 2085964 Kb

      Minimum: 2095104 Kb Maximum: 4190208 Kb

      Page File: \??\E:\pagefile.sys

      Current: 4193280 Kb Free Space: 4187508 Kb

      Minimum: 4193280 Kb Maximum: 4193280 Kb

      Available Pages: 452869 ( 1811476 Kb)

      ResAvail Pages: 835185 ( 3340740 Kb)

      Locked IO Pages: 685 ( 2740 Kb)

      Free System PTEs: 185982 ( 743928 Kb)

      Free NP PTEs: 2036 ( 8144 Kb)

      Free Special NP: 0 ( 0 Kb)

      Modified Pages: 213 ( 852 Kb)

      Modified PF Pages: 213 ( 852 Kb)

      NonPagedPool Usage: 63071 ( 252284 Kb)

      NonPagedPool Max: 65280 ( 261120 Kb)

      ********** Excessive NonPaged Pool Usage *****

      PagedPool 0 Usage: 24312 ( 97248 Kb)

      PagedPool 1 Usage: 846 ( 3384 Kb)

      PagedPool 2 Usage: 824 ( 3296 Kb)

      PagedPool 3 Usage: 866 ( 3464 Kb)

      PagedPool 4 Usage: 859 ( 3436 Kb)

      PagedPool Usage: 27707 ( 110828 Kb)

      PagedPool Maximum: 90112 ( 360448 Kb)

      Shared Commit: 4379 ( 17516 Kb)

      Special Pool: 0 ( 0 Kb)

      Shared Process: 5009 ( 20036 Kb)

      PagedPool Commit: 27771 ( 111084 Kb)

      Driver Commit: 1899 ( 7596 Kb)

      Committed pages: 486017 ( 1944068 Kb)

      Commit limit: 2450007 ( 9800028 Kb)

 

In addition to checking the pool usage in the kernel dump (using !poolused) you can also use the PoolMon tool to see which drivers might be leaking. On this particular case the tags were:

Pool Tag: Vmem

Type: Pool_Paged

Leaking?: *YES*

Mapped Driver: Unknown Driver

Pool Tag: Ddk

Type: Pool_Non-Paged

Leaking?: *YES*

Mapped Driver: <unknown>

Those tags do not belong to Microsoft products and you can find that by using the article “How to find pool tags that are used by third-party drivers” at Microsoft Help and Support.

To analyze the Performance Monitor data you can follow the recommendations of the article “Analyzing performance data” at Microsoft TechNet and review the Operating System side. In this particular case, the ISA Server process (wspsrv.exe) was not showing any kind of leaking behavior or excessive use of memory or CPU.

4. The Culprit

With those pieces of the puzzle in hands we could match such behavior with the following two KB articles:

  • 947475 When TrendMicro OfficeScan is installed on a Windows Server 2003-based computer, event ID 2020 occurs, and the computer may stop responding, or the computer may respond slowly at Microsoft Help and Support.
  • 923125 The computer may stop responding after you install Trend Micro OfficeScan 8 on a computer that is running Microsoft Windows 2000 Server or Windows Server 2003 at Microsoft Help and Support.

This particular product was installed on the server and the problem was starting to make sense at this point.

4. Conclusion

It makes sense that after the customer stopped the Microsoft Firewall service the server started working normally. The fact is that if the Operating System itself runs out of the Non Paged Pool Resources it will trigger the LowNonPagedPoolCondition which will cause ISA Server to stop to accept new connections. This doesn’t cause issues only in ISA Server, this also happens with other products, such as IIS (per KB933844).

The ISA Server was just a victim in this situation and the workaround was to uninstall the antivirus product for immediate relief from this problem in this customer’s environment (as per KB923125). TrendMicro has released a hotfix that addresses this problem.

Yuri Diogenes

Security Support Engineer – ISA Server Team – Microsoft Texas