Windows 2003 Scalable Networking pack and its possible effects on Exchange


Note: Please see this blog post for the most updated recommendations on this subject.

EDIT: On 3/12/2008, we have posted a new blog post that talks about the hotfix that we have released for this problem: http://msexchangeteam.com/archive/2008/03/12/448421.aspx

Previously I did not mention what network cards were affected, but the majority of the cases that we have seen have dealt with Broadcom cards or network cards that have a Broadcom chipset in them.

Broadcom has provided an update to their drivers and can be downloaded from http://www.broadcom.com/support/ethernet_nic/netxtremeii.php  that helps resolves the issues with these offloading feature problems. To determine if you have the correct update, you need to check the version numbers to ensure they are 3.7.19 or later. Anything lower than that does not have the fixes in them. ExBPA is being updated with a new rule to detect if the Scalable Networking pack features are enabled on the server.

NOTE: Before updating directly from Broadcom’s site, I would recommend checking with your Server manufacturer first before applying the update directly from Broadcom as this may affect your ability to apply updates directly from each vendor in the future using their integrated update utilities.

Currently there are only a handful of manufacturers that have updated the drivers on their sites, but over time, you will see the updates available for downloading.

I have personally dealt with cases that having the Scalable Networking features enabled with the latest drivers have not caused any connectivity issues to/from  Exchange servers, so you can now take advantage of these features that may help increase overall network performance on your servers.

Possibly impacted Operating System versions:

With the release of the Scalable Networking Pack that is included with Windows 2003 SP2, we in Exchange support have been seeing some connectivity issues once the new networking features are enabled. These new features are enabled by default and are only used if your network card driver supports them. Some of the new architectural additions that were introduced with the Scalable Networking Pack are TCP Chimney Offload, Receive-side Scaling (RSS) and NetDMA. These were introduced because of the Microsoft Scalable Networking Initiative that was designed to help reduce OS bottlenecks caused by network packet processing. More information regarding the Scalable Networking initiative can be found at www.microsoft.com/snp.

What this is does essentially is to offload TCP/IP packet processing to the network card, thus freeing up valuable CPU cycles for your applications. The throughput increases that you can get from having these enabled are quite significant.

To support these new features, the NDIS miniport driver had to be redesigned to handle this, thus NDIS 6.0 was born. For more information regarding the updated NDIS miniport driver, see http://msdn2.microsoft.com/en-us/library/ms798546.aspx from the Windows Driver Kit NDIS Miniport Driver Reference. With the NDIS 5.1 driver, the Operating System was limited to processing network traffic on a single CPU which impacted CPU performance quite significantly. The new NDIS driver design allows for processing this same information across multiple processors which will improve performance quite significantly.

This appears like this would actually increase the performance of Operating System, but what does this have to do with Exchange? Well, some of the issues surrounding the problems are documented in 936594 and a short list of what may affect Exchange is listed below.

  • You cannot create a Remote Desktop Protocol (RDP) connection to the server.
  • You cannot connect to shares on the server from a computer on the local area network.
  • You cannot connect to Microsoft Exchange Server from a computer that is running Microsoft Outlook.
  • You can only connect to Web sites that are hosted on the server or on the Internet by using a secure sockets layer (SSL) connection. In this scenario, you cannot connect to a Web site that does not use SSL encryption.
  • You experience slow network performance.
  • You cannot create an outgoing FTP connection from the server.
  • You experience intermittent RPC communications failures.
  • Some Outlook clients may be unable to connect to Exchange.
  • You cannot run the Configure E-mail and Internet Connection Wizard successfully.
  • Microsoft Internet Security and Acceleration (ISA) Server blocks RPC communications.
  • You cannot browse Internet Information Services (IIS) Virtual Directories.

In Support Services, we are also seeing some of the following behaviors when clients are trying to connect to the Exchange server.

  • 32 MAPI sessions exceeded (9646 errors) causing the inability for Outlook clients to connect to the Information Store. This can occur more frequently with VPN connected clients and we have also seen scenarios where Exchange 2007 is affected with Windows 2003 SP2 installed as well.
  • Non-paged pool memory leaks caused by having the Chimney feature enabled. Sometimes you can’t even start the IIS services when Non-paged pool gets below 20MB.
  • The Inability to logon to Outlook Web Services or even IIS for that matter, either locally or remotely.
  • Networking throughput is decreased when these features are turned on. This is the opposite of what the Networking Pack is supposed to do.
  • Cluster ISAlive checks fail randomly
  • TCP Connections are reset when RSS is enabled.
  • TCP port exhaustion

Wow, now that is a lot of things that could possibly go wrong, so why have this enabled by default? The default assumes that the network card drivers are up to date or have the latest driver release that supports the new networking features. Most of the issues that we see are due to outdated network card drivers and simply updating the network card driver to the latest release has provided relief in most situations. Other cause that has been seen is outdated Storport/SAN drivers consuming higher than normal non-paged pool memory, so ensure that you have the latest storport.sys and SAN drivers installed on your servers to help mitigate this problem. If you are still having problems, then I would highly recommend opening a case with Support Services for further assistance.

So how does one go about troubleshooting something like this? One would think that Netmon may show you what is going on, but the truth to the matter is that once the data is handed off to the Network card, the netmon filter can never log that request. This makes it harder to troubleshoot unfortunately.

You can actually see what connections are offloaded by using the netstat -t command. This -t switch is only available if the networking pack is installed on a server. An offloaded network connection can be in one of the following states:

  • In Host – the network connection is being handled by the host CPU
  • Offloading – the network connection is in the process of being transferred to the offload target
  • Uploading – the network connection is in the process of being transferred back to the host CPU
  • Offloaded – the network connection is being handled by the offload target.

One of the more common issues we have seen is working with the TCP Chimney offload feature and most of the resolutions to date that we provide to customers is to disable that feature to see if that helps your specific scenario.

To disable TCP Chimney, you can do this one of two ways. I prefer the latter as it is instantaneous and does not require a reboot.

To Disable TCP Chimney, Navigate to the following registry key and set the value to 0. Note: You have to reboot the server after this registry change.

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters]
EnableTCPChimney“=dword:00000000

Or you can use the netsh command which I prefer without having to reboot:

Netsh int ip set chimney DISABLED

If that does help with the situation, you could also try disabling the following offloading keys under the above registry hive to disable the RSS features.

EnableTCPA“=dword:00000000
EnableRSS“=dword:00000000

If any of the above options help with the issue, this may be as simple as updating your network card drivers to the latest version and then re-enabling those features again to see if the problem still occurs. If there are still questions regarding whether or not your network card supports these new features, contact the hardware vendor for your card for more information.

To help with some of the HTTP resource constraints that may occur when nonpaged pool memory is less than 20MB on an Exchange Server with these features enabled, you can enable the AggressiveMemoryUsage registry key per 934878 to help HTTP services continue to function in low nonpaged pool memory conditions. Setting this same registry key may also help with allowing your cluster HTTP IsAlive polls to function in these low memory conditions that normally result in a failover. Of course, this is a workaround for a short period of time, but root cause of paged pool memory issues should be identified and fixed by following KB articles 177415, 912376, 317249, to help determine what might be causing the problem.

One other thing to mention here is that TCP Chimney offload and NetDMA will not work with the following features enabled:

  1. Windows Firewall
  2. Internet Protocol security (IPsec)
  3. Internet Protocol Network Address Translation (IPNAT)
  4. Third-party firewalls
  5. NDIS 5.1 intermediate drivers

If any one of these features is turned on, TCP Chimney offload and NetDMA will not work regardless of the registry settings.

Using offloading with any Microsoft or 3rd party NAT solution will also cause known connectivity issues with Exchange Servers, SBS, ISA servers, etc.

As you can see, there are many variables that can affect the performance and stability of your network card, so you just need to keep in mind the new networking features during your troubleshooting efforts.

I hope this helps shed some light on these new features and some of the issues that we in support are seeing here today. It may even prevent a future support call somewhere down the road.

For a list of partners that have tested Scalable Networking with their networking products, see Scalable Networking Partners.

References:

You cannot host TCP connections when Receive Side Scaling is enabled in Windows Server 2003 with Service Pack 2
http://support.microsoft.com/default.aspx?scid=kb;EN-US;927695

You may experience network-related problems after you install Windows Server 2003 SP2 or the Scalable Networking Pack on a Windows Small Business Server 2003-based computer that has an advanced network adapter
http://support.microsoft.com/default.aspx?scid=kb;EN-US;936594

TCP traffic stops after you enable both receive-side scaling and Internet Connection Sharing in Windows Vista or in Windows Server 2003 with Service Pack 1
http://support.microsoft.com/default.aspx?scid=kb;EN-US;927168

Mike Lagase

Comments (29)
  1. Concerned user says:

    That Q Article above states the following conditions exist:

    A problem exists that affects NAT when you have Receive Side Scaling enabled.

    Adjusting the MTU or max segment size would fix the issue that have been related to the support calls.

    Netmon can capture from remote computer.

  2. mike says:

    Thanks for putting this together, Mike.

  3. mike says:

    Also, will a rule be added to the BPA to check for these keys and direct folks here or to the KB articles for more information?

  4. Exchange says:

    Mike,

    Indeed, this is being evaluated/worked on.

  5. WantanExchange says:

    Don’t mean to hijack the thread, but I am not sure where else to complain.

    I am a veteran of Linux, and have been using it since 1998. I am no stranger to the command line, and I find it extremely useful when implemented well.

    However, Exchange 2007 has all the disadvantages of a poor GUI and a poor command shell. Sure, the command line is useful for large enterprises who have 18 Exchange admins, like Bank of America where I used to work (luckily when still using Exchange 2003), but not having the ability to change user mailbox permissions in the GUI?

    That’s just asinine, and unconscionable. This alienates many administrators, and makes my life harder, and increases the time it takes me to do simple Exchange tasks from a few seconds to, in some cases, several days.

    This is insane.

    I’ve noticed that anyone who disagrees with the Microsofties on other forums reference the ridiculous EMC is accused of being a troll and spreading FUD.

    Whatever.

    I’ve been in IT since the 1990s. I now control a lot of money and make a lot of decisions as to what companies should use for their IT infrastructure. If there is any possible way that I can convince the small- to medium-sized companies that I consult for to explore other email solutions, I will do so — even if it’s Linux-based, as at least the command line there makes some sense.

    It’s not that I am not comfortable with Powershell/EMS. I am. It’s just that I hate it and think it mars the good name of Exchange, and also makes day-to-day administration of most tasks much, much harder.

    Old days: User calls up and wants me to add some permissions to a mailbox: Click! Click! Click! Done. 8 seconds.

    Today: User calls up and wants me to add some permissions to a mailbox: Oh, what’s that command? (Looks it up.) Oh yeah. Ok, hmm, wrong syntax. Let me try that again. Oops, wrong syntax again. To user: Can I call you back in an hour? I have to wade through 20 pages of tech docs to figure this out.

    User: Confused because it used to take 8 seconds.

    If there weren’t so very much removed from the EMC, I’d be more pleased with it. What a waste of my damn time.

    I will not be recommending Microsoft Exchange in the future to anyone until this is corrected, and though I am sure MSFT will not notice it in ther bottom line, I am not the only one by far who feels this way.

    What a terrible decision for a product that I really liked, and what a terrible response to complaints.

  6. Chad says:

    We have been fighting with this on two clients. Couldn’t find a ryme or reason. Thank you.

  7. Sven says:

    Same here, after disabling the chimney feature, non-paged pool usage dropped from 120 MB to 70 MB. I’m feeling much better now. :-)

    A million thanks for your blog!

  8. Mark D says:

    Please educate PSS on this.  We have 3 tickets about this and ended up having to revert to SP1.  We found during our production outage of 35000 users that turning of /3gb and /ua in boot.ini got the system into a usable state with SP2.  It allowed us to schedule an outage after hours to uninstall.

  9. Rick says:

    Does this only affect Small Business Server, as mentioned in the article, or are other Windows Server versions at risk?

    Thanks

  10. Exchange says:

    Rick,

    Any flavor of Windows Server 2003 SP2 could be affected, depending on the mix of drivers installed on the machine.

  11. Mike Lagase says:

    This can also happen on servers running SP1 with 912222 installed.

  12. Kevin says:

    We were definitely experiencing symptoms of this on our Exchange cluster and we feel it may explain recent client-DC issues.  We’re speculating that aberrations in the network traffic exposed another issue in Exchange Microsoft is working with us on and may be fixed with an unreleased patch addressing store.exe cache flushing and database page allocation.  

    I wanted to add that I did quite a bit of testing with disabling the TCPA/Chimney/RSS features at the OS level, which showed improvement, but we didn’t get fully proper behaviour until I also disabled all TCP offload and RSS on the driver configuration for our HP DL385G2 servers.  It defies normal network troubleshooting because narrowing down traffic issues which seem node specific fail to indicate any issue with the end node nor intervening links.  Other traffic traveling to the same endpoint via the same path may pass just fine.

  13. Ankur Kothari says:

    The Scalable Network Pack or SP2 will also affect at least ISA 2004/2006 machines.  Behavior there is slightly different as it will result in slow login times, inability to run gpupdate, and on EE Arrays a random communication loss between servers.

    Disabling the TCPChimney and TCPA (and sometimes RSS) will help these issues as well.  

  14. Bill Nitz says:

    You need to get those KB articles updated. I spent two days fighting this issue with ISA Server 2006 only to get a 5 minute resolution from PSS.

    They gave me the KB#, but the product list is far from complete.

  15. Paul S. says:

    A connection with a state of ‘In Host’ is not truly offloaded, correct?

  16. larry heier says:

    Hello,

    I had a couple customers with weird connectivity problems after Windows 2003 SP2 and the three registry changes solved all the problems.

    One client was simply Exchange 2003 on new Dell hardware with Windows 2003 SP2 recently applied and connectivity slowed down.

    The other client was a new install of Exchange 2007 on HP ProLiant DL360/380G5 servers and we saw most of the problems at a remote site over a T1 (no local DC) that initially looked like firewall port issues until I saw this post.  Basically logins over the WAN went from a minute or two to 8-10 minutes, WSUS 3.0 wouldn’t pull in the PC’s at this office, and the first Exchange 2007 mailboxes moved were very slow and mail in the outbox was stuck at times.  We also saw some slowness at the hub site.  Setting the three items in the registry to 0 on the two new DC’s and Exchange 2007 solved all problems.

    I think Microsoft needs to communicate this problem even better as I am sure it is affecting more than just SBS customers (not mine were not SBS customers).

    thanks,

    Larry H

  17. Mike Lagase says:

    Paul S,

    That is correct, In Host means that the CPU is currently handling this requests. The other 3 states are when the packet is being offloaded to/from the card and when it has been completely offloaded to the card.

    Hope that helps.

    Mike

  18. Christopher Lewis says:

    So what’s the exposure to this?  Does MS have a hot list of Hardware/Drivers version where this is evident?

    And is it an immediately obvious error?

    We’re looking at deploying SP2 to 50,000 servers and are trying to figure out if we should force these setting off…

  19. TimH says:

    Kevin with the DL385 and other guys running AMD may want to take a look at KB 838448. I had an issue with time stamp counter drift on some two-way opteron machines that initially looked like the issues in this article.

    -Tim-

  20. Mike Lagase says:

    Christopher,

    Currently there is not a list of hardware/drivers where this is evident. What I would recommend is that if you are having problems after upgrading a couple of machines, I would do one of the following.

    1. Open a case with PSS with our networking team to help determine if this is a problem with the network card driver or firmware that you are using. This is the recommended avenue so that we can get to the bottom of these issues.

    2. Disable the SNP features to get you out of a pickle for the time being.

    Mike

  21. ixchanger says:

    TimH,

    the correct KB for the AMD issue: 938448

    ixchanger

  22. Karsten says:

    We are using DL380 G4 with 2 ql2300 HBAs (Connection to SAN). Now since installing Win2003SP2 we get from ESE Warning 507/508/509/510 about once a week sometimes more often (only on one of our two Cluster Knodes). These are saying Exchange couldn’d get data in reasonable time (took more than 60 sec). After a while we got 467 Error (Database Error).

    Has anybody else had these Problems after applieing SP2 and has solved the case?

    We updated to the newest drivers (7.9A including HP StorageWorks Fibre Channel Adapter Kit for QLAxxxx Storport Driver for Windows Server 2003 1.4.0.1 )

  23. abhinaw says:

    Microsoft release a patch (http://support.microsoft.com/kb/936594) on Aug 6, 2007. We are struggling with one client for about a month now and we asked our client to download this patch. we also disabled the three REG entries. But the problem still exists. Is there anything else that we should do? Client is using BroadCom NetXtreem gigabit nic, driver version is 7.100.0.0.

    thanks

    – abhi

  24. David says:

    Running the "Netsh int ip set chimney DISABLED" helped me failover to one of my nodes after I applied SP2 to it. Thank you.

  25. nick says:

    thank for the info

  26. professorX13 says:

    Karsten – we seem to have exactly the same issue here at the moment and we are still working on the problem. Have you managed to resolve your issues yet?

  27. Gary Liddington says:

    Similar problem here. FTP on a multihomed Windows 2003

    SP2 install. FTP put to a remote site works fine, FTP get

    times out after about 10 seconds (on a 50meg file)

    Small 100k files are fine.

    Tried patch, registry changes and netsh command, all to

    no avail :-(

  28. steve h says:

    Karsten – professorX13

    we have the exact same issue, with server 2003 SP1 and SP2  connected to DMX 3 SANs running exchange 2003 sp2, we have microsoft and EMC in next week to investigate,

    The issue appeared after critical patches were applied in August to SP1 and on servers with SP2 installed, we have updated all firmware and drivers to recommended levels however this has made no difference.

    the issue is worse when online maintaince runs or when a clone is taken of the stores, however we still see the error outside of these times.

    for info we do not get the issue on both nodes of a cluster, both nodes are built the same with identical hardware and software. the only common factor is the DMX’s

    has anyone else seen this issue?

  29. Brian.Kronberg says:

    I had this issue at a client.  They had intermittent errors with their TSM backup.  It would work whenever they did a manual test but would fail when scheduled at night.

    Ran the "Netsh int ip set chimney DISABLED" command and all the servers have backed up perfectly ever since.

Comments are closed.