LBFO Dynamic Teaming mode may drop send packets in Windows Server 2012 R2

UPDATE: The hotfix is now available for this issue! Get it at https://support.microsoft.com/en-us/kb/3137691
This hotfix applies to Windows Server 2012 R2 that use load balancing and failover (LBFO) Dynamic Teaming and should prevent the specific problem discussed below from occurring.

---------------

My name is Ajay Sarkaria & I am a Supportability Program Manager at Microsoft. I want to highlight an issue you may experience if you are using LBFO Dynamic Team mode on Windows Server 2012 R2 Servers.

NIC teaming is available in all editions of Windows Server 2012 R2 including Server Core. NIC teaming in Windows Server 2012 R2 supports the following traffic load distribution algorithms:
  

  • Hyper-V switch port
  • Address Hashing
  • Dynamic

Dynamic teaming load balancing is a new feature in Windows Server 2012 R2. Windows 2012 R2 Servers defaults to dynamic modewhen creating teaming NICs on physical computers.

This algorithm takes the best aspects of each of the other two modes and combines them into a single mode.

  • Outbound loads are distributed based on a hash of the TCP Ports and IP addresses. Dynamic mode also rebalances loads in real time so that a given outbound flow may move back and forth between team members.
  • Inbound loads are distributed as though the Hyper-V port mode was in use

The outbound loads in this mode are dynamically balanced based on the concept of flowlets. Just as human speech has natural breaks at the ends of words and sentences, TCP flows (TCP communication streams) also have naturally occurring breaks. The portion of a TCP flow between two such breaks is referred to as a flowlet. When the dynamic mode algorithm detects that a flowlet boundary has been encountered, i.e., a break of sufficient length has occurred in the TCP flow, the algorithm will opportunistically rebalance the flow to another team member if appropriate. The algorithm may also periodically rebalance flows that do not contain any flowlets if circumstances require it. As a result, the affinity between TCP flow and team member can change at any time as the dynamic balancing algorithm works to balance the workload of the team members.

There is a scenario where LBFO may drop send packets when using the default dynamic mode. Packet loss increases as processor cores are added and network traffic increases.

Windows Server 2012 R2 computers configured with teaming NICS in Dynamic mode may experience the following symptoms:

  • Cluster nodes are unexpectedly removed from cluster groups
  • SQL servers experience database transaction failures
  • TCP connection aborts

LBFO Event 8 may be logged in the LBFO Event Log:

Event location: [Applications and Services] -> [Microsoft] -> [MsLbfoProvider] –> [Operational] view

Log Name: Microsoft-Windows-MsLbfoProvider/Operational
Source: Microsoft-Windows-MsLbfoEventProvider
Event ID: 8
Level: Warning
Keywords: (4398046511104),(1099511627776)
Computer: SERVER.contoso.com
Description:
Failing NBL send on TeamNic <some value>

Microsoft-Windows-FailoverClustering events 1177 and 1135 may be logged
Note: these events may be caused by dropped packets or other root causes

Event logged in node which is removed due to dropped packets

Event logged in active nodes in same cluster group that noticed the removal of the mode identified in the event 1177

Log Name: System Source: Microsoft-Windows-FailoverClustering Event ID: 1177 Task Category: Quorum Manager Level: Critical User: SYSTEM Description: The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Log Name: System Source: Microsoft-Windows-FailoverClustering Event ID: 1135 Task Category: Node Mgr Level: Critical User: SYSTEM Description: Cluster node '<nodename>' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

If you are using LBFO in the default Dynamic mode and are experiencing the above symptoms, the following workaround may provide relief till we work on a public fix to be released:

  • Use another teaming mode other than Dynamic Active-Active such Dynamic Active-Standby or Address Hash mode. 
  • If you have only two NIC adapters as part of LBFO team then make one of the adapters as Standby adapter.

Screenshot

clip_image001

Note: The workarounds may have an impact like reduced network throughput / network performance so follow these in case you have the above symptoms and are using LBFO in the default dynamic mode. Once we have a public fix released, we will update the blog with the public KB details. It is highly recommended you change the mode back to dynamic after you install the fix.

As a general recommendation when using LBFO always use the latest network interface card (NIC) driver \ firmware versions. You can check for the latest driver \ firmware versions on your OEM or NIC Vendor websites.

Thanks,

Ajay Sarkaria
Supportability Program Manager - Windows