Core Network Stack Features in the Creators Update for Windows 10

By: Praveen Balasubramanian and Daniel Havey

This blog is the sequel to our first Windows Core Networking features announcements post.  It describes the second wave of core networking features in the Windows Redstone series.  The first wave of features is described here: Announcing: New Transport Advancements in the Anniversary Update for Windows 10 and Windows Server 2016.  We encourage the Windows networking enthusiast community to experiment and provide feedback.  If you are interested in Windows Transport please follow our Facebook feedback and discussion page: @Windows.10.Data.Transport.

 

TCP Improvements:

TCP Fast Open (TFO) updates and server side support

In the modern age of popular Web services and e-commerce , latency is a killer when it comes to page responsiveness. We’re adding support in TCP for TCP Fast Open (TFO) to cut down on round trips that can severely impact how long it takes for a page to load.  Here’s how it works: TFO establishes a secure TFO cookie in the first connection using a standard 3-way handshake.  Subsequent connections to the same server use the TFO cookie to connect without the 3-way handshake (zero RTT).  This means TCP can carry data in the SYN and SYN-ACK.

What we found together with others in the industry is that middleboxes are interfering with such traffic and dropping connections. Together with our large population of Windows enthusiasts (that’s you!), we conducted experiments over the past few months, and tuned our algorithms to avoid usage of this option on networks where improper middlebox behavior is observed.  Specifically, we enabled TFO in Edge using a checkbox in about:flags.

To harden against such challenges, Windows automatically detects and disables TFO on connections that traverse through these problematic middleboxes.  For our Windows Insider Program community, we enabled TFO in Edge (About:flags) by default for all insider flights in order to get a better understanding of middlebox interference issues as well as find more problems with anti-virus and firewall software.  The data helped us improve our fallback algorithm which detects typical middlebox issues.  We intend to continue our partnership with our Windows Insider Program (WIP) professionals to improve our fallback algorithm and identify unwanted anti-virus, firewall and middlebox behavior.  Retail and non WIP releases will not participate in the experiments.  If you operate infrastructure or software components such as middleboxes or packet processing engines that make use of a TCP state machine, please incorporate support for TFO.  In the future, the combination of TLS 1.3 and TFO is expected to be more widespread.

The Creators Update also includes a fully functional server side implementation of TFO. The server side implementation also supports a pre-shared key for cases where a server farm is behind a load balancer. The shared key can be set by the following knob (requires elevation):

reg add HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters /v TcpFastopenKey /t REG_BINARY /f /d 0123456789abcdef0123456789abcdef
netsh int tcp reload

We encourage the community to test both client and server side functionality for interop with other operating system network stacks. The subsequent releases of Windows Server will include TFO functionality allowing deployment of IIS and other web servers which can take advantage of reduced connection setup times.

 

Experimental Support for the High Speed CUBIC Congestion Control Algorithm

CUBIC is a TCP Congestion Control (CC) algorithm featuring a cubic congestion window (Cwnd) growth function.  The Cubic CC is a high-speed TCP variant and uses the amount of time since the last congestion event instead of ACK clocking to advance the Cwnd.  In large BDP networks the Cubic algorithm takes advantage of throughput much faster than ACK clocked CC algorithms such as New Reno TCP.  There have been reports that CUBIC can cause bufferbloat in networks with unmanaged queues (LTE and ADSL).  In the Creators Update, we are introducing a Windows native implementation of CUBIC.  We encourage the community to experiment with CUBIC and send us feedback.

The following commands can be used to enable CUBIC globally and to return to the default Compound TCP (requires elevation):

netsh int tcp set supplemental template=internet congestionprovider=cubic
netsh int tcp set supplemental template=internet congestionprovider=compound

*** The Windows implementation of Cubic does not have the “Quiescence bug” that was recently uncovered in the Linux implementation.

 

Improved Receive Window Autotuning

TCP autotuning logic computes the “receive window” parameter of a TCP connection as described in TCP autotuning logic.  High speed and/or long delay connections need this algorithm to achieve good performance characteristics.  The takeaway from all this is that using the SO_RCVBUF socket option to specify a static value for the receive buffer is almost universally a bad idea.  For those of you who choose to do so anyways please remember that calculating the correct size for TCP send/receive buffers is complex and requires information that applications do not have access to.  It is far better to allow the Windows autotuning algorithm to size the buffer for you.  We are working to identify such suboptimal usage of SO_RCVBUF/SO_SENDBUF socket options and to convince developers to move away from fixed window values.  If you are an app developer and you are using either of these socket options please contact us. 

In parallel to our developer education effort we are improving the autotuning algorithm.  Before the Creators Update the TCP receive Window autotuning algorithm depended on correct estimates of the connection’s bandwidth and RTT.  There are two problems with this method.  First, the TCP RTT estimate is only measured on the sending side as described in RFC 793.  However, there are many examples of receive heavy workloads such as OS updates etc.  The RTT estimate taken at the receive heavy side could be inaccurate.  Second, there could be a feedback loop between altering the receive window (which can change the estimated bandwidth) and then measuring the bandwidth to determine how to alter the receive window. 

These two problems caused the receive window to constantly vary over time.  We eliminated the unwanted behavior by modifying the algorithm to use a step function to converge on the maximum receive window value for a given connection.  The step function algorithm results in a larger receive buffer size, however, the advertised receive window size is not backed by non-paged pool memory allocation and system resources are not used unless data is received and queued so the larger size is fine.  Based on experimental results, the new algorithm adapts to the BDP much more quickly than the old algorithm.  We encourage user and system administrators to also take note of our earlier post: An Update on Windows TCP AutoTuningLevel.  This should clear misconceptions that autotuning and receive window scaling are bad for performance.

TCP stats API

The Estats API requires elevation and enumerates statistics for all connections.  This can be inefficient especially on busy servers with lots of connections.  In the Creators Update we are introducing a new API called SIO_TCP_INFO.   SIO_TCP_INFO allows developers to query rich information on individual TCP connections using a socket option. The SIO_TCP_INFO API is versioned and we plan to add more statistics over time.  In addition, we plan to add SIO_TCP_INFO  to .Net NCL and HTTP APIs in subsequent releases.

The MSDN documentation for this API will be up soon and we will add a link here as soon as it is available.

IPv6 improvements

The Windows networking stack is dual stack and supports both IPv4 and IPv6 by default since Windows Vista.  Over the Windows 10 releases, we are actively working on improving the support for IPv6.  The following are some of the advancements in Creators Update.

RFC 6106 support

The Creators Update includes support for RFC 6106 which allows for DNS configuration through router advertisements (RAs).  RDNSS and DNSSL ND options contained in router advertisements are validated and processed as described in the RFC.  The implementation supports a max of 3 RDNSS and DNSSL entries each per interface.  If there are more than 3 entries available from one or more routers on an interface, then entries with greater lifetime are preferred.  In the presence of both DHCPv6 and RA DNS information, Windows gives precedence to DHCPv6 DNS information, in accordance with the RFC.

In Windows, the lifetime processing of RA DNS entries deviates slightly from the RFC.  In order to avoid implementing timers to expire DNS entries when their lifetime ends, we rely on the periodic Windows DNS service query interval (15 minutes) to remove expired entries, unless a new RA DNS message is received in which case the entry is updated immediately.  This enhancement eliminates the complexity and overhead of kernel timers while keeping the DNS entries fresh.The following knob can be used to control this feature (requires elevation):

The following command can be used to control this feature (requires elevation):
netsh int ipv6 set interface <ifindex> rabaseddnsconfig=<enabled | disabled>

Flow Labels

Before the Creators update, the FlowLabel field in the IPv6 header was set to 0.  Beginning with the Creators Update, outbound TCP and UDP packets over IPv6 have this field set to a hash of the 5-tuple (Src IP, Dst IP, Src Port, Dst Port).  Middleboxes can use the FlowLabel field to perform ECMP for in-encapsulated native IPv6 traffic without having to parse the transport headers.  This will make IPv6 only datacenters doing load balancing or flow classification more efficient.

You can use this admin only knob to enable/disable IPv6 flow labels :
netsh int ipv6 set flowlabel=[disabled|enabled] (enabled by default)

The following knob can be used to control this feature (requires elevation):
netsh int ipv6 set global flowlabel=<enabled | disabled>

ISATAP and 6to4 disabled by default

IPv6 continues to see uptake and IPv6 only networks are no longer a rarity. ISATAP and 6to4 are IPv6 transition technologies that have been enabled by default in Windows since Vista/Server 2008. As a step towards future deprecation, the Creators Update will have these technologies disabled by default. There are administrator and group policy knobs to re-enable them for specific enterprise deployments. An upgrade to the Creators Update will honor any administrator or group policy configured settings. By disabling these technologies, we aim to increase native IPv6 traffic on the Internet. Teredo is the last transition technology that is expected to be in active use because of its ability to perform NAT traversal to enable peer-to-peer communication.

Improved 464XLAT support

464XLAT was originally designed for cellular only scenarios since mobile operators are some of the first ISPs with IPv6 only networks.  However, some apps are not IP-agnostic and still require IPv4 support.  Since a major use case for mobile is tethering, 464XLAT should provide IPv4 connectivity to tethered clients as well as to apps running on the mobile device itself. Creators Update adds support for 464XLAT on desktops and tablets too. We also enabled support for TCP Large Send Offload (LSO) over 464XLAT improving throughput and reducing CPU usage.

Multi-homing improvements

Devices with multiple network interfaces are becoming ubiquitous.  The trend is especially prevalent on mobile devices, but, 3G and LTE connectivity is becoming common on laptops, hybrids and many other form factors.  For the Creators Update we collaborated with the Windows Connection Manager (WCM) team to make the WiFi to cellular handover faster and to improve performance when a mobile device is docked with wired Ethernet connectivity and then undocked causing a failover to WiFi.

Dead Gateway Detection (DGD)

Windows has always had a DGD algorithm that automatically transitions connections over to another gateway when the current gateway is unreachable, but, that algorithm was designed for server scenarios.  For the Creators update we improved the DGD algorithm to respond to client scenarios such as switching back and forth between WiFi to 3G or LTE connectivity.  DGD signals WCM whenever transport timeouts suggest that the gateway has gone dead.  WCM uses this data to decide when to migrate connections over to the cellular interface.  DGD also periodically re-probes the network so that WCM can migrate connections back to WiFi.  This behavior only occurs if the user has opted in for automatic failover to cellular.

Fast connection teardown

In Windows, TCP connections are preserved for about 20 seconds to allow for fast reconnection in the case of a temporary loss of wired or wireless connectivity.  However, in the case of a true disconnection such as docking and undocking this is an unacceptably long delay.  Using the Fast Connection Teardown feature WCM can signal the Windows transport layer to instantly tear down TCP connections for a fast transition.

Improved diagnostics using Test-NetConnection

Test-NetConnection (alias tnc) is a built-in cmdlet in powershell that performs a variety of network diagnostics.  In Creators Update we have enhanced this cmdlet to provide detailed information about both route selection as well as source address selection.

The following command when run elevated will describe the steps to select a particular route per RFC 6724. This can be particularly useful in multi-homed systems or when there are multiple IP addresses on the system.

Test-NetConnection -ComputerName "www.contoso.com" -ConstrainInterface 5 -DiagnoseRouting -InformationLevel "Detailed"