Windows network performance suffering from bad buffering

Daniel Havey, Praveen Balasubramanian

Windows telemetry results have indicated that a significant number of data connections are using the SO_RCVBUF and/or the SO_SNDBUF winsock options to statically allocate TCP buffers. There are many websites that recommend setting the TCP buffers with these options in order to improve TCP performance. This is a myth. Using Winsock options (SO_RCVBUF and/or SO_SNDBUF) to statically allocate TCP buffers will not make Windows networking stack “faster”. In fact, static allocation of the TCP buffers will degrade performance in terms of how fast the connection responds (latency) and how much data it delivers (bandwidth). The Windows transports team officially recommends not doing this.

TCP buffers need to be dynamically allocated in proportion to the Bandwidth Delay Product (BDP) of the TCP connection. There are two good reasons why we should let the Windows networking stack dynamically set the TCP buffers for us and not set them statically at the application layer. 1.) The application does not know the BDP (TCP does) so it cannot properly set the TCP buffers and 2.) Dynamic buffer management requires complex algorithmic control which TCP already has. In summary, Windows 10 has autotuning for TCP. Let the autotuning algorithm manage the TCP buffers.

Example: I am going to use the Cygwin application as an example since they recently fixed their buffering (thank you Corinna). The experiment is conducted across the Internet to an iperf server in France (from my desk in Redmond).

Experiment 1 — Cygwin (Bad buffering):

Pinging 178.250.209.22 with 32 bytes of data:
Reply from 178.250.209.22: bytes=32 time=176ms TTL=35
Reply from 178.250.209.22: bytes=32 time=173ms TTL=35
Reply from 178.250.209.22: bytes=32 time=173ms TTL=35
Reply from 178.250.209.22: bytes=32 time=172ms TTL=35
Ping statistics for 178.250.209.22:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 172ms, Maximum = 176ms, Average = 173ms
————————————————————
Client connecting to 178.250.209.22, TCP port 5001
TCP window size: 208 KByte (default)
————————————————————
[ 3] local 10.137.196.108 port 56758 connected with 178.250.209.22 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 512 KBytes 4.19 Mbits/sec
[ 3] 1.0- 2.0 sec 1.50 MBytes 12.6 Mbits/sec
[ 3] 2.0- 3.0 sec 1.50 MBytes 12.6 Mbits/sec
[ 3] 3.0- 4.0 sec 1.25 MBytes 10.5 Mbits/sec
[ 3] 4.0- 5.0 sec 1.50 MBytes 12.6 Mbits/sec
[ 3] 5.0- 6.0 sec 1.50 MBytes 12.6 Mbits/sec
[ 3] 6.0- 7.0 sec 1.50 MBytes 12.6 Mbits/sec
[ 3] 7.0- 8.0 sec 1.25 MBytes 10.5 Mbits/sec
[ 3] 8.0- 9.0 sec 1.50 MBytes 12.6 Mbits/sec
[ 3] 9.0-10.0 sec 1.50 MBytes 12.6 Mbits/sec
[ 3] 0.0-10.1 sec 13.6 MBytes 11.3 Mbits/sec

 

We can see that the RTT is the same for both Experiment 1 & 2 about 177ms. However, in Experiment 1 Cygwin has bad buffering and the throughput averages 11.3 Mbps and tops out at 12.6 Mbps. This is because in Experiment 1 Cygwin was using SO_RCVBUF to allocate 278,775 bytes for the TCP receive buffer and the throughput is buffer limited to 12.6 Mbps.

Experiment 2 — Cygwin (Good buffering):

Pinging 178.250.209.22 with 32 bytes of data:
Reply from 178.250.209.22: bytes=32 time=172ms TTL=35
Reply from 178.250.209.22: bytes=32 time=172ms TTL=35
Reply from 178.250.209.22: bytes=32 time=172ms TTL=35
Reply from 178.250.209.22: bytes=32 time=173ms TTL=35
Ping statistics for 178.250.209.22:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 172ms, Maximum = 173ms, Average = 172ms
————————————————————
Client connecting to 178.250.209.22, TCP port 5001
TCP window size: 64.0 KByte (default)
————————————————————
[  3] local 10.137.196.108 port 56898 connected with 178.250.209.22 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   768 KBytes  6.29 Mbits/sec
[  3]  1.0- 2.0 sec  11.8 MBytes  98.6 Mbits/sec
[  3]  2.0- 3.0 sec  18.0 MBytes   151 Mbits/sec
[  3]  3.0- 4.0 sec  16.6 MBytes   139 Mbits/sec
[  3]  4.0- 5.0 sec  16.4 MBytes   137 Mbits/sec
[  3]  5.0- 6.0 sec  18.0 MBytes   151 Mbits/sec
[  3]  6.0- 7.0 sec  18.0 MBytes   151 Mbits/sec
[  3]  7.0- 8.0 sec  18.0 MBytes   151 Mbits/sec
[  3]  8.0- 9.0 sec  15.6 MBytes   131 Mbits/sec
[  3]  9.0-10.0 sec  17.4 MBytes   146 Mbits/sec
[  3]  0.0-10.0 sec   151 MBytes   126 Mbits/sec

 

In Experiment 2 we see Cygwin perform without static application level buffering.  The average throughput is 126 Mbps and the maximum is 151 Mbps which is the true unloaded line speed of this connection.  By statically allocating the receive buffer using SO_RCVBUF we limited ourselves to a top speed of 12.6 Mbps.  By letting Windows TCP autotuning dynamically allocate the buffers we achieved the true unloaded line rate of 151 Mbps.  That is about an order of magnitude better performance.  Static allocation of TCP buffers at the app level is a bad idea.  Don’t do it.

Sometimes there are corner cases where as a developer one might think that there is justifiable cause to statically allocate the TCP buffers.  Let’s take a look at three of the most common causes for thinking this:

1.) Setting the buffers for performance sake.  Don’t.  TCP autotuning is a kernel level algorithm and can do a better job than any application layer algorithm.
2.) Setting the buffers because you are trying to rate limit traffic.  Be Careful!  The results may not be what you expect.  In the Cygwin example the connection is buffer limited to 12.6 Mbps maximum.  However, if the RTT were to change to about 40 ms then the connection would be limited to about 50 Mbps.  You cannot reliably set a bandwidth cap in this manner (See the BDP equations).
3.) Setting the buffers for some other reason.  Let’s have a discussion.  Please comment on the post and we will respond.