How to Troubleshoot Office 365 Service Slowness?

Hi Folks,
Lakshman Hariharan here with a post to share with you one reason why Office 365 services might be perceived as “slow.” 
In my role as a Premier Field Engineer, I perform many “Office 365 Network Performance Assessments” for our Premier customers. If you haven’t heard of the assessment and are planning to use or already are using Office 365, ask your Technical Account Manager (TAM) about the offering. It has helped most of the customers I’ve worked with and I would highly recommend it.

Oftentimes, I am asked to perform the assessment after the fact, due to sluggish performance to Office 365 online services.  We help identify potential issues on the customer’s network that could be causing performance problems.

One of the issues that I often find on these assessments that can cause performance problems – not just to Office 365 but to any traffic that is encrypted using SSL – is proxy servers or intermediary devices performing SSL inspection and/or decryption and re-encryption before handing the packet off to the respective destinations. 
One of the primary causes of “slowness” is latency from a client to the destination. As is probably evident to most people reading this post, the higher the latency, the slower the performance. A good way to measure latency to an endpoint from a client is to capture a network trace while accessing the endpoint from an application that uses TCP. If we capture a network trace during the three way handshake, the time delta between the Syn and Syn-Ack packets of the handshake gives you the round trip time.
Below are two frames from a network capture taken while I navigated to https://microsoft.sharepoint.com using the Edge browser on my Windows 10 machine.
  • Speaking of Windows 10 (and Edge), have you upgraded yet?  It’s free!

The highlighted values indicate the time delta field. As we can see below, the latency to the destination server is 19ms.
MicrosoftEdgeCP.exe   0.0000000  TCP  192.168.1.70    191.234.10.14     TCP:Flags=……S., SrcPort=59728, DstPort=HTTPS(443), PayloadLen=0, Seq=2566998163, Ack=0, Win=65535 ( Negotiating scale factor 0x8 ) = 65535
MicrosoftEdgeCP.exe   0.0191718  TCP  191.234.10.14   192.168.1.70     TCP:Flags=…A..S., SrcPort=HTTPS(443), DstPort=59728, PayloadLen=0, Seq=4151039979, Ack=2566998164, Win=8192 ( Negotiated scale factor 0x8 ) = 2097152
Now this is my home network and there aren’t any proxy servers, packet shapers, WAN optimizers and such. Many corporate networks do utilize such devices and a proxy server is the most common one. When traffic is routed through proxy servers they can do one of two things with the traffic:

1.       Straight-through proxy without modifying the packets in any way.

or

2.       Terminate and recreate all the TCP connections that are proxied through them.

In addition to #2 above, some proxy servers also perform packet inspection on SSL traffic.
In cases where the proxy server is terminating and recreating the TCP sessions, as in #2 above, running a network trace while launching https://microsoft.sharepoint.com or any destination will not capture latency to the destination but will give you the latency to the proxy server because the TCP sessions are terminated and recreated at the proxy server.
Below are a couple of frames from a network capture where the proxy server is terminating and recreating TCP sessions. This capture was taken from a client in the Central region of the United States going to a server on the East Coast of the United States. There are a couple of ways to tell that the proxy is terminating and recreating the TCP sessions and I discuss them after the snip from the network capture. Note that the IP addresses in the captures are fictitious.
1001 0.0000000  TCP  10.11.12.13     191.232.210.32  TCP:Flags=……S., SrcPort=62542, DstPort=HTTPS(443), PayloadLen=0, Seq=970000008, Ack=0, Win=8192 ( Negotiating scale factor 0x8 ) = 8192
1002 0.0024863  TCP  191.232.210.32  10.11.12.13     TCP:Flags=…A..S., SrcPort=HTTPS(443), DstPort=62542, PayloadLen=0, Seq=3467183491, Ack=970000009, Win=14320 ( Negotiated scale factor 0x7 ) = 1832960
One way to tell that a nearby device is terminating the session is by looking at the time delta field highlighted above on the Syn-Ack packet from the server (second frame). It is a mere 2ms, which is near impossible going from a client in the Central region to the East Coast of the United States given the distance involved. However, that is just an educated guess.
Another way would be to look at the Time to Live (TTL) value in the IP field of the frames. Below are IP fields from the same two frames that we looked at above, one the Syn packet from the client and another from the Syn-Ack supposedly back from the server. I say supposedly because I will discuss exactly why it isn’t coming from the server but from the proxy.
TimeToLive: 128 (0x80)
NextProtocol: TCP, 6(0x6)
Checksum: 0 (0x0)
SourceAddress: 10.11.12.13
DestinationAddress: 191.232.210.32

TimeToLive: 62 (0x3E)
NextProtocol: TCP, 6(0x6)
Checksum: 5384 (0x1508)
SourceAddress: 191.232.210.32
DestinationAddress: 10.11.12.13
If you note the highlighted values for TTL above, you will see that the Syn packet has a TTL of 128, which is the default value for a Windows machine. The TTL value on the second packet is 62, which means that the Syn-Ack is actually coming from a device that is most likely three hops away. The reason we know that is most Linux and/or Unix based devices and appliances have a default TTL of 65 and as the packet traverses each hop the TTL is decremented by one.
Thus, we can draw the conclusion that the Syn-Ack packet originated from a device that is 3 hops away. It is near impossible that a client in a corporate network can reach a server that is almost a thousand miles away on the internet in a mere three hops because in most corporate environments it takes three hops just to get to the edge of the corporate managed network.
Now that we’ve established that the proxy is indeed terminating the connections and recreating them, let’s establish that the proxy is also performing SSL inspection.
Below are two frames from the SSL session setup. The first is a Client Hello packet and the second is the Server Hello when the server responds with the certificate.
1007 0.0007042  TLS  10.11.12.13 191.232.210.32 TLS:TLS Rec Layer-1 HandShake: Client Hello.
1009 0.2251383  TLS  191.232.210.32 10.11.12.13 TLS:TLS Rec Layer-1 HandShake: Server Hello. Certificate.
Note again the highlighted Time Delta field and value. According to that the response is coming after 225ms. That value of latency is quite high given the distance. Which could mean one of two things

1.       That the connection to the destination over the internet is going through an extremely latent connection where the ISP is routing the traffic to the East Coast via a circuitous route

OR

2.       An intermediary device, likely the proxy server is intercepting the packet and causing the delay.

This is when we go back to the TTL field in the IP portion of the packets and look at the values. Below are the TTL values from the Client Hello packet and Server Hello packet.
TimeToLive: 128 (0x80)
NextProtocol: TCP, 6(0x6)
Checksum: 0 (0x0)
SourceAddress: 10.11.12.13
DestinationAddress: 191.232.210.32

TimeToLive: 62 (0x3E)
NextProtocol: TCP, 6(0x6)
Checksum: 0 (0x0)
SourceAddress: 10.11.12.13
DestinationAddress: 191.232.210.32
Note the TTL values. The client Hello has a TTL of 128, which given it is a Windows machine is expected. The TTL on the Server Hello packet is 62, which as we saw earlier is coming from the proxy server that is three hops away. Given this and the latency between the client and proxy server being only 2ms as we saw earlier, we can quite confidently draw the conclusion that the bulk of the 225ms latency (excluding the actual latency to the destination and back of a few tens of milliseconds) can be attributed to the proxy server intercepting the SSL packets and performing SSL inspection.
Bear in mind this assumes that you only have access to the client side network captures and don’t have access or otherwise any knowledge of proxy server configuration or network captures from the proxy server.
In summary, this is in essence how we can establish – via client side-only network traces – whether the traffic you are sending via SSL to a destination is being inspected by an intermediary device.
If you are seeing sluggish performance to SSL sites and/or Office 365 and are seeing similar symptoms on a client side trace, work with your network team to further investigate and hopefully, resolve the issue.
Happy tracing y’all.

Lakshman Hariharan