How to measure the Network Round Trip Time to Office 365

One of the jobs I do fairly often is to help customers work out the performance of their network connection to office 365 from various sites to ensure it's within the limits which will give good user performance with Outlook, SharePoint etc.

We have various tools available that do some level of network check for you such as https://em1-fasttrack.cloudapp.net/o365nwtest (for EMEA) but I tend to do my checks manually using a variety of tools as it gives me more granular detail, so here is how I do it. Beware, this tool uses Java too which may not be permitted on some customer sites.

 

How do I find the IP address I need to connect to so as to check this?

There are multiple ways we can get this information:

 

  • Ping the name you are trying to connect to.

 

ping mytennant.sharepoint.com

Pinging prodnet47-48ipv4a.sharepointonline.com.akadns.net [157.55.232.50] with 32 bytes of data:

Alternatively, for Exchange, you can ping outlook.office365.com. Not only will this tell you the IP you'll hit, it'll also tell you which datacenter you are connecting to.

This is a good check to do to ensure efficiency. We're pretty clever with connecting Outlook traffic in that we use geo-dns to find out where in the world you are, then we provide you with the DNS address of your nearest datacenter. You connect to a CAS server there which then will route you through our high speed backbone between the datacenters to wherever your mailbox is located.

This means for example, that if your mailbox is located in an EMEA tennant, and you travel to the US. You'll be connected to a CAS server in a US datacenter which will then pull the data you need for Outlook over our network which should mean a much better service. https://technet.microsoft.com/en-us/library/dn741250.aspx outlines this in more detail.

The problem with this is, if your DNS server (or your proxy for that matter) is located in a different region to where you are located, then this can mean you're being inefficiently routed so it's always a good check to run.

For example, if I ping from Seattle I get

Ping outlook.office365.com

Pinging outlook-namnorthwest.office365.com [132.245.92.41] with 32 bytes of data

Whereas here at home in the UK I get one of the EMEA datacenters.

ping outlook.office365.com

Pinging outlook-emeawest.office365.com [132.245.229.114] with 32 bytes of data:

 

  • Network Sniffing

     

    The other method is to use a packet sniffer to trace a connection occurring and this will show you the IP. This method also gives me the advantage of seeing if we're going via a proxy to connect out.

    Personally I use Netmon but other tools such as Wireshark will do the job well depending on your preference.

    Start the tool tracing (run as admin and attach to the correct NIC first) then either launch Outlook or open the SharePoint site you want to access.

    Once you've accessed the data you want, stop the trace and have a look in Netmon for the process you used. In this case it was IE connecting to my O365 SharePoint site.

    Firstly I can see the DNS call go out (use "DNS" in the filter column).

    189    15:56:56 08/05/2014    15:56:56 08/05/2014    11.6029790    8.3358920        192.168.0.8    192.168.0.1    DNS    DNS:QueryId = 0xA8BD, QUERY (Standard query), Query for mytennant.sharepoint.com of type A on class Internet    {DNS:55, UDP:54, IPv4:8}    

    203    15:56:57 08/05/2014    15:56:57 08/05/2014    12.1115268    0.0006276        192.168.0.1    192.168.0.8    DNS    DNS:QueryId = 0xA8BD, QUERY (Standard query), Response - Success, 49, 0 ...     {DNS:55, UDP:54, IPv4:8}

    ARecord: prodnet47-48ipv4a.sharepointonline.com.akadns.net of type A on class Internet: 157.55.232.50    

    In Netmon it handily breaks down the connections per process then connections made by that process. Here we can see IE is connecting to 157.55.232.50 on port 443 which is the IP address we got back for my SharePoint site. If you close all IE windows and just use one to connect, you know the IP address used is that to SharePoint.

     

    So now you have the IP address, and port used, now what do you do?

     

    Measuring the Round Trip Time to the address

    As ICMP is often blocked at firewalls, it's not an effective tool to use for this task.

    Thankfully we have an alternative, I use a great tool by Mark Russinovich in the Sysinternals Suite, PSPING. https://technet.microsoft.com/en-us/sysinternals/jj729731.aspx

    We can use this tool to measure latency internally (by running a server version on the remote end) but as this is O365 we can't do that but it does a great job of measuring the RTT on the three way handshake.

    What it does is complete a TCP connection to the IP address and port provided so we can accurately measure the time between the syn and syn ack on that connection. And as it will use a port we know is open, it won't be blocked by the firewall.

    The syntax for this is:

    psping -n 20 157.55.232.50:443

    So we're doing 20 pings, to the IP address derived above, on TCP port 443.

    PsPing v2.01 - PsPing - ping, latency, bandwidth measurement utility

    Copyright (C) 2012-2014 Mark Russinovich

    Sysinternals - www.sysinternals.com

     

    TCP connect to 157.55.232.50:443:

    21 iterations (warmup 1) connecting test:

     Connecting to 157.55.232.50:443 (warmup): 15.01ms
    
     Connecting to 157.55.232.50:443: 15.44ms
    
     Connecting to 157.55.232.50:443: 15.73ms
    
     Connecting to 157.55.232.50:443: 15.55ms
    
     Connecting to 157.55.232.50:443: 15.54ms
    
     Connecting to 157.55.232.50:443: 15.70ms
    
     Connecting to 157.55.232.50:443: 14.97ms
    
     Connecting to 157.55.232.50:443: 14.70ms
    
     Connecting to 157.55.232.50:443: 16.02ms
    
     Connecting to 157.55.232.50:443: 16.53ms
    
     Connecting to 157.55.232.50:443: 15.39ms
    
     Connecting to 157.55.232.50:443: 15.38ms
    
     Connecting to 157.55.232.50:443: 15.95ms
    
     Connecting to 157.55.232.50:443: 15.99ms
    
     Connecting to 157.55.232.50:443: 16.82ms
    
     Connecting to 157.55.232.50:443: 16.10ms
    
     Connecting to 157.55.232.50:443: 15.55ms
    
     Connecting to 157.55.232.50:443: 16.30ms
    
     Connecting to 157.55.232.50:443: 16.03ms
    
     Connecting to 157.55.232.50:443: 15.55ms
    
     Connecting to 157.55.232.50:443: 14.81ms
    

     

    TCP connect statistics for 157.55.232.50:443:

    Sent = 20, Received = 20, Lost = 0 (0% loss),

    Minimum = 14.70ms, Maximum = 16.82ms, Average = 15.70ms

    So we can see the RTT to my SharePoint site is an average of 15.70ms and the maximum is 16.82 meaning there is no real fluctuation in the RTT.

    Compare this to the network tool at https://em1-fasttrack.cloudapp.net/o365nwtest and it's almost identical. However by using PSPING it's easier to combine this with a network trace so we can check other network level settings such as MTU and TCP Window scaling.

     

     

    So what if I'm not going Direct and using a proxy?

    When using a proxy, this particular method comes into its own and it's a common scenario I see at customer sites.

    If we are connecting via a proxy server then we have to change tack a little as we're not communicating directly to the O365 node. We have a TCP connection to the proxy then another one is set up from the proxy out to the O365 endpoint.

    Firstly, we repeat the test we did above, but to the address of our proxy server. Once we have the RTT to the proxy, which is invariably at the network perimeter, we then need to get a reading from that point out to O365.

    We therefore run PSPING on the proxy, or a machine in front of the proxy which has a direct internet connection, to the IP address of the O365 endpoint. From these two figures we have an overall RTT and also, an idea where the majority of our latency is occurring.

    If you have the ability to bypass the proxy then repeat the test on the same client, but with the direct connection. You then subtract the RTT from the one you got to the proxy (presuming the exit point and routing are similar) and you'll have your internal and external RTT.

    If it's not possible to take a psping trace on the proxy then most proxies allow for a packet capture. We can use this to get the RTT by looking at the time delta between the syn and syn ack or any other packet which has a response we can match to the request such as the SSL handshake.

     

     

     

    Here we can see clearly, the poor RTT is outside the customer's environment, on the ISP link to Office 365. If this RTT is unexpected, the customer can engage their ISP to investigate.

    We've tested SharePoint on Office 365 to play nicely up to 300ms, and I've personally seen it work well at higher latency levels than this, as long as all other network settings (tcp window scaling, packet sizes, proxy performance etc) are all optimal. With Outlook the effect of latency is less pronounced as the software does a good job of masking it, but if it's too high, actions such as switching calendars/mailboxes may show a delay.

     

    So now I have my RTT, but what is a good one and bad?

    That's always the key question, but is always subjective, in short the answer is, it depends on the scenario. The longer the distance the higher it'll be.

    • Internal to your environment look for <100ms, ideally much less.
    • From UK site to EMEA Datacenter <100ms total should be the aim. Ideally much less than that. For example my home connection above is showing 16ms (admittedly there is very little of the network equipment in the way which you'll see on a corporate network)
    • Australia <>EMEA can be done in 300ms as a reference
    • Verizon have a handy table of latency between various endpoints on their network here https://www.verizonenterprise.com/about/network/latency/ which will give you a good idea of the latency you can expect between these points with such a carrier, and for you to compare yours to.
    • Having both internal and external RTT allows us to accurately identify if a network latency issue is inside your environment or outside.
    • It's useful to do this test as a baseline during normal operation so it can be referred to if issues occur we can then work out if any segment has increased RTT.

     

    What about application level latency?

    With this method we're only looking at network level latency. As the traffic to O365 will be encrypted you'll need other methods to measure application level latency.

    Outlook:

    For Outlook, the Software itself measures RPC latency and a Microsoft Colleague Neil Johnson has as great blog here which describes how to look at this:

    https://blogs.technet.com/b/neiljohn/archive/2012/01/23/outlook-performance-troubleshooting-including-office-365.aspx

    SharePoint:

    For SharePoint, we can utilise a couple of 3rd party tools to look at the encrypted application level requests in the clear.

    My tool of choice is the great HTTPWatch which plugs into your browser and shows the page load times in great detail, helping you troubleshoot which element of a page is taking a long time, showing you clearly the timing between a get request and response and the rendering time in the browser. I'll write a more detailed blog post on this tool when I get time.

    Another tool is Fiddler which also does a great job of showing you inside encrypted traffic. I tend not to use this often as I find it can alter the behaviour of the receiving application. For example if we're doing byte range requests for a file (which allows us to pause and resume downloads) then it can disable that feature and just send one get request for the file, which changes the user behaviour I'm trying to baseline/troubleshoot.

     

    That's all for now..I hope this helps with making sure your O365 solution is connecting as quickly as possible!