Where are my packets? Analyzing a packet drop issue...

One of the most common reasons for network connectivity or performance problems is packet drop. In this blog post, I’ll be talking about analyzing a packet drop issue, please read on.

One of customers was complaining about remote SCCM agent policy updates and it was suspected a network packet drop issue. Then we were involved in to analyze the problem from networking perspective. Generally such problems might stem from the following points:

a) A problem on the source client (generally at NDIS layer or below). For additional information please see a previous blog post.

b) A problem stemming from network itself (links, firewalls, routers, proxy devices, encryption devices, switches etc). This is the most common problem point in such cases.

c) A problem on the target server (generally at NDIS layer or below). For additional information please see a previous blog post.

In packet drop issues, the most important logs are simultaneous network traces collected on the source and target systems.

You’ll find below more details about how we got to the bottom of the problem:

NETWORK TRACE ANALYSIS:

=======================

In order for a succesful network trace analysis, you need to be familiar with the technology that you’re troubleshooting. At least you should have some prior knowledge on the network activity that the related action (like how SCCM agent retrieves policies from SCCM server in general for this exampl) would generate. In this instance, after a short discussion with an SCCM colleague of mine, I realized that SCCM agent sends an HTTP POST request to the SCCM server to retrieve the policies. I analyzed the network traces in the light of this fact:

How we see the problematic session at SCCM agent side trace (10.1.1.1):

No. Time Source Destination Protocol Info

  1917 104.968750 10.1.1.1 172.16.1.1 TCP rmiactivation > http [SYN] Seq=0 Win=65535 Len=0 MSS=1460

   1918 105.000000 172.16.1.1 10.1.1.1 TCP http > rmiactivation [SYN, ACK] Seq=0 Ack=1 Win=16384 Len=0 MSS=1460

   1919 105.000000 10.1.1.1 172.16.1.1 TCP rmiactivation > http [ACK] Seq=1 Ack=1 Win=65535 [TCP CHECKSUM INCORRECT] Len=0

   1920 105.000000 10.1.1.1 172.16.1.1 HTTP CCM_POST /ccm_system/request HTTP/1.1

   1921 105.000000 10.1.1.1 172.16.1.1 HTTP Continuation or non-HTTP traffic

   1922 105.000000 10.1.1.1 172.16.1.1 HTTP Continuation or non-HTTP traffic

   1925 105.140625 172.16.1.1 10.1.1.1 TCP http > rmiactivation [ACK] Seq=1 Ack=282 Win=65254 Len=0 SLE=1742 SRE=2328

   1975 107.750000 10.1.1.1 172.16.1.1 HTTP [TCP Retransmission] Continuation or non-HTTP traffic

   2071 112.890625 10.1.1.1 172.16.1.1 HTTP [TCP Retransmission] Continuation or non-HTTP traffic

   2264 123.281250 10.1.1.1 172.16.1.1 HTTP [TCP Retransmission] Continuation or non-HTTP traffic

   2651 144.171875 10.1.1.1 172.16.1.1 HTTP [TCP Retransmission] Continuation or non-HTTP traffic

   3475 185.843750 10.1.1.1 172.16.1.1 HTTP [TCP Retransmission] Continuation or non-HTTP traffic

   4392 234.937500 172.16.1.1 10.1.1.1 TCP http > rmiactivation [RST, ACK] Seq=1 Ack=282 Win=0 Len=0

How we see the problematic session at SCCM Server side trace (172.16.1.1):

No. Time Source Destination Protocol Info

  13587 100.765625 10.1.1.1 172.16.1.1 TCP rmiactivation > http [SYN] Seq=0 Win=65535 Len=0 MSS=1460

  13588 0.000000 172.16.1.1 10.1.1.1 TCP http > rmiactivation [SYN, ACK] Seq=0 Ack=1 Win=16384 Len=0 MSS=1460

  13591 0.031250 10.1.1.1 172.16.1.1 TCP rmiactivation > http [ACK] Seq=1 Ack=1 Win=65535 Len=0

  13598 0.031250 10.1.1.1 172.16.1.1 HTTP CCM_POST /ccm_system/request HTTP/1.1

  13611 0.078125 10.1.1.1 172.16.1.1 HTTP [TCP Previous segment lost] Continuation or non-HTTP traffic

  13612 0.000000 172.16.1.1 10.1.1.1 TCP http > rmiactivation [ACK] Seq=1 Ack=282 Win=65254 [TCP CHECKSUM INCORRECT] Len=0 SLE=1742 SRE=2328

  30509 129.812500 172.16.1.1 10.1.1.1 TCP http > rmiactivation [RST, ACK] Seq=1 Ack=282 Win=0 Len=0

 

Explanation on color coding:

a) GREEN packets are the ones that we both see at client and server side traces. In other words, those are packets that were either sent by the client and successfully received by the server or sent by the server and successfully received by the client.

b) RED packets are the ones that were sent by the client but not received by the server. To further provide details on this part:

- The frame #1921 that we see at the client side trace (which is a continuation packet for the HTTP POST request) is not visible at the server side trace

- The frames #1975, 2071, 2264, 2651 and 3475 are retransmissions of frame #1921. We cannot even see those 5 retransmissions of the same original TCP segment at the server side trace which means most likely all those retransmission segments were lost on the way to the server even though there’s still a minor chance that they might have been physically received by the server but dropped by an NDIS level driver (see this post for more details)

The server side forcefully closes the session approximately after 2 minutes. (Client side frame #4392 and server side frame #30509)

RESULTS:

========

1) The simultaneous network trace analysis better indicated that this was a packet drop issue.

2) The problem most likely stems from one of these points:

a) An NDIS layer or below problem at the SCCM agent side (like NIC/teaming driver, NDIS layer firewalls, other kind of filtering drivers)

b) A network related problem (link or network device problem like router/switch/firewall/proxy issue)

c) An NDIS layer or below problem at the SCCM server side (like NIC/teaming driver, NDIS layer firewalls, other kind of filtering drivers)

3) In such scenarios we may consider doing the following at the client and server side:

a) Updating NIC/teaming drivers

b) Updating 3rd party filter drivers (AV/Firewall/Host IDS software/other kind of filtering devices) or temporarily uninstalling them for the duration of troubleshooting

4) Most of the time such problems stem from network itself. And you may consider taking the following actions on that:

a) Checking cabling/switch port

b) Checking port speed/duplex settings (you may try manually setting to 100 MB/ FD for example)

c) Enabling logging on Router/Switch/firewall/proxy or similar devices and check interface statistics or check problematic sessions to see if there are any packet drops

d) Alternatively you may consider collecting network traces at more than 2 points (sending & receiving ends). For example, 4 network traces could be collected: at the source, at the target and at two intermediate points. So that you can follow the session and may isolate the problem to a certain part of your network.

For your information, the specific problem I mentioned in this blog post was stemming from a router running in between the SCCM agent and server.

Hope this helps

Thanks,

Murat