Network Capture Best Practices

Hi Diddly Doodly readers. Michael Rendino here again with a follow up to my “Basic Network Capture Methods” blog, this time to give some best practices on network capture collection when troubleshooting. As you may have guessed, one of my favorite tools, due to my years in networking support, is the network capture. It can provide a plethora of information about what exactly was transpiring when systems were trying (and possibly failing) to communicate. I don’t really concern myself with the tool used, be it Network Monitor, Wireshark, Message Analyzer, Sniffer or any other tool. My biggest point to stress is what I mentioned previously – it shows the communication on the network. The important point to get from that is that collecting a trace from a single point doesn’t provide the full picture. While I will take a single-sided trace over no trace at all, the best scenario is to get it from all points involved in the transaction. With something like SharePoint, this could be a number of machines – the client running the browser, the web front end, the SQL back end and then multiple domain controllers. It sounds like a daunting task to get the captures from every location, but I would rather have too much data rather than too little. To add to that point, please don’t apply a capture filter unless absolutely necessary! By only capturing data between two select points, you could be omitting a critical piece of information.

Following is a perfect example of both of these points. I was engaged to troubleshoot an issue that was described as a problem with a SharePoint web front end talking to the SQL server. I got the captures from the two servers, which fortunately were not filtered. If I just went on the problem description, I would typically have opened the capture from the SQL box, applied the ipv4.address==Web Front End IP (ipv4 because I was using Network Monitor – it would be ip.addr== for you Wireshark fans) to locate the traffic from that box. In fact, I did that to start and saw that all traffic to and from the WFE appeared completely normal.

9873    9:37:54 AM     WFE    49346 (0xC0C2)    SQL    1433 (0x599)    TCP    TCP:Flags=…A…., PayloadLen=0, Seq=3198716784, Ack=438404416, Win=510

10093    9:37:55 AM     WFE    49346 (0xC0C2)    SQL    1433 (0x599)    TDS    TDS:RPCRequest, SPID = 0, PacketID = 1, Flags=…AP…, PayloadLen=201, Seq=3198716784 – 3198716985, Ack=438404416, Win=510

10094    9:37:55 AM     SQL    1433 (0x599)    WFE    49346 (0xC0C2)    TDS    TDS:Response, SPID = 117, PacketID = 1, Flags=…AP…, SrcPort=1433, DstPort=49346, PayloadLen=61, Seq=438404416 – 438404477, Ack=3198716985, Win=255

10188    9:37:55 AM     WFE    49346 (0xC0C2)    SQL    1433 (0x599)    TCP    TCP:Flags=…A…., PayloadLen=0, Seq=3198716985, Ack=438404477, Win=509

To me, it looked like clean SQL traffic, moving quickly and without errors. All good so I needed to look elsewhere. To move on, it’s important to know what other types of things will happen when using SharePoint. Other than the SQL traffic, the WFE will also have to communicate with the client, perform name resolution and communicate with a domain controller. I first applied the filter “dns or nbtns” (Again, this was Network Monitor, although I typically use multiple tools for my analysis) and again, everything looked “clean.” I then moved on to examine the authentication traffic. I applied the filter “Kerberosv5” and lo and behold, the issue jumped right out to me. Appearing over and over in the trace was this:

97    9:38:46 AM     0.0000000    WFE    52882 (0xCE92)    DC    88 (0x58)    TCP    TCP:Flags=……S., SrcPort=52882, DstPort=Kerberos(88), PayloadLen=0, Seq=2542638417, Ack=0, Win=8192 ( Negotiating scale factor 0x8 ) = 8192

98    9:38:46 AM     0.0004965    DC    88 (0x58)    WFE    52882 (0xCE92)    TCP    TCP:Flags=…A..S., SrcPort=Kerberos(88), DstPort=52882, PayloadLen=0, Seq=4098142762, Ack=2542638418, Win=65535 ( Negotiated scale factor 0x1 ) = 131070

99    9:38:46 AM     0.0000200    WFE    52882 (0xCE92)    DC    88 (0x58)    TCP    TCP:Flags=…A…., SrcPort=52882, DstPort=Kerberos(88), PayloadLen=0, Seq=2542638418, Ack=4098142763, Win=513 (scale factor 0x8) = 131328

100    9:38:46 AM     0.0000599    WFE    52882 (0xCE92)    DC    88 (0x58)    KerberosV5    KerberosV5:AS Request Cname: farmsvc Realm: CONTOSO.COM Sname: krbtgt/CONTOSO.COM

102    9:38:46 AM     0.0022497    DC    88 (0x58)    WFE    52882 (0xCE92)    KerberosV5    KerberosV5:KRB_ERROR – KDC_ERR_CLIENT_REVOKED (18)

KRB_ERROR – KDC_ERR_CLIENT_REVOKED means that the client account has been locked out. We checked active directory and sure enough, the account used for the WFE service was locked. We then learned that they had recently changed the password for that service account, which resulted in said lockout. One thing to note about Network Monitor (and you can do this with Wireshark, as well) is that I actually had all Kerberos traffic highlighted in green so it stood out quickly.

So what did we learn? We know that if the trace had just been taken from the SQL server, I wouldn’t have found the issue. We also know that if the WFE trace had been filtered to just include SQL traffic or SQL and client traffic, I wouldn’t have found the issue. Remember, more is better! Even if I get gigabytes of captures, I can always parse them or break them into smaller, bite-sized (no pun intended) chunks for faster filtering. Happy tracing!