Delayed Write Failure Trace Study

In this "Trace Study”, we'll look at a case where the customer is seeing delayed write failures logged in the event log. Delayed write failures are reported when a file being written over the network is inaccessible for a time. Based on a trace taken at the same time as the error was logged, we will determine the cause.

Zooming In

Since we know the file name reported in the event log error, we'll use that name to find where in the trace we are accessing this file. We start by building a filter that uses a property we set for any SMB frame which references a file.

Property.SMBFileName.Contains("dir.txt")

This displays a bunch of frames that reference the "dir.txt" file, but this does not represent the entire conversation. To get the entire conversation, right click any frame and select Find Conversation->SMB. Then remove your display filter and now you will see all the frames associated with this particular SMB conversation. An SMB conversation is usually all operations involving a single file.

The next step is to look for an error of some kind. We do this by creating a color filter (https://blogs.technet.com/netmon/archive/2007/06/28/color-filtering-error-messages.aspx) to make SMB error frames stand out. We'll use this color filter:

(smb.DOSError.Error != 0 AND smb.DOSError.Error != 22)

OR

(smb.NTStatus.Code != 0 && smb.NTStatus.Code!= 22)

I made my color filter have a red background and a white foreground, a color scheme I use to identify errors.

With this color filter enabled, I simply scroll through the trace looking for a red frame to stand out. As they pop up you'll have to look at the specific error and see if it applies. In my case I see a STATUS_NETWORK_SESSION_EXPIRED. Following this traffic I see a Session Setup, and then continued SMB Writes before and after.

 

SMB:C; Write Andx, FID = 0x400C (filesdir.txt@#1644), 1 bytes at Offset 32780

SMB:R; Write Andx, FID = 0x400C (filesdir.txt@#1644), 1 bytes

SMB:C; Transact2, Query File Info, Query File Standard Info, FID = 0x400C (filesdir.txt@#1644)

SMB:R; Transact2, Query File Info, FID = 0x400C (filesdir.txt@#1644) - NT Status: System - Error, Code = (860) STATUS_NETWORK_SESSION_EXPIRED

SMB:C; Session Setup Andx, Krb5ApReq (0x100)

SMB:R; Session Setup Andx, Krb5ApRep (0x200)

SMB:C; Write Andx, FID = 0x400C (filesdir.txt@#1644), 1 bytes at Offset 32780

SMB:R; Write Andx, FID = 0x400C (filesdir.txt@#1644), 1 bytes

Obviously this is not normal traffic for SMB. Session Setups occur when you first make a connection to a share, but not in the middle of a file transfer. What caused this session to expire?

Zooming Out

When we used the "Find Conversation->SMB" above, we narrowed down the traffic to just one SMB conversation. But something happened on another network conversation in between our Session Setup and the last error. To figure out where to go next, we'll have to zoom out and look at the rest of the traffic around the error in question. I'll select the error frame to keep my context and then click on "All Traffic" at the top of the conversation tree to remove the SMB conversation filter. When I do, I see the following traffic:

 

SMB:C; Write Andx, FID = 0x400C (filesdir.txt@#1644), 1 bytes at Offset 32780

SMB:R; Write Andx, FID = 0x400C (filesdir.txt@#1644), 1 bytes

SMB:C; Transact2, Query File Info, Query File Standard Info, FID = 0x400C (filesdir.txt@#1644)

SMB:R; Transact2, Query File Info, FID = 0x400C (filesdir.txt@#1644) - NT Status: System - Error, Code = (860) STATUS_NETWORK_SESSION_EXPIRED

KerberosV5:TGS Request Realm: CORP1.LOCAL Sname: cifs/c01e3n01ads.corp1.local

TCP:Flags=...A...., SrcPort=1162, DstPort=Microsoft-DS(445), PayloadLen=0, Seq=1084491174, Ack=239237167, Win=4163

KerberosV5:TGS Response Cname: Kevin

KerberosV5:TGS Request Realm: CORP1.LOCAL Sname: krbtgt/CORP1.LOCAL

KerberosV5:TGS Response Cname: Kevin

KerberosV5:AS Request Cname: Kevin Realm: CORP1.LOCAL Sname: krbtgt/CORP1.LOCAL

KerberosV5:AS Response Ticket[Realm: CORP1.LOCAL, Sname: krbtgt/CORP1.LOCAL]

KerberosV5:TGS Request Realm: CORP1.LOCAL Sname: krbtgt/CORP1.LOCAL

KerberosV5:TGS Response Cname: Kevin

SMB:C; Session Setup Andx, Krb5ApReq (0x100)

Kerberos Ticket Expired

Once the UI has completed updating the frame summary, my current selection remains on the SMB Error frame which keeps my place. But now some new Kerberos frames show up. This information together with the "Session Expired" message tells us the whole story.

The expired SMB session means we need to re-authenticate. In this case the Kerberos ticket expired and a new ticket had to be issued to us by the server. If we had the original setup traffic, we would be able to see the initial Kerberos ticket with its expiration time. Once this Kerberos negotiation completes, the SMB session is reset using the new Kerberos ticket and the SMB traffic continues where it left off. This authentication interruption in the traffic is what caused our "Delayed Write Failure" event log error message in the first place.

Getting to the Bottom of Things

In this case the Delay Write failure is easily explained. But there are many ways a delayed write failure can be triggered. You can use these same steps to zoom in and zoom out of a trace to understand this type of problem. Next time you see a Delay Write failure in your event log, I hope you can use these steps to figure out why it occurred.