When Idle Threads Bugcheck (Or, when APCs attack)

image Step in the architect John Sheehan.... John is one of the architects that came up with some of the SoftGrid features and has now started to dump his mind into a blog! Fantastic I hear you all shout! John is an avid presenter and should be great to have blogging in the world! John wrote a great article the other day about a bug that they found in the File System feature of the product. I shall be adding a little link and sure to be referring back to John in the future! John was a great source of information when I met up with him in Boston and the ideas this chap has are amazing! Great to have him our team......

-----------------------------------------------------------

REQUIRED DISCLAIMER: I’m the architect for the Microsoft Application Virtualization product (formerly known as SoftGrid). However, all information, opinions, anecdotes, facts, pseudo-facts, exaggerations and flat-out lies represent my personal view of the world and not those necessarily held by my team or Microsoft. You’ve been warned.

The Situation

Several weeks ago, we were trying to ship the Microsoft Application Virtualization Beta when we encountered a very strange bugcheck in our File System. It was one we had never seen before and the initial investigation didn’t look very promising. Now, as anyone in software knows all too well, software products love to wait for these special moments to decide to blow up in new and interesting ways. I personally believe that, just like disobedient children, they hate their parents and want to make them miserable. Unfortunately, you can’t put your product in time-out. And, just as with children, beatings don’t really seem to accomplish much. J

Initial Analysis

When investigating an issue like this, you first start by trying to figure out what action caused the system to bugcheck. You can then use this information to help reconstruct the situation that lead to the crash. Normally, the first place you start is with the ‘!analyze –v’ command. It does a basic analysis of the dump file and gives you its best guess at what caused the crash. In this particular case, !analyze pointed its finger at our File System driver. But, not much else made sense. Here is the stack that was output:

ChildEBP RetAddr

a9750c80 806ffa16 nt!KiTrap0E+0x238

a9750d5c a9338ae2 hal!KeAcquireInStackQueuedSpinLock+0x26

a9750d6c a9339c79 sftfsXP!SoftFSProcessRequest+0x6c

a9750dac 80574128 sftfsXP!SoftFSQueueSyncWorker+0x39

a9750ddc 804ec781 nt!PspSystemThreadStartup+0x34

00000000 00000000 nt!KiThreadStartup+0x16

In this particular situation, the stack doesn’t seem to make a whole lot of sense. It looks as if our File System is calling nt!KeAcquireInStackQueuedSpinLock. Now, sometimes functions in the DDK are really macros for some internal function that you’ve never heard of. But, when we looked through the code in this function, this did not seem to be the case. This was a very simple dispatch function that did very little. So, even though this was a release build and, therefore, was optimized, it still didn’t seem likely that the code could be calling nt!KeAcquireInStackQueuedSpinLock. Could we be facing stack corruption? Only time would tell.

NOTE: At this point, lots of analysis of the dump file occurred. Since none of it pointed to the culprit, I will omit the details to prevent this article from turning into a novel (which it is already in danger of becoming). I will also omit lots of other details that were uncovered during this long debugging session.

At this point, we figured we would try to catch the corruption in the act, so that we would have a live system to debug. Here is what we learned:

1. It only happened on a single test system, which happened to be running Windows XP

2. It did not happen with a checked build of the product

3. It did not happen if you had kernel mode debugging enabled

4. It did not happen on an identical hardware setup that was built using the same OS image

5. It only happened during the first launch of a package

6. If you rebooted after install, it did not occur

So, we’ve hit software’s equivalent of the Observer effect. Basically, the Observer effect states that the act of observing the phenomenon changes the phenomenon being observed. This usually points to some kind of timing issue that only occurs when the system is running at full speed. Given all of the other caveats around the issue, this certainly appears to be a timing issue. The fact that it happens on one system and not another (practically identical) system points to some kind of environmental issue. Unfortunately, it is impossible to tell at this point whether it is a hardware or software difference. In my experience, it is almost always a software difference when you are dealing with identically configured system (software just varies so much more readily than hardware). So, it looks like we’ll have to continue our investigation using the crash dump that we started with.

At this point, reports started coming in from some of our TAP customers who had started playing with a Beta build. They were seeing the same issue. In every case, it happened on XP. Vista did not seem to be vulnerable to this particular race.

Debugging the Issue

Even though the stack didn’t make complete sense, it did give up some clues as to the scenario that was causing the bugcheck. The thread in question was a kernel thread that was part of a pool of worker threads that our File System uses to perform any deferred work that it needs to accomplish. This actually complicates things, since this one thread could have been performing one of over a dozen of different operations. However, all was not lost. We knew that this bugcheck occurred during the very first launch of a virtual application. We also knew that this worker pool was primarily used during initialization / shutdown of virtual application volumes. Since these requests all come from our NT service (named sftlist.exe), this is a good place to look for the initial request that this thread was processing. After all, it isn’t likely that this thread decided to do work all on its own. A quick scan of the threads in our NT service found the following interesting stack (using the command ‘!process 0 F sftlist.exe’):

ChildEBP RetAddr

a9fe5904 804e1bd2 nt!KiSwapContext+0x2f

a9fe5910 804e1c1e nt!KiSwapThread+0x8a

a9fe5938 a9339042 nt!KeWaitForSingleObject+0x1c2

a9fe5964 a933985e sftfsXP!SoftFSInsertQueueItem+0x12a

a9fe5984 a93149bb sftfsXP!SoftFSVolMngrGlobalPackageInitRequest+0x88

a9fe59e8 a9340b0b sftfsXP!SoftFSConnectSession+0x33d

Here, a thread has just posted a job to our thread pool using sftfsXP!SoftFSInsertQueueItem and is awaiting the response. Now, this does not absolutely prove that the thread in question was performing this task. But, it does strongly point in that direction. Ideally, we would be able to correlate the request objects in both threads. But, in this case, the other thread’s stack was not intact enough to perform this analysis.

Unfortunately, this thread pool also does some periodic work that could be the culprit. (We actually went down that path, at one point.) But, given the strong correlation between the two events, we proceeded with the assumption that these were related. This is one of the techniques you need to use when doing post-mortem debugging. Sometimes you just have to create a hypothesis and then follow it to its logical conclusion. The key to being a good debugger is to come up with solid hypotheses that can be proven / disproven quickly. If you can figure out you’re wrong quickly, you can move on to another hypothesis. Rinse…repeat.

If we look at all of the work that this type of request does, it is a little disheartening. This is the most complicated request that this worker thread can perform. The code path represents several thousand lines (if you follow all of the different potential branches). So, code review does not seem to be the quickest route to finding the issue. Still, it is good to at least familiarize yourself with the code to help inform the various hypotheses you will come up with during the debugging process. This particular code did the following operations (this is an incomplete list):

1. Creates the global package volume files, if they do not already exist (there are two)

2. Ensures there is space in the cache for the package, pre-allocating if necessary

3. Registers the global package volume in the file system cache structures

4. Create the temporary working versions of the volumes, copying the existing ones, if necessary

5. Initialize the volumes with package data, if they have not yet been initialized

Wow. So, there is a lot going on in this one request. Each of these operations represents many sub-operations, any one of which could be the culprit of the bugcheck.

Logging to the Rescue

Since we have a system where we can reproduce the issue, we decided to add some logging to see what requests the worker pool was handling during the reproduction run. Choosing what information to log is the key to success. You need to log enough information to further guide your debugging. However, if you log too much, you can change the timing of the system sufficiently to prevent the reproduction from being success (another form of the Observer effect). For this scenario, we decided to log each job that was submitted to the worker pool, including:

1. The type of request being handled

2. Its basic parameters

3. Which thread it was running on

4. The return code of the operation (after it was done being handled)

We also logged some basic information about each thread state change, such as when it went to sleep to wait for new requests, when it was woken up to do some work, etc. One very appealing aspect of this approach is that it should tell us which type of request was in process when the bugcheck occurred.

With the logging in place, we reproduced the issue. To our utter amazement, there were no requests in process when the bugcheck occurred! At first, we didn’t believe it. Since we use WMI logging in our driver, we assumed that the log messages for the running thread were just in the memory buffer and hadn’t been written to disk yet. To check this, we used the ‘!wmitrace.logdump’ command to dump the contents of the in-memory buffer. To our surprise, there were no other logging events in our buffer. All of the requests appear to have been complete. If fact, the thread that bugchecked logged a message a full minute before the crash stating that it was going to wait on the job queue for the next job. The next line of code after the log message was:

KeWaitForSingleObject(queueEvent, Executive, KernelMode, FALSE, 0);

The line after this line was another log message stating that the thread had woken up. So, to our disbelief, the thread in question had gone to sleep on an event and, a minute after sleeping, had bugchecked without ever waking up. To double-check, we verified that no jobs had been put in the queue since this last job succeeded. Sure enough, this was the last job put in the queue. So, this thread really did bugcheck a minute after going to sleep without ever being woken up. What was going on here?!!!

Making Sense of the Results

The only way that a sleeping thread can execute code without being woken up is if an Asynchronous Procedure Call (APC) is posted to it. However, if you look at the parameters that were passed to the KeWaitForSingleObject call, you will see that the Alertable parameter was set to FALSE. So, we shouldn’t be woken up, right? We’ll soon see.

If we go back to the original stack that we saw at the beginning of the investigation, we can try to gather some more information. Debuggers (e.g. kd, windbg) do their best to reconstruct the thread’s stack based on what information is available. Since they are working with imperfect information, they don’t always get it right. If you are concerned that the stack being displayed is not correct, you can attempt to walk it yourself to gather more information. A great command for this is the ‘kd’ command. ‘Kd’ outputs the raw stack with each DWORD on a separate line. For each line, it attempts to locate any symbol information for the location, effectively doing an ‘ln’ on each DWORD. This is a great way to get detailed information about the stack.

Here is the output of ‘kd’ for the dump in question. I have removed all of the lines that did not resolve to a particular symbol to make it easier to read.

a9750c84 806ffa16 hal!KeAcquireInStackQueuedSpinLock+0x26

a9750c90 804f2bb5 nt!IopCompleteRequest+0x32f

a9750cac a9330d77 sftfsXP!SoftFSReleaseSoftVolume+0x33

a9750cd0 804e0030 nt!VdmFixEspEbp+0x60

a9750ce8 806ffa16 hal!KeAcquireInStackQueuedSpinLock+0x26

a9750cf4 804f2dc4 nt!KiDeliverApc+0xc2

a9750cfc 806ff410 hal!KfLowerIrql

a9750d18 804f2a72 nt!IopCompleteRequest

a9750d30 804f2df4 nt!KiSwapThread+0xa8

a9750d48 804e1c1e nt!KeWaitForSingleObject+0x1c2

a9750d60 a9338ae2 sftfsXP!SoftFSProcessRequest+0x6c

a9750d70 a9339c79 sftfsXP!SoftFSQueueSyncWorker+0x39

a9750da0 a9346b72 sftfsXP!except_handler3

a9750da4 a9373e70 sftfsXP!__safe_se_handler_table+0x910

a9750db0 80574128 nt!PspSystemThreadStartup+0x34

As we can see from this listing, it looks like an APC is being delivered to our thread, which is currently sleeping in nt!KeWaitForSingleObject. This lines-up with the rest of the data we have gathered so far. So, the next question is: where is this APC coming from?

Finding the APC

From the stack, it appears that a Completion Request is being delivered to this thread. That’s strange. All IRPs that we send to other drivers during processing, such as to read / write a volume file, are done by the same routine, which waits for IRPs to be completed before continuing processing. So, it doesn’t seem like we should be getting any APCs on this thread.

Another interesting data point is that this started happening fairly recently (on the order of months). We would like to be able to do a binary search of mainline builds to determine which build introduced the issue. Unfortunately, there were other issues blocking this test during the period of time that this issue was introduced. So, we cannot determine which build introduced the issue. However, we knew the changes made to our File System during this time. This vastly reduced the amount of code that would need to be reviewed. Therefore, we went back and looked at the changes. Most of them seemed pretty straightforward. However, there was one change that looked a little different from the rest.

Tracking Down the Culprit

As I mentioned previously, our File System creates an IRP when it wants to read / write from a volume file on disk. This common code has been around for years and seems to work fine. It ensures that the request is complete before it returns to the caller. While doing our targeted code review, we found one place where this common function was not being used. In this one case, ZwWriteFile was being used to write to the file. This function was writing a single NUL byte past the current end of the file in order to extend its length. Now, upon first examination, nothing seemed strange. Here is the basic signature of the call that was made:

IO_STATUS_BLOCK iosb;

status = ZwWriteFile(file, NULL, NULL, NULL, &iosb, buffer, length, offset,0);

if (NT_ERROR(status)) …// handle error

This particular call is setting all of the APC parameters to NULL, since it does not want to handle this call asynchronously. The code expects that the call will complete before returning. (This assumption is the heart of the problem.) Unfortunately, the code used the NT_ERROR macro to determine if the ZwWriteFile call succeeded. Although not completely clear in the documentation, it turns out that ZwWriteFile can return STATUS_PENDING if the call has not yet completed. Since STATUS_PENDING is a success code, the caller missed the fact that the call hadn’t completed yet. Another issue with this code is the return code of ZwWriteFile is not the return code of the file operation. To determine if a file operation succeeded, you need to check the Status member of the IO_STATUS_BLOCK structure. Even if the code had detected that the request was still pending, there is no way for it to determine when it is done, since it did not pass in an Event that the underlying driver could signal when it was done.

I found it a little strange that the FS was willing to pend the request, even though there was no way for it to communicate the completion of the request back to the caller (since all parameters that could be used for this purpose were NULL). But, I guess users of the NTAPI are supposed to know what they’re doing. In particular, the file handle being used to make this write call was not opened for synchronous operations (i.e. none of the FILE_SYNCHRONOUS_IO_* flags were set in the ZwCreateFile call). Therefore, the user of this handle had specifically stated that they were willing to handle asynchronous operations.

There is one other piece of the puzzle that we need to understand. What was the completion routine doing that was causing the bugcheck? The answer is right in the code that calls ZwWriteFile. Can you spot it?

Final Cause Analysis

In the final analysis, it turns out that we were running into a form of stack corruption, albeit not the one we originally thought we were. The IO_STATUS_BLOCK that is handed into ZwWriteFile is used by the underlying file system driver to save the status of the write operation. The IO_STATUS_BLOCK handed in by the code above was on the stack. Once this function returns, this stack location is no longer valid. The stack location will eventually get reused by the next function call that extends this far down the stack. It turns out that the IoCompletion routine of the file system driver was writing its final status onto the stack at a location that was now being used for something else (in this case, an In-Stack Queued SpinLock). This corrupted the stack and caused the bugcheck. Once you figure out what is happening, it makes so much sense.

Conclusion

Well, at least we now know what is going on. While implementing a new feature that pre-allocated cache space for a virtual application, we triggered a situation where, under certain conditions, the file extension request could not be finished synchronously and was being pended. The code in question was not ready to handle this situation. So, it ended up assuming that the operation was complete and went back to sleep. When the IoCompletion routine was later triggered for the request, it attempted to write the status of the write operation onto a stack location that was no longer valid, causing stack corruption.

The solution to the fix was easy. All we had to do was call the same write function that we use everywhere else in our code. This was sufficient to prevent the situation from occurring. The one nagging issue was why we didn’t see this issue on other test systems and why customers were able to catch it so quickly.

After a little more digging, we believe we discovered why the two almost-identical systems had such different behavior. It turns out that the system that experienced the issue had a hard drive that was highly fragmented. The system that could not reproduce the issue had very little fragmentation. So, our hypothesis is that NTFS was able to complete the call synchronously if it could find a single range to handle the extension. If the drive was highly fragmented, the request would take a long time and would end up getting pended. This may also explain why we did not see this on Vista machines, since Vista automatically defrags the hard drive on a regular basis.

Lessons Learned

I think the biggest lesson to be gained from this experience is that, when using the NTAPI to write to files in kernel mode, you either need to specify one of the FILE_SYNCHRONOUS_IO_* flags when creating / opening the file or you need to pass in an Event object and handle the case where file operations return STATUS_PENDING. Doing neither of these leads to disaster (and long nights).