In a nutshell, it is about speeding up the launch of Windows client applications by optimizing the startup IO.
When an application is launched the required code and data pages are unlikely to be read from contiguous locations on the storage device. More often than not part of the file is read, then other parts in the same file, parts of other required files and maybe other parts of the original binary.
In terms of application performance and startup time this is undesirable because the jumping between distant locations on storage slows application startup time considerably. The cost of retrieving random IO compared to sequential IO can be significant.
Well, take a look at notepad:
There are parts of 30 binaries in the notepad address space. 30 files we needed to find on the disk!
Consider the (worst case) scenario where none, or very few of the binaries already reside in memory and none of them sit contiguously on disk. Lets say the disk has 10ms of seek latency...
30 x 10ms = 300ms of seek latency just to find the pages required to start the application because we are jumping all over the disk looking for the pieces required by the process.
Of course, the example is oversimplified - there may be pages already resident in memory which can soft faulted into the process address space; not all the process binaries in the diagram are needed for launch, and there are other optimizations at play. (Also, even with prefetch, the IO is not necessarily contiguous in terms of physical location, however, the IO is issued sequentially).
But the example gives you an idea of the problem application prefetch is intended to solve. Also keep in mind that notepad is a simple application... take a look at outlook.exe or internet explorer to see a real list of binaries in a process address space.
(Note: Solid State Disks [SSD's] come up a lot when discussing this feature. Issuing sequential IO is still very beneficial to SSD even without the rotational latency mentioned above. This is especially true on a busy device where we want to get all the startup IO into the queue together)
Anyway - To improve application launch experience the Applaunch Prefetch component was implemented way back in Windows XP as a component of the Sysmain Windows service (%systemroot%\System32\Sysmain.dll).
Applaunch prefetch attempts to improve startup time by optimizing the startup IO of windows applications. Prefetch optimizes for more sequential IO by reading the initial application IO in large efficient batches.
Prefetch achieves this optimization by monitoring the code and data required for an application to launch. The prefetcher monitors up to the first 10 seconds of application startup with heuristics to stop sooner if the launch is completed sooner (when the application stops faulting pages).
The prefetch trace data is then written to a per-application file in %systemroot%\Prefetch.
The trace file names in the prefetch directory are comprised of the application name with a hexadecimal representation of the file path (hash) and a .pf extension.
For example: c:\windows\Prefetch\WINWORD.EXE-2437DA78.pf.
Trace files (.pf) describe the pages that were historically accessed by the application during launch.
You can use Mark Russinovich' s strings application to inspect a prefetch file if you are interested.
This is the notepad.exe example from above:
When an application is launched the prefetcher looks in the prefetch directory to determine whether a .pf file is present for the application.
If the .pf file exists, the kernel component of the prefetcher issues asynchronous IOs to prefetch metadata, data, and image pages described in the trace file (.pf file).
In an effort to further optimize performance, every 72 hours during system idle time the Sysmain service provides information that can be used by the system defragmenter. The information is provided to enable defrag to physically order files and directories in the order they are referenced during application launch. This order is stored in the %systemroot%\Prefetch\Layout.ini file.
The Sysmain service does not run defragmentation, it only provides data for systems that have defragmentation enabled (SSD's have defrag off).
And that's the bulk of it.
The impact is pretty big. In some recent tests we saw between 10% and 80% improvement in application launch times across multiple devices, multiple applications, with multiple configurations doing all sorts of different things.
Interestingly, the biggest improvements are when devices are under a heavy IO load. The smallest gains are on SSD devices with no load at all. Issuing the IO in large contiguous chunks is very beneficial for application launch because the application is not constantly waiting in line behind other IO every time it needs to jump to a different binary.
You could think of it as: One person waiting in line to buy movie tickets for the whole family rather than each person doing it individually. (Especially if they turn up a few milliseconds seconds apart and having a random stranger jump into the line between you).
In the good scenario, you turn up at the counter with your five ticket requests and they are all completed at once. You head on in to the cinema, happy days.
In the bad scenario, you line up and get your ticket. Then you have to wait on your wife, who is behind some dude deciding between pop corn and a chock top. Then, you both have to wait on your son who is behind three other folk in the line... etc. Its going to take a while before you get in to the cinema because you all have to go together.
Well, its roughly the same deal for your application launch IO. We cant start till we fetch all the pages we need for launch.
Have a great weekend.