Designing Applications for High Performance – Part III

Rick Vicik – Architect, Windows Server Performance Team


The third, and final, part of this series covers I/O Completions and Memory Management techniques.  I will go through the different ways to handle I/O completions with some recommendations and optimizations introduced in Vista and later releases of Windows.  I will also cover tradeoffs associated with designing single and multiple process applications.  Finally, I will go through memory fragmentation, heap management, and provide a list of the new and expanded NUMA topology APIs.   


Some common I/O issues

It is recommended to use asynchronous I/O to avoid switching threads and to maximize performance.  Asynchronous I/O is more complex because it needs to determine which of the many outstanding I/Os completed.  Those I/Os may be to different handles, mixed file and network I/O, etc.  There are many different ways to handle I/O completion, not all of which are suitable for high performance (e.g. WaitForMultipleObjects, I/O Completion Routines).  For highest performance, use I/O Completion Ports.  Prior to Vista, scanning the pending overlapped structures was necessary to achieve the highest performance, but the improvements in Vista have made that technique obsolete.  However, it should be noted that an asynchronous write can block when extending a file and there is no asynchronous OpenFile. 


The old DOS SetFilePointer API is an anachronism.  One should specify the file offset in the overlapped structure even for synchronous I/O.  It should never be necessary to resort to the hack of having private file handles for each thread.


Overview of I/O Completion Processing

The processor that receives the hardware interrupt runs the interrupt service routine (ISR).  Interrupts are either distributed across all processors or the interrupts from a specific device can be bound to a particular processor or set of processors.  The ISR queues a DPC (usually to the same processor, otherwise an IPI is required) to perform the part of I/O completion processing that doesn’t need access to the user address space of the process that issued the I/O.  The DPC queues a “Special Kernel APC” to the thread that issued the I/O to copy status and byte count information to the overlapped structure in the user process.  In the case of buffered I/O, the APC also copies data from the kernel buffer to the user buffer.  In the case of I/O Completion Routines (not ports), the “Special Kernel APC” queues a user APC to itself to call the user-specified function.  Moreover, prior to Vista every I/O completion required a context-switch to the thread that issued the I/O in order to run the APC.


These APCs are disruptive because as the number of processors and threads increases, the probability that the APC will preempt some other thread also increases.  This disruption is less likely to happen by fully partitioning the application.  That includes having per processor threads and binding interrupts to specific processors. 


What’s new for I/O in Vista and Above

·         Ability to flag an I/O as low priority.  This reduces the competition between background and foreground tasks, and improves I/O bandwidth utilization.  Low priority IO is exposed via SetPriorityClass PROCESS_MODE_BACKGROUND_BEGIN, also by NtSetInformationProcess(process,ProcessIoPriority,…

·         There are no disruptive APCs running when using I/O Completion Ports.  Also, this can be accomplished for Overlapped structs if the user locks them in memory by using the SetFileOverlappedRange call.

·         Ability to retrieve up to ‘n’ I/O completions with a single call to GetQueuedCompletionStatusEx. 

·         The option to skip setting the event in the file handle and skip queuing a dummy completion if the I/O completes in-line (i.e. did not return PENDING status).  These can be done by making a call to SetFileCompletionNotificationModes. 

·         The Dispatcher lock is not taken when a completion is queued to a port and no threads are waiting on that port.  Similarly, no lock gets taken when removing a completion if there are items in the queue when GetQueuedCompletionStatus is called because again the thread does not need to wait for an item to be inserted.  If the call to GetQueuedCompletionStatus was made with zero timeout, then no waiting takes place.  On the other hand, the lock is taken if queuing a completion wakes a waiting thread or if calling GetQueuedCompletionStatus results in a thread waiting.


I/O Completion Port Example

Let’s take an example where the main thread loops on GetQueuedCompletionStatus and calls the service function which was specified when the I/O was issued (passed via an augmented Overlapped structure).  The service functions issue only asynchronous I/O and do not wait, therefore the only wait in the main thread is really on the call made to GetQueuedCompletionStatus.  The following are some examples of “events” whose completion we wait on and suggestions on what to do next once they complete:


          If the completion is for a new connection establishment, set up a session structure and issue an asynchronous network receive. 

          If the completion is for a network receive, parse the request to determine the file name and issue a call to TransmitFile API. 

          If the completion is for a network send, log the request and issue an asynchronous network receive. 

          If the completion is for a user signal (from PostQueuedCompletionStatus), call the routine specified in the payload.


The timeout parameter on GetQueuedCompletionStatus (GQCS) can cause it to wait forever, return after the specified time, or return immediately.  Completions are queued and processed FIFO, but threads are queued and released LIFO.  That favors the running thread and treats the others as a stack of “filler” threads.  Because in Vista the Completion Ports are integrated with the thread pool and scheduler, when a thread that is associated with a port waits (except on the port) and the active thread limit hasn’t been exceeded, another thread is released from the port to take the place of the one that waited.  When the waiting thread runs again, the active thread count of the port is incremented.  Unfortunately, there is no way to “take back” a thread that is released this way.  If the threads can wait and resume in many places besides GQCS (as is usually the case), it is very common for too many threads to be active.


PostQueuedCompletionStatus allows user signals to be integrated with I/O completion handling which allows for a unified state machine.


Characteristics of I/O Completion Ports

An I/O Completion Port can be associated with many file (or socket) handles, but not the reverse.  The association cannot be changed without closing and reopening the handle.  It is possible to create a port and associate it with a file handle using a single system call, but additional calls are needed to associate a port with multiple handles. 


While you don’t need an event in the Overlapped structure when using Completion Ports because the event is never waited on, if you leave it out, the event in the file handle will be set and that incurs extra locking.


In Vista, applications that use Completion Ports get the performance benefit of eliminating the IO Completion APC without any code changes or even having to recompile.  This is true even if buffered IO is used.  The other way to get the benefit of IO Completion APC elimination (locking the overlapped structure) requires code changes and cannot be used with buffered IO.


Even if the I/O completes in-line (and PENDING is not returned), the I/O completion event is set and a completion is queued to the port unless the SetFileCompletionNotificationModes option is used.


if( !ReadFile(fh,buf,size,&actual,&ioreq)){

    // could be an error or asynchronous I/O successfully queued

    if( GetLastError() == ERROR_IO_PENDING ) {

      // asynchronous I/O queued and did not complete “in-line”

    } else {

      // asynchronous I/O not queued or was serviced in-line and failed


} else {

    // completed in-line, but still must consume completion

    // unless new option specified



 Memory Management Issues

When designing an application, developers are often faced with questions like – Should the application be designed with a single process or multiple processes?  Should there be a separate process for each processor or node?  In this section, we try to answer some of these questions while providing the advantages and disadvantages for each approach.


Advantages for designing applications with multiple processes include isolation, reliability and security.  First, an application can take advantage of more than 4GB of physical memory because each process can use up to 2GB for user-mode data.  Second, if memory is corrupted by bad code in one of the processes, the others are unaffected (unless shared memory is corrupted) and the application as a whole does not need to be terminated.  Also, separate address spaces provide isolation that can’t be duplicated with multiple threads in a single process.


Some disadvantages of using multiple processes include higher cost of a context switch compared to a thread switch in the same process due to the TLB getting flushed.  Also there are possible performance bottlenecks introduced by the mechanism chosen for Inter-Process Communication (IPC).  Examples of IPC include RPC, pipes, ALPC, and shared memory, so it is important that the right kind of IPC is chosen. Some estimates for round trip cost to send 100 bytes via RPC: 27,000 cycles, local named pipes: 26,000 cycles, ALPC: 13,000.  IPC via shared memory is the fastest but it erodes the isolation benefit of separate processes because bad code can potentially corrupt data in the shared memory.  Also with shared memory it is not always possible to use the data “in-place” and copying incurs an added cost of 2.5 cycles per byte copied.


Advantages for designing applications with a single process include not needing cross-process communication, cross process locks, etc.  Single process application can also approximate some of the advantages associated with multiple processes via workarounds.  For instance, exception-handing can trap a failing thread making it unnecessary to terminate the entire process.  The 2GB user virtual address limit is gone on x64 and can be worked around to some degree on 32bits using Address Windowing Extension (AWE) or the 3GB switch to change the user/kernel split of the 4GB virtual address space from 2:2 to 3:1. 


Shared Memory Issues

Shared memory is the fastest IPC but sacrifices some of the isolation that was the justification for using separate processes.  Shared memory is secure to outsiders once set up, but the mechanism by which the multiple processes gain access to it has some vulnerability.  Either a name must be used that is known to all the processes (which is susceptible to “name squatting”) or else a handle must be passed to the processes that didn’t create the shared memory (using some other IPC mechanism). 


Managing updates to shared memory:

1. Data is read-write to all processes, use cross-process lock to guard data or lock-free structures. 

2. Data is read-only to all but 1 process which does all updates (w/o allowing readers to see inconsistencies)

3. Same as 2 but kernel does updates

4. Data is read-only, unprotect briefly to update (suffering TLB flush due to page protection change).


Global Data defined in a DLL is normally process-wide but the linker “/SECTION:.MySeg,RWS” option can be used to make it system-wide if that is what is needed.  Just loading the DLL causes it to be set up as opposed to the usual CreateSection/MapViewOfSection APIs.  The downside is that the size is fixed at compile time.


Memory Fragmentation – What is it?

Fragmentation can occur in the Heap or in the process Virtual Address Space.  It is a consequence of the “best fit” memory allocation policy and a usage pattern that mixes large, short-lived allocations with small, long-lived ones.   It leaves a trail of free blocks (each too small to be used) which cannot be coalesced because of the small, long-lived allocations between them.  It cannot happen if all allocations are the same size or if all are freed at the same time.  Avoid fragmentation by not mixing wildly different sizes and lifetimes.  Large allocations and frees should be infrequent and batched.  Consider rolling your own “zone” heap for frequent, small allocations that are freed at the same time (e.g. constructing a parse tree).  Obtain a large block of memory and claim space in it using InterlockedExchangeAdd (to avoid locking).  If the zone is per-thread, there is no need for even the interlocked instruction.  Use the Low Fragmentation Heap whenever possible.  It is NUMA-aware and lock-free in most cases.  It replaces the heap look-asides and covers up to 16KB allocations.  It combines the management of free and uncommitted space to eliminate linear scans.  It is enabled automatically in Vista or by calling HeapSetInformation on the heap handle.


Best practices when managing the Heap

·         Don’t use GlobalAlloc, LocalAlloc, WalkHeap, ValidateHeap, or LockHeap.  GlobalAlloc and LocalAlloc are old functions which may take an additional lock even if the underlying heap uses lock-free look-asides.  The WalkHeap and ValidateHeap functions disable the heap look-asides.


·         Don’t “shrink to fit” (i.e. allocate a buffer for largest possible message, then realloc to actual size) or “expand to fit” (i.e. allocate typical size, realloc ‘n’ bytes at a time until it fits).  These are fragmentation-producing allocation patterns and realloc often involves copying the data.


·         Don’t use the heap for large buffers (>16KB).  Buffers obtained from the heap are not aligned on a natural boundary.  Use VirtualAlloc for large buffers, but do it infrequently, carve them up yourself and recycle them.


·         New in Vista: dynamic kernel virtual address allocation… no longer need to manually juggle the sizes of the various kernel memory pools when one of them runs out of space (e.g. desktop heap).


·         New in Vista:  Prefetch API – PreFetchCacheLine(adr,options).  The API has a large dependency on the hardware’s support for prefetch.


NUMA Support

APIs to get topology information (depends on hardware for the information):



  GetNumaProcessorNode (specified processor is on which node)

  GetNumaNodeProcessorMask (which processors are on the specified node)

  GetNumaAvailableMemory (current free memory on specified node)


Use existing affinity APIs to place threads on desired nodes.


Memory is allocated on node where thread is running when the memory is touched for the first time (not at allocation time).  For better control over where memory is allocated, use new “ExNuma” versions of the memory allocation APIs.  The additional parameter specifies node.  It is a preferred, not absolute specification and it is 1-based because 0 signifies no preference.


  VirtualAllocExNuma(…, Node)

  MapViewOfFileExNuma(…, Node)

  CreateFileMappingExNuma(…, Node)


Large pages and the TLB

The Translation Look-aside Buffer (TLB) is a critical resource on machines with large amounts of physical memory.  Server applications often have large blocks of memory which should be treated as a whole by the memory manager (instead of 4KB or 8KB chunks), e.g. a database cache.  Using large pages for those items reduces the number of TLB entries, improves TLB hit ratio, and decreases CPI.  Also, the data item is either all in memory or all out.  


  minsize = GetLargePageMinimum();

  p = VirtualAlloc(null, n*minsize, MEM_LARGE_PAGES, …);

Comments (15)

  1. Anonymous says:

    Author: Carl Rabeler Contributers: Eric Jacobsen, Thomas Kejser Technical Reviewers: Brad Daniels, Lindsey

  2. Anonymous says:

    Len, it is true that there is nothing you can do to make SKIP_COMPLETION_PORT_ON_SUCCESS work correctly when "MoreData" can be returned.

    Is memory the resource that is leaked?  Let me know and I will add it to the bug.  

  3. Anonymous says:

    There is a bug with the SKIP_COMPLETION_PORT_ON_SUCCESS feature because “MoreData” is a warning, but due to the current design, it must be classified as either SUCCESS or FAILURE.  Simply considering “MoreData” a failure instead of success will fix this problem but could introduce backward compatibility issues which is why the fix may take some time.

  4. Anonymous says:

    Are you sure it is ACCESS_DENIED and not INVALID_PARAMETER?  It will return INVALID_PARAMETER if called on a file opened without the OVERLAPPED flag.  The only way I can see it returning ACCESS_DENIED is if you really don’t have read/write access to the file.  Try SetFileShortName (see MSDN for usage).  It exercises the same underlying code to get access to the file before branching to the function-specific code.

  5. Anonymous says:

    Windows has a host of synchronous file system APIs, which include ones like CreateFile and GetFileAttributes.  Application writers can often get major performance boosts with careful use of multithreading, but optimizations for APIs like CreateFile are being investigated as well.

  6. John Davis says:

    Rick, why does non-blocking i/o outperform IOCP on UDP sockets when sending/receiving small packets (ie 300 bytes)  This includes using send/recv overlap depths of up to 128 combined with GetQueueCompletionStatusEx with an array of 128 completion notification

    Since you are using GQCSEx, it must be Vista or above. 
    By “non-blocking i/o” I assume you mean receiving IO completion notification by calling WaitForSingleObject on the EventObject in the IOSB (although you might be scanning the status field in the IOSBs like SQL). 
    Since depth is 128, WaitForMultiple probably isn’t used. 


    It is surprising that GQCS isn’t outperforming WaitForSingleObject because there is no IO completion APC and no dispatcher lock
    acquire per IO.  That usually makes a significant difference with many cpus, but it should be at least equal on UP.


    It is possible that setting the “don’t queue a phony completion if the IO completes in-line” option might help, but there is probably
    some other interaction going on.  Is it possible to get an ETW trace (with CSWITCH+LOADER+PROC_THREAD+DISPATCHER+DPC and stacks)?




  7. John Davis says:

    How do I do an ETW trace, etc.?

    There is information on ETW at 


    Also, there is a lot of built-in help:

    xperf –help stackwalk

    xperf –help symbols


    Here is what I use for a quick look (must run as admin):

    xperf -on CSWITCH+LOADER+PROC_THREAD+DPC+DISPATCHER -maxfile 100 -buffersize 512 -stackwalk CSWITCH+READYTHREAD


    xperf -d cswitch.bin

    xperf -i cswitch.bin -symbols -a dumper >mytrace.txt


  8. Don Burn says:

    I am trying to use SetFileIoOverlappedRange and SetFileCompletionNotificationModes in a program to drive my device.  But whatever I do I keep getting ERROR_ACCESS_DENIED on both calls.  Unfortunately the documentation is a little lite, and I am at a loss.

  9. Len Holgate says:

    I’m setting the FILE_SKIP_COMPLETION_PORT_ON_SUCCESS flag using  SetFileCompletionNotificationModes() for a datagram (UDP) socket connection and I’m seeing some unexpected behaviour in my unit tests. In situations where I am attempting a WSARecv() with a buffer that is too small to contain the entire datagram I get the following results when FILE_SKIP_COMPLETION_PORT_ON_SUCCESS is NOT set: The WSARecv() call returns SOCKET_ERROR and a subsequent call to WSAGetLastError() returns ERROR_IO_PENDING as expected. The thread servicing the IOCP has a call to GetQueuedCompletionStatus() return FALSE and the subsequent call to GetLastError() returns ERROR_MORE_DATA. This sequence of events occurs whether the WSARecv() call is made before or after the datagram arriving.

    However, if FILE_SKIP_COMPLETION_PORT_ON_SUCCESS IS set on the socket then I get the same results as above if the WSARecv() is made before the datagram arrives BUT if the WSARecv() call is made after the datagram arrives I get a SOCKET_ERROR and ERROR_IO_PENDING but the WSARecv() never completes. Note that the WSARecv() completion doesn’t occur even if I close the socket.

    I expect that there’s something wrong with my code, or my tests, but given that you guys can see the code involved when this flag is set it would be good to know that this particular situation works as expected on a fully patched Windows 7 machine. Note that I haven’t tried running my unit tests on a Vista box yet to see if I get the same results there.

  10. Len Holgate says:

    OK, thanks for confirming the bug. Am I correct in thinking that from my point of view there is nothing I can do to make my code operate correctly with FILE_SKIP_COMPLETION_PORT_ON_SUCCESS enabled for datagram sockets when a "MoreData" error might be generated?

    The backward compatibility issue is, IMHO, not really an issue as the "More data" is considered an error when FILE_SKIP_COMPLETION_PORT_ON_SUCCESS is NOT enabled and the current implementation causes a resource leak when it IS enabled; thus reclassifying "More data" as an error state reinstates the behaviour that’s found when FILE_SKIP_COMPLETION_PORT_ON_SUCCESS is NOT enabled and fixes the resource leak when it IS enabled… But I expect I’m missing how the current behaviour could be considered as anything other than a bug from a user’s perspective.

  11. Cory Nelson says:

    Is an async CreateFile on any TODO list?  This could be really useful when you’re dealing with many files, or with files over a network share, etc.

    Maybe split into something like CreateFileHandle() and CreateFileEx(HANDLE, …, OVERLAPPED*).

  12. Len Holgate says:

    The resources leaked are simply any resources that are relying on the I/O operation to actually complete at some point. The results of my tests show that the I/O operation never completes; not even when the socket is closed. Since I use reference counting for my per device and per operation data these never get cleaned up because they always have an I/O request pending.

  13. Greg Hazel says:

    I’m experiencing a problems on Vista and up with and without FILE_SKIP_COMPLETION_PORT_ON_SUCCESS, which I’ve written very simple repros for here:

    Basically, without FILE_SKIP_COMPLETION_PORT_ON_SUCCESS I believe WSARecv is returning 0 and setting lpNumberOfBytesRecvd to 0 when the socket should not actually be closed.

    With FILE_SKIP_COMPLETION_PORT_ON_SUCCESS I believe GetQueuedCompletionStatus is timing out when there are completions to process.

    Contact info is in the paste, let me know if there’s anything else I can do to demonstrate the problems or work around them.

  14. Len Holgate says:

    "There is a bug with the SKIP_COMPLETION_PORT_ON_SUCCESS feature because “MoreData” is a warning, but due to the current design, it must be classified as either SUCCESS or FAILURE.  Simply considering “MoreData” a failure instead of success will fix this
    problem but could introduce backward compatibility issues which is why the fix may take some time."

    Any update on this? Is there a Microsoft Connect issue open for it?

    This issue has been investigated, and the following workaround has been recommended: post receive buffers of 1500 bytes using the
    or WSARecvFrom