Designing Applications for High Performance - Part III

Rick Vicik - Architect, Windows Server Performance Team

 

The third, and final, part of this series covers I/O Completions and Memory Management techniques. I will go through the different ways to handle I/O completions with some recommendations and optimizations introduced in Vista and later releases of Windows. I will also cover tradeoffs associated with designing single and multiple process applications. Finally, I will go through memory fragmentation, heap management, and provide a list of the new and expanded NUMA topology APIs.

Some common I/O issues

It is recommended to use asynchronous I/O to avoid switching threads and to maximize performance. Asynchronous I/O is more complex because it needs to determine which of the many outstanding I/Os completed. Those I/Os may be to different handles, mixed file and network I/O, etc. There are many different ways to handle I/O completion, not all of which are suitable for high performance (e.g. WaitForMultipleObjects, I/O Completion Routines). For highest performance, use I/O Completion Ports. Prior to Vista, scanning the pending overlapped structures was necessary to achieve the highest performance, but the improvements in Vista have made that technique obsolete. However, it should be noted that an asynchronous write can block when extending a file and there is no asynchronous OpenFile.

 

The old DOS SetFilePointer API is an anachronism. One should specify the file offset in the overlapped structure even for synchronous I/O. It should never be necessary to resort to the hack of having private file handles for each thread.

Overview of I/O Completion Processing

The processor that receives the hardware interrupt runs the interrupt service routine (ISR). Interrupts are either distributed across all processors or the interrupts from a specific device can be bound to a particular processor or set of processors. The ISR queues a DPC (usually to the same processor, otherwise an IPI is required) to perform the part of I/O completion processing that doesn’t need access to the user address space of the process that issued the I/O. The DPC queues a “Special Kernel APC” to the thread that issued the I/O to copy status and byte count information to the overlapped structure in the user process. In the case of buffered I/O, the APC also copies data from the kernel buffer to the user buffer. In the case of I/O Completion Routines (not ports), the “Special Kernel APC” queues a user APC to itself to call the user-specified function. Moreover, prior to Vista every I/O completion required a context-switch to the thread that issued the I/O in order to run the APC.

These APCs are disruptive because as the number of processors and threads increases, the probability that the APC will preempt some other thread also increases. This disruption is less likely to happen by fully partitioning the application. That includes having per processor threads and binding interrupts to specific processors.

What’s new for I/O in Vista and Above

· Ability to flag an I/O as low priority. This reduces the competition between background and foreground tasks, and improves I/O bandwidth utilization. Low priority IO is exposed via SetPriorityClass PROCESS_MODE_BACKGROUND_BEGIN, also by NtSetInformationProcess(process,ProcessIoPriority,...

· There are no disruptive APCs running when using I/O Completion Ports. Also, this can be accomplished for Overlapped structs if the user locks them in memory by using the SetFileOverlappedRange call.

· Ability to retrieve up to ‘n’ I/O completions with a single call to GetQueuedCompletionStatusEx.

· The option to skip setting the event in the file handle and skip queuing a dummy completion if the I/O completes in-line (i.e. did not return PENDING status). These can be done by making a call to SetFileCompletionNotificationModes.

· The Dispatcher lock is not taken when a completion is queued to a port and no threads are waiting on that port. Similarly, no lock gets taken when removing a completion if there are items in the queue when GetQueuedCompletionStatus is called because again the thread does not need to wait for an item to be inserted. If the call to GetQueuedCompletionStatus was made with zero timeout, then no waiting takes place. On the other hand, the lock is taken if queuing a completion wakes a waiting thread or if calling GetQueuedCompletionStatus results in a thread waiting.

I/O Completion Port Example

Let’s take an example where the main thread loops on GetQueuedCompletionStatus and calls the service function which was specified when the I/O was issued (passed via an augmented Overlapped structure). The service functions issue only asynchronous I/O and do not wait, therefore the only wait in the main thread is really on the call made to GetQueuedCompletionStatus. The following are some examples of “events” whose completion we wait on and suggestions on what to do next once they complete:

- If the completion is for a new connection establishment, set up a session structure and issue an asynchronous network receive.

- If the completion is for a network receive, parse the request to determine the file name and issue a call to TransmitFile API.

- If the completion is for a network send, log the request and issue an asynchronous network receive.

- If the completion is for a user signal (from PostQueuedCompletionStatus), call the routine specified in the payload.

The timeout parameter on GetQueuedCompletionStatus (GQCS) can cause it to wait forever, return after the specified time, or return immediately. Completions are queued and processed FIFO, but threads are queued and released LIFO. That favors the running thread and treats the others as a stack of “filler” threads. Because in Vista the Completion Ports are integrated with the thread pool and scheduler, when a thread that is associated with a port waits (except on the port) and the active thread limit hasn’t been exceeded, another thread is released from the port to take the place of the one that waited. When the waiting thread runs again, the active thread count of the port is incremented. Unfortunately, there is no way to “take back” a thread that is released this way. If the threads can wait and resume in many places besides GQCS (as is usually the case), it is very common for too many threads to be active.

PostQueuedCompletionStatus allows user signals to be integrated with I/O completion handling which allows for a unified state machine.

Characteristics of I/O Completion Ports

An I/O Completion Port can be associated with many file (or socket) handles, but not the reverse. The association cannot be changed without closing and reopening the handle. It is possible to create a port and associate it with a file handle using a single system call, but additional calls are needed to associate a port with multiple handles.

While you don’t need an event in the Overlapped structure when using Completion Ports because the event is never waited on, if you leave it out, the event in the file handle will be set and that incurs extra locking.

In Vista, applications that use Completion Ports get the performance benefit of eliminating the IO Completion APC without any code changes or even having to recompile. This is true even if buffered IO is used. The other way to get the benefit of IO Completion APC elimination (locking the overlapped structure) requires code changes and cannot be used with buffered IO.

Even if the I/O completes in-line (and PENDING is not returned), the I/O completion event is set and a completion is queued to the port unless the SetFileCompletionNotificationModes option is used.

if( !ReadFile(fh,buf,size,&actual,&ioreq)){

    // could be an error or asynchronous I/O successfully queued

    if( GetLastError() == ERROR_IO_PENDING ) {

      // asynchronous I/O queued and did not complete “in-line”

    } else {

      // asynchronous I/O not queued or was serviced in-line and failed

    }

} else {

    // completed in-line, but still must consume completion

    // unless new option specified

}

 Memory Management Issues

When designing an application, developers are often faced with questions like - Should the application be designed with a single process or multiple processes? Should there be a separate process for each processor or node? In this section, we try to answer some of these questions while providing the advantages and disadvantages for each approach.

Advantages for designing applications with multiple processes include isolation, reliability and security. First, an application can take advantage of more than 4GB of physical memory because each process can use up to 2GB for user-mode data. Second, if memory is corrupted by bad code in one of the processes, the others are unaffected (unless shared memory is corrupted) and the application as a whole does not need to be terminated. Also, separate address spaces provide isolation that can’t be duplicated with multiple threads in a single process.

Some disadvantages of using multiple processes include higher cost of a context switch compared to a thread switch in the same process due to the TLB getting flushed. Also there are possible performance bottlenecks introduced by the mechanism chosen for Inter-Process Communication (IPC). Examples of IPC include RPC, pipes, ALPC, and shared memory, so it is important that the right kind of IPC is chosen. Some estimates for round trip cost to send 100 bytes via RPC: 27,000 cycles, local named pipes: 26,000 cycles, ALPC: 13,000. IPC via shared memory is the fastest but it erodes the isolation benefit of separate processes because bad code can potentially corrupt data in the shared memory. Also with shared memory it is not always possible to use the data “in-place” and copying incurs an added cost of 2.5 cycles per byte copied.

Advantages for designing applications with a single process include not needing cross-process communication, cross process locks, etc. Single process application can also approximate some of the advantages associated with multiple processes via workarounds. For instance, exception-handing can trap a failing thread making it unnecessary to terminate the entire process. The 2GB user virtual address limit is gone on x64 and can be worked around to some degree on 32bits using Address Windowing Extension (AWE) or the 3GB switch to change the user/kernel split of the 4GB virtual address space from 2:2 to 3:1.

Shared Memory Issues

Shared memory is the fastest IPC but sacrifices some of the isolation that was the justification for using separate processes. Shared memory is secure to outsiders once set up, but the mechanism by which the multiple processes gain access to it has some vulnerability. Either a name must be used that is known to all the processes (which is susceptible to “name squatting”) or else a handle must be passed to the processes that didn’t create the shared memory (using some other IPC mechanism).

Managing updates to shared memory:

1. Data is read-write to all processes, use cross-process lock to guard data or lock-free structures.

2. Data is read-only to all but 1 process which does all updates (w/o allowing readers to see inconsistencies)

3. Same as 2 but kernel does updates

4. Data is read-only, unprotect briefly to update (suffering TLB flush due to page protection change).

Global Data defined in a DLL is normally process-wide but the linker “/SECTION:.MySeg,RWS” option can be used to make it system-wide if that is what is needed. Just loading the DLL causes it to be set up as opposed to the usual CreateSection/MapViewOfSection APIs. The downside is that the size is fixed at compile time.

Memory Fragmentation – What is it?

Fragmentation can occur in the Heap or in the process Virtual Address Space. It is a consequence of the “best fit” memory allocation policy and a usage pattern that mixes large, short-lived allocations with small, long-lived ones. It leaves a trail of free blocks (each too small to be used) which cannot be coalesced because of the small, long-lived allocations between them. It cannot happen if all allocations are the same size or if all are freed at the same time. Avoid fragmentation by not mixing wildly different sizes and lifetimes. Large allocations and frees should be infrequent and batched. Consider rolling your own “zone” heap for frequent, small allocations that are freed at the same time (e.g. constructing a parse tree). Obtain a large block of memory and claim space in it using InterlockedExchangeAdd (to avoid locking). If the zone is per-thread, there is no need for even the interlocked instruction. Use the Low Fragmentation Heap whenever possible. It is NUMA-aware and lock-free in most cases. It replaces the heap look-asides and covers up to 16KB allocations. It combines the management of free and uncommitted space to eliminate linear scans. It is enabled automatically in Vista or by calling HeapSetInformation on the heap handle.

Best practices when managing the Heap

· Don’t use GlobalAlloc, LocalAlloc, WalkHeap, ValidateHeap, or LockHeap. GlobalAlloc and LocalAlloc are old functions which may take an additional lock even if the underlying heap uses lock-free look-asides. The WalkHeap and ValidateHeap functions disable the heap look-asides.

· Don’t “shrink to fit” (i.e. allocate a buffer for largest possible message, then realloc to actual size) or “expand to fit” (i.e. allocate typical size, realloc ‘n’ bytes at a time until it fits). These are fragmentation-producing allocation patterns and realloc often involves copying the data.

· Don’t use the heap for large buffers (>16KB). Buffers obtained from the heap are not aligned on a natural boundary. Use VirtualAlloc for large buffers, but do it infrequently, carve them up yourself and recycle them.

· New in Vista: dynamic kernel virtual address allocation... no longer need to manually juggle the sizes of the various kernel memory pools when one of them runs out of space (e.g. desktop heap).

· New in Vista: Prefetch API - PreFetchCacheLine(adr,options). The API has a large dependency on the hardware’s support for prefetch.

NUMA Support

APIs to get topology information (depends on hardware for the information):

  GetNumaHighestNodeNumber

  GetNumaProcessorNode (specified processor is on which node)

  GetNumaNodeProcessorMask (which processors are on the specified node)

  GetNumaAvailableMemory (current free memory on specified node)

Use existing affinity APIs to place threads on desired nodes.

Memory is allocated on node where thread is running when the memory is touched for the first time (not at allocation time). For better control over where memory is allocated, use new “ExNuma” versions of the memory allocation APIs. The additional parameter specifies node. It is a preferred, not absolute specification and it is 1-based because 0 signifies no preference.

  VirtualAllocExNuma(..., Node)

  MapViewOfFileExNuma(..., Node)

  CreateFileMappingExNuma(..., Node)

Large pages and the TLB

The Translation Look-aside Buffer (TLB) is a critical resource on machines with large amounts of physical memory. Server applications often have large blocks of memory which should be treated as a whole by the memory manager (instead of 4KB or 8KB chunks), e.g. a database cache. Using large pages for those items reduces the number of TLB entries, improves TLB hit ratio, and decreases CPI. Also, the data item is either all in memory or all out.

  minsize = GetLargePageMinimum();

  p = VirtualAlloc(null, n*minsize, MEM_LARGE_PAGES, ...);