Designing Applications for High Performance - Part II

Rick Vicik - Architect, Windows Server Performance Team

The second part of this series covers Data Structures and Locks. I will provide general guidance on which data structures to use under certain circumstances and how to use locks without having a negative impact on performance. Finally, there will be examples covering common problems/solutions and a simple cookbook detailing functional requirements and recommendations when using data structures and locks.

In order to avoid cache line thrashing and a high rate of lock collisions, the following are suggested guidelines when designing an application:

· Minimize the need for a large number of locks by partitioning data amongst threads or processors.

· Be aware of the hardware cache-line size because accessing different data items that fall on the same cache-line is considered a collision by the hardware.

· Use the minimum mechanism necessary in data structures. For example, don’t use a doubly-linked list to implement a queue unless it is necessary to remove from the middle or scan both ways.

· Don’t use a FIFO queue for a free list.

· Use lock-free techniques when possible. For example, claim buffer space with InterlockedAdd and use an S-List for free lists or producer/consumer queues.

The “ABA” problem

The “ABA” problem occurs when InterlockedCompareExchange is used to implement lock-free data structures because what is really needed is the ability to detect any change, even a set of multiple changes that restores the original value. If 2 items are removed from an S-List and the first is replaced, just comparing values would not detect that the list has changed (and the local copy of the ‘next’ pointer from the first item on the list is “stale”). The solution is to add a version# to the comparison so that it fails after any change, even one that restored the old value. That gets tricky in x64 because the size of a pointer is the same as the maximum size InterlockedCompareExchange.

The “Updating of Shared Data” problem

The primary concern when handling updates to shared data is to be aware of all the ways an item can be reached. When removing from the middle of a doubly-linked list, an item becomes unreachable when the next and previous pointers of the adjacent items are nulled. The tricky part is that a thread may have made a local copy of the “next” pointer from the previous item before it was nulled but hasn’t yet accessed that “next” item. A “lurking reader” must make its presence known to others by using a refcount or by taking per-item locks in a “crabbing” fashion as it traverses the list. The refcount can’t be on the target item because the lurking reader hasn’t gotten there yet. It must be on the pointer that was used to get there or else logically apply to the list as a whole. The “crabbing” technique is usually too expensive in most cases. It is almost always necessary to have a lock which guards the list.

True Lock free Operations

The simple test for a true “lock-free” operation is whether or not a thread can die anywhere during the operation and not prevent other threads from doing the operation. It is commonly thought that replacing a pointer with a “keep out” indicator is better than using a lock, but it is really just folding the lock and pointer into the same data item. If the thread dies after inserting the “keep out” indicator, no other thread can get in.

Guidelines for good locking practices and things to avoid

· Hash Lookup (cache index or “timer wheel”)

A good general practice is to use an array of synonym list heads with a lock per list (or set of lists), where locks fall on different cache lines. This design does not increase the number of lock acquires per lookup compared to the single lock implementation while reducing the lock collision rate. Use a doubly-linked list for synonyms to support removal from the middle (if lookup always precedes removal, a singly-linked list can be used). If the access pattern is mostly lookups with occasional insert/remove, use a Read/Write lock with a “writer priority” policy to provide fairness. If the entry is not found, then an exclusive lock is needed to insert it. If the lock doesn’t support atomic promotion, then must drop, reacquire and rescan. To avoid memory allocations (and possibly waiting) while holding lock, allocate the new block between the dropping and re-acquiring of the lock.

· N:M Relationship

Suppose a company allows a many-to-many relationship between employees and projects. That requires a set of intersecting lists to represent the relationships and a single lock can probably be used to guard it. That solution might be sufficient if most access is read-only, but will become a bottleneck if updates are frequent. A finer-grain solution is to have a lock on each instance of project and employee. Adding or removing a relationship requires taking the 2 intersecting locks, which is slightly more than the single lock implementation. A number of optimizations are possible to avoid lock-ordering issues: Deferred removal of intersection blocks, one-at-a-time insertion & removal, InterlockedCompareExchange for removal.

· Lock Convoy

FIFO locks guarantee fairness and forward progress at the expense of causing lock convoys. The term originally meant several threads executing the same part of the code as a group resulting in higher collisions than if they were randomly distributed throughout the code (much like automobiles being grouped into packets by traffic lights). The particular phenomenon I’m talking about is worse because once it forms the implicit handoff of lock ownership keeps the threads in lock-step.

To illustrate, consider the example where a thread holds a lock and it gets preempted while holding the lock. The result is all the other threads will pile up on the wait list for that lock. When the preempted thread (lock owner at this time) gets to run again and releases the lock, it automatically hands ownership of the lock to the first thread on the wait list. That thread may not run for some time, but the “hold time” clock is ticking. The previous owner usually requests the lock again before the wait list is cleared out, perpetuating the convoy.

· Producer/Consumer Implementation

The first thing you should ask yourself when setting up a producer/consumer arrangement is what the gain achieved by handing off the work? The amount of processing done per hand-off must be significantly greater than the cost of a context switch (~3k cycles direct cost, 10x or more indirect cost, depending on cache impact). The only legitimate reasons for handing off work to another thread are: If the isolation of a separate process is needed or if preemption is needed rather than cooperative yielding.

The following are code snippets for a Producer/Consumer implementation which will be used to point out things to avoid when doing the design.

Producer

WaitForSingleObject(QueueMutex,...);

InsertTailList(&QueueHead, Item);

SetEvent(WakeupEvent);

ReleaseMutex(QueueMutex);

Consumer

for(;;) {

  WaitForSingleObject(WakeupEvent,...);

  WaitForSingleObject(QueueMutex,...);

  item = RemoveHeadList( &QueueHead);

  ReleaseMutex( QueueMutex);

  ... process item ...

}

1. Don’t wake the consumer while holding the lock it will need. In our example, the Producer is holding the QueueMutex when it made the SetEvent call.

2. Don’t make multiple system calls to process each item even when no scheduling events occur (3 system calls in the producer, 3 in the consumer in this case).

3. Using the WaitForMultiple(ALL) call on both the QueueMutex and WakeupEvent seems like a clever solution because it avoids the extra context switch and combines the two WaitForSingleObject system calls into a single call. It really isn’t much better because each time an event in the set is signaled, the waiting thread is awakened via APC to check the status of all events in the set (resulting in just as many context switches).

4. Amortize the hand-off cost by not waking the consumer until ‘n’ items are queued, but then a timeout is needed to cap latency.

5. It is better to integrate background processing into the main state machine.

6. The consumer should be lower priority than the main thread so it runs only when the main thread has nothing to do. Consumer should not cause preemption when delivering the “finished” notification.

7. In the example above, consider using PostQueueCompletionStatus rather than SetEvent API. The latter boosts the target thread’s priority by 1 which may cause it to preempt something else.

8. Don’t use the Windows GUI message mechanism for producer/consumer queues or inter-process communication. Use it only for GUI messages.

Synchronization Resources

Events are strictly FIFO, can be used cross-process by passing the name or handle, and they have no spin option or logic to prevent convoys from forming. When an event is created, its signaling mode can be set to either auto-reset or manual-reset. Each signaling mode dictates how APIs like SetEvent and PulseEvent interact with threads.

 

Ø The SetEvent call on auto-reset events will allow a single thread to pass through if no others are waiting on the event. For manual-reset events, the call will allow all threads waiting on the event to pass and “keep the door open” until the event is explicitly reset.

Ø The PulseEvent call on auto-reset events will allow a single thread to pass through but only if there is one waiting. On manual-reset events, the call will allow all threads waiting on the event to pass and “leaves the door closed”.

Note: TheSignalObjectAndWait call allows a thread to signal another and wait without being preempted by the signaled thread. The SetEvent call typically boosts the priority of the signaled thread by 1, so it is possible for the signaled thread to preempt the signaling thread before it waits. This pattern is a very common cause of excess context switches.

Semaphores are also strictly FIFO, can be used cross-process, and they have no spin option and no logic to prevents convoys from forming. When it is necessary to release a specific number of threads, but not all, using the ReleaseSemaphores call is recommended.

Mutexes exposed to user mode supports recursion and have the current owning thread ID stored in it, whereas the Event and the Semaphore do not.

The ExecutiveResource is a reader/writer lock available in user and kernel mode. It is FIFO, has no spin or anti-convoy logic, but does have a TryToAcquire option and has anti-priority-inversion logic (attempts to boost the owner(s) if acquire takes over 500millisec). There are options regarding promotion to exclusive, demotion to shared, and whether readers or writers have priority (but not all options are available in user and kernel versions).

The CriticalSection is the most common lock in user-mode. It is exclusive, supports recursion, has TryToAcquire and spin options and is convoy-free because it is non-FIFO. Acquire and Release are very fast and do not enter the kernel if there is no contention. It cannot be used across processes. The event is created on first collision and all critical sections in a process are linked together for debugging purposes which requires a process-wide lock on creation and destruction (but there is an option to skip the linking).

Read/Write locks (Slim Read/Write Lock (SRWLock) in user mode and PushLock in kernel mode)

The new, lighter-weight read/write lock is available in Vista and later releases of Windows. It is similar to the CriticalSection in that it makes no system call if there are no contentions. It is non-FIFO and therefore convoy-free. The data structure is just a single pointer and the initialized state is 0, so they are very cheap to create. It does not support recursion. To start using SRWLocks, visit MSDN for a list of the supported APIs.

ConditionVariables are also new in Vista. They are synchronization primitives that allow for threads to wait until a specific condition occurs. They use SRWLocks or CriticalSections and cannot be shared across processes. To start using ConditionVariables, visit MSDN for a list of the supported APIs.

Lock-Free APIs (Recommended when possible)

Atomic updates are typically better than acquiring locks, but locking can still happen at the hardware level which causes bus traffic. It is also possible to have excessive retries (at the hardware or software level).

InterlockedCompareExchange compares a 4 or 8 byte data item with a test value and if equal, atomically replaces it with a specified value. If the comparison fails, the data item is left unchanged and the old value is returned. The S-List is constructed using InterlockedCompareExchange. It atomically modifies the head of a singly-linked list to support push, pop and grab operations. It uses a version# to prevent the “ABA” problem mentioned earlier in the post. InterlockedCompareExchange can also be used to construct a true FIFO lock-free queue. It is basically an S-List with a tail pointer that could be slightly out-of-date. It is not widely used because it has the limitation that elements used in the list can never be re-used for anything else. This is often worked-around by using surrogates for the list elements. A common pool of them can be maintained in the application. It can grow but never shrink.

InterlockedIncrement, InterlockedDecrement, and InterlockedExchangeAdd are similar, but the new value is derived from the old rather than specified (add or subtract 1, add specified value... use negative to subtract). These can be better performing than InterlockedCompareExchange because they cannot fail, so retries are eliminated in most cases. To add variety, InterlockedIncrement and InterlockedDecrement return the new value while InterlockedExchangeAdd returns the old value.

 

The Locking Cookbook

The following table provides a list of proposed recommendations which shouldn’t negatively impact performance when working towards meeting a specific set of functional requirements.

Functional Requirement

Performance recommendation

Maintaining reference count, Driving circular buffer, Construct barrier

InterlockedIncrement / InterlockedDecrement

Claim space in buffer, Roll-up per CPU counts, Construct complex locks

InterlockedExchangeAdd

8-byte mailbox, S-List * Queue

InterlockedCompareExchange

Free List or Producer/Consumer work queue

S-List

Implement a true FIFO with no need to reverse

Lock free Queue

List that supports traversal and/or removal from middle

Conventional lock (if >70% read, use Reader/Writer lock)

Least Recently Used (LRU) list

Use “clock” algorithm or deferred removal

Tree

Use lock sub-tree or “crab” downward like traversing a linked list

Table 1 : A locking cookbook

Conclusion

In conclusion, the following are common guidelines for designing applications for high performance while using locks and data structures to ensure data integrity through the different available synchronization mechanisms.

· Minimize the frequency of lock acquires and hold time. Don’t wait on objects, do IO, call SetEvent, or an RPC while holding a lock. Don’t call anything that may allocate memory while holding a lock. Note that taking a lock while holding a lock (i.e. nesting locks) inflates hold time on the outer lock.

  

· Use a Reader/Writer lock if >70% of operations take the lock shared. It is incorrect to assume that any amount of shared access will be an improvement over an exclusive lock. Exclusive operations can be delayed by multiple shared operations even if alternating shared/exclusive fairness is implemented.

          

· Break up locks but not so that typical operations need most of them. This is the tricky trade-off.

· Taking multiple locks can cause deadlocks. The typical solution is to always take locks in a pre-defined order, but in practice, different parts of the application may start with a different lock. Use TryToAcquire first (it helps eliminate deadlocks because you don’t wait). If that fails drop the lock that is held and re-acquire in the anti-convoy order. Things can change between the drop and reacquire, so it may be necessary to recheck the data. Not all locks have try-to-acquire option. WaitForMultipleObject(All) is another way to solve the deadlock problem without defining a locking order (deadlocks are impossible if all locks are obtained atomically). The expense of WaitForMultipleObject(All) is the downside.

· If you find there is a need to use recursion on locks, then it means you don’t know when the lock is held. The lack of knowledge makes it impossible to minimize the lock hold time because you don’t know when it was held. This is a common problem with Object-Oriented design