What are zombie users?


Pretty much anyone who has upgraded a 5.5 server to E2K has probably encountered the zombie user phenomenon.  Error log event 9551 is likely familiar.  The reason behind these errors has to do with what we did to Exchange security in Exchange 2000 versus how it existed in Exchange 5.5 and earlier.  The early versions of Exchange were developed before the NT security model became widely adopted, so it rolled its own for security.  Both the NT model and the Ex5.5 model made use of something called an ACL, or Access Control List, but the formats of them are very different.  Having a different security model in Exchange versus the OS and other products was a nuisance and limited a lot of things we could do along the lines of storage convergence, but the main reason for making the change was that we were also integrating with the new (at the time) Active Directory which used NT security descriptors.  This presented us with a major headache: how do we convert the 5.5 ACL’s to NTSD’s?
 
As an aside: the terms NTSD, DACL, and ACL are often used interchangeably although they mean slightly different things.  An NTSD is a full security descriptor of an object, including the owner, the primary group, the DACL (discretionary ACL, which controls who has access to the object) and the SACL (the system ACL, which controls what security auditing is done when someone tries to access the object).  Because the DACL is the most important and complex part of the NTSD, the terms are often interchanged.  Also, since the SACL is not very widely used and does not directly control access, the more generic term ACL is usually used to mean just the DACL.  In 5.5 the ACL is the term used internally for what is really equivalent to an NT DACL because auditing did not exist and there was no such thing as a SACL anyway.
 
The reason conversion was difficult had to do with a number of factors related to the format of 5.5 ACL’s.  The format was actually pretty simple: each ACL consisted of a list of DN’s (distinguished names, which is how users and groups were stored in the 5.5 directory) and a bit mask which showed what rights (read access, write access, etc) that DN had and didn’t have to the object.  In NT (and E2K), the DACL consists of a list of SID’s (security identifiers, which are what the AD uses to uniquely identify a security entity) and a rights mask.  There are also two different types of entries: allows and denies (there are others as well, but they need not be considered for purposes of this discussion).  If an entry indicated that a user was allowed read access only to an object, it did not necessarily mean that the user was denied write access to that object.  In 5.5 it did.  This presented a problem with ordering the entries (called ACE’s, or access control entries) in the ACL.  NT called for one specific ordering, but in order to maintain the same security semantics as 5.5, we needed to use a different ordering.
 
That wasn’t the main problem though.  The main problem was that the DN’s needed to be converted to SID’s.  When converting from 5.5 to E2K, most roll outs had at some point a mixed environment of both types of servers.  The ADC was used to keep the 5.5 directory in sync with the AD.  In the AD, security entities have a special property called the legacy exchange DN.  This property was used to do the conversion.  The problem that comes into play here is if the E2K system cannot find the legacy exchange DN in the AD for an ACL entry.  There are two reasons for this: the security entity no longer exists in the AD (if a user/group object was ACL’d on a folder and then the user/group was deleted) or the security entity does exist but has not yet been replicated to the AD from 5.5.  In the first case, it would make sense to simply drop the entry.  But in the second case, that could lead to incorrect security behavior.  An entry in this situation is called a zombie user.
 
This issue was known fairly early on in the design of E2K but it was pretty much considered to be something we were stuck with.  It was thought (mistakenly as it turns out) that this problem could be minimized through the use of the DSIS consistency checker. This tool went through an Exchange 5.5 database and examined each ACL and removed any entries that no longer existed in the directory.  As long as this was run first, it should prevent zombies from being replicated to E2K.  This didn’t work out too well for a number of reasons.  One incorrect assumption on our part was that directories were fairly static things, at least in regards to impacting zombie users.  On the contrary, directories are quite dynamic and new zombies can get created all the time.  Another was that we underestimated the amount of time customers would remain in mixed 5.5/E2K environments and didn’t account for the fact that as long as public folders remained in 5.5 these problems would occur over and over again.
 
A final complication regarding 5.5 ACL conversion was in the complexity of the ACL’s.  It was thought that very complex ACLs might contain a couple of dozen entries.  As it turns out, I have seen ACLs with 50-100 entries quite frequently.  Further, group conversions (from distribution lists to security groups) involve even more difficulties. 
 
Our initial upgrade algorithm was fairly basic and had a number of problems.  We have since learned our lesson with this and have made a series of improvements in service packs and Exchange 2003 to optimize this logic.  Hopefully, this issue has now been greatly reduced and will no longer present much of an upgrade obstacle anymore.
 
-
Jon Avner

Comments (3)
  1. Gary MacDonald says:

    Our site is still a mixed Ex55/Ex2K3 environment six months after the migration, and it will probably be another three months before we go native. We’re seeing scores if not hundreds of zombies in our event logs. These are very tedious to manually remove. Can anyone provide a pointer or two to a script-based solution for removing them?

  2. Jon Avner says:

    As long as you are not potentially generating new zombies by adding or modifying users and groups on the 5.5 side, you can try this:

    – Set the "Ignore zombie users" reg value to true (see KB:324323 for details).

    – If you are experiencing these on public folders, force ACL upgrades on all public folders by walking them and accessing the ACL’s. PFDAVAdmin can do this. See http://blogs.msdn.com/exchange/archive/2004/11/05/252979.aspx.

    – If these are occurring on private mailboxes, you are on you own for a tool that will walk all the folders in all the mailboxes.

    – Set the reg value back to false.

    You don’t want to leave this value on forever because dropping zombies could result in security holes under the right set of circumstances and some users may get additional access they weren’t supposed to have. But if you’re desperate to be rid of them, it’s something you can do.

  3. Vermyndax says:

    How strange… I just finished posting a blog entry about something similar. We’re seeing zombie user event log entries, but the users mentioned in the log entries exist fine in the directory.

    I’m sure it’s a replication problem, but by and large the nastiest replication problem I’ve experienced is the one I *just* blogged about here:

    http://blogs.galaxycow.com/vermyndax/archive/2004/12/01/1103.aspx

    In our situation, 5.5 and 2003 did not populate the "member" attribute of distribution list objects in active directory but DID populate the "memberOf" attribute. Nasty, nasty problem!

Comments are closed.

Skip to main content