The Effects Of Archival Stubs On Database Space Management

Update: The Exchange 2010 issue was resolved in SP2 RU1.

Recently, there have been some theories flying around the blogosphere about the way archive stubbing affects space reclamation in Exchange databases. Specifically, some have questioned whether Exchange will properly reclaim the space when the size of an item in the database shrinks. I contend that yes, Exchange will properly reclaim that space. But you don’t have to take my word for it. I will show you exactly how I tested it, so that you can test it yourself and draw your own conclusions.

First, however, I want to point out that we are currently investigating an issue with Exchange 2010 where database space is not reclaimed, but this issue does not necessarily have anything to do with archival or stubbing. We believe we have reproduced the behavior without those factors in the mix at all, and we are still trying to understand the behavior. This issue only affects Exchange 2010. As a result, I do not recommend trying to test this on Exchange 2010 at this time. The space may not be reclaimed as expected, and you could erroneously conclude that stubbing is at fault, when it’s actually due to the issue that we are investigating.

That said, let’s get started with the test. The point of my test was to prove whether we free up space in the database when a large message body is replaced with a small one. We can’t really test this on 2010 because of the outstanding issue we’re looking at, but it can easily be tested on 2007 or 2003. I decided to perform a simple test on Exchange 2007. Here’s what I did.

The Test

First, I created a brand new database on my Exchange 2007 server. On that database I put a single mailbox. Then, I wrote a simple script to send a bunch of messages with large bodies to that mailbox. Here’s the script:

# Send-SMTPMessages

param([string]$ServerName, [string]$RecipientAddress, [int]$NumberOfMessages)

$bodySB = new-object System.Text.StringBuilder
for ($x = 0; $x -lt 25; $x++) # 25 = 5000 byte body (25 iterations * 2 bytes for each character * 100 characters)
{
$foo = $bodySB.Append("1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890")
}

$smtpClient = new-object System.Net.Mail.SmtpClient($ServerName)

for ($x = 0; $x -lt $NumberOfMessages; $x++)
{
("Sending message " + $x.ToString())
$message = new-object System.Net.Mail.MailMessage("sendsmtpmsgscript@contoso.com", $RecipientAddress, ("Message " + ($x+1).ToString()), $bodySB.ToString())
$smtpClient.Send($message)
}

I ran the script with the following syntax:

.\Send-SMTPMessages MyServerName mytestmailbox@contoso.com 10000

As the script is written above, this will generate 10,000 messages (which I specified at the command line), with each of those messages having a 5000-byte message body. You can adjust the size of the message body by altering the number of iterations of the for loop, as I noted in the script. All those messages will end up in the Inbox of your test mailbox.

With that done, I let online defrag run to make sure there was no free space (I set the schedule and watched for the 1221 event), and then I dismounted the database and generated a space dump with eseutil /ms.

When you’re looking at space in an Exchange database, it’s important to understand how the database is structured. In all versions of Exchange prior to 2010, the database contained one big Msg table. That table held the bodies of every single message, for every single folder, for every single mailbox in the database. It was all stuffed into one huge table (attachments were stored in another huge table).

So, as expected, my space dump showed that the vast majority of the content of the database was in the Msg table. Here is a snippet of the actual output:

Name Type ObjidFDP PgnoFDP PriExt Owned Available
Msg Tbl 19 50 2-m 11270 16
<Long Values> LV 98 51 1-m 10046 7

“Owned” tells us how many pages belong to this table, and “Available” tells us how many of those pages are free. I had just created 10,000 new messages, so as you would expect, there was very little free space in the table.

With that done, I mounted the database and ran another script. I wrote this second script to simulate the act of stubbing. It iterates over every message in the Inbox and changes the body to simply say “Stubbed!”, reducing the body size from 5000 bytes to less than 100 (it actually comes out to something like 80 bytes, because this method of setting the body generates some HTML around the text). Here’s the script:

# Stub-AllMessagesInMailbox.ps1
# This script requires the EWS Managed API to be installed.

param($ServerName, $UserName, $Password, $DomainName)

Import-Module -Name "C:\Program Files\Microsoft\Exchange\Web Services\1.0\Microsoft.Exchange.WebServices.dll"

$exchService = new-object Microsoft.Exchange.WebServices.Data.ExchangeService([Microsoft.Exchange.WebServices.Data.ExchangeVersion]::Exchange2007_SP1)
$exchService.Credentials = new-object System.Net.NetworkCredential($UserName, $Password, $DomainName)
$exchService.Url = new-object System.Uri(("https://" + $ServerName + "/EWS/Exchange.asmx"))
$inbox = [Microsoft.Exchange.WebServices.Data.Folder]::Bind($exchService, [Microsoft.Exchange.WebServices.Data.WellKnownFolderName]::Inbox)

$offset = 0
$itemView = new-object Microsoft.Exchange.WebServices.Data.ItemView(100)
while (($inboxItems = $inbox.FindItems($itemView)).Items.Count -gt 0)
{
foreach ($item in $inboxItems)
{
("Stubbing item: " + $item.Subject)
$item.Body = "Stubbed!"
$item.Update([Microsoft.Exchange.WebServices.Data.ConflictResolutionMode]::AlwaysOverwrite)
}

$offset += $inboxItems.Items.Count
$itemView = new-object Microsoft.Exchange.WebServices.Data.ItemView(100, $offset)
}

After all my items were “stubbed”, I immediately dismounted the database again and ran another space dump, without letting online defrag run. The result surprised me:

Name Type ObjidFDP PgnoFDP PriExt Owned Available
Msg Tbl 19 50 2-m 11414 12
<Long Values> LV 98 51 1-m 10054 7952

My Owned pages went up slightly, but I instantly had 7,952 pages free in the LV tree of the Msg table – about 62 MB. And this was before allowing online defrag to run. After online defrag I had even more free pages:

Name Type ObjidFDP PgnoFDP PriExt Owned Available
Msg Tbl 19 50 2-m 11414 16
<Long Values> LV 98 51 1-m 10050 8207

I repeated this test using message bodies of 2000 bytes and 7000 bytes, and the results were the same. I tried stubbing 5000-byte items down to 1078 bytes instead of 100 bytes, and the results were the same. I even tried stubbing only every 5th message instead of all the messages. In all my tests, Exchange 2007 freed a ton of pages when I reduced the size of the message bodies.

Conclusions From Testing

From this test, it is quite clear that after stubbing the items, space is reclaimed – and not just a small amount, either. In fact, we reclaimed more space than we would expect based on the size of the items. 5000 bytes x 10000 messages = roughly 50 MB, but we got back about 65 MB (8207 pages * 8192 bytes per page = 67,231,744 bytes). This is likely due to page fragmentation from before the stubbing. That is, if our items are 5k in size and we have 8k pages in Exchange 2007, we can’t fit 2 items on one page. As a result, there will be some unused space on these pages. After we stubbed, we were able to fit more items on a page more efficiently, and space wasted by fragmentation actually decreased.

Page fragmentation might also explain why the Msg table was about 90 MB in size before we stubbed, when we had only added 50 MB of message bodies. This is a great example of why you can’t just add up mailbox sizes to determine how big the database file should be. It has a lot to do with what sort of data is in the database.

So What’s The Deal With Stubbing?

A few years back, I worked on a series of Exchange 2003 cases where customers were using stub archival (also known as shortcuts), and they were concerned at the discrepancy between reported mailbox size and actual database size. Several of those customers sent me databases, and I spent a few months working with the archival software vendor to analyze the database contents and understand the reason for the difference. Our findings revealed that the mystery space was being taken up by two types of overhead.

The first type of overhead is properties that simply don't count toward the size of the message. When you look at the size of a message in Outlook, the size you see is not the total size of all the data that Exchange stores for that message. We store certain properties that do not count toward message size, and they aren’t counted against the user and aren’t reflected in the message size. These may be publicly documented properties such as PR_URL_NAME or other internal properties which are not publicly documented.

The second type of overhead is page fragmentation. A page in an Exchange 2003 database is 4k in size. Each record in the database has to be arranged onto these pages, and depending on how efficiently we are able to do so, there will be some space left on the page. This is page fragmentation, which results in empty space that cannot be reclaimed by online maintenance.

To the extent that an Exchange database is made up of many tiny items, these two kinds of overhead become more apparent. I worked with the vendor to set up a lab environment where we could test this theory with the vendor’s real archival software. We tested several scenarios, but I think the final test was the most interesting. We used LoadSim in a lab environment to fill a database with 1.4 million messages ranging in size from 1k to 4k with no attachments. Even before archiving, the database had about 30% overhead from fragmentation and 20% from properties that don't count toward message size. This showed that the behavior can be achieved even without archival by filling the database with lots of small items. After archiving everything and leaving shortcuts in place, those numbers went up to a combined 70% (because all the 4k items became much smaller shortcuts), although archiving did free up about 15% of the database. The same behavior would be expected from any archival product that is configured to leave shortcuts behind.

Exchange will continue to work just fine in this scenario. There's nothing inherently wrong with it. It's just that when you archive email messages and leave behind a shortcut, you are taking out the big properties like message body and attachments, and leaving a small stub - but all the overhead associated with the item remains. Although you do save disk space by taking out the big stuff, the remaining tiny items still have the same amount of overhead as a normal item. The ratio of overhead to actual email rises, so the database size will be considerably larger than what you would expect from adding up the reported mailbox sizes.

We concluded that if you want to reduce the overhead, the best approach is to expire the shortcuts after some time limit. To the extent you expire the shortcuts, you should see the overhead come down. Of course, another option is to let all the shortcuts stay in the database, knowing that your database is perfectly healthy despite the size difference between the mailbox sizes and the file size.

So what about Exchange 2007? If anything, I would expect page fragmentation to be somewhat less of an issue in 2007 than 2003 due to the larger page size. That should make it easier to arrange several stubbed items onto a single page. But again, this will depend a lot on what sort of items you put in the database.

In Conclusion

I avoided discussing Exchange 2010 here, because the issue with space reclamation which we are currently investigating makes it very difficult to test and draw any conclusions about how 2010 will behave.

However, my findings in testing Exchange 2007 are clear: When the items shrink, we reclaim a lot of space. Background cleanup is able to grab much of the space even before online defrag comes along. Maybe someone can find fault with my testing, or maybe someone can come up with a reproducible scenario where we don’t reclaim the space. If you can, I would love to hear from you.

Let me clearly state: At this time, there is no known issue with archival stubs beyond the increased overhead ratio I described above.