How To Identify Bad Items In Public Folder Replication

As I’ve previously discussed here and here, there are various things that will prevent an item from replicating to an Exchange 2007 or Exchange 2010 public folder store – things like bad start/end dates on appointments, category names that contain certain characters, etc. However, once you’ve figured out that replication is failing because of one of these problems, it can still be tricky to identify which particular items are the ‘bad’ items. In this post, I want to share the approach we usually use to track these down.

Now, you may be thinking, “Well, I can just compare the source and the destination to see which items didn’t make it over, and those are my bad items, right?” Not quite.

When we replicate changes for a particular folder, we pack up as many changes as we can until we meet the Replication Message Size Limit configured on the public folder store. This applies to both the replication of new data and the backfill of existing data. As a result, when you observe a backfill response for a folder with a bunch of messages, you typically see a big list of items in the event:

Event Type:    Information
Event Source:    MSExchangeIS Public Store
Event Category:    Replication Outgoing Messages
Event ID:    3021
Description:
An outgoing replication message was issued.

Type: 0x80000004
Message ID: <003346D6BDE71245954899B0FC2F3702017DCF@bilongtimb1.bilong.test>
Folder: (1-16771) IPM_SUBTREE\FolderA\Subfolder1

Database "First Storage Group\Public Folder Store (BILONGTIMB1)".
CNSET: 1-1,1-1791E

CNSET (FAI): 1-1,1-1791E

Message IDs: 90
  1: 1-1687B, 1-16C45
--- : Test 10 : 5/19/2010 4:16:12 PM
2: 1-1687C, 1-16C46
--- : Test 10 : 5/19/2010 4:16:12 PM
3: 1-16886, 1-16C47
--- : Test 10 : 5/19/2010 4:16:12 PM
4: 1-16890, 1-16C48
--- : Test 10 : 5/19/2010 4:16:12 PM
5: 1-1689A, 1-16C49
--- : Test 10 : 5/19/2010 4:16:12 PM
6: 1-168A4, 1-16C4A
--- : Test 10 : 5/19/2010 4:16:12 PM
7: 1-168AE, 1-16C4B
--- : Test 10 : 5/19/2010 4:16:12 PM
8: 1-168B8, 1-16C4C
--- : Test 10 : 5/19/2010 4:16:12 PM
9: 1-168C2, 1-16C4D
--- : Test 10 : 5/19/2010 4:16:12 PM
10: 1-1687A, 1-16C4E
--- : Test 9 : 5/19/2010 4:16:09 PM
11: 1-1687D, 1-16C4F
--- : Test 9 : 5/19/2010 4:16:09 PM
12: 1-16887, 1-16C50
--- : Test 9 : 5/19/2010 4:16:09 PM
13: 1-16891, 1-16C51
--- : Test 9 : 5/19/2010 4:16:09 PM
14: 1-1689B, 1-16C52
--- : Test 9 : 5/19/2010 4:16:09 PM
15: 1-168A5, 1-16C53
--- : Test 9 : 5/19/2010 4:16:09 PM
16: 1-168AF, 1-16C54
--- : Test 9 : 5/19/2010 4:16:09 PM
17: 1-168B9, 1-16C55
--- : Test 9 : 5/19/2010 4:16:09 PM
18: 1-168C3, 1-16C56
--- : Test 9 : 5/19/2010 4:16:09 PM
19: 1-16879, 1-16C57
--- : Test 8 : 5/19/2010 4:16:05 PM
20: 1-1687E, 1-16C58
--- : Test 8 : 5/19/2010 4:16:05 PM
21: 1-16888, 1-16C59
--- : Test 8 : 5/19/2010 4:16:05 PM
22: 1-16892, 1-16C5A
--- : Test 8 : 5/19/2010 4:16:05 PM
23: 1-1689C, 1-16C5B
--- : Test 8 : 5/19/2010 4:16:05 PM
24: 1-168A6, 1-16C5C
--- : Test 8 : 5/19/2010 4:16:05 PM
25: 1-168B0, 1-16C5D
--- : Test 8 : 5/19/2010 4:16:05 PM
26: 1-168BA, 1-16C5E
--- : Test 8 : 5/19/2010 4:16:05 PM
27: 1-168C4, 1-16C5F
--- : Test 8 : 5/19/2010 4:16:05 PM
28: 1-16878, 1-16C60
--- : Test 7 : 5/19/2010 4:15:59 PM
29: 1-1687F, 1-16C61
--- : Test 7 : 5/19/2010 4:15:59 PM
30: 1-16889, 1-16C62
--- : Test 7 : 5/19/2010 4:15:59 PM
31: 1-16893, 1-16C63
--- : Test 7 : 5/19/2010 4:15:59 PM
32: 1-1689D, 1-16C64
--- : Test 7 : 5/19/2010 4:15:59 PM
33: 1-168A7, 1-16C65
--- : Test 7 : 5/19/2010 4:15:59 PM
34: 1-168B1, 1-16C66
--- : Test 7 : 5/19/2010 4:15:59 PM
35: 1-168BB, 1-16C67
--- : Test 7 : 5...
MIDSET Deleted: 1-1,1-16871

Server: /O=FIRST ORGANIZATION/OU=EXCHANGE ADMINISTRATIVE GROUP (FYDIBOHF23SPDLT)/CN=CONFIGURATION/CN=SERVERS/CN=BILONGE12MB1/CN=MICROSOFT PUBLIC MDB

One thing to notice is that the end of the list of items you just have a “…”. In fact, it doesn’t even give the whole date for that last item – the date is truncated. This “…” at the end means there was more stuff there, but we didn’t log the rest. Because of this, you can’t really see the entire list of items that were included in this backfill response.

Anyway, if any one of these items fails in content conversion, all the other items in this message also fail to replicate. Because of this, a single problem item can prevent many other items from replicating, even when those other items are just fine.

Fortunately, there’s an easy way to find the problem items. You can let the replication engine narrow down the problem for you just by changing one simple setting.

On the store that’s sending the backfill response (the store that has the data), set the replication message size limit to 1KB. Here are some example screenshots of the size limit from various tools:

sizelimit-Ti

sizelimit-E12 sizelimit-E12cmd

In the top two screenshots, I had already changed the limit to 1KB. In the bottom one I had not changed it yet, so it was still at its default of 300KB.

Once you’ve turned up logging, changed the limit, and re-mounted the public store, go to the server that is missing the data and force a backfill timeout on that folder. In Exchange 2007 and 2010, you do this with the Update-PublicFolder cmdlet, or by right-clicking on the folder and choosing “Update Content” in the PF management GUI. In Exchange 2003, you do this from ESM by going to the public folder tree, clicking on the folder in question in the left-hand pane, choosing the Status tab in the right-hand pane, right-clicking on the server that doesn’t have the data, and choosing “Synchronize Content”.

Because of the tiny replication size limit, instead of a few big replication messages, you’ll see a ton of tiny ones. Assuming you have logging turned up on Replication Incoming and Replication Outgoing, the application log on the server that has the data will look something like this:

applog1

That 3026 you see at the bottom is the backfill request from the other server. Then you have a huge series of 3021’s, one for each backfill response. Each of these responses will only contain one or two messages.

Once this process completes, most of the good messages will have successfully replicated. At that point, you can go and force a backfill timeout one more time. You’ll see another backfill request followed by some responses – but this time, since all the good stuff replicated, you’ll only see a few backfill responses. You can look at each of those response events to see the messages listed therein.

And with that information, you know which messages to focus on to determine what the problem is! You can force a backfill timeout over and over, and you should see the store repeatedly try to replicate those same problem items every time.

Once you’re done troubleshooting, you can change the replication message size limit back to 300KB.