Recently I was working with a customer that experienced database moves in their DAG. The "failovers" were always at odd hours when there were no admins available. By the time the admins came in and tried to dump the cluster logs the data was gone. (Basically the cluster logs are spread across three files. Each of those files operates in a circular fashion. This is explained by Jeff Hughes in greater detail here: http://blogs.technet.com/b/askcore/archive/2010/04/13/understanding-the-cluster-debug-log-in-2008.aspx). In this customer's case the logs were overwriting within about 90 minutes.
I suggested increasing the size of the customer's cluster log, but to get the 72 hours Jeff recommends their cluster log was going to need to be huge. We needed an alternative that would cover them if there was a problem on Friday night that nobody picked up until Monday morning.
The suggestion I gave them was simply to use the Windows 2008 Task Scheduler to generate their cluster logs. Based on their scenario the last event in the sequence before the server began restarting services was Event ID 2137 from MSExchangeRepl. Within Task Scheduler we followed these steps:
- Under Actions select Create Task
- Go to the Triggers tab and click New
- Change the Begin the Task selection to "On an event"
- We used Basic, selected the Application log, our source and the event ID
- We then chose to delay the task for 5 minutes (long enough for the services to restart) and clicked OK
- Go to the Actions tab and select New
- We chose to place our command to generate the cluster log (complete with all parameters) in a BAT file, so we specified that BAT file as the program to run and clicked OK. For details on how to generate the cluster log please see this blog: http://blogs.msdn.com/b/clustering/archive/2008/09/24/8962934.aspx
- On the General tab we gave our task a name and description
- We also selected "Run whether the user is logged on or not"
We left the task to run overnight and it successfully captured the cluster log for us. The only real challenge for us was selecting an event that was consistently associated with their failure so that we would not miss our opportunity by generating a spurious log. The MSExchangeRepl event worked well in this instance because it was unique. I might have preferred using a 7031 from the System log, but the Task does not offer a chance to filter by the text in the description. Therefore the 7031 event would not be unique and we might capture the wrong service failure.
Later we were able to agree on an increased size for their cluster log, but when I last spoke to them they were setting up the task just in case they overran their cluster log and an admin didn't respond to their alerts in time.