I want to start off by saying that the issue I am about to talk about has been blogged on before. However, it seems to come up to me time and time again by customers and peers alike and I feel that there is a need to address it once again and widen the foot print of search results. So with that said, let’s talk about Windows Server 2008 and Windows Server 2008 R2 Failover Clustering and the way we handle cluster logging.
The Failover Clustering feature in Windows Server 2008 and Windows Server 2008 R2 has a diagnostic debug log which captures detailed information about cluster operations. This information is logged using the standard Event Tracing for Windows (ETW) infrastructure. To make viewing the output easier a text based log can be generated with the Get-ClusterLog cmdlet or the with the Cluster.exe Log /G command. The Cluster.log text log which is produced merges all available ETW logs into a single text log. At times the merging of these multiple ETW log files may produce gaps in the dates of logging that may not be intuitive. These gaps in time can be confusing and appear that logging data was lost, when that is not the case. This is merely cosmetic and not something to be concerned with; let’s discuss.
Cluster event tracing is enabled by default and the log files are stored in the %WinDir%\System32\winevt\logs\ directory in an .ETL format.
The log size is set to 100 MB by default and each log is handled in a circular logging scheme. Three separate ETL log files are kept, each time a node is rebooted a new ETL file is generated and logging is conducted to the new log file until the server is rebooted again. The ETL log file name is incremented each time and has a 00X suffix appended to it, once the maximum number of log files for retention has been hit, the first log file is overwritten, then the next in a circular fashion. Only one log file is actively being logged to at any given time. The ETL file naming is a little different depending on the OS version, in Windows Server 2008 it is ClusterLog.etl.00X and in Windows Server 2008 R2 it is in the format of Microsoft-Windows-FailoverClustering Diagnostic.etl.00X.
There are three important concepts to understand here:
1. Logging to each ETL is circular
2. Each time a node is rebooted it increments and logs to a new file
3. The Cluster.log output is a merging of all available ETL files
Here’s how this works… Node is up and running logging to an ETL file. That node is rebooted, to ensure the logging data is retained after reboot logging is now started to a new ETL file. The server is up and running logging to the 2nd ETL and that ETL file hits its 100 MB maximum size and wraps, then later the node is rebooted again and starts logging to a 3rd ETL file. Now a cluster.log file is generated, which is a single text file that appends all three ETL files together. Because the 2nd ETL file wrapped, when the different ETL’s are appended there “appears” to be a gap in the log and missing data. When the reality is that nothing is lost… this is perfectly fine.
The ETL.001 file is the active file being used by the live cluster service to write debug entries. Let’s say there were many entries being written to the current trace session file and we hit the 100MB limit at 3pm. At that point, events from 12am to 3am were truncated to make room for additional entries in the log. At that point, the ETL.001 log would look like this:
Now I want to view the cluster logs in some text readable format so I run either the ‘cluster log /gen’ command or the ‘Get-ClusterLog’ cmdlet
These commands take all the ETL.00X files and “glues” them together in the order they were created to make one contiguous text file.
<- Live Trace Session
The file that gets created is located in \%WinDir%\Cluster\Reports\ and is called CLUSTER.LOG
If you were looking at the debug entries in the CLUSTER.LOG file, you might notice that since we glued all three logs together, there is an apparent gap in the CLUSTER.LOG from 11:59pm on 1/2 to 3am on 1/3. The assumption may be that either there is information missing during that time or there was nothing being logged. The missing time is not in and of itself a problem, just a side effect of concatenating all three logs together.
So you see, this is why there are gaps in the data and you can understand how those phantom missing pieces could affect you in your endeavor of finding a root cause of your issue.
Again, I would like to thank Jeff Hughes here for his work on this topic and being kind enough to allow me to borrow his data to increase the foot print of this issue.
Please head over and read Jeff’s blog post as he goes into detail on how to determine the correct size of your log file to fit your environment. How to set the log file size using the command line and power shell and also how to determine/change the detail verbosity of what is logged to your cluster log files.
Jeff’s Blog is available –> here <-
A special thanks to Elden Christensen on this as well for the pointers and technical review.