Three node Availability Group set, with two asynchronous secondaries. Windows Server 2012 R2, with SQL Server 2014 RTM across all nodes. One of the secondaries is in read-only mode and carries an additional workload for reporting and analysis.
One of the asynchronous secondaries suffers an Out of Memory (OOM) condition and restarts. Investigations and root cause analysis required. GO…..
First place to start looking would be the SQL Server ERRORLOG. Investigations here showed a DBCC MEMORYSTATUS dump of the various memory clerks present within SQL Server, followed by the:
<Date><Time><spidID> Failed to allocate BUFs: FAIL_BUFFER_ALLOCATION 1
message and finally messages indicating a shutdown and restart of the instance.
Scanning through the list of memory clerks, one clerk in particular stood-out; the clerk called MEMORYCLERK_SQLLOGPOOL at a whopping 17GB+. The instance in question has 256GB of memory allocated to it, and with the workload which was present, coupled with the memory requirements for the SQLLOGPOOL – memory was completely consumed. So, in essence, the SQLLOGPOOL clerk has grown to such an extent that the other memory clerks (including the Buffer Pool) have been put under pressure to the point that no pages could be allocated and the OOM condition occurs. The question then becomes – “What has caused the SQLLOGPOOL clerk to grow to such a size”.
The SQLLOGPOOL clerk is a memory cache container for log blocks which are expected to be needed in the near future. Transaction log blocks get hardened to the transaction log, and also copied into the SQLLOGPOOL cache to make access much faster. These uses usually include:
· Transactional Replication requirements
· Change Data Capture requirements
· AlwaysOn asynchronous redo thread activity
· Change Tracking
So, with only AlwaysOn technologies in play, focus immediately falls onto the AlwaysOn asynchronous redo thread. In normal operation, we would expect the redo thread to process the log blocks in the SQLLOGPOOL and mark them as “no longer required”. These entries could then be cleaned up / over-written by newer entries. However, if the redo thread is struggling to keep up with the rate of transactional log activity, we can see the SQLLOGPOOL will start to grow in size*.
*As an aside… there is a trace flag (9023) to control the size of the SQLLOGPOOL cache, and place a 512MB upper limit on its size. This should prevent any further Out Of Memory incidents from this particular issue, but is unlikely to prevent the node from running behind. The redo will simply no longer find the required TLOG entries in the SQLLOGPOOL cache, will generate a cache miss event and then proceed to use the TLOG file from disk.
Bearing in mind that the platform is built upon a 3 node AlwaysOn Availability group set, where the secondaries are asynchronous replicas – they will always be out of sync with the primary node – by definition. However, the amount of skew seen between the replicas within the set will be aggravated by several factors:
1) The transaction volume present on the system
2) The rate of transmission of data from the primary to the replica(s) (i.e. network)
3) The rate at which the re-do thread on the replicas works to re-play the transaction log entries into the recovering (read-only) databases
Visibility of the lag can be seen in several ways:
1) Monitoring the Send Queue Length, the Send Queue Rate, the Redo Queue Length and Redo Queue rate (via script or AlwaysOn Dashboard)
2) Monitoring the LSNs and Commit times via a script
3) Monitoring the size of the SQL_LOGPOOL memory clerk
Running the scripts to determine the amount of log present on the replicas shows that one of the availability group nodes is skewed, with a large amount of log blocks awaiting replay via the re-do thread into the replica. We can also observe the SQLLOGPOOL clerk growing to a very large size (17GB+) in one case causing the Out Of Memory condition which originally triggered the investigation.
Looking into the AlwaysOn health extended event session, we can clearly see entries showing “redo_thread_blocked” on multiple occasions.
Now we are making progress.
So, something is managing to block the re-do thread, causing the log blocks on the replica to stack up, skew the Hardened LSNs and the replica commit times, and eventually bloat the SQLLOGPOOL clerk to such an extent that it starves the buffer pool of pages and triggers an OOM condition.
The only item which can block the redo thread is the presence of SCH_S / SCH_M locks (schema stability or schema modification locks). Given that the redo blocking is occurring on a read-only replica
– How could this be??
This is running a read-only workload right?!? So how could we get SCH_S / SCH_M locks?
Well….. Our original environment definition stated we had a read-only workload running on the node which is experiencing the issue. What could be generating the schema based locks which would block the redo?
How about statistics??
SQL Server employs two types of schema locks, both of which are taken when it updates the statistics for a table:
Sch-S: Schema Stability Lock
This lock ensures that a schema element, such as a table or index, will not be dropped while any session holds a schema stability lock on the schema element.
Sch-M-UPD-STATS: Schema Modification Lock
This is a non-blocking lock that is used by the system to ensure that only one automatic UPDATE STATISTICS process is run against a table at any given point in time. The sp_lock stored procedure will report this lock has having a type = TAB, resouce = UPD-STATS and mode = SCH-M.
Although the statistics within the read-only replica database will filter through as part of the transaction log changes in the normal manner – what about temporary statistics..? If we have a situation whereby the read-only workload *does not* have supporting statistics present within the database, the optimiser will create temporary statistics and store them in TempDB as described here:
So – how does this work?… If the query optimiser needs statistics on a particular column in order to generate an optimal plan (and auto create statistics / auto update statistics is turned on at the database level) it will scan the table to generate the statistics object. This scan will cause a SCH_S / SCH_M lock on the base table as the scan occurs (in turn blocking the redo thread and generating the extended event error and the increase in the SQL_LOGPOOL size), and will be held for the duration of the scan. The resulting statistics object is then saved into TempDB for continued use. If the server is re-started, the TempDB database will be flushed and re-created, and as such, the temporary statistics will be lost and also need to be re-created.
Looking into the read-only asynchronous node, we can see evidence that temporary statistics have been generated (see the is_temporary column on sys.stats), and evidence that system generated statistics are also being generated (statistic name like ‘_WA_sys_%’) with a stats_date of today.
So – final piece of proof required. Create two extended event sessions which show:
1) auto_stat events **
2) Sch_s / sch_m lock events
And use the information returned to correlate between the auto_stat event firing and the existence of the SCH_S / SCH_M locks which then block the redo thread.
** There is also a trace flag 8721 which can be applied to show statistics information, and have it dropped into the ERRORLOG with trace flag 3605. This is actually a little more useful than the extended event, as this also shows the duration that the statistics object took to rebuild.
The upshot of the analysis is that the optimiser is receiving queries as part of the read-only batch, and does not find the statistics objects which it needs to generate optimal plans. As such, the optimiser then generates a temporary statistics object, stores it in TempDB and uses that to generate the plan which is then executed. The generation of the temporary statistic takes time (the larger the table and scan, the longer it will take) and during this time, a SCH_S lock is held, with a SCH_M lock to swap the new statistics blob in at the end. These SCH_S and SCH_M locks block the redo thread and causes the growth of the SQL_LOGPOOL cache, the resulting lag between nodes and the OOM condition.
I would begin by investigating all of the temporary statistics from the read-only node, and create them as column level statistics in the primary replica. The optimiser obviously thinks it needs them, so creating these statistics in the primary replica will ensure that they filter through to the read-only replica and were then in place to support the read-only batch at run time. This should also safe guard against having to re-create them in the case where the server is re-started and TempDB is recreated. I would also look to do that same exercise with the system generated statistics present within the database and look to build column level statistics for them, as again, the optimised obviously thinks they are useful and they are used on a daily basis.
Download the code for this post here: DT_Appendix