SQL Server 2008 DR, Testing, Waits and Queues

Article
03/25/2009

Here are some notes on “SQL Server 2008 DR, Testing, Waits” I took while attending an advanced class on SQL Server taught by Ron Talmage (from https://www.solidq.com/na/MentorDetail.aspx?Id=38).

Please note that, although these notes were taken during the class, I might have added some of my own (mis)interpretation :-). Always check your facts on Books Online (I try to provide links when applicable). As with anything else you get from a blog, never use any of this in production before you thoroughly validate it a test environment. Please keep in mind that some of those will be hard to follow without some pre-requisite knowledge and the right context. Reading the post from top to bottom will help.

Disaster Recovery

RTO (Recovery Time Objective), RPO (Recovery Point Objective), RTA (Recovery Time Actual)
Backup Media (disk/tape) and Location (local, remote, offsite)
Recovery Technology (Backup/Restore, Mirroring, Log Shipping, Clustering, Replication)

DR and HA

What’s the difference between HA and DR?
How would you prepare for a disaster?
What happens if you lose the entire data center?
Primary data center, standby data center.
Consider risks, cost, time to invest.
Big question: What are you trying to protect against?

Class discussion

If you have an RPO that requires very small data loss, what to do?
Small RPO? Large RPO? Small RTO?
How far apart can the two nodes be?
Synchronous or asynchronous replication?
Consider: Clustering, Mirroring, SAN Replication, GeoClustering, Log Shipping
Planned downtime: New hardware, Upgrade, Patches, Versions, Consolidation
Unplanned downtime: Disasters
How to flip a switch to the DR site?
Virtualization support for SQL Server - https://support.microsoft.com/KB/956893
Server Virtualization Validation Program (SVVP) - https://technet.microsoft.com/en-us/library/ms143506.aspx
Windows Server 2008 Failover Cluster Configuration Program (FCCP) - See https://www.microsoft.com/windowsserver2008/en/us/failover-clustering-program-overview.aspx

Technologies

VDI - COM-based interface with SQL Server 7 and later
VDI - Supports snapshot backups using SAN-based technologies
VSS - https://blogs.technet.com/josebda/archive/2007/10/10/the-basics-of-the-volume-shadow-copy-service-vss.aspx
DPM - www.microsoft.com/dpm
Log Shipping - https://msdn.microsoft.com/en-us/library/ms187103.aspx
Database Mirroring - https://msdn.microsoft.com/en-us/library/bb934127.aspx

Benchmarking and baselining

Load-testing, benchmarking and baselining
Meet performance requirements
Verify correctness
Determine system limits
Determine impact of changes

Class Discussion

Baseline - Ground zero, known state, Point of comparison
Benchmark - Measurement under specified conditions, goal, used to estabilish baseline
TPC - https://www.tpc.org/, TPC-E for OLTP, TPC-H for DW/OLAP. Reports trans/sec and response time
Load test - Application of a load, looking for a sufficient or expected load
Stress test - Increase load. Done with concrete in construction.
Saturation test - Add load until it fails.
Endurance test – With specified load for a long time

White Paper

Tuning the Performance of Backup Compression in SQL Server 2008
https://sqlcat.com/technicalnotes/archive/2008/04/21/tuning-the-performance-of-backup-compression-in-sql-server-2008.aspx
Looking at the white paper
- Backup throughput to NUL with varying BUFFERCOUNT (what’s the baseline?)
- Disk throughput and latency during backup compression (disk saturated, increasing queue)
- Database snapshot creating time with varying workload (what’s the baseline? Load test.)

Performance counters

Tools

SQLIO, IOMeter, SQLIOSim, RML utilities, VSTS, 3rd party tools
SQLIO – https://www.microsoft.com/downloads/details.aspx?familyid=9A8B005B-84E4-4F24-8D65-CB53442D9E19&displaylang=en
IOMeter – https://www.iometer.org/
SQLIOSim - https://blogs.msdn.com/sqlserverstorageengine/archive/2006/10/06/SQLIOSim-available-for-download.aspx
SQLIOSim - correctness and stress tool, simulates data and log file activity, does not require SQL
See https://sqlblog.com/blogs/kevin_kline/archive/2007/06/28/understanding-sqliosim-output.aspx
RML utilities – OSTRESS, ReadTrace, Reporter - https://support.microsoft.com/kb/887057
See https://blogs.msdn.com/psssql/archive/2007/12/18/rml-utilities-for-microsoft-sql-server-released.aspx
See https://sqlcat.com/technicalnotes/archive/2008/02/01/precision-performance-for-microsoft-sql-server-using-rml-utilities-9-0.aspx

Waits and Queues

Wait occurs when thread has to wait for a resource
How waits reporting works on SQL Server – waiter queue and wait types
SQL Server 2000 – 76 types of waits
SQL Server 2005 – 201 types of waits
SQL Server 2008 – 484 types of waits
Query with sys.dm_os_wait_stats, reset with DBCC SQLPERF ('sys.dm_os_wait_stats', CLEAR)
See https://msdn.microsoft.com/en-us/library/ms179984.aspx
See https://blogs.msdn.com/sqlcat/archive/2005/09/05/461199.aspx
SQL Server 2005 Waits and Queues - White Paper
https://download.microsoft.com/download/4/7/a/47a548b9-249e-484c-abd7-29f31282b04d/Performance_Tuning_Waits_Queues.doc

From the white paper, regarding OLTP workloads

Database Design
- High Frequency queries having a high number of table joins (>4)
- Frequently updated tables having # indexes (>3)
- Big IOs - Table Scans, Range Scans (>1)
- Unused Indexes
CPU
- Signal Waits (>25%)
- Plan reuse (<90%)
- Parallelism: Cxpacket waits (>5%)
Memory
- Page life expectancy (<300 sec)
- Page life expectancy (Drops by 50%)
- Memory Grants Pending (>1)
- SQL cache hit ratio (<90%)
Disk
- Average Disk sec/read (>20 ms) – Comment: this should be lower (>5, >8)
- Average Disk sec/write (>20 ms) -- Comment: this should be lower (>5, >8)
- Big IOs - Table Scans, Range Scans (>1)
- ASYNCH_IO_COMPLETION, IO_COMPLETION, LOGMGR, WRITELOG, PAGEIOLATCH_x (in top 2)
- Low bytes per sec
Blocking
- Block percentage (>2%)
- Block process report (30 sec)
- Average Row Lock Waits (>100ms)
- LCK_M_% (in top 2)
- 5, High number of deadlocks (>5 per hour)
Network
- High network latency + many round trips to DB (Output queue length >2)
- Network bandwidth used up (Packets Outbound Discarded, Packets Outbound Errors, Packets Received Discarded, Packets Received Errors)

Troubleshooting Performance Problems in SQL Server 2005
https://download.microsoft.com/download/1/3/4/134644fd-05ad-4ee8-8b5a-0aed1c18a31e/TShootPerfProbs.doc

Diagnosing Transaction Log Performance Issues and Limits of the Log Manager
https://sqlcat.com/technicalnotes/archive/2008/12/09/diagnosing-transaction-log-performance-issues-and-limits-of-the-log-manager.aspx

SQL Server 2008 DR, Testing, Waits and Queues

Additional resources