SQL Server 2008 Database Checking

Here are some notes on "SQL Server 2008 Database Checking" I took while attending an advanced class on SQL Server taught by Paul Randal (https://sqlskills.com/AboutPaulSRandal.asp).

Please note that, although these notes were taken during the class, I might have added some of my own (mis)interpretation :-). Always check your facts on Books Online (I try to provide links when applicable). As with anything else you get from a blog, never use any of this in production before you thoroughly validate it a test environment. Please keep in mind that some of those will be hard to follow without the right pre-requisite knowledge and the right context. Reading the post from top to bottom will help.


Checking databases

  • Why check for corruption?
  • What to do when corruption happens?
  • Many people don't prepare for it and panic.
  • Monitor with agent alerts

IO Errors

  • 823 - hard I/O error - sql never got the data, just an error code
  • 824 - soft I/O error - sql got some data, but it came with error code
  • 825 - read/retry error
  • msdb..suspect_pages has a list

Read-Retry

  • SQL has retried a few times, error 825 goes to the log
  • Your I/O subsystems has transient problems
  • Eventual failure results in an IO error
  • Since SQL Server 2005, this was extended to data pages
  • This is not logged as an error, but you should keep an eye on it

SQLIOSim

  • Not a performance analysis tool.
  • This is to stress to system, before you introduce SQL, to check for problems
  • Documented in blog posts

Page protection options

  • Set using ALTER PAGE SET PAGE_VERIFY NONE/CHECKSUM/TORN_PAGE_DETECTION
  • TORN_PAGE_DETECTION - First two bits of each 512 bytes in the page
  • Alternating bit pattern helps identify data corruption condition

CHECKSUM

  • Simple checksum of all the bits on the page in page header
  • Enough to detect data corruption condition on the page (new in SQL 2005)
  • Last thing done to page before write, first thing checked after read
  • Always recommended to have at least one of them.
  • CHECKSUM has a CPU overhead, but it is recommended.
  • Checked:  when page is read, during CHECKDB, BACKUP with CHECKSUM
  • Available in tempdb starting with SQL Sever 2008

Automatic page repair

  • Mirroring is based on the log, so it does not mirror physical corruption
  • If you see a 823/824 error occurs and database is mirrored
  • Corrupt pages on the principal and mirror can be repaired
  • Principal and mirrored must be synchronized
  • Repairs are asynchronous, corrupt pages are unusable until repaired
  • "Database suspect data page" event class
  • Monitor last 100 corrupted pages with sys.dm_db_mirroring_auto_page_repair

DBCC CHECKDB

  • Checks the database. Only way to force reading all allocated pages in the DB.
  • May cause contention, but does not take any locks. Runs online by default.
  • 3 ways: CHECKDB WITH PHYSICAL_ONLY / CHECKDB / CHECKDB WITH EXTENDED_LOGICAL_CHECKS

Evolution of CHECKDB

  • In SQL 2000, introduced many ways to minimize runtime and run online
  • In SQL 2005, progress report, data purity, indexed view, last known good...
  • In SQL 2008, long running checks moved under WITH EXTENDED_LOGICAL_CHECKS
  • That's why, by default, SQL 2008 runs DBCC CHECKDB faster than SQL 2005
  • CHECKDB will use a lot of resources, CPU, tempdb, etc. Does not use log.

Running CHECKDB

  • By default, only returns the first 200 errors
  • Could return lot of distracting informational messages
  • Use DBCC CHECKDB WITH ALL_ERRORMSG, NO_INFOMSGS
  • If it's taking too longer than usual, it could just be that it found some repairable corruption (don't assume the worst)
  • Additional messages in SQL 2008 to inform during run that it found corruption

CHECKDB checks

How frequently to run CHECKDB

  • Depends on a number of things:
  • - stability of IO subsystem, backup strategy, downtime SLA, data loss SLA,
  • - can take additional CPU/IO, type of system (production/test/backup)
  • At least once a week
  • If you're running as part of maitanence, do it before the backup

How long does it take?

  • Depends on a number of things
  • - size of DB, current server load (CPU/IO), update activity
  • - number of CPUs, speed of tempdb disks, complexity of schema
  • - which options you use...

CHECKDB on VLDB

What to do?

  • Do not panic!
  • Check the extent of the problem - run full CHECKDB on another system
  • Verify what backups are available while it's running full checks
  • Find out for how long has that being going
  • Might be able to fix online with a restore

Repairable?

  • Repairable errors
  • - Errors in non-clustered indexes - Rebuild indexes, run CHECKDB again
  • Unrepairable errors
  • - Errors can prevent CHECKDB from running or doing repairs
  • - When you run DBCC CHECKCATALOG tells to run DBCC CHECKCATALOG

Restore or Repair?

  • Did CHECKDB fail? Is it a repairable error? Cannot repair...
  • Do you have a backup? How old is it? Is the transaction log damaged?
  • What will give you the least data loss in the shortest amount of data
  • You can do single page restore from your backup (if you know what you're doing)

Repair

  • Repairs are part of CHECKDB: REPAIR_FAST, REPAIR_REBUILD, REPAIR_ALLOW_DATA_LOSS
  • Repairs ranked - First the most intrusive things to repair
  • Repairs are never be online - it's hard enough to do it offline :-)
  • Beware of REPAIR_ALLOW_DATA_LOSS - it does what it says
  • If log is damaged, you can try emergency mode - no consistency guarantee
  • Demo: Repairing a suspect databases

Lessons learned

  • Backup your data frequently
  • Have multiple backups, keep some of them off-site
  • Validate that your backups are good by doing trial restores to alternate server
  • Run CHECKDB regularly if possibly in your case, know how long it regularly takes

Books Online: DBCC CHECKDB
https://msdn.microsoft.com/en-us/library/ms176064.aspx