Here are some notes on “SQL Server 2008 Database Checking” I took while attending an advanced class on SQL Server taught by Paul Randal (http://sqlskills.com/AboutPaulSRandal.asp).
Please note that, although these notes were taken during the class, I might have added some of my own (mis)interpretation :-). Always check your facts on Books Online (I try to provide links when applicable). As with anything else you get from a blog, never use any of this in production before you thoroughly validate it a test environment. Please keep in mind that some of those will be hard to follow without the right pre-requisite knowledge and the right context. Reading the post from top to bottom will help.
- Why check for corruption?
- What to do when corruption happens?
- Many people don’t prepare for it and panic.
- Monitor with agent alerts
- 823 – hard I/O error – sql never got the data, just an error code
- 824 – soft I/O error – sql got some data, but it came with error code
- 825 – read/retry error
- msdb..suspect_pages has a list
- SQL has retried a few times, error 825 goes to the log
- Your I/O subsystems has transient problems
- Eventual failure results in an IO error
- Since SQL Server 2005, this was extended to data pages
- This is not logged as an error, but you should keep an eye on it
- Not a performance analysis tool.
- This is to stress to system, before you introduce SQL, to check for problems
- Documented in blog posts
Page protection options
- Set using ALTER PAGE SET PAGE_VERIFY NONE/CHECKSUM/TORN_PAGE_DETECTION
- TORN_PAGE_DETECTION – First two bits of each 512 bytes in the page
- Alternating bit pattern helps identify data corruption condition
- Simple checksum of all the bits on the page in page header
- Enough to detect data corruption condition on the page (new in SQL 2005)
- Last thing done to page before write, first thing checked after read
- Always recommended to have at least one of them.
- CHECKSUM has a CPU overhead, but it is recommended.
- Checked: when page is read, during CHECKDB, BACKUP with CHECKSUM
- Available in tempdb starting with SQL Sever 2008
Automatic page repair
- Mirroring is based on the log, so it does not mirror physical corruption
- If you see a 823/824 error occurs and database is mirrored
- Corrupt pages on the principal and mirror can be repaired
- Principal and mirrored must be synchronized
- Repairs are asynchronous, corrupt pages are unusable until repaired
- “Database suspect data page” event class
- Monitor last 100 corrupted pages with sys.dm_db_mirroring_auto_page_repair
- Checks the database. Only way to force reading all allocated pages in the DB.
- May cause contention, but does not take any locks. Runs online by default.
- 3 ways: CHECKDB WITH PHYSICAL_ONLY / CHECKDB / CHECKDB WITH EXTENDED_LOGICAL_CHECKS
Evolution of CHECKDB
- In SQL 2000, introduced many ways to minimize runtime and run online
- In SQL 2005, progress report, data purity, indexed view, last known good…
- In SQL 2008, long running checks moved under WITH EXTENDED_LOGICAL_CHECKS
- That’s why, by default, SQL 2008 runs DBCC CHECKDB faster than SQL 2005
- CHECKDB will use a lot of resources, CPU, tempdb, etc. Does not use log.
- By default, only returns the first 200 errors
- Could return lot of distracting informational messages
- Use DBCC CHECKDB WITH ALL_ERRORMSG, NO_INFOMSGS
- If it’s taking too longer than usual, it could just be that it found some repairable corruption (don’t assume the worst)
- Additional messages in SQL 2008 to inform during run that it found corruption
- If it can’t read the system tables, it cannot do much
- Allocation checks/repairs – CHECKALLOC
- Logical checks/repairs for system tables – Checks done by page in allocation order
- Logical checks/repairs for all other tables
- Service broker data validations
- Metadata checks – CHECKCATALOG
- Indexed views and XML index cheks (if WITH EXTENDED_LOGICAL_CHECKS)
- WITH PHYSICAL ONLY – simpler check of system table, stops there
- See http://sqlskills.com/BLOGS/PAUL/post/CHECKDB-From-Every-Angle-Complete-description-of-all-CHECKDB-stages.aspx
How frequently to run CHECKDB
- Depends on a number of things:
- - stability of IO subsystem, backup strategy, downtime SLA, data loss SLA,
- - can take additional CPU/IO, type of system (production/test/backup)
- At least once a week
- If you’re running as part of maitanence, do it before the backup
How long does it take?
- Depends on a number of things
- - size of DB, current server load (CPU/IO), update activity
- - number of CPUs, speed of tempdb disks, complexity of schema
- - which options you use…
CHECKDB on VLDB
- CHECKDB on a multi-TB database can take hours
- Five options to reduce time
- - Don’t run it
- - Run WITH PHYSICAL_ONLY
- - Break up the checks
- - Use partitioning – run CHECKFILEGROUP
- - Use another system – use a BACKUP
- Don’t give up!
- See http://sqlskills.com/BLOGS/PAUL/post/CHECKDB-From-Every-Angle-Consistency-Checking-Options-for-a-VLDB.aspx
What to do?
- Do not panic!
- Check the extent of the problem – run full CHECKDB on another system
- Verify what backups are available while it’s running full checks
- Find out for how long has that being going
- Might be able to fix online with a restore
- Repairable errors
- - Errors in non-clustered indexes – Rebuild indexes, run CHECKDB again
- Unrepairable errors
- - Errors can prevent CHECKDB from running or doing repairs
- - When you run DBCC CHECKCATALOG tells to run DBCC CHECKCATALOG
Restore or Repair?
- Did CHECKDB fail? Is it a repairable error? Cannot repair…
- Do you have a backup? How old is it? Is the transaction log damaged?
- What will give you the least data loss in the shortest amount of data
- You can do single page restore from your backup (if you know what you’re doing)
- Repairs are part of CHECKDB: REPAIR_FAST, REPAIR_REBUILD, REPAIR_ALLOW_DATA_LOSS
- Repairs ranked – First the most intrusive things to repair
- Repairs are never be online – it’s hard enough to do it offline
- Beware of REPAIR_ALLOW_DATA_LOSS – it does what it says
- If log is damaged, you can try emergency mode – no consistency guarantee
- Demo: Repairing a suspect databases
- Backup your data frequently
- Have multiple backups, keep some of them off-site
- Validate that your backups are good by doing trial restores to alternate server
- Run CHECKDB regularly if possibly in your case, know how long it regularly takes
Books Online: DBCC CHECKDB