Please note that, although these notes were taken during the class, I might have added some of my own (mis)interpretation :-). Always check your facts on Books Online (I try to provide links when applicable). As with anything else you get from a blog, never use any of this in production before you thoroughly validate it a test environment. Please keep in mind that some of those will be hard to follow without some pre-requisite knowledge and the right context. Reading the post from top to bottom will help.
Disk history and basics
- Stone, paper, magnetic tape, magnetic disk 🙂
- Amount of store increasing rapidly
- Magnetic disk – Non volatile, random address, re-writable
- Discussion - Reading data later – how far ahead can you guess (10 years?), especially removable media
- Functions – Seek to, Random from, Write to, atomic access to single block (512 bytes)
- Physical geometry obscured with LBAs – Reality of the geometry could be completely different
- Discussion - For how long will we have mechanical drives?
- Structure of disks - platters, head, tracks, cylinders, sector, density changes, more data on outer tracks
- Discussion – Small drives or large drives
- Spindles – Seek to track or stay in the track, rotating spindles, actuators
- Discussion – Drive reliability, MTBF
- Removable disks – Disk packs. Spin up/load heads/seek/read/write. Alignment packs.
- Discussion – Engineering of old style removable disk.
- Fixed disk – Winchester. Heads land and take off, like a plane. Radial connections.
- Media – magnetic (disks/tapes), semiconductor (flash/RAM), optical (CD/DVD/BD), hybrid, holographic?
- Discussion – Trust the media? Keeping track of many tapes over time. How many copies?
- Discussion – Magnetic physical limitations. RAMSAN, FusionIO, SSD. Video = array of 24 SSD drives.
- Semiconductor limitations – SLC vs. MLC, wear leveling, block erases
- Magnetics x Semiconductor – Fast seeks and low power vs. magnetic resilience
- How/why disks fail – electrical/mechanical domains, wearing out, trying forever, data corruption, ECC
- “Not quite broken” – transitory errors, connection bus, vibration, heat, RFI, static
- Discussion - desktop drives x enterprise drives, error rates
- Disk terminology – tracks per inch, bits per inch, seek latency, access time, response time, IOPs, bandwidth
- IOPS – I/Os per second - max IOPS = bandwidth/IO size. IOPs per spindle. Controller capacity.
- Latency – How long does it take? – IOPs/Bandwith vs. Latency – RPM vs. Latency – 7.2K vs 10K vs 15K
- Discussion – How is “average seek” time calculated?
- Real-time – What is real time? SAN QOS, SSD, SQL snapshot isolation, Resource Governor
- Power – Lower power, lower heat, 15K disk vs. SSD, power is a significant part of the overall cost
- How much data – How many copies of the data do your keep? Online, backup, compliance, archive?
- Backup options – D2T, D2D, D2D2T, D2D2D, etc...
- Future: Storage Utility? Compute Utility? Services for data access, validity, recovery. Cloud?
- Discussion – Pods. Datacenters in containers. 3,500 servers/pod. 3 connections: power, network, cooling
- LBA or LBN is disk address, not head/cylinder/sector anymore
- It’s all a big bunch of sectors and blocks (512 bytes)
- Initators (find and mounts drives/LUNs) and targets
- Disk ordering of commands (fifo, seek time, latency, size of stream, direction)
- Parallel attachment – IDE, ATA, PATA, SCSI
- Serial attachment – FC, SATA, SAS, Ethernet
- Classes of disks – Desktop (SATA, 7.2K RPM), Enterprise (SAS/FC, 15K RPM)
- Size and density of disks – 3.5” (standard) and 2.5” (small form factor, SFF)
- Duty cycle
- Serial interfaces (SATA/SAS/FC)
- Channel Performance (USB/IDE/SATA/FC), Network Performance (10M/100M/1G/10G)
- RAID – Software, Direct-Attach RAID, External RAID (SAN)
- SSD – Low (no) latency, low (no) seek time, high read IOPs, good bandwidth per drive
- SSD – More expensive, lower capacity, higher cost,
- What is an IO? What is the path from the App to the physical disk?
- SQL Server – SQL Server protocol, query processor, SQL OS, storage engine, buffer manager, buffer pool, OS
- “Hardware” I/O – Driver, HBA controller, buffer, FC switch, “SAN OS”, controller, cache, backend switch, disk
- I/O request lifecycle – Should all happen in milliseconds. Better if can be done via buffer/cache.
- I/O Latency – Optimal – Log: <5ms (ideally <1ms), OLTP: <20ms (ideally <10ms), OLAP: <30ms
- I/O Latency – Depends on technology, should be consistent
- Retains information on power removal. Disk, tape, optical.
- Assumed to be reliable, fair amount of solution cost, regulatory requirements to keep data
- Acronyms – JBOD, RAID, DAS, NAS, SAN, CAS, COS, CIFS, SMB, NFS
- Storage: Initiator, Target, LUN
SQL Server storage
- 3 basic requirements: stable media, write ordering, torn I/O prevention
- Applied also to any storage replication scheme – sync or async
- The idea is to avoid data corruption and/or loss of transactional consistent
- Platters, heads, tracks and sectors – typical sector size is 512 bytes
- SQL Server uses 4KB sectors
- Disk sector – Preamble, data sync, user data (512 bytes), ECC (40 bits), gap
- Data is stored in sectors, accessed by LBN or block address
- For SQL, should always have some sort of RAID construct for fault tolerance
- NTFS – Default 4KB cluster size
- Consider using 64KB cluster - aligns with SQL extents, improves read-ahead, reduces split I/O
- Disable NTFS compression
- GPT vs. MBR – MBR is the old partitioning. GPT is newer and allows volumes >2TB.
- Beginning of disk used to store partition info, actual data portion start after that.
- Query with
- – Windows Server 2008 and later will automatically align at next 1MB boundary
- – Older versions of Windows will align at 63 sectors
- – You should re-align at 1MB if possible. It can have significant impact.
- See http://support.microsoft.com/kb/929491
- See http://blogs.msdn.com/jimmymay/archive/2008/10/14/disk-partition-alignment-for-sql-server-slide-deck.aspx
- See http://sqlblog.com/blogs/kevin_kline/archive/2008/10/08/how-to-improve-application-and-database-performance-up-to-40-in-one-easy-step.aspx
SQL Server I/O
- Write-ahead logging (WAL) - Key to make it ACID
- All log records must go to stable media BEFORE data pages does.
- If these are reordered and stopped in the middle, bad things can happen
- Stable media – must be able to survive failure. Cache is OK, if battery backed.
- Multiple channels, Load balancing, Mirroring, Remote Mirroring – Also need to guarantee order
- Torn I/O – two bits on each of the 512 sectors on the 8K page
- Page checksum – checksum for the entire page, replaces Torn I/O if enabled
- Force unit access – CreateFile with FILE_FLAG_WRITEHTROUGH flag
- Hardware read cache – In controller, reads more data than requested. Good for OLAP, reporting.
- Latching – Data synchronization, reader/writer “lock” in SQL Server when writing page
- Scatter-Gather I/O: Instead of doing individual I/Os, try to do multiple. For instance: entire extents.
- Lazy Writer: On memory pressure, free pages. LRU stays. Up to 256 pages.
- Checkpoint: Flush dirty pages in buffer pool to stable media. Up to 16 pages.
- Eager Write: Minimally-logged, goes to disk without going to the buffer pool
- Read-ahead: Read multiple pages ahead. Up to 128 pages (Enterprise Edition, up to 1024).
- Enterprise Edition also can prime the cache on start.
- Log manager: Writes log files, sequential. 8 pages (32 in 64-bit) and 480KB 32 outstanding log writes.
- Orderly shutdown: Will checkpoint all databases, close out internal structures.
- Don’t kill SQL Server process. You might need to increase shutdown time
- See http://support.microsoft.com/kb/146092
- Careful - Stale reads, lost writes, filter drivers, online file copy, drive fragmentation
- Error 823 – Error, some data returned - Usually, something close to being broken (or broken)
- Error 824 – Error, no data returned - Usually, something broken
- Error 832 – Stale I/O or checksum issue
- Error 833 – Hung I/O – nothing comes back (could be anti-virus, compression, network issue, etc)
- Stale Read Protection – Holds some info on written pages, checked on read – Limited to ~320MB of data on 64-bit
- Read Retry: Drives will try, OS will also try, SQL Server will retry on top of that - can mask issues
- Page and log audit – trace flags to do checks
- Idle server – Suspend/resume SQL Server – By default only on Express/Workgroup. Careful!
Scalable Shared Databases
- Multiple SQL Servers mounting the same files.
- Read-only databases and read-only files. SQL Server 2005 or later.
- See http://msdn.microsoft.com/en-us/library/ms345392.aspx
Instant File Initialization
- Both OS and FS must support IFI.
- SQL Server account needs SE_MANAGE_VOLUME_NAME.
- Used in data files, not logs.
- Careful – Files will not be initialized.
- SELECT * FROM sys.dm_io_pending_io_requests
- See http://msdn.microsoft.com/en-us/library/ms188762.aspx
- SELECT * FROM sys.dm_io_virtual_file_stats(<database>/NULL,<file>/NULL)
- Shows io_stall, io_stall_read_ms, io_stall_write_ms, num_of_reads, num_of_writes
- See http://msdn.microsoft.com/en-us/library/ms190326.aspx
- Storage Area Network. Fibre Channel or iSCSI. Comprised of fabric and end points.
- Terminology: HBA, array, frame, port, frames, PDU, fabrics, zoning, VSAN, masking, WWN, MPIO.
- See http://www.snia.org/images/tutorial_docs/Networking/MKJibbe-Wilson_Fibre_Channel_Technologies-rev.pdf
- Terminology: Write Atomicity, Restripe, Concatenation, Thin provisioning, LUN growth/shrink
- “Magic SAN dust” – Scalability, Availability, Redundancy, Data Management/Mobility, Snaps, Close, Mirroring
- Easy and flexible provisioning
- Generally used with Failover Clustering
- Careful – Shared storage performance – How to guarantee?
- Careful – RAID and stripe size – Alignment still matters
- Careful – LUN security
- Discussion – SAN vs. DAS – Pros and cons of each…
- Maintenance: Index rebuilds, updating statistics
- Fragmentation: Only important if you’re doing scans. Watching for them is important.
- Identify the few very large tables that need attention. Consider partitioning
- Backup and Restore strategy
- Minimize Log write waits
- Throughput vs. Latency
- Reduce SQL locking
- Add CPUs
- Reduce I/O path
- Capacity – Striping (RAID 0)
- Availability – Mirroring (RAID1, RAID10), Parity (RAID5, RAID6)
- Performance – Striping (RAID 0)
- Tempdb in RAID0? SQL Server needs tempdb to be there all the time…
- RAID5/RAID6 have a performance penalty
- Determine capacity of I/O system prior to deployment
- Tools: SQLIO, IOMeter, SQLIOSim (not for performance)
- See http://sqlblog.com/blogs/kevin_kline/archive/2007/06/28/understanding-sqliosim-output.aspx
- SQLIO: IOPS, MB/s, latency. One or many files. Single I/O type per run.
- IOMeter: IOPS, MB/s, latency. Can combine many IO types per run.
- IO sizes: SQL can use 8/64/128/256/1024KB. Random - look at IOPS. Sequential – look at MB/s.
- Workloads: OLTP - heavy reads, writes during checkpoints, 60K sequential read/write for logs
- Workloads: OLAP - sequential reads and writes, larger I/O sizes, more scans
- Determining Saturation Point – IOPS constant, latency increasing
- Careful – HBA can saturate before the SAN disks, consider multiple channels
- Careful – HBA drivers/configuration, controller cache configuration, volume alignment