SLA and Disaster Recovery Planning for Microsoft Business Intelligence

(Post courtesy Partner Technical Consultant specializing on Data Analytics)

Service Level Agreement Planning and Disaster Recovery Planning for the Microsoft Business Intelligence Stack with Microsoft SQL Server and Microsoft SharePoint Server

For a fully implemented Microsoft Business Intelligence stack, which might be composed of SQL Server, SSAS, SSIS, SSRS, SharePoint Server, MDS, and DQS; the question may arise regarding how to ensure the consistent status of all of the applications in case of a total or partial failure, where the different components may be subject to varying maintenance schedules.

In the worst case, disaster recovery requires us to recover from a situation where the primary data center, which hosts the servers, is unable to continue operation. Even potentially smaller disruptions like power outages, data corruption or accidental deletion of data can force us to restore data or configuration from our backups. It is well known that fault-tolerance is achieved through redundancy, ideally not only at the data layer but also in each and every component of the network and all services (switches, servers, Active Directory, SMTP…)

In this blog post we would like to focus on the Microsoft Business Intelligence stack and provide an overview what you need to consider when defining Service Level Agreements and how to prepare for a fast resumption of operation after such an unwelcome event has occurred.

Balancing cost and risk of downtime

First, let's consider the two factors that determine the Service Level Agreement corresponding to Availability. The whitepaper on "High Availability with SQL Server 2008 R2" at https://technet.microsoft.com/en-us/library/ff658546.aspx explains it concisely:

"The two main requirements around high-availability are commonly known as RTO and RPO. RTO stands for Recovery Time Objective and is the maximum allowable downtime when a failure occurs. RPO stands for Recovery Point Objective and is the maximum allowable data-loss when a failure occurs. Apart from specifying a number, it is also necessary to contextualize the number. For example, when specifying that a database must be available 99.99% of the time, is that 99.99% of 24x7 or is there an allowable maintenance window?"

This means that both RPO, i.e. the amount of data you are willing to lose, and RTO, the duration of the outage, need to be determined individually, depending on your customer's specific needs. Their calculation follows an actuarial principle in that cost and risk need to be balanced. Please do not forget that the RTO does not only depend on how soon your services are back online but might in certain circumstances encompass the amount of time needed to restore data up to a certain point in time from backups as well.

Assuming a 24x7x365 operation, the following calculation applies, taken from "Create a high availability architecture and strategy for SharePoint 2013" at https://technet.microsoft.com/en-us/library/cc748824.aspx:

   Availability class            Availability measurement             Annual down time

   Two nines                       99%                                                   3.7 days

   Three nines                    99.9%                                                8.8 hours

   Four nines                       99.99%                                              53 minutes

   Five nines                        99.999%                                           5.3 minutes

So now we start to appreciate what it means that in Windows Azure we receive a general SLA of 99.9% across services respectively 99.95% for cloud services, cf. https://www.windowsazure.com/en-us/support/legal/sla.

And here is one more argument in favor of using Windows Azure as your secondary site and standby data center: If you back up your databases and transaction logs to Azure blob storage and take Hyper-V based snapshots of your virtual machines, which you then transfer to Azure blob storage, then you will only incur the cheap storage cost and still be able to turn on the VM's any time you decide to bring them online, and start paying for them only while they are running. Windows Server 2012 Backup and Windows Azure Backup allow you to backup system state and files/folders to Windows Azure storage as well.

Alternatively, availability can be calculated as the expected time between two consecutive failures for a repairable system as per the following formula:

   Availability = MTTF / (MTTF + MTTR)

where MTTF = Mean Time To Failure and MTTR = Mean Time To Recover.

Scope

A disaster recovery concept needs to encompass the whole architecture and all technologies involved and include best practices on functional and non-functional requirements (non-functional refers to software behavior like performance, security, etc.).

To summarize, partners need to define in the SLA towards their customers an RPO (recovery point objective) and RTO (recovery time objective). For this, they are looking for a disaster recovery concept that takes into account:

- full, differential and transaction log backups (assuming the database is in full recovery mode)

- application backups

- any add-on components of the software

- Hyper-V virtual images of production servers

With that let's take a detailed look at an end-to-end disaster recovery planning for a Microsoft BI solution.

SQL Server databases

To begin with, how do the above concepts apply to the SQL Server databases?

Where would you look in the first place to find out about the recovery time of all of your databases? Correct, it is the SQL Server's error log, which can be read along a timeline.

To estimate the roll forward rate for a standalone system, one could use a test copy of a database and restore a transaction log from a high-load time period to it. The application design plays an important role as well: Short-running transactions reduce the roll forward time.

Upon failover in an AlwaysOn failover cluster instance, all databases need to be recovered on the new node, which means that transactions that are committed in the transaction log need to be rolled forward in the database, whereas transactions that got aborted have to be rolled back.

Side note: In a Failover Cluster Instance, the time for switchover is furthermore impacted by factors like for example storage regrouping or DNS/network name provisioning. Regarding the client side, one can configure the connection timeout in order to accelerate the time needed to reestablish a broken connection.

The new SQL Server 2012 Availability Groups make it easy to observe the RPO and RTO. For details, see "Monitor for RTO and RPO" at https://technet.microsoft.com/en-us/library/dn135338.aspx.

Here are some tips for an efficient backup strategy of your SQL Server databases:

- Use a separate physical location where you store the backups.

- Have a schedule to carry out regular backups, for example nightly full backups, every 6 hours a differential backup, and every 30 minutes a transactional log backup, if you need a point-in-time recovery.

- Enable CHECKSUM on backups. This is the default with backup compression, which is available in Standard, Business Intelligence and Enterprise Edition.

- Test your backups periodically by restoring them because you might unknowingly carry on some data corruption, making your backups useless. Monitor the suspect_pages table in MSDB to determine when a page level restore is sufficient.

- With regards to long-term archival, it is considered good practice to maintain three different retention periods. If you leverage three rotational schemes, thus for example create full backups daily, weekly and monthly and store them onto different media sets each, then you could regenerate your data from these if necessary. This is called the grandfather-father-son principle and allows for reusing the media sets after their retention period. As an example, a backup on a Monday overwrites that of some previous Monday and so on. The screenshot at https://social.msdn.microsoft.com/Forums/en-US/92fbf076-3cd1-4ab2-97d2-1ae6c9e909c7/grandfatherfatherson-backup-scenario depicts these options very well.

- Filegroups for historical partitions can be marked as read-only, hence require only a one-time filegroup backup. A piecemeal restore of read-write filegroups can accelerate recovery.

- Use "SQL Server Backup to Windows Azure" to upload the backup files for offsite storage, optimally with compression and encryption, even for versions earlier than SQL Server 2014. Check out the "Microsoft SQL Server Backup to Microsoft Windows Azure Tool" at https://www.microsoft.com/en-us/download/details.aspx?id=40740.

- While the RBS FILESTREAM provider, which uses local disk storage, is integrated with SQL Server's backup and restore procedures, with a third party RBS provider it will be your responsibility to back up the Remote Blob Storage separately in a consistent manner, cf. "Plan for RBS in SharePoint 2013" https://technet.microsoft.com/en-us/library/ff628583.aspx.

Fortunately, all Microsoft products are built to scale for availability. With SQL Server Availability Groups in SQL Server 2012 and higher you get a highly available set of databases and of secondary replicas for failover, disaster recovery purposes or to load-balance your read requests. Availability groups are a feature of SQL Server Enterprise Edition, which comes with even more online features than the other editions to allow for higher availability and faster recovery, noticeably online page and file restore or Database Recovery Advisor. The latter is helpful for point-in-time restores across sometimes complicated backup chains. For a concise list please see the table at: https://msdn.microsoft.com/en-us/library/cc645993.aspx#High_availability.

With SQL Server Availability Groups spread out to Windows Azure virtual machines it is even possible to host your secondary database replicas in Azure and, for example, run your System Center Data Protection Manager and its agent in the cloud against them.

Marked transactions allow you to restore several databases, for example the MDS database, the SSIS catalog and your corresponding user databases on the same instance of SQL Server consistently up the very same point in time, which can be advantageous if a major event, for example a fusion of two companies’ databases, occurs. See "Use Marked Transactions to Recover Related Databases Consistently (Full Recovery Model)" at https://technet.microsoft.com/en-us/library/ms187014.aspx.

SSAS

Since SQL Server Analysis Services is mainly a read-only system, you can do without things like transaction logging or differential backups. If metadata (.xmla files) is available, then this is sufficient to recreate and reprocess your cubes. If you even have functional database backups (.abf files), then those can be restored and used.

It is possible to run a separate SSAS server, which has the same configuration settings, in a remote location and supply it regularly with the latest data via database synchronization.

Hints:

- When running SSAS in SharePoint mode (as the POWERPIVOT instance), the SharePoint content and service application databases contain the application and data files.

- If you host your secondary replica for read access in Windows Azure, you will want to place your SSAS instance running in an Azure VM within the same Availability Set.

SSIS

SQL Server Integration Services since version 2012 offers two deployment modes: package-based for backward compatibility and the new project-based deployment. Backup and restore procedures depend on the storage location of the data. The package store can be folders in the file system or the msdb database. Any files should be copied away together with a script for dtutil to be able to upload them, additionally any centrally managed configuration files. Starting with SQL Server 2012, it is strongly recommended to use project deployment for the Integration Services server. The SSISDB catalog is a database that stores projects, packages, parameters, environments, operational history, and as such can be backed up into a .bak file. You also need to back up the master key for the SSISDB database, whereby the resulting file will be encrypted with a password you specify. Unless the key changes, this is a one-time operation.

SSRS

With SQL Server Reporting Services in native mode being a stateless service, it is the ReportServer database which contains the metadata and report snapshots with data. It can be protected as required for your SLA and RTO via full or differential backups. Experience has shown that doing just full backups oftentimes works fast enough. The ReportServerTempDB database can be recreated anytime. Do not forget to back up the RecoveryKey, which encrypts the database. This should be done at creation time, which suffices unless the service identity or computer name changes. In case of subscriptions, you need to back up the SQL Server Agent jobs as well. This can be accomplished via a simultaneous backup of the msdb database. For a backup of the Report Server configuration and custom assemblies kindly refer the corresponding links in the final section of this blog post.

Concerning SQL Server Reporting Services in SharePoint mode, the SharePoint 2013 built-in backup does not take care of the SSRS databases – with the additional Reporting Service Alerting database - so the previous paragraph is still valid, which means you must use SQL Server tools for SharePoint Server or SQL Server (Express) tools for SharePoint Foundation. As for the application part, since SSRS in SharePoint mode is a true SharePoint Service application, configuration occurs through Central Administration and SharePoint Server's own backup and recovery applies.

SharePoint Server

The BI, also called Insights, features of SharePoint Server, like for example Excel Services, Access Services, Visio Services, PerformancePoint Services benefit from SharePoint Server's backup options for service applications. A restore of a service application database has to be followed by provisioning the service application. Please find further details in the TechNet articles referenced below.

MDS

Master Data Services consists of a database wherein all master data as well as MDS system settings are stored plus a Master Data Manager web application. Scheduling daily full backups and more frequent transaction log backups is recommended. MDSModelDeploy.exe is a useful tool for creating packages of your model objects and data.

Side note: In our experience it is less the IIS-hosted website that tends to cause a bottleneck at high load than the MDS database itself. Hence, a scale-out would not necessarily involve just several MDS web sites, pointing to the same database, although this allows for redundancy and increased availability while web servers get updated. Rather it would separate out models into different MDS databases. On the one hand, this increases the overhead for security accounts and administration, given that the metadata tables are completely isolated from each other; on the other hand, blockings are avoided and databases can be managed independently.

DQS

Data Quality Services keeps its information in three databases: DQS_MAIN, DQS_PROJECTS, and DQS_STAGING_DATA, therefore can neatly be integrated into your SQL Server backup and restore processes. With the help of the command DQSInstaller.exe it is even possible to export all of the published knowledge bases from a Data Quality Server to a DQS backup file (.dqsb) in one go.

Cross-cutting best practices

- Making use of SQL alias for connections to your SQL Server computer eases the process of moving a database, for example when a SQL virtual cluster name changes. For instructions see for example "Install & Configure SharePoint 2013 with SQL Client Alias" https://blogs.msdn.com/b/sowmyancs/archive/2012/08/06/install-amp-configure-sharepoint-2013-with-sql-client-alias.aspx. It shows how you gain flexibility over the SQL Server connection string by appropriately populating the SQL Server Network Configuration and SQL Server Client Network Utility. This procedure has significant advantages over the DNS A record or CNAME alias in that the SQL alias does not change the Kerberos SPN format for connections. You continue to use the registered DNS host name (A record) in the Service Principal Name when connecting. Furthermore, it allows you to specify more than one alias pointing to the same instance. For example, you can create an alias for your content databases, search databases etc. and thereby plan ahead for future scale out.

- System Center Data Protection Manager can be used for both database backups and application server backups. For a list of protected workloads please see the left-hand navigation bar on the page "Administering and Managing System Center 2012 - Data Protection Manager" https://technet.microsoft.com/en-us/library/hh757851.aspx.

- In the context of private clouds, System Center comes into play with its Operations Manager to monitor SQL Server instances and virtual machines and its Virtual Machine Manager to quickly provision new virtual machines.

Closing words

SQL Server and SharePoint Server allow for robust disaster recovery routines as part of your business continuity plan. New hybrid and cloud based solutions enhance traditional possibilities greatly.

As has become clear, configuration changes that occur outside of user databases should always happen in a controlled manner, requiring a tight Change Management process.

Further reading

" Microsoft SQL Server AlwaysOn Solutions Guide for High Availability and Disaster Recovery" https://msdn.microsoft.com/en-us/library/hh781257.aspx.

With some good discussions: "Simple script to backup all SQL Server databases" https://www.mssqltips.com/sqlservertip/1070/simple-script-to-backup-all-sql-server-databases/.

"Back Up and Restore of System Databases (SQL Server)" https://technet.microsoft.com/en-us/library/ms190190.aspx.

"SQL Server AlwaysOn Architecture Guide: Building a High Availability and Disaster Recovery Solution by Using AlwaysOn Availability Groups" https://msdn.microsoft.com/en-us/library/jj191711.aspx.

"SQL Server AlwaysOn Architecture Guide: Building a High Availability and Disaster Recovery Solution by Using Failover Cluster Instances and Availability Groups" https://msdn.microsoft.com/en-us/library/jj215886.aspx .

"Backup and Restore of Analysis Services Databases" https://technet.microsoft.com/en-us/library/ms174874.aspx.

"Disaster Recovery for PowerPivot for SharePoint" https://social.technet.microsoft.com/wiki/contents/articles/22137.disaster-recovery-for-powerpivot-for-sharepoint.aspx.

"Package Backup and Restore (SSIS Service)" https://technet.microsoft.com/en-us/library/ms141699.aspx.

"Backup, Restore, and Move the SSIS Catalog" https://technet.microsoft.com/en-us/library/hh213291.aspx.

"Backup and Restore Operations for Reporting Services" https://technet.microsoft.com/en-us/library/ms155814.aspx.

"Migrate a Reporting Services Installation (Native Mode)" https://technet.microsoft.com/en-us/library/ms143724.aspx.

"Migrate a Reporting Services Installation (SharePoint Mode)" https://technet.microsoft.com/en-us/library/hh270316.aspx.

"Backup and Restore Reporting Services Service Applications" https://technet.microsoft.com/en-us/library/hh270316.aspx.

"Planning Disaster Recovery for Microsoft SQL Server Reporting Services in SharePoint Integrated Mode" https://msdn.microsoft.com/en-us/library/jj856260.aspx.

"Overview of backup and recovery in SharePoint 2013" https://technet.microsoft.com/en-us/library/ee663490.aspx.

"Plan for backup and recovery in SharePoint 2013" https://technet.microsoft.com/en-us/library/cc261687.aspx.

"Backup and restore SharePoint 2013" https://technet.microsoft.com/en-us/library/ee662536.aspx.

"Supported high availability and disaster recovery options for SharePoint databases (SharePoint 2013)" https://technet.microsoft.com/EN-US/library/jj841106.aspx.

"Database Requirements (Master Data Services)" https://technet.microsoft.com/en-us/library/ee633767.aspx.

"Web Application Requirements (Master Data Services)" https://technet.microsoft.com/en-us/library/ee633744.aspx.

"Export and Import DQS Knowledge Bases Using DQSInstaller.exe" https://technet.microsoft.com/en-us/library/hh548693.aspx.

"Using AlwaysOn Availability Groups for High Availability and Disaster Recovery of Data Quality Services (DQS)" https://msdn.microsoft.com/en-us/library/jj874055.aspx.

"Install SQL Server 2012 Business Intelligence Features" https://technet.microsoft.com/en-us/library/hh231681.aspx.

"SQLCATs Guide to High Availability and Disaster Recovery", "SQLCAT's Guide to BI and Analytics" https://blogs.msdn.com/b/sqlcat/archive/2013/10/23/sqlcat-com-ebook-downloads.aspx.

Case Study for failover to a standby data center: "High Availability and Disaster Recovery at ServiceU: A SQL Server 2008 Technical Case Study" https://technet.microsoft.com/en-us/library/ee355221.aspx.

"Business Continuity in Windows Azure SQL Database" https://msdn.microsoft.com/en-us/library/hh852669.aspx.

"SQL Server Managed Backup to Windows Azure" https://msdn.microsoft.com/en-us/library/dn449496.aspx.

"SQL Server Deployment in Windows Azure Virtual Machines" https://msdn.microsoft.com/en-us/library/windowsazure/dn133141.aspx.

Hybrid storage appliance "StorSimple cloud integrated storage" https://www.microsoft.com/en-us/server-cloud/products/storsimple.

This posting is provided "AS IS" with no warranties, and confers no rights.